[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/Klabauterkerl/finnish-chopper/blob/main/fairseq_morfessor.ipynb)

## Installing Dependencies

In [None]:
# Install fairseq and other dependencies
%pip install sacrebleu sentencepiece
%pip install tensorboardX
%pip install subword-nmt
%pip install sacremoses
# Needed because otherwise fairseq generate will fail bacause of wrong version of pytorch on colab
%pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
%pip install fairseq

## Setting up filepaths
- Exectute the first cell to mount your Google Drive
- Execute the second cell to set up the filepaths when runnning locally

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Define paths for the mounted Google Drive
base_path = "/content/drive/MyDrive/translation_model"
dataset_path = f"{base_path}/dataset"
!mkdir -p "{dataset_path}"
data_bin_path = f"{base_path}/data-bin"
checkpoints_path = f"{base_path}/checkpoints"
logs_path = f"{base_path}/logs"
morfessor_path = f"{base_path}/dataset/morfessor"

!mkdir -p "{dataset_path}"
!mkdir -p "{data_bin_path}"
!mkdir -p "{checkpoints_path}"
!mkdir -p "{logs_path}"
!mkdir -p "{morfessor_path}"

In [None]:
# Define paths when locally running
base_path = "data/morfessor"
dataset_path = f"{base_path}/dataset"
data_bin_path = f"{base_path}/data-bin"
checkpoints_path = f"{base_path}/checkpoints"
logs_path = f"{base_path}/logs"
morfessor_path = f"{base_path}/dataset"
evaluation_folder = f"{base_path}/evaluation"
!mkdir -p "{dataset_path}"
!mkdir -p "{data_bin_path}"
!mkdir -p "{checkpoints_path}"
!mkdir -p "{logs_path}"
!mkdir -p "{morfessor_path}"
!mkdir -p "{evaluation_folder}"

## Downloading the data
- Data used is the Europarl corpus, which can be downloaded from [here](https://www.statmt.org/europarl/v7/fi-en.tgz)
- The parallel data is then split into finnish and english files

In [None]:
# Download and extract dataset
!wget -P "{dataset_path}" https://www.statmt.org/europarl/v9/training/europarl-v9.fi-en.tsv.gz
!gunzip "{dataset_path}/europarl-v9.fi-en.tsv.gz"

In [None]:
# Split dataset into two files, each containing one column of the original dataset
!cut -f1 {dataset_path}/europarl-v9.fi-en.tsv > {dataset_path}/europarl-v9.fi
!cut -f2 {dataset_path}/europarl-v9.fi-en.tsv > {dataset_path}/europarl-v9.en

## Tokenization & Normalization

Before training a Morfessor Model on the provided Dataset the text has to be preprocessed for optimal function.

Preprocessing steps taken:
- Tokenization
- Normalization
- Truecasing
- Corpus Cleaning (no sentences longer than 50 words)

In [None]:
# Train truecaser model for Finnish
!mosesdecoder/scripts/recaser/train-truecaser.perl \
-corpus {dataset_path}/train.fi \
-model {dataset_path}/truecase-model.fi

# Train truecaser model for English
!mosesdecoder/scripts/recaser/train-truecaser.perl \
-corpus {dataset_path}/train.en \
-model {dataset_path}/truecase-model.en

In [None]:
set_name = "train"

# Normalize punctuation and tokenize Finnish text
!cat {dataset_path}/{set_name}.fi | \
mosesdecoder/scripts/tokenizer/normalize-punctuation.perl fi | \
mosesdecoder/scripts/tokenizer/tokenizer.perl -threads 8 -no-escape -l fi \
> {dataset_path}/{set_name}.tok.fi

# Normalize punctuation and tokenize English text
!cat {dataset_path}/{set_name}.en | \
mosesdecoder/scripts/tokenizer/normalize-punctuation.perl en | \
mosesdecoder/scripts/tokenizer/tokenizer.perl -threads 8 -no-escape -l en \
> {dataset_path}/{set_name}.tok.en

# Truecase the tokenized Finnish text
!mosesdecoder/scripts/recaser/truecase.perl \
-model {dataset_path}/truecase-model.fi \
< {dataset_path}/{set_name}.tok.fi \
> {dataset_path}/{set_name}.tok.truecase.fi

# Truecase the tokenized English text
!mosesdecoder/scripts/recaser/truecase.perl \
-model {dataset_path}/truecase-model.en \
< {dataset_path}/{set_name}.tok.en \
> {dataset_path}/{set_name}.tok.truecase.en

# Clean the corpus
!perl mosesdecoder/scripts/training/clean-corpus-n.perl \
{dataset_path}/{set_name}.tok.truecase en fi \
{dataset_path}/{set_name}.tok.clean 1 50

## Morfesssor preprocessing

With Morfessor the Finnish words will be preprocessed by splitting off their suffixes and adding delemiters between the splits

This Information can be conatained in Finnish suffixes:
- Case
- Number
- Person
- Tense & Mood
- Possession
- Derivation
- Comparative and superlative forms

The delimiter chosen here is "|" as it used seldomly in the Finnish language.

In [None]:
# Learn Morfessor model from tokenized data
!morfessor -t "{dataset_path}/train.tok.clean.fi" -s "{dataset_path}/model_fi.bin"

In [None]:
# Segment Finnish tokenized data using learned Morfessor model for train set
!morfessor -l "data/morfessor/dataset/model_fi.bin" -T - \
    --output-newlines --output-format "{analysis}  " --output-format-separator " __" \
    < "data/morfessor/dataset/train.tok.clean.fi" > \
    "data/morfessor/dataset/train.morfessor.fi"

In [None]:
# Segment Finnish tokenized data using learned Morfessor model for dev set
!morfessor -l "data/morfessor/dataset/model_fi.bin" -T - \
    --output-newlines --output-format "{analysis}  " --output-format-separator " __" \
    < "data/morfessor/dataset/dev.tok.clean.fi" > \
    "data/morfessor/dataset/dev.morfessor.fi"

In [None]:
# Segment Finnish tokenized data using learned Morfessor model for test set
!morfessor -l "data/morfessor/dataset/model_fi.bin" -T - \
    --output-newlines --output-format "{analysis}  " --output-format-separator " __" \
    < "data/morfessor/dataset/test.tok.clean.fi" > \
    "data/morfessor/dataset/test.morfessor.fi"

## Additional Preprocessing using BPE

After having linguistically preprocessed the text using Morfessor an additonal preprocessing step to make the model more robust is to use Byte Pair Encoding (BPE).

In [None]:
set_name = "train"

# Learn a joint BPE model and vocabulary
!subword-nmt learn-joint-bpe-and-vocab \
    --input {dataset_path}/{set_name}.morfessor.fi {dataset_path}/{set_name}.tok.clean.en -s 32000 \
    -o {dataset_path}/bpe.codes --write-vocabulary {dataset_path}/vocab.fi {dataset_path}/vocab.en

In [None]:
set_name = "train"

# Apply the learned BPE model and vocabulary
!subword-nmt apply-bpe -c {dataset_path}/bpe.codes \
    --vocabulary {dataset_path}/vocab.fi < {dataset_path}/{set_name}.morfessor.fi > {dataset_path}/{set_name}.bpe.fi
!subword-nmt apply-bpe -c {dataset_path}/bpe.codes \
    --vocabulary {dataset_path}/vocab.en < {dataset_path}/{set_name}.tok.clean.en > {dataset_path}/{set_name}.bpe.en

## Creating the Dataset using Fairseq

In [None]:
# Create Dataset using BPE Data
!fairseq-preprocess --source-lang fi --target-lang en \
    --trainpref {dataset_path}/train.bpe --validpref {dataset_path}/dev.bpe --testpref {dataset_path}/test.bpe \
    --destdir {data_bin_path} --workers 20

## Training the Model

In [None]:
# Prefix CUDA_VISIBLE_DEVICES= <insert index of GPUs to use> if some GPUs are occupied

# Train Model using BPE Dataset
!fairseq-train data/morfessor/data-bin \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 4096 \
    --save-interval 1 \
    --keep-last-epochs 5 --log-format simple --log-interval 100 \
    --tensorboard-logdir data/morfessor/logs \
    --save-dir data/morfessor/checkpoints \
    --task translation\
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --max-epoch 20

## Generating Translations

In [None]:
# Generate translations using trained model
!fairseq-generate {data_bin_path} \
    --path {checkpoints_path}/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe \
    --scoring sacrebleu --sacrebleu\
        > {base_path}/translations_sacrebleu.txt

## Evaluation

In [None]:
output_file = f'{base_path}/translations_sacrebleu.txt'  # File generated by fairseq-generate
reordered_output_file = f'{base_path}/reordered_output.txt'  # File to save the reordered translations

# Read the output file and extract translations
with open(output_file, 'r') as f:
    lines = f.readlines()

translations = {}
for line in lines:
    if line.startswith('H-'):
        parts = line.split('\t')
        index = int(parts[0].split('-')[1])
        translation = parts[2].strip()
        translations[index] = translation

# Reorder translations and save to file
with open(reordered_output_file, 'w') as f:
    for i in sorted(translations.keys()):
        f.write(translations[i] + '\n')

In [None]:
# Reverse the truecasing of the reference test set (English)
!mosesdecoder/scripts/recaser/detruecase.perl \
< {dataset_path}/test.tok.clean.en \
> {evaluation_folder}/test.tok.en

# Reverse the truecasing of the source test set (Finnish)
!mosesdecoder/scripts/recaser/detruecase.perl \
< {dataset_path}/test.tok.clean.fi \
> {evaluation_folder}/test.tok.fi

# Reverse the tokenization of the reference test set (English)
!mosesdecoder/scripts/tokenizer/detokenizer.perl -l en \
< {evaluation_folder}/test.tok.en \
> {evaluation_folder}/test.detok.en

# Reverse the tokenization of the source test set (Finnish)
!mosesdecoder/scripts/tokenizer/detokenizer.perl -l fi \
< {evaluation_folder}/test.tok.fi \
> {evaluation_folder}/test.detok.fi

# Reverse the punctuation normalization of the reference test set (English)
!mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -r \
< {evaluation_folder}/test.detok.en \
> {evaluation_folder}/test.en

# Reverse the punctuation normalization of the source test set (Finnish)
!mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -r \
< {evaluation_folder}/test.detok.fi \
> {evaluation_folder}/test.fi

# Reverse the truecasing of the hypothesis translations (English)
!mosesdecoder/scripts/recaser/detruecase.perl \
< {base_path}/reordered_output.txt \
> {evaluation_folder}/reordered_output.truecase.txt

# Reverse the tokenization of the hypothesis translations (English)
!mosesdecoder/scripts/tokenizer/detokenizer.perl -l en \
< {evaluation_folder}/reordered_output.truecase.txt \
> {evaluation_folder}/reordered_output.detok.txt

# Reverse the punctuation normalization of the hypothesis translations (English)
!mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -r \
< {evaluation_folder}/reordered_output.detok.txt \
> {evaluation_folder}/reordered_output.txt


In [None]:
!sacrebleu {evaluation_folder}/test.en < {evaluation_folder}/reordered_output.txt > {evaluation_folder}/sacrebleu_score.txt

In [None]:
!comet-score -t {evaluation_folder}/reordered_output.txt -r {evaluation_folder}/test.en -s {evaluation_folder}/test.fi > {evaluation_folder}/comet_score.txt