[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/Klabauterkerl/finnish-chopper/blob/main/fairseq_morfessor.ipynb)

In [None]:
# Install fairseq and other dependencies
%pip install sacrebleu sentencepiece
%pip install tensorboardX
%pip install subword-nmt
%pip install sacremoses
# Needed because otherwise fairseq generate will fail bacause of wrong version of pytorch on colab
%pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
%pip install fairseq

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Define paths for the mounted Google Drive
base_path = "/content/drive/MyDrive/translation_model"
dataset_path = f"{base_path}/dataset"
!mkdir -p "{dataset_path}"
data_bin_path = f"{base_path}/data-bin"
checkpoints_path = f"{base_path}/checkpoints"
logs_path = f"{base_path}/logs"

!mkdir -p "{dataset_path}"
!mkdir -p "{data_bin_path}"
!mkdir -p "{checkpoints_path}"
!mkdir -p "{logs_path}"

In [None]:
# Define paths when locally running
base_path = "data"
dataset_path = f"{base_path}/dataset"
data_bin_path = f"{base_path}/data-bin"
checkpoints_path = f"{base_path}/checkpoints"
logs_path = f"{base_path}/logs"

!mkdir -p "{dataset_path}"
!mkdir -p "{data_bin_path}"
!mkdir -p "{checkpoints_path}"
!mkdir -p "{logs_path}"

In [None]:
# Download and extract dataset
!wget -P "{dataset_path}" https://www.statmt.org/europarl/v9/training/europarl-v9.fi-en.tsv.gz
!gunzip "{dataset_path}/europarl-v9.fi-en.tsv.gz"

In [None]:
# Split dataset into two files, each containing one column of the original dataset
!cut -f1 {dataset_path}/europarl-v9.fi-en.tsv > {dataset_path}/europarl-v9.fi
!cut -f2 {dataset_path}/europarl-v9.fi-en.tsv > {dataset_path}/europarl-v9.en

## Tokenization & Normalization

Before training a Morfessor Model on the provided Dataset the text has to be preprocessed for optimal function.

Preprocessing steps taken:
- Tokenization
- Normalization
- Truecasing
- Corpus Cleaning (no sentences longer than 50 words)

In [None]:
# Train truecaser model for Finnish
!mosesdecoder/scripts/recaser/train-truecaser.perl \
-corpus {dataset_path}/train.fi \
-model {dataset_path}/truecase-model.fi

# Train truecaser model for English
!mosesdecoder/scripts/recaser/train-truecaser.perl \
-corpus {dataset_path}/train.en \
-model {dataset_path}/truecase-model.en

In [None]:
set_name = "train"

# Normalize punctuation and tokenize Finnish text
!cat {dataset_path}/{set_name}.fi | \
mosesdecoder/scripts/tokenizer/normalize-punctuation.perl fi | \
mosesdecoder/scripts/tokenizer/tokenizer.perl -threads 8 -no-escape -l fi \
> {dataset_path}/{set_name}.tok.fi

# Normalize punctuation and tokenize English text
!cat {dataset_path}/{set_name}.en | \
mosesdecoder/scripts/tokenizer/normalize-punctuation.perl en | \
mosesdecoder/scripts/tokenizer/tokenizer.perl -threads 8 -no-escape -l en \
> {dataset_path}/{set_name}.tok.en

# Truecase the tokenized Finnish text
!mosesdecoder/scripts/recaser/truecase.perl \
-model {dataset_path}/truecase-model.fi \
< {dataset_path}/{set_name}.tok.fi \
> {dataset_path}/{set_name}.tok.truecase.fi

# Truecase the tokenized English text
!mosesdecoder/scripts/recaser/truecase.perl \
-model {dataset_path}/truecase-model.en \
< {dataset_path}/{set_name}.tok.en \
> {dataset_path}/{set_name}.tok.truecase.en

# Clean the corpus
!perl mosesdecoder/scripts/training/clean-corpus-n.perl \
{dataset_path}/{set_name}.tok.truecase en fi \
{dataset_path}/{set_name}.tok.clean 1 50

## Morfesssor preprocessing

With Morfessor the Finnish words will be preprocessed by splitting off their suffixes and adding delemiters between the splits

This Information can be conatained in Finnish suffixes:
- Case
- Number
- Person
- Tense & Mood
- Possession
- Derivation
- Comparative and superlative forms

The delimiter chosen here is "|" as it used seldomly in the Finnish language.

In [None]:
# Install the Morfessor for Morfessor-Encoding
%pip install morfessor

# Create Morfessor directory and set its path
morfessor_path = f"{dataset_path}/morfessor"
!mkdir -p "{morfessor_path}"

# Learn Morfessor model from tokenized data
!morfessor -t "{dataset_path}/train.tok.clean.fi" -s "{morfessor_path}/model_fi.bin"

In [None]:
set_name = "train"

# Segment Finnish tokenized data using learned Morfessor model
!morfessor -l "/content/drive/MyDrive/translation_model/dataset/morfessor/model_fi.bin" -T - \
    --output-newlines --output-format "{analysis}  " --output-format-separator " |" \
    < "/content/drive/MyDrive/translation_model/dataset/{set_name}.tok.clean.fi" > \
    "/content/drive/MyDrive/translation_model/dataset/morfessor/{set_name}.morfessor.fi"

## Additional Preprocessing using BPE

After having linguistically preprocessed the text using Morfessor an additonal preprocessing step to make the model more robust is to use Byte Pair Encoding (BPE).

In [None]:
set_name = "train"

# Learn a joint BPE model and vocabulary
!subword-nmt learn-joint-bpe-and-vocab \
    --input {morfessor_path}/{set_name}.morfessor.fi {dataset_path}/{set_name}.tok.clean.en -s 32000 \
    -o {morfessor_path}/bpe.codes --write-vocabulary {morfessor_path}/vocab.fi {morfessor_path}/vocab.en

In [None]:
set_name = "dev"

# Apply the learned BPE model and vocabulary
!subword-nmt apply-bpe -c {morfessor_path}/bpe.codes \
    --vocabulary {morfessor_path}/vocab.fi < {morfessor_path}/{set_name}.morfessor.fi > {morfessor_path}/{set_name}.bpe.fi
!subword-nmt apply-bpe -c {morfessor_path}/bpe.codes \
    --vocabulary {morfessor_path}/vocab.en < {morfessor_path}/{set_name}.tok.clean.en > {morfessor_path}/{set_name}.bpe.en

In [None]:
# Create Dataset using BPE Data
!fairseq-preprocess --source-lang fi --target-lang en \
    --trainpref {dataset_path}/train.bpe --validpref {dataset_path}/dev.bpe --testpref {dataset_path}/test.bpe \
    --destdir {data_bin_path} --workers 20

In [None]:
# Train Model using BPE Dataset
!fairseq-train {data_bin_path} \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 8192 \
    --save-interval 1 \
    --keep-last-epochs 5 --log-format simple --log-interval 100 \ 
    --tensorboard-logdir {logs_path} \
    --save-dir {checkpoints_path} \
    --amp \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

In [None]:
# Generate translations using BPE trained model
! fairseq-generate "{data_bin_path}" \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe \
        > {dataset_path}/bpe/translations.txt

In [None]:
# Compute BLEU score
!grep ^H {base_path}/translations.txt | cut -f3- > {base_path}/hyp.txt
!grep ^T {base_path}/translations.txt | cut -f2- > {base_path}/ref.txt
!mosesdecoder/scripts/generic/multi-bleu.perl {base_path}/ref.txt < {base_path}/hyp.txt