In this notebook we will try to test if the backtranslation method is useful for morphological inflection. We will use the prepared data from the SIGMORPHON 2018 first task.  We will first do and explain the whole process using the small Spanish training dataset (100 examples), and then proceed to replicate the work with the three training sizes (100, 500 and 1000) and with the two languages (Spanish and Basque).

The process will be:

- Train and evaluate the first inflection model.

- Train and evaluate the tagger model.

- Use the tagger to generate new tagged data, and append it to the original data.

- Train and evaluate the second inflection model, using the new training data.


In [None]:
# %%bash
# #git clone https://github.com/pytorch/fairseq.git #we just need to do it the first time
# cd fairseq
# pip install --editable ./

The transformer architecture that we are going to use in the models is provided by the sequence modeling toolkit [**Fairseq**](https://ai.facebook.com/tools/fairseq/). Installing it with Github gives some errors, so we will use the pip version.

**Run-time must be restarted after this**

In [1]:
!pip install fairseq

Collecting fairseq
  Downloading fairseq-0.10.2-cp37-cp37m-manylinux1_x86_64.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 6.3 MB/s 
[?25hCollecting hydra-core
  Downloading hydra_core-1.1.1-py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 18.5 MB/s 
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting sacrebleu>=1.4.12
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 9.2 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting portalocker
  Downloading portalocker-2.3.2-py2.py3-none-any.whl (15 kB)
Collecting antlr4-python3-runtime==4.8
  Downloading antlr4-python3-runtime-4.8.tar.gz (112 kB)
[K     |████████████████████████████████| 112 kB 49.4 MB/s 
Collecting omegaconf==2.1.*
  Downloading omegaconf-2.1.1-py3-none-any.whl (74 kB)
[K     |████████████████████████████████| 74 kB 3.2 MB/s 
[?25hCollecting PyYAM

Tensorboard to save and compare the losses and the training of the models.

In [2]:
!pip install tensorboardX

Collecting tensorboardX
  Downloading tensorboardX-2.4.1-py2.py3-none-any.whl (124 kB)
[K     |████████████████████████████████| 124 kB 4.9 MB/s 
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.4.1


In [3]:
%reload_ext tensorboard
import os
import fairseq
import pandas as pd
import torch
import tensorflow as tf
import tensorboardX

In [4]:
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/My Drive/Colab Notebooks/backtranslation/"

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/backtranslation


The next function processes the output when we evaluate the models performance. While testing, the models generate data using the test dataset. 

Since the evaluation is done using the evaluation script from SIGMORPHON 2018, the output data needs to be reformatted according to its criteria.

In [5]:
def reformat(lemmas, inflecteds, path):
    tags = [lemma.split("> ")[1] for lemma in lemmas]
    lemmas = [lemma.split("> ")[0] for lemma in lemmas]
    tags = [tag.replace(" ", ";") for tag in tags]
    lemmas = [lemma.replace(" ", "") for lemma in lemmas]
    lemmas = [lemma[1:] for lemma in lemmas]

    inflecteds = [inflected.replace(" ","") for inflected in inflecteds]
    inflecteds = [inflected[1:-1] for inflected in inflecteds]
    inflecteds = [inflected.replace("#", " ") for inflected in inflecteds]
    columns = [[lemma, inflected, tag] for (lemma, inflected, tag) in zip(lemmas, inflecteds, tags)]
    df = pd.DataFrame(columns)
    df.to_csv(path, sep="\t", header=False, index=False)

# Preprocessing the data
To be able to train the model we need to preprocess the data. The next script takes the data from a language, and then binarizes it and generates a vocabulary for the model. The input data will be the training data corresponding to the size, the evaluation data and the test data.

In [None]:
#!bash ./scripts/preprocess.sh es low 

2022-01-31 16:55:43 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/low', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.lemma_tag', srcdict=None, target_lang='es.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tok

# Inflection model

We train and evaluate the inflection model. This model will take word lemmas and tags as input, and will output the inflected form.

## Training
The trained model is a Transformer model as implemented in Fairseq, the parameters of the model are in the A appendix of the

In [None]:
# !bash ./scripts/train_small.sh es lemma_tag inflected

In [None]:
# %tensorboard --logdir logs/es/low/inflected

## Evaluation
To evaluate the model we will use the evalm.py script used in the SIGMORPHON 2018 task 1. This script being feed an generated data with the correct format, will compare it to the gold standard file. It will output the accuracy (percentage of correctly guessed inflected forms ) and the average levenshtein distance.

The evaluation will be made with the test dataset. First the model will generate the predicted inflected forms given the lemmas and tags.

In [None]:
# !bash ./scripts/generate.sh es low lemma_tag inflected

We need to reprocess the output to be able to use it in the script.




In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/low/inflected/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/es/low/inflected/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/low/inflected/predicted.txt")

Evaluation script from SIGMORPHON 2018

In [None]:
# !python ./scripts/evalm.py --guess generated/es/low/inflected/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

# Tagger

We train and evaluate the tagger model. This model will be the inverse of the inflection model; it takes inflected words as input, and produces lemmas and tags.

## Training

In [None]:
# !bash ./scripts/train_small.sh es inflected lemma_tag

In [None]:
# %tensorboard --logdir logs/es/low/lemma_tag

## Evaluation

In [None]:
# !bash ./scripts/generate.sh es low inflected lemma_tag

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/low/lemma_tag/sen.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]
# with open('generated/es/low/lemma_tag/hyp.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/low/lemma_tag/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/low/lemma_tag/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	9.30
levenshtein:	9.31


# Generation of new examples

Here we apply the backtranslation process. We will use only the inflected forms of the dataset used for generation (5000 examples). First, we need to just use the examples that do not appear in the original training dataset.

In [None]:
# path_gen = "./data/prepared/gen"
# path_data = "./data/prepared"

# path_size = "low"

# path_gen_inflected = os.path.join(path_gen, "gen.es.inflected")
# path_inflected = os.path.join(os.path.join(path_data, path_size), "train.es.inflected")
# path_bt = os.path.join(path_gen, path_size)

# if not os.path.exists(path_bt):
#     os.makedirs(path_bt)

# path_bt_inflected = os.path.join(path_bt, "bt.es.inflected")
# with open(path_inflected) as f:
#     inflected = [line.rstrip() for line in f]

# with open(path_gen_inflected) as f:
#     gen_inflected = [line.rstrip() for line in f]

# deprocess = []
# for gen_inf in gen_inflected:
#     if gen_inf not in inflected:
#         deprocess.append(gen_inf)

# with open(path_bt_inflected, 'w') as f:
#     for item in deprocess:
#         f.write("%s\n" % item)

Then, use the tagger to get the lemma and tags for these new examples. Preprocess the data (we just have the inflections, and thus a new preprocessing script is needed)

In [None]:
# !bash ./scripts/preprocess_bt.sh es low

2022-01-31 17:51:05 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/low/bt', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.inflected', srcdict='data-bin/low/dict.es.inflected.txt', target_lang='es.lemma_tag', task='translation', tensorboard_logdir=None, testpref='./data/prepared/gen/low/bt', tgtdict=None, threshold_loss_scale=None, 

We generate the lemmas and the tags for the inflections.

In [None]:
# !bash ./scripts/bt.sh es low

# New inflection model

Now we apply the exact same process we did to the first inflection model. We preprocess the new data, train the new inflection model and evaluate it.

##Preprocessing

In [None]:
# !bash ./scripts/preprocess_new.sh es low 

2022-01-31 17:59:34 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/low/new', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.lemma_tag', srcdict=None, target_lang='es.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0,

## Training

In [None]:
# !bash ./scripts/train_new.sh es low

In [None]:
# %tensorboard --logdir logs/es/low/inflected/bt

## Evaluation

In [None]:
# !bash ./scripts/generate_new.sh es low

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/low/inflected/bt/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/es/low/inflected/bt/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/low/inflected/bt/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/low/inflected/bt/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	34.00
levenshtein:	1.72


The process applied since the beginning of the notebook is replicated with the other training data (language and size):

# More tests
Spanish/Basque, low/med/high

## Basque low data (100 examples)

In [None]:
# !bash ./scripts/preprocess.sh eu low 

2022-01-31 18:35:15 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/low', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.lemma_tag', srcdict=None, target_lang='eu.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tok

In [None]:
# !bash ./scripts/train_small.sh eu lemma_tag inflected

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
epoch 165 | valid on 'valid' subset:  12% 4/32 [00:00<00:01, 19.25it/s][A
epoch 165 | valid on 'valid' subset:  25% 8/32 [00:00<00:00, 26.43it/s][A
epoch 165 | valid on 'valid' subset:  38% 12/32 [00:00<00:00, 30.58it/s][A
epoch 165 | valid on 'valid' subset:  50% 16/32 [00:00<00:00, 28.25it/s][A
epoch 165 | valid on 'valid' subset:  59% 19/32 [00:00<00:00, 25.81it/s][A
epoch 165 | valid on 'valid' subset:  69% 22/32 [00:00<00:00, 25.96it/s][A
epoch 165 | valid on 'valid' subset:  81% 26/32 [00:00<00:00, 28.84it/s][A
epoch 165 | valid on 'valid' subset:  94% 30/32 [00:01<00:00, 30.57it/s][A
                                                                        [A2022-01-31 18:44:28 | INFO | valid | epoch 165 | valid on 'valid' subset | loss 2.291 | nll_loss 1.344 | ppl 2.54 | wps 12316.8 | wpb 406.8 | bsz 31.2 | num_updates 660 | best_loss 2.291
2022-01-31 18:44:28 | INFO | fairseq_cli.train | begin sa

In [None]:
# %tensorboard --logdir logs/eu/low/inflected

In [None]:
# !bash ./scripts/generate.sh eu low lemma_tag inflected

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/low/inflected/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/eu/low/inflected/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/low/inflected/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/low/inflected/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	19.30
levenshtein:	2.71


In [None]:
# !bash ./scripts/train_small.sh eu inflected lemma_tag

2022-01-31 19:54:36 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=32, batch_size_valid=32, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/low', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distributed_back

In [None]:
# %tensorboard --logdir logs/eu/low/lemma_tag

In [None]:
# !bash ./scripts/generate.sh eu low inflected lemma_tag

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/low/lemma_tag/sen.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]
# with open('generated/eu/low/lemma_tag/hyp.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/low/lemma_tag/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/low/lemma_tag/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	6.40
levenshtein:	9.29


In [None]:
# path_gen = "./data/prepared/gen"
# path_data = "./data/prepared"

# path_size = "low"

# path_gen_inflected = os.path.join(path_gen, "gen.eu.inflected")
# path_inflected = os.path.join(os.path.join(path_data, path_size), "train.eu.inflected")
# path_bt = os.path.join(path_gen, path_size)

# if not os.path.exists(path_bt):
#     os.makedirs(path_bt)

# path_bt_inflected = os.path.join(path_bt, "bt.eu.inflected")
# with open(path_inflected) as f:
#     inflected = [line.rstrip() for line in f]

# with open(path_gen_inflected) as f:
#     gen_inflected = [line.rstrip() for line in f]

# deprocess = []
# for gen_inf in gen_inflected:
#     if gen_inf not in inflected:
#         deprocess.append(gen_inf)

# with open(path_bt_inflected, 'w') as f:
#     for item in deprocess:
#         f.write("%s\n" % item)

In [None]:
# !bash ./scripts/preprocess_bt.sh eu low

2022-01-31 19:56:58 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/low/bt', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.inflected', srcdict='data-bin/low/dict.eu.inflected.txt', target_lang='eu.lemma_tag', task='translation', tensorboard_logdir=None, testpref='./data/prepared/gen/low/bt', tgtdict=None, threshold_loss_scale=None, 

In [None]:
# !bash ./scripts/bt.sh eu low

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size
< e z a n > V ARGABS3 ARGABSPL ARGERG1 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOMASC PRES IND
< e z a n > V ARGABS3 ARGABSSG ARGERG3 ARGERGSG ARGIO1 ARGIOSG HYP IND
< e k a r r i > V ARGABS3 ARGABSSG ARGERG2 ARGERGPL ARGIO3 ARGIOPL HYP POT
< j o a n > V ARGABS1 ARGABSPL ARGIO3 ARGIOSG PRES IND
< e g i n > V ARGABS3 ARGABSSG ARGERG1 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOFEM PAST POT
< e u t s i > V ARGABS3 ARGABSSG ARGERG3 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOFEM HYP IND
< j a r d u n > V ARGABS3 ARGABSSG ARGERG3 ARGERGSG INFM FEM PRES IND
< e u t s i > V ARGABS3 ARGABSSG ARGERG3 ARGERGSG ARGIO1 ARGIOPL PAST POT
< u k a n > V ARGABS3 ARGABSSG ARGERG3 ARGERGPL ARGIO3 ARGIOSG INFM FEM PRES POT
< u k a n > V ARGABS3 ARGABSSG ARGERG1 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOFEM HYP IND
< e z a n > V ARGABS3 ARGABSSG ARGERG2 ARGERGPL ARGIO1 ARGIOSG IMP
< e d u n > V ARGABS3 ARGABSPL ARGERG3 ARGERGPL ARGIO1 ARGIOPL INFM FEM

In [None]:
# !bash ./scripts/preprocess_new.sh eu low 

2022-01-31 19:57:48 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/low/new', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.lemma_tag', srcdict=None, target_lang='eu.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0,

In [None]:
# !bash ./scripts/train_new.sh eu low

2022-01-31 19:57:51 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=256, batch_size_valid=256, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/low/new', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distribute

In [None]:
# %tensorboard --logdir logs/eu/low/inflected/bt

In [None]:
# !bash ./scripts/generate_new.sh eu low

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/low/inflected/bt/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/eu/low/inflected/bt/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/low/inflected/bt/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/low/inflected/bt/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	25.20
levenshtein:	2.62


##Spanish med data (500 examples)

In [None]:
# !bash ./scripts/preprocess.sh es med

2022-01-31 20:10:27 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/med', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.lemma_tag', srcdict=None, target_lang='es.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tok

In [None]:
# !bash ./scripts/train_med.sh es lemma_tag inflected

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
        (fc2): Linear(in_features=1024, out_features=256, bias=True)
        (final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=256, out_features=256, bias=True)
          (v_proj): Linear(in_features=256, out_features=256, bias=True)
          (q_proj): Linear(in_features=256, out_features=256, bias=True)
          (out_proj): Linear(in_features=256, out_features=256, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=256, out_features=256, bias=T

In [None]:
# %tensorboard --logdir logs/es/low/inflected

In [None]:
# !bash ./scripts/generate.sh es med lemma_tag inflected

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/med/inflected/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]a
# with open('generated/es/med/inflected/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/med/inflected/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/med/inflected/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	70.80
levenshtein:	0.62


In [None]:
# !bash ./scripts/train_med.sh es inflected lemma_tag

2022-01-31 20:32:16 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=128, batch_size_valid=128, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/med', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distributed_ba

In [None]:
# %tensorboard --logdir logs/es/low/lemma_tag

In [None]:
# !bash ./scripts/generate.sh es med inflected lemma_tag

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/med/lemma_tag/sen.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]
# with open('generated/es/med/lemma_tag/hyp.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/med/lemma_tag/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/med/lemma_tag/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	60.10
levenshtein:	3.98


In [None]:
# path_gen = "./data/prepared/gen"
# path_data = "./data/prepared"

# path_size = "med"

# path_gen_inflected = os.path.join(path_gen, "gen.es.inflected")
# path_inflected = os.path.join(os.path.join(path_data, path_size), "train.es.inflected")
# path_bt = os.path.join(path_gen, path_size)

# if not os.path.exists(path_bt):
#     os.makedirs(path_bt)

# path_bt_inflected = os.path.join(path_bt, "bt.es.inflected")
# with open(path_inflected) as f:
#     inflected = [line.rstrip() for line in f]

# with open(path_gen_inflected) as f:
#     gen_inflected = [line.rstrip() for line in f]

# deprocess = []
# for gen_inf in gen_inflected:
#     if gen_inf not in inflected:
#         deprocess.append(gen_inf)

# with open(path_bt_inflected, 'w') as f:
#     for item in deprocess:
#         f.write("%s\n" % item)

In [None]:
# !bash ./scripts/preprocess_bt.sh es med5

2022-01-31 20:51:16 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/med/bt', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.inflected', srcdict='data-bin/med/dict.es.inflected.txt', target_lang='es.lemma_tag', task='translation', tensorboard_logdir=None, testpref='./data/prepared/gen/med/bt', tgtdict=None, threshold_loss_scale=None, 

In [None]:
# !bash ./scripts/bt.sh es med

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size
< m a r e a r > V IND FUT 1 PL
< e n l e n z a r > V COND 3 SG
< m a d r e a r s e > V COND 3 PL
< e n t a b l a r > V IND PRS 3 PL
< m a r c h a r s e > V SBJV PST 2 PL
< f o r n i c a r > V COND 2 PL
< g e r m a n i z a r > V SBJV FUT 2 SG
< r e m b o l s a r > V IND FUT 3 PL
< a b u n d a r > V IND PST 2 PL IPFV
< c o n f e s a r > V SBJV PRS 3 PL
< c o n s t r u i r > V IND PRS 1 PL
< g a r b a r > V COND 3 PL
< t a j a r > V IND FUT 2 SG
< r e e x p e d i r > V.PTCP PST MASC SG
< v i c t i m a r > V IND PST 2 PL IPFV
< l l o v i z n a r > V IND FUT 3 PL
< d o m i n a r > V SBJV PST 1 PL LGSPEC1
< a c o n s e j a r > V SBJV PST 2 PL
< t e n d e r s e > V IND PST 3 SG IPFV
< a g u j e r e a r > V IND PRS 3 PL
< a t e m o r i z a r > V SBJV PRS 2 PL
< s o c a v a r > V SBJV PST 3 PL LGSPEC1
< c o n t r a r r e s t a r > V SBJV FUT 2 PL
< e n s u c i a r > V POS IMP 1 PL
< e n l i s t a r > V POS IMP 2 PL
< v e h i

In [None]:
# !bash ./scripts/preprocess_new.sh es med 

2022-01-31 20:51:35 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/med/new', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.lemma_tag', srcdict=None, target_lang='es.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0,

In [None]:
# !bash ./scripts/train_new.sh es med

2022-01-31 20:51:37 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=256, batch_size_valid=256, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/med/new', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distribute

In [None]:
# %tensorboard --logdir logs/es/med/inflected/bt

In [None]:
# !bash ./scripts/generate_new.sh es med

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/med/inflected/bt/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/es/med/inflected/bt/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/med/inflected/bt/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/med/inflected/bt/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	78.30
levenshtein:	0.47


##Basque med data (500 examples)

In [None]:
# !bash ./scripts/preprocess.sh eu med 

2022-01-31 21:09:44 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/med', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.lemma_tag', srcdict=None, target_lang='eu.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tok

In [None]:
# !bash ./scripts/train_med.sh eu lemma_tag inflected

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
epoch 049 | valid on 'valid' subset:  50% 4/8 [00:00<00:00, 17.17it/s][A
epoch 049 | valid on 'valid' subset:  88% 7/8 [00:00<00:00, 19.14it/s][A
                                                                      [A2022-02-01 07:29:27 | INFO | valid | epoch 049 | valid on 'valid' subset | loss 3.546 | nll_loss 2.954 | ppl 7.75 | wps 34237.8 | wpb 1627 | bsz 125 | num_updates 196 | best_loss 3.546
2022-02-01 07:29:27 | INFO | fairseq_cli.train | begin save checkpoint
2022-02-01 07:29:29 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/eu/med/inflected/checkpoint_best.pt (epoch 49 @ 196 updates, score 3.546) (writing took 1.268678111999975 seconds)
2022-02-01 07:29:29 | INFO | fairseq_cli.train | end of epoch 49 (average epoch stats below)
2022-02-01 07:29:29 | INFO | train | epoch 049 | loss 3.737 | nll_loss 3.22 | ppl 9.32 | wps 2593.4 | ups 1.62 | wpb 1598.5 | bsz 125 | num_updates 196 | l

In [None]:
# %tensorboard --logdir logs/eu/med/inflected

In [None]:
# !bash ./scripts/generate.sh eu med lemma_tag inflected

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/med/inflected/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/eu/med/inflected/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/med/inflected/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/med/inflected/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	81.30
levenshtein:	0.39


In [None]:
# !bash ./scripts/train_med.sh eu inflected lemma_tag

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
        )
        (encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=256, out_features=1024, bias=True)
        (fc2): Linear(in_features=1024, out_features=256, bias=True)
        (final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (output_projection): Linear(in_features=256, out_features=64, bias=False)
  )
)
2022-02-01 07:50:09 | INFO | fairseq_cli.train | task: translation (TranslationTask)
2022-02-01 07:50:09 | INFO | fairseq_cli.train | model: transformer (TransformerModel)
2022-02-01 07:50:09 | INFO | fairseq_cli.train | criterion: label_smoothed_cross_entropy (LabelSmoothedCrossEntropyCriterion)
2022-02-01 07:50:09 | INFO | fairseq_cli.train | num. model params: 7406592 (num. trained: 7406592)
2022-02-01 07:50:11 | INFO | fairseq.t

In [None]:
# %tensorboard --logdir logs/eu/med/lemma_tag

In [None]:
# !bash ./scripts/generate.sh eu med inflected lemma_tag

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/med/lemma_tag/sen.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]
# with open('generated/eu/med/lemma_tag/hyp.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/med/lemma_tag/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/med/lemma_tag/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	56.80
levenshtein:	4.05


In [None]:
# path_gen = "./data/prepared/gen"
# path_data = "./data/prepared"

# path_size = "med"

# path_gen_inflected = os.path.join(path_gen, "gen.eu.inflected")
# path_inflected = os.path.join(os.path.join(path_data, path_size), "train.eu.inflected")
# path_bt = os.path.join(path_gen, path_size)

# if not os.path.exists(path_bt):
#     os.makedirs(path_bt)

# path_bt_inflected = os.path.join(path_bt, "bt.eu.inflected")
# with open(path_inflected) as f:
#     inflected = [line.rstrip() for line in f]

# with open(path_gen_inflected) as f:
#     gen_inflected = [line.rstrip() for line in f]

# deprocess = []
# for gen_inf in gen_inflected:
#     if gen_inf not in inflected:
#         deprocess.append(gen_inf)

# with open(path_bt_inflected, 'w') as f:
#     for item in deprocess:
#         f.write("%s\n" % item)

In [None]:
# !bash ./scripts/preprocess_bt.sh eu med

2022-02-01 08:11:22 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/med/bt', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.inflected', srcdict='data-bin/med/dict.eu.inflected.txt', target_lang='eu.lemma_tag', task='translation', tensorboard_logdir=None, testpref='./data/prepared/gen/med/bt', tgtdict=None, threshold_loss_scale=None, 

In [None]:
# !bash ./scripts/bt.sh eu med

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size
< e z a n > V ARGABS2 ARGABSSG ARGABS2 ARGABSPL ARGERG1 ARGERGSG HYP IND
< e d u n > V ARGABS3 ARGABSPL ARGERG3 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOMASC PAST POT
< e r o a n > V ARGABS3 ARGABSSG ARGERG3 ARGERGSG HYP POT
< a t x i k i > V ARGABS3 ARGABSSG ARGIO2 ARGIOSG PAST POT
< e t o r r i > V ARGABS1 ARGABSSG ARGIO3 ARGIOPL HYP IND
< e z a g u t u > V ARGABS3 ARGABSPL ARGERG3 ARGERGSG INFM MASC PAST IND
< e d u n > V ARGABS1 ARGABSPL ARGERG3 ARGERGPL PRES IND
< e d u n > V ARGABS3 ARGABSPL ARGERG3 ARGERGPL ARGIO1 ARGIOPL INFM FEM PRES POT
< e d u k i > V ARGABS3 ARGABSSG ARGERG3 ARGERGPL INFM FEM PAST IND
< u k a n > V ARGABS2 ARGABSSG ARGABS2 ARGABSPL ARGERG3 ARGERGPL PRES IND
< e r o a n > V ARGABS3 ARGABSSG ARGERG1 ARGERGSG INFM FEM PAST IND
< e r a b i l i > V ARGABS1 ARGABSPL ARGERG3 ARGERGSG INFM MASC PAST IND
< j a r r a i k i > V ARGABS1 ARGABSSG ARGIO3 ARGIOSG INFM MASC PRES IND
< e g i n > V ARGABS3 

In [None]:
# !bash ./scripts/preprocess_new.sh eu med

2022-02-01 08:12:39 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/med/new', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.lemma_tag', srcdict=None, target_lang='eu.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0,

In [None]:
# !bash ./scripts/train_new.sh eu med

2022-02-01 08:12:47 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=256, batch_size_valid=256, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/med/new', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distribute

In [None]:
# %tensorboard --logdir logs/eu/med/inflected/bt

In [None]:
# !bash ./scripts/generate_new.sh eu med

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/med/inflected/bt/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/eu/med/inflected/bt/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/med/inflected/bt/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/med/inflected/bt/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	79.50
levenshtein:	0.43


## Spanish high data (1000 examples)

In [None]:
# !bash ./scripts/preprocess.sh es high 

2022-02-01 08:42:19 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/high', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.lemma_tag', srcdict=None, target_lang='es.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, to

In [None]:
# !bash ./scripts/train_high.sh es lemma_tag inflected

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
          (q_proj): Linear(in_features=256, out_features=256, bias=True)
          (out_proj): Linear(in_features=256, out_features=256, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=256, out_features=1024, bias=True)
        (fc2): Linear(in_features=1024, out_features=256, bias=True)
        (final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=256, out_features=256, bias=True)
          (v_proj): Linear(in_features=256, out_features=256, bias=True)
          (q_proj): Linear(in_features=256, out_features=256, bias=True)
          (out_proj): Linear(in_features=256, out_features=256, bi

In [None]:
# %tensorboard --logdir logs/es/high/inflected

In [None]:
# !bash ./scripts/generate.sh es high lemma_tag inflected

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/high/inflected/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/es/high/inflected/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/high/inflected/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/high/inflected/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	89.10
levenshtein:	0.25


In [None]:
# !bash ./scripts/train_high.sh es inflected lemma_tag

2022-02-01 09:04:36 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=256, batch_size_valid=256, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/high', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distributed_b

In [None]:
# %tensorboard --logdir logs/es/high/lemma_tag

In [None]:
# !bash ./scripts/generate.sh es high inflected lemma_tag

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/high/lemma_tag/sen.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]
# with open('generated/es/high/lemma_tag/hyp.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/high/lemma_tag/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/high/lemma_tag/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	68.10
levenshtein:	3.14


In [None]:
# path_gen = "./data/prepared/gen"
# path_data = "./data/prepared"

# path_size = "high"

# path_gen_inflected = os.path.join(path_gen, "gen.es.inflected")
# path_inflected = os.path.join(os.path.join(path_data, path_size), "train.es.inflected")
# path_bt = os.path.join(path_gen, path_size)

# if not os.path.exists(path_bt):
#     os.makedirs(path_bt)

# path_bt_inflected = os.path.join(path_bt, "bt.es.inflected")
# with open(path_inflected) as f:
#     inflected = [line.rstrip() for line in f]

# with open(path_gen_inflected) as f:
#     gen_inflected = [line.rstrip() for line in f]

# deprocess = []
# for gen_inf in gen_inflected:
#     if gen_inf not in inflected:
#         deprocess.append(gen_inf)

# with open(path_bt_inflected, 'w') as f:
#     for item in deprocess:
#         f.write("%s\n" % item)

In [None]:
# !bash ./scripts/preprocess_bt.sh es high

2022-02-01 09:26:20 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/high/bt', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='es.inflected', srcdict='data-bin/high/dict.es.inflected.txt', target_lang='es.lemma_tag', task='translation', tensorboard_logdir=None, testpref='./data/prepared/gen/high/bt', tgtdict=None, threshold_loss_scale=Non

In [None]:
# !bash ./scripts/bt.sh es high

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size
< r e i t e r a r > V NEG IMP 3 PL
< m a c h u c a r > V IND PST 2 SG PFV
< p r e g u n t a r s e > V POS IMP 3 SG
< m a r c i r > V NEG IMP 2 PL
< j a q u e a r > V COND 1 PL
< d e c i r > V IND PRS 2 SG
< a t r a s a r > V SBJV PST 3 PL
< e s c a s e a r > V SBJV FUT 3 PL
< c o m p a r a r > V IND PST 1 SG IPFV
< a l e c h u g a r > V SBJV FUT 1 SG
< r e d e s c u b r i r > V COND 1 PL
< d e s e m p e ñ a r > V IND FUT 1 PL
< s u s c i t a r > V IND PRS 1 PL
< e x p a t r i a r > V SBJV PST 1 SG LGSPEC1
< e j e r c i t a r > V IND PRS 2 SG
< a g r e d i r > V.CVB PRS
< r e t r a d u c i r > V SBJV FUT 1 SG
< b o r d a r > V IND PRS 2 SG
< c l i q u e a r > V SBJV PST 2 SG LGSPEC1
< a l o c a r > V SBJV PST 3 PL LGSPEC1
< e x t r a p o l a r > V IND PRS 3 PL
< g r a t i n a r > V IND FUT 2 PL
< a v a n z a r > V NEG IMP 2 PL
< o f r e n d a r > V SBJV PST 1 PL
< a o j a r > V SBJV FUT 2 SG
< e d i t a r > V SBJV PS

In [None]:
# !bash ./scripts/preprocess_new.sh es high

In [None]:
# !bash ./scripts/train_new.sh es high

2022-02-01 09:37:25 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=256, batch_size_valid=256, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/high/new', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distribut

In [None]:
# %tensorboard --logdir logs/es/high/inflected/bt

In [None]:
# !bash ./scripts/generate_new.sh es high

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/es/high/inflected/bt/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/es/high/inflected/bt/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/es/high/inflected/bt/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/es/high/inflected/bt/predicted.txt --gold data/sigmorphon/spanish-test.txt --task 1

acccuracy:	74.60
levenshtein:	0.53


## Basque high data (1000 examples)

In [None]:
# !bash ./scripts/preprocess.sh eu high 

2022-02-01 10:14:01 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/high', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.lemma_tag', srcdict=None, target_lang='eu.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, to

In [None]:
# !bash ./scripts/train_high.sh eu lemma_tag inflected

2022-02-01 10:14:08 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=256, batch_size_valid=256, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/high', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distributed_b

In [None]:
# %tensorboard --logdir logs/eu/high/inflected

In [None]:
# !bash ./scripts/generate.sh eu high lemma_tag inflected

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/high/inflected/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/eu/high/inflected/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/high/inflected/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/high/inflected/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	89.60
levenshtein:	0.20


In [None]:
# !bash ./scripts/train_high.sh eu inflected lemma_tag

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m

epoch 052 | valid on 'valid' subset:   0% 0/4 [00:00<?, ?it/s][A
epoch 052 | valid on 'valid' subset:  25% 1/4 [00:02<00:06,  2.13s/it][A
epoch 052 | valid on 'valid' subset:  75% 3/4 [00:02<00:00,  1.63it/s][A
                                                                      [A2022-02-01 10:43:23 | INFO | valid | epoch 052 | valid on 'valid' subset | loss 3.214 | nll_loss 2.566 | ppl 5.92 | wps 44668.6 | wpb 4146.2 | bsz 250 | num_updates 208 | best_loss 3.214
2022-02-01 10:43:23 | INFO | fairseq_cli.train | begin save checkpoint
2022-02-01 10:43:24 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/eu/high/lemma_tag/checkpoint_best.pt (epoch 52 @ 208 updates, score 3.214) (writing took 1.1772742699995433 seconds)
2022-02-01 10:43:24 | INFO | fairseq_cli.train | end of epoch 52 (average epoch stats below)
2022-02-01 10:43:24 | INFO | train | epoch 052 | loss 3.549 | nll_loss 3.034 | ppl 8

In [None]:
# %tensorboard --logdir logs/eu/high/lemma_tag

In [None]:
# !bash ./scripts/generate.sh eu high inflected lemma_tag

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/high/lemma_tag/sen.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]
# with open('generated/eu/high/lemma_tag/hyp.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/high/lemma_tag/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/high/lemma_tag/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	73.60
levenshtein:	2.35


In [None]:
# path_gen = "./data/prepared/gen"
# path_data = "./data/prepared"

# path_size = "high"

# path_gen_inflected = os.path.join(path_gen, "gen.eu.inflected")
# path_inflected = os.path.join(os.path.join(path_data, path_size), "train.eu.inflected")
# path_bt = os.path.join(path_gen, path_size)

# if not os.path.exists(path_bt):
#     os.makedirs(path_bt)

# path_bt_inflected = os.path.join(path_bt, "bt.eu.inflected")
# with open(path_inflected) as f:
#     inflected = [line.rstrip() for line in f]

# with open(path_gen_inflected) as f:
#     gen_inflected = [line.rstrip() for line in f]

# deprocess = []
# for gen_inf in gen_inflected:
#     if gen_inf not in inflected:
#         deprocess.append(gen_inf)

# with open(path_bt_inflected, 'w') as f:
#     for item in deprocess:
#         f.write("%s\n" % item)

In [None]:
# !bash ./scripts/preprocess_bt.sh eu high

2022-02-01 11:19:32 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/high/bt', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.inflected', srcdict='data-bin/high/dict.eu.inflected.txt', target_lang='eu.lemma_tag', task='translation', tensorboard_logdir=None, testpref='./data/prepared/gen/high/bt', tgtdict=None, threshold_loss_scale=Non

In [None]:
# !bash ./scripts/bt.sh eu high

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size
< e z a n > V ARGABS3 ARGABSPL ARGERG1 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOMASC PRES IND
< e z a n > V ARGABS3 ARGABSSG ARGERG3 ARGERGSG ARGIO1 ARGIOSG HYP IND
< e k a r r i > V ARGABS3 ARGABSSG ARGERG2 ARGERGPL ARGIO3 ARGIOPL HYP POT
< j o a n > V ARGABS1 ARGABSPL ARGIO3 ARGIOSG PRES IND
< e g i n > V ARGABS3 ARGABSSG ARGERG1 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOFEM PAST POT
< e u t s i > V ARGABS3 ARGABSSG ARGERG3 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOFEM HYP IND
< j a r d u n > V ARGABS3 ARGABSSG ARGERG3 ARGERGSG INFM FEM PRES IND
< e u t s i > V ARGABS3 ARGABSSG ARGERG3 ARGERGSG ARGIO1 ARGIOPL PAST POT
< u k a n > V ARGABS3 ARGABSSG ARGERG3 ARGERGPL ARGIO3 ARGIOSG INFM FEM PRES POT
< u k a n > V ARGABS3 ARGABSSG ARGERG1 ARGERGPL ARGIO2 ARGIOSG ARGIOINFM ARGIOFEM HYP IND
< e z a n > V ARGABS3 ARGABSSG ARGERG2 ARGERGPL ARGIO1 ARGIOSG IMP
< e d u n > V ARGABS3 ARGABSPL ARGERG3 ARGERGPL ARGIO1 ARGIOPL INFM FEM

In [None]:
# !bash ./scripts/preprocess_new.sh eu high

2022-02-01 11:21:11 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/high/new', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='eu.lemma_tag', srcdict=None, target_lang='eu.inflected', task='translation', tensorboard_logdir=None, testpref='./data/prepared/test/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0

In [None]:
# !bash ./scripts/train_new.sh eu high

2022-02-01 11:21:16 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.3, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer', attention_dropout=0.3, batch_size=256, batch_size_valid=256, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/high/new', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=256, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=256, decoder_layerdrop=0, decoder_layers=4, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=256, device_id=0, disable_validation=False, distribut

In [None]:
# %tensorboard --logdir logs/eu/high/inflected/bt

In [None]:
# !bash ./scripts/generate_new.sh eu high

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


In [None]:
# lemmas = []
# inflecteds = []
# with open('generated/eu/high/inflected/bt/sen.txt', 'r') as f:
#     lemmas = [line.strip() for line in f]
# with open('generated/eu/high/inflected/bt/hyp.txt', 'r') as f:
#     inflecteds = [line.strip() for line in f]

# reformat(lemmas, inflecteds, "generated/eu/high/inflected/bt/predicted.txt")

In [None]:
# !python ./scripts/evalm.py --guess generated/eu/high/inflected/bt/predicted.txt --gold data/sigmorphon/basque-test.txt --task 1

acccuracy:	89.60
levenshtein:	0.21
