In this notebook we will take the data from the [SIGMORPHON 2018 1. task](https://github.com/sigmorphon/conll2018/tree/master/task1/all) and we will format it to be able to feed it to the [Fairseq](https://github.com/pytorch/fairseq) functions.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/My Drive/Colab Notebooks/backtranslation/"

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/backtranslation


In [None]:
import pandas as pd
import os

The data is in csv format, with no headers and separated by tabulation. The 1. column is the word in lemma form, the 2. column is the word inflected and the 3. column is the tags that describe the inflection information. Each row of the data is a an example. The examples present on the dataset are only verbs.

We will use the datasets for Basque and English. Each one contains three training sets with different number of examples --low (100), medium (1000) and high (10000)-- a dev set and a test set.

In [None]:
INPUT = os.path.join('./data', 'sigmorphon')

es_train_low = os.path.join(INPUT, 'spanish-train-low.txt')
es_train_med = os.path.join(INPUT, 'spanish-train-medium.txt')
es_gen = os.path.join(INPUT, 'spanish-train-high.txt')
es_dev = os.path.join(INPUT, 'spanish-dev.txt')
es_test = os.path.join(INPUT, 'spanish-test.txt')

eu_train_low = os.path.join(INPUT, 'basque-train-low.txt')
eu_train_med = os.path.join(INPUT, 'basque-train-medium.txt')
eu_gen = os.path.join(INPUT, 'basque-train-high.txt')
eu_dev = os.path.join(INPUT, 'basque-dev.txt')
eu_test = os.path.join(INPUT, 'basque-test.txt')

From these, new datasets with the correct format will be made. Three training datasets -- low (100) -> low (100), med (1000) -> med (500), med (1000) -> high (1000)--and a dataset for backtranslation -- high (10000) -> gen (5000). The test and dev datasets will contain exactly the same data.

In [None]:
df_es_train_low = pd.read_csv(es_train_low, sep="\t", header=None)
df_es_train_high = pd.read_csv(es_train_med, sep="\t", header=None)
df_es_train_med = df_es_train_high.sample(frac=0.5)
df_es_gen = pd.read_csv(es_gen, sep="\t", header=None)
df_es_gen = df_es_gen.sample(frac=0.5)
df_es_dev = pd.read_csv(es_dev, sep="\t", header=None)
df_es_test = pd.read_csv(es_test, sep="\t", header=None)

df_eu_train_low = pd.read_csv(eu_train_low, sep="\t", header=None)
df_eu_train_high = pd.read_csv(eu_train_med, sep="\t", header=None)
df_eu_train_med = df_eu_train_high.sample(frac=0.5)
df_eu_gen = pd.read_csv(eu_gen, sep="\t", header=None)
df_es_gen = df_es_gen.sample(frac=0.5)
df_eu_dev = pd.read_csv(eu_dev, sep="\t", header=None)
df_eu_test = pd.read_csv(eu_test, sep="\t", header=None)

For each dataset, the three columns will be transformed in 2 different files. The first one will contain the lemmas with the tags. The second, will contain the inflected words. 

Since the inflection and tagger models will use characters as tokens, we need to separate each character in the words, and each inflection information in the tag.

In [None]:
def format(dataset):
    lemmas = dataset.iloc[:,0]
    tags = dataset.iloc[:,2]
    lemmas = ["<" + lemma + ">" for lemma in lemmas]
    lemmas = [" ".join(lemma) for lemma in lemmas]
    tags = [tag.replace(";", " ") for tag in tags]
    lemmas = [lemma + " " + tag for (lemma, tag) in zip(lemmas,tags)]

    inflecteds = dataset.iloc[:,1]
    inflecteds = [inflected.replace(" ", "#") for inflected in inflecteds]
    inflecteds = ["<" + inflected + ">" for inflected in inflecteds]
    inflecteds = [" ".join(inflected) for inflected in inflecteds]

    return lemmas, inflecteds

In [None]:
def process(dataset, t, lang):
    path_output = os.path.join(OUTPUT, t)
    if not os.path.exists(path_output):
        os.makedirs(path_output)
    if t in ['low', 'med', 'high']:
        path_output_lemma = os.path.join(path_output, 'train.{}.lemma_tag'.format(lang))
        path_output_inflected = os.path.join(path_output, 'train.{}.inflected'.format(lang))
    else:
        path_output_lemma = os.path.join(path_output, '{}.{}.lemma_tag'.format(t, lang))
        path_output_inflected = os.path.join(path_output, '{}.{}.inflected'.format(t, lang))
    
    lemmas, inflecteds = format(dataset)

    with open(path_output_lemma, "w") as f:
        f.write("\n".join(lemmas))

    with open(path_output_inflected, "w") as f:
        f.write("\n".join(inflecteds))

In [None]:
OUTPUT = os.path.join('./data', 'prepared')
types = ['low', 'med', 'high', 'dev', 'gen', 'test']
langs = ['es','eu']

In [None]:
datasets_es = [df_es_train_low, df_es_train_med, df_es_train_high, df_es_dev, df_es_gen, df_es_test]
datasets_eu = [df_eu_train_low, df_eu_train_med, df_eu_train_high, df_eu_dev, df_eu_gen, df_eu_test]
datasets = [datasets_es, datasets_eu]

In [None]:
for (lang_datasets,lang) in zip(datasets, langs):
    for (d, t) in zip(lang_datasets, types):
        process(d, t, lang)