## Compilation of `multilingual` training dataset for Sunbird language models

## Logic

#### V1: A model that translates any of the other languages to English (`mul-en`)
##### Source sentence: Any of the other languages
##### Target: English


#### V2: A model that translates English to any of the other languages (`en-mul`)
##### Source sentence: English
##### Target: Any of the other languages

In [39]:
import json
import pandas as pd
from itertools import chain
from sklearn.model_selection import train_test_split

In [None]:
# Download the Sunbird language dataset if needed
!wget https://transfer.sh/AvcWgi/sunbird-ug-lang-v4.0.jsonl

### Part 1: Create Multi-lingual to English target dataset (mul-en)

#### Multilingual to English dataset creation function

In [42]:
# Function to generate multiple training examples from one translated sentence.
def training_examples_from_sentence_mul_en(translated_sentence,
                                    target_language = "English"):
  if target_language not in translated_sentence:
    raise ValueError(
        f"Target language {target_language} expected in translations, but "
        f"{translated_sentence.keys()} found")

  source_languages = set(translated_sentence.keys())
  source_languages.remove(target_language)

  if not source_languages:
    raise ValueError("There should be at least one language apart from the "
                    "target.")

  training_examples = [
                        {
                          "source": translated_sentence[lang], 
                          "target": translated_sentence[target_language],
                          "source_language": lang
                        }
                        for lang in source_languages
                      ]

  return training_examples

#### Application to the Sunbird dataset

In [43]:
sunbird_df = pd.read_json("sunbird-ug-lang-v5.0.jsonl", lines=True)
sunbird_df.head()

Unnamed: 0,English,Luganda,Runyankole,Ateso,Lugbara,Acholi
0,Eggplants always grow best under warm conditions.,Bbiringanya lubeerera asinga kukulira mu mbee...,Entonga buriijo zikurira omu mbeera y'obwire e...,Epoloi ebirinyanyi ojok apakio nu emwanar akwap.,Birinyanya eyi zo kililiru ndeni angu driza ma...,Bilinyanya pol kare dongo maber ka lyeto tye
1,Farmland is sometimes a challenge to farmers.,Ettaka ly'okulimirako n'okulundirako ebiseera ...,"Eitaka ry'okuhingamu, obumwe n'obumwe nirireet...",Akiro nu alupok nes erai ationis kanejaas akoriok,Amvu ma angu eri sa'wa azini 'diyisi 'ba amvu ...,Ngom me pur i kare mukene obedo peko madit bot...
2,Farmers should be encouraged to grow more coffee.,Abalimi balina okukubirizibwa okwongera okulim...,Abahingi bashemereire kuhigwa bongyere okuhing...,Ekot aisinyikokit akoriok akoru emwanyi loepol,Le 'ba ma fe 'ba amvu 'yapi 'diyini ava kawa '...,Lupur omyero ki konygi wek nong miti me puru m...
3,Uganda is focusing on farming.,Uganda essira eritadde ku bulimi.,Uganda eteire amaani aha buhingi n'oburiisa.,Uganda nes ejai akiro nu akoru.,Kari Uganda niri eri asi'baza be amvu 'yaza ma...,Uganda tye ka keme ki lok me pur
4,Some plants die due to lack of sunlight.,Ebimera ebimu bifa olw'ebbula ly'omusana.,Ebihingwa ebimwe nibyoma ahabw'okubura omushana.,Icie ikorion etwakete naarai emamei akolong.,Ori azi 'diyi odra te ituka ma akosi.,jami apita mukene too woko pien pe ginongo cen...


In [44]:
sunbird_df.shape

(25007, 6)

In [45]:
# train/test/val split
train_df, test_df = train_test_split(sunbird_df, test_size=0.33, random_state=42)
test_df, val_df = train_test_split(test_df, test_size=0.5, random_state=42)
print(train_df.shape)
print(test_df.shape)
print(val_df.shape)

(16754, 6)
(4126, 6)
(4127, 6)


In [46]:
translated_sentence = train_df.loc[0]
translated_sentence

English       Eggplants always grow best under warm conditions.
Luganda       Bbiringanya lubeerera  asinga kukulira mu mbee...
Runyankole    Entonga buriijo zikurira omu mbeera y'obwire e...
Ateso          Epoloi ebirinyanyi ojok apakio nu emwanar akwap.
Lugbara       Birinyanya eyi zo kililiru ndeni angu driza ma...
Acholi             Bilinyanya pol kare dongo maber ka lyeto tye
Name: 0, dtype: object

In [47]:
translated_sentence = dict(translated_sentence)
translated_sentence

{'English': 'Eggplants always grow best under warm conditions.',
 'Luganda': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
 'Runyankole': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
 'Ateso': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
 'Lugbara': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
 'Acholi': 'Bilinyanya pol kare dongo maber ka lyeto tye'}

In [48]:
training_examples_from_sentence_mul_en(translated_sentence)

[{'source': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target': 'Eggplants always grow best under warm conditions.',
  'source_language': 'Runyankole'},
 {'source': 'Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Eggplants always grow best under warm conditions.',
  'source_language': 'Acholi'},
 {'source': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target': 'Eggplants always grow best under warm conditions.',
  'source_language': 'Lugbara'},
 {'source': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target': 'Eggplants always grow best under warm conditions.',
  'source_language': 'Ateso'},
 {'source': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target': 'Eggplants always grow best under warm conditions.',
  'source_language': 'Luganda'}]

In [49]:
train = []
for i in range(len(train_df)):
  train.append(training_examples_from_sentence_mul_en(dict(train_df.iloc[i])))

train[0]

[{'source': 'Nitwenda embaririra ya sente ezimwaheirwe.',
  'target': 'We need the accountability of funds given to you.',
  'source_language': 'Runyankole'},
 {'source': 'Wamito niang kit ma itiyo ki cene ma wamiyi',
  'target': 'We need the accountability of funds given to you.',
  'source_language': 'Acholi'},
 {'source': "Ale ki geri mini robia 'bani fe mi dri ri ayuzu ri ni.",
  'target': 'We need the accountability of funds given to you.',
  'source_language': 'Lugbara'},
 {'source': 'Ikoto iso aitodunet na itwasamatere ikapun lu ijaikinio yes.',
  'target': 'We need the accountability of funds given to you.',
  'source_language': 'Ateso'},
 {'source': "Twetaaga embalirira y'ensimbi ezakuweebwa.",
  'target': 'We need the accountability of funds given to you.',
  'source_language': 'Luganda'}]

In [50]:
test = []
for i in range(len(test_df)):
  test.append(training_examples_from_sentence_mul_en(dict(test_df.iloc[i])))

test[0]

[{'source': 'Tata akaitirwa omu kurumbwa.',
  'target': 'My father was killed in the attack.',
  'source_language': 'Runyankole'},
 {'source': 'Kineno wora I mony ne.',
  'target': 'My father was killed in the attack.',
  'source_language': 'Acholi'},
 {'source': "Ba 'di ma atinie'yo amvuta ndeniri ma alea.",
  'target': 'My father was killed in the attack.',
  'source_language': 'Lugbara'},
 {'source': 'Aponi koyarai papaka kotoma ojie kangol.',
  'target': 'My father was killed in the attack.',
  'source_language': 'Ateso'},
 {'source': 'Taata wange yafiira mu bulumbaganyi.',
  'target': 'My father was killed in the attack.',
  'source_language': 'Luganda'}]

In [51]:
val = []
for i in range(len(val_df)):
  val.append(training_examples_from_sentence_mul_en(dict(val_df.iloc[i])))

val[0]

[{'source': 'Ebihandiikirwe nibyoreeka ku enshohoza ya Rwanda aha mahe yayeyongiire  kurumba omwaka enkumi ibiri ikumi na munaana.',
  'target': 'The data indicate that military expenditure in Rwanda increased by two thousand eighteen.',
  'source_language': 'Runyankole'},
 {'source': 'Dul ngec meno waco ni Rwanda wel cente ma Rwanda tiyo kwede I lweny omede I mwaka alip aryo ki apar wiye aboro.',
  'target': 'The data indicate that military expenditure in Rwanda increased by two thousand eighteen.',
  'source_language': 'Acholi'},
 {'source': "O'duko nderi ece kini Rwanda ma aje afa marani ni 'diyi ma driari ma ongmbo tu alifu iri mudri drini arosi",
  'target': 'The data indicate that military expenditure in Rwanda increased by two thousand eighteen.',
  'source_language': 'Lugbara'},
 {'source': 'Itodunitos akiro nuingadatai ebe abu etosomae loka ikapun kanu ajore iyatakin kokaru ilukumin iarei itomonakanyauni.',
  'target': 'The data indicate that military expenditure in Rwanda inc

In [52]:
train_df_final = pd.DataFrame(list(chain.from_iterable(train)))
test_df_final = pd.DataFrame(list(chain.from_iterable(test)))
val_df_final = pd.DataFrame(list(chain.from_iterable(val)))

In [53]:
source_languages = list(train_df_final["source_language"].unique())
language_codes = {
    "English": "en", "Luganda": "lug", "Runyankole": "run", 
    "Acholi": "ach", "Ateso": "teo", "Lugbara": "lgg"
}

In [54]:
train_df_final["source"].to_csv(f"train.src", header=False, index=False, sep="\t", mode="a")
train_df_final["target"].to_csv(f"train.tgt", header=False, index=False, sep="\t", mode="a")

In [55]:
for language in source_languages:
    test_df_final[test_df_final["source_language"] == language]["source"].to_csv(f"test_{language_codes[language]}.src", header=False, index=False, sep="\t", mode="a")
    test_df_final[test_df_final["source_language"] == language]["target"].to_csv(f"test_{language_codes[language]}.tgt", header=False, index=False, sep="\t", mode="a")
    val_df_final[val_df_final["source_language"] == language]["source"].to_csv(f"val_{language_codes[language]}.src", header=False, index=False, sep="\t", mode="a")
    val_df_final[val_df_final["source_language"] == language]["target"].to_csv(f"val_{language_codes[language]}.tgt", header=False, index=False, sep="\t", mode="a")

#### Application to the AI4D Luganda dataset

In [56]:
ai4d_df = pd.read_csv("ai4d_luganda.csv")
ai4d_df.head()

Unnamed: 0,eng,lug
0,All refugees were requested to register with t...,Abanoonyiboobubudamu bonna baasabiddwa beewand...
1,They called for a refugees' meeting yesterday.,Baayise olukungaana lw'abanoonyiboobubudamu eg...
2,Refugees had misunderstandings between thems...,Abanoonyiboobubudamu b'abadde n'obutakkaanya w...
3,We were urged to welcome refugees into our com...,Twakubirizibwa okwaniriza abanoonyiboobubudamu...
4,More development is achieved when we work toge...,Bwe tukolera awamu enkulaakulana enyingi efuni...


In [57]:
ai4d_df.rename(columns={"eng": "English", "lug": "Luganda"}, inplace=True)
ai4d_df.columns

Index(['English', 'Luganda'], dtype='object')

In [58]:
ai4d = []
for i in range(len(ai4d_df)):
  ai4d.append(training_examples_from_sentence_mul_en(ai4d_df.loc[i]))

ai4d[:5]

[[{'source': 'Abanoonyiboobubudamu bonna baasabiddwa beewandiise ewa ssentebe.',
   'target': 'All refugees were requested to register with the chairman.',
   'source_language': 'Luganda'}],
 [{'source': "Baayise olukungaana lw'abanoonyiboobubudamu eggulo.",
   'target': "They called for a refugees' meeting yesterday.",
   'source_language': 'Luganda'}],
 [{'source': "Abanoonyiboobubudamu b'abadde n'obutakkaanya wakati waabwe.",
   'target': 'Refugees had misunderstandings between   themselves.',
   'source_language': 'Luganda'}],
 [{'source': 'Twakubirizibwa okwaniriza abanoonyiboobubudamu mu bitundu byaffe.',
   'target': 'We were urged to welcome refugees into our communities.',
   'source_language': 'Luganda'}],
 [{'source': 'Bwe tukolera awamu enkulaakulana enyingi efunibwa.',
   'target': 'More development is achieved when we work together.',
   'source_language': 'Luganda'}]]

In [59]:
ai4d_dataset = pd.DataFrame(list(chain.from_iterable(ai4d)))
ai4d_dataset.head()

Unnamed: 0,source,target,source_language
0,Abanoonyiboobubudamu bonna baasabiddwa beewand...,All refugees were requested to register with t...,Luganda
1,Baayise olukungaana lw'abanoonyiboobubudamu eg...,They called for a refugees' meeting yesterday.,Luganda
2,Abanoonyiboobubudamu b'abadde n'obutakkaanya w...,Refugees had misunderstandings between thems...,Luganda
3,Twakubirizibwa okwaniriza abanoonyiboobubudamu...,We were urged to welcome refugees into our com...,Luganda
4,Bwe tukolera awamu enkulaakulana enyingi efuni...,More development is achieved when we work toge...,Luganda


In [60]:
ai4d_dataset[["source"]].to_csv("train_ai4d.src", header=None, index=None, sep="\t", mode="a")
ai4d_dataset[["target"]].to_csv("train_ai4d.tgt", header=None, index=None, sep="\t", mode="a")

#### Application to the Flores 101 dataset

In [61]:
flores_df = pd.read_csv("flores101.csv")
flores_df.head()

Unnamed: 0,lug,luo,eng
0,"Ku balaza, Banasayansi okuva mu setendekero ya...","Chieng' Wuoktich, josayans mawuok e Mbalariany...","On Monday, scientists from the Stanford Univer..."
1,Abakulira abanoonyereza bagamba nti kino kijak...,Jononro motelo wachoni ma nyalo kelo fweny mac...,Lead researchers say this may bring early dete...
2,Aba JAS 39C Gripen basasanila mu luguudo ku sa...,Ndegeno mar JAS 39C Gripen ne ogore piny e nda...,The JAS 39C Gripen crashed onto a runway at ar...
3,Omuvuzi wenyonyi yategerekeka nga omukulembeze...,Jariemb ndegeno noyangi kaka Squadron Dilokrit...,The pilot was identified as Squadron Leader Di...
4,Amawulire agakuno galaga ekimotoka kyomuliro e...,Ute fwambo ma alwora no golo ripot ni gach neg...,Local media reports an airport fire vehicle ro...


In [62]:
flores_df.rename(columns={"eng": "English", "lug": "Luganda", "luo": "Luo"}, inplace=True)
flores_df.columns

Index(['Luganda', 'Luo', 'English'], dtype='object')

In [63]:
flores = []
for i in range(len(flores_df)):
  flores.append(training_examples_from_sentence_mul_en(flores_df.loc[i]))

flores[:5]

[[{'source': "Chieng' Wuoktich, josayans mawuok e Mbalariany mar Stanford e Skul mar Thieth nolando ni negifwenyo gimanyien mitiyogo e nono tuoche ma nyalo pogo ng'injo mag del kaluwore kod kitgi: en chip moro matin ma inyalo go chapa gi printa ma bende inyalo losi kitiyo kod printa mapile mag inkjet kwom manyalo romo otonglo achiel mar Amerka e moro ka moro.",
   'target': 'On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.',
   'source_language': 'Luo'},
  {'source': "Ku balaza, Banasayansi okuva mu setendekero ya Stanford ku somero ly'ebyedagala balangirira okuvumbulwa kwa akuuma akakebera nga kasobola okusengeka obutafaali nga kasinzira kukika kyabwo: Akuuma katono akasobola okufulumizibwa ku lupapula akasobola okukolebwa ne Printa enungi ku sente entono nga emu eya US bu

In [64]:
flores_dataset = pd.DataFrame(list(chain.from_iterable(flores)))
flores_dataset.head(8)

Unnamed: 0,source,target,source_language
0,"Chieng' Wuoktich, josayans mawuok e Mbalariany...","On Monday, scientists from the Stanford Univer...",Luo
1,"Ku balaza, Banasayansi okuva mu setendekero ya...","On Monday, scientists from the Stanford Univer...",Luganda
2,Jononro motelo wachoni ma nyalo kelo fweny mac...,Lead researchers say this may bring early dete...,Luo
3,Abakulira abanoonyereza bagamba nti kino kijak...,Lead researchers say this may bring early dete...,Luganda
4,Ndegeno mar JAS 39C Gripen ne ogore piny e nda...,The JAS 39C Gripen crashed onto a runway at ar...,Luo
5,Aba JAS 39C Gripen basasanila mu luguudo ku sa...,The JAS 39C Gripen crashed onto a runway at ar...,Luganda
6,Jariemb ndegeno noyangi kaka Squadron Dilokrit...,The pilot was identified as Squadron Leader Di...,Luo
7,Omuvuzi wenyonyi yategerekeka nga omukulembeze...,The pilot was identified as Squadron Leader Di...,Luganda


In [65]:
flores_dataset[["source"]].to_csv("train_flores.src", header=None, index=None, sep="\t", mode="a")
flores_dataset[["target"]].to_csv("train_flores.tgt", header=None, index=None, sep="\t", mode="a")

#### Appliction to the MT560 dataset

In [66]:
mt560_df = pd.read_csv("mt560.csv")
mt560_df.head(10)

Unnamed: 0,source,english,source_language
0,Beduru gi Kuwe kod Ji Duto,Adam and Eve - Were They Real People?,luo
1,Hera umo richo mogundho.,"In fact, ""love covers a multitude of sins.""",luo
2,I mwaka me apar wiye angwen me loc pa kabaka K...,"In the fourteenth year of King Hezekiah, Senna...",ach
3,Muliraanwa wange y'ani?,Who really is my neighbor?,lug
4,Notego wang'e kuom pokne.,"He ""looked intently toward the payment of the ...",luo
5,Okuva mu Nnimi Zonna,Out of All the Languages,lug
6,Omiyo wang 'chieng' mare wuok ni jo maricho ko...,He makes his sun rise upon wicked people and g...,luo
7,Yakuwa Ayagala Obwenkanya,Jehovah Is a Lover of Justice,lug
8,Yoleka Obwenkanya ng'Okola ku Nsonga Zange,See That I Get Justice,lug
9,"""Akamwa kange kanaayogera amagezi; n'omutima g...","""The meditation of my heart will be of things ...",lug


In [67]:
mt560_df["source_language"].unique()

array(['luo', 'ach', 'lug', 'nyn'], dtype=object)

In [68]:
mt560_df.drop(columns="source_language", inplace=True)
mt560_df.rename(columns={"english": "target"}, inplace=True)
mt560_df.head()

Unnamed: 0,source,target
0,Beduru gi Kuwe kod Ji Duto,Adam and Eve - Were They Real People?
1,Hera umo richo mogundho.,"In fact, ""love covers a multitude of sins."""
2,I mwaka me apar wiye angwen me loc pa kabaka K...,"In the fourteenth year of King Hezekiah, Senna..."
3,Muliraanwa wange y'ani?,Who really is my neighbor?
4,Notego wang'e kuom pokne.,"He ""looked intently toward the payment of the ..."


In [69]:
mt560_df[["source"]].to_csv("train_mt560.src", header=None, index=None, sep="\t", mode="a")
mt560_df[["target"]].to_csv("train_mt560.tgt", header=None, index=None, sep="\t", mode="a")

### Create dataset folders and add dataset files


In [70]:
!mkdir multilingual-dataset


mkdir: multilingual-dataset: File exists


In [71]:
!mv {*.src,*.tgt} multilingual-dataset

In [72]:
!ls multilingual-dataset

test_ach.src     test_run.tgt     train_flores.src val_lgg.tgt
test_ach.tgt     test_teo.src     train_flores.tgt val_lug.src
test_lgg.src     test_teo.tgt     train_mt560.src  val_lug.tgt
test_lgg.tgt     train.src        train_mt560.tgt  val_run.src
test_lug.src     train.tgt        val_ach.src      val_run.tgt
test_lug.tgt     train_ai4d.src   val_ach.tgt      val_teo.src
test_run.src     train_ai4d.tgt   val_lgg.src      val_teo.tgt


In [73]:
!mkdir -p v7-dataset/v7.0/supervised/mul-en

In [74]:
!cp -v multilingual-dataset/*.{src,tgt} v7-dataset/v7.0/supervised/mul-en

multilingual-dataset/test_ach.src -> v7-dataset/v7.0/supervised/mul-en/test_ach.src
multilingual-dataset/test_lgg.src -> v7-dataset/v7.0/supervised/mul-en/test_lgg.src
multilingual-dataset/test_lug.src -> v7-dataset/v7.0/supervised/mul-en/test_lug.src
multilingual-dataset/test_run.src -> v7-dataset/v7.0/supervised/mul-en/test_run.src
multilingual-dataset/test_teo.src -> v7-dataset/v7.0/supervised/mul-en/test_teo.src
multilingual-dataset/train.src -> v7-dataset/v7.0/supervised/mul-en/train.src
multilingual-dataset/train_ai4d.src -> v7-dataset/v7.0/supervised/mul-en/train_ai4d.src
multilingual-dataset/train_flores.src -> v7-dataset/v7.0/supervised/mul-en/train_flores.src
multilingual-dataset/train_mt560.src -> v7-dataset/v7.0/supervised/mul-en/train_mt560.src
multilingual-dataset/val_ach.src -> v7-dataset/v7.0/supervised/mul-en/val_ach.src
multilingual-dataset/val_lgg.src -> v7-dataset/v7.0/supervised/mul-en/val_lgg.src
multilingual-dataset/val_lug.src -> v7-dataset/v7.0/supervised/mul-e

### Part 2: Create English to all languages dataset (en-mul)

In [83]:
# Multi-lingual case: generate all examples of source and target language
def training_examples_from_sentence_en_mul(translated_sentence,
                                    source_language = "English"):
  languages = set(translated_sentence.keys())

  if len(languages) < 2:
    raise ValueError("There must be at least two different languages, "
                     f"found {languages})")

  training_examples = []
  languages.remove("English")
  for target_language in languages:
      source_text = (f">>{language_codes[target_language]}<< "
                     f"{translated_sentence[source_language]}")
      target_text = translated_sentence[target_language]

      training_examples.append(
                              {
                                "source": source_text, 
                                "target": target_text,
                                "target_language": target_language
                              })
      
  return training_examples

In [84]:
training_examples_from_sentence_en_mul(translated_sentence)

[{'source': '>>run<< Eggplants always grow best under warm conditions.',
  'target': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target_language': 'Runyankole'},
 {'source': '>>ach<< Eggplants always grow best under warm conditions.',
  'target': 'Bilinyanya pol kare dongo maber ka lyeto tye',
  'target_language': 'Acholi'},
 {'source': '>>lgg<< Eggplants always grow best under warm conditions.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target_language': 'Lugbara'},
 {'source': '>>teo<< Eggplants always grow best under warm conditions.',
  'target': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target_language': 'Ateso'},
 {'source': '>>lug<< Eggplants always grow best under warm conditions.',
  'target': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target_language': 'Luganda'}]

In [85]:
train = []
for i in range(len(train_df)):
  train.append(training_examples_from_sentence_en_mul(dict(train_df.iloc[i])))

train[0]

[{'source': '>>run<< We need the accountability of funds given to you.',
  'target': 'Nitwenda embaririra ya sente ezimwaheirwe.',
  'target_language': 'Runyankole'},
 {'source': '>>ach<< We need the accountability of funds given to you.',
  'target': 'Wamito niang kit ma itiyo ki cene ma wamiyi',
  'target_language': 'Acholi'},
 {'source': '>>lgg<< We need the accountability of funds given to you.',
  'target': "Ale ki geri mini robia 'bani fe mi dri ri ayuzu ri ni.",
  'target_language': 'Lugbara'},
 {'source': '>>teo<< We need the accountability of funds given to you.',
  'target': 'Ikoto iso aitodunet na itwasamatere ikapun lu ijaikinio yes.',
  'target_language': 'Ateso'},
 {'source': '>>lug<< We need the accountability of funds given to you.',
  'target': "Twetaaga embalirira y'ensimbi ezakuweebwa.",
  'target_language': 'Luganda'}]

In [86]:
test = []
for i in range(len(test_df)):
  test.append(training_examples_from_sentence_en_mul(dict(test_df.iloc[i])))

test[0]

[{'source': '>>run<< My father was killed in the attack.',
  'target': 'Tata akaitirwa omu kurumbwa.',
  'target_language': 'Runyankole'},
 {'source': '>>ach<< My father was killed in the attack.',
  'target': 'Kineno wora I mony ne.',
  'target_language': 'Acholi'},
 {'source': '>>lgg<< My father was killed in the attack.',
  'target': "Ba 'di ma atinie'yo amvuta ndeniri ma alea.",
  'target_language': 'Lugbara'},
 {'source': '>>teo<< My father was killed in the attack.',
  'target': 'Aponi koyarai papaka kotoma ojie kangol.',
  'target_language': 'Ateso'},
 {'source': '>>lug<< My father was killed in the attack.',
  'target': 'Taata wange yafiira mu bulumbaganyi.',
  'target_language': 'Luganda'}]

In [87]:
val = []
for i in range(len(val_df)):
  val.append(training_examples_from_sentence_en_mul(dict(val_df.iloc[i])))

val[0]

[{'source': '>>run<< The data indicate that military expenditure in Rwanda increased by two thousand eighteen.',
  'target': 'Ebihandiikirwe nibyoreeka ku enshohoza ya Rwanda aha mahe yayeyongiire  kurumba omwaka enkumi ibiri ikumi na munaana.',
  'target_language': 'Runyankole'},
 {'source': '>>ach<< The data indicate that military expenditure in Rwanda increased by two thousand eighteen.',
  'target': 'Dul ngec meno waco ni Rwanda wel cente ma Rwanda tiyo kwede I lweny omede I mwaka alip aryo ki apar wiye aboro.',
  'target_language': 'Acholi'},
 {'source': '>>lgg<< The data indicate that military expenditure in Rwanda increased by two thousand eighteen.',
  'target': "O'duko nderi ece kini Rwanda ma aje afa marani ni 'diyi ma driari ma ongmbo tu alifu iri mudri drini arosi",
  'target_language': 'Lugbara'},
 {'source': '>>teo<< The data indicate that military expenditure in Rwanda increased by two thousand eighteen.',
  'target': 'Itodunitos akiro nuingadatai ebe abu etosomae loka i

In [88]:
train_df_final_2 = pd.DataFrame(list(chain.from_iterable(train)))
test_df_final_2 = pd.DataFrame(list(chain.from_iterable(test)))
val_df_final_2 = pd.DataFrame(list(chain.from_iterable(val)))

In [91]:
target_languages = list(train_df_final["source_language"].unique())
target_languages

['Runyankole', 'Acholi', 'Lugbara', 'Ateso', 'Luganda']

In [92]:
train_df_final_2["source"].to_csv(f"train.src", header=False, index=False, sep="\t", mode="a")
train_df_final_2["target"].to_csv(f"train.tgt", header=False, index=False, sep="\t", mode="a")

In [93]:
for language in source_languages:
    test_df_final_2[test_df_final_2["target_language"] == language]["source"].to_csv(f"test_{language_codes[language]}.src", header=False, index=False, sep="\t", mode="a")
    test_df_final_2[test_df_final_2["target_language"] == language]["target"].to_csv(f"test_{language_codes[language]}.tgt", header=False, index=False, sep="\t", mode="a")
    val_df_final_2[val_df_final_2["target_language"] == language]["source"].to_csv(f"val_{language_codes[language]}.src", header=False, index=False, sep="\t", mode="a")
    val_df_final_2[val_df_final_2["target_language"] == language]["target"].to_csv(f"val_{language_codes[language]}.tgt", header=False, index=False, sep="\t", mode="a")

### AI4D

In [None]:
ai4d = []
for i in range(len(ai4d_df)):
  ai4d.append(training_examples_from_sentence_en_mul(ai4d_df.loc[i]))

ai4d[:5]

In [None]:
ai4d_dataset = pd.DataFrame(list(chain.from_iterable(ai4d)))
ai4d_dataset.head()

In [None]:
ai4d_dataset[["source"]].to_csv("train_ai4d.src", header=None, index=None, sep="\t", mode="a")
ai4d_dataset[["target"]].to_csv("train_ai4d.tgt", header=None, index=None, sep="\t", mode="a")

### Flores 101

In [None]:
flores = []
for i in range(len(flores_df)):
  flores.append(training_examples_from_sentence_en_mul(flores_df.loc[i]))

flores[:5]

In [None]:
flores_dataset = pd.DataFrame(list(chain.from_iterable(flores)))
flores_dataset.head(8)

In [None]:
flores_dataset[["source"]].to_csv("train_flores.src", header=None, index=None, sep="\t", mode="a")
flores_dataset[["target"]].to_csv("train_flores.tgt", header=None, index=None, sep="\t", mode="a")


### MT560

In [None]:
mt560_df[["source"]].to_csv("train_mt560.src", header=None, index=None, sep="\t", mode="a")
mt560_df[["target"]].to_csv("train_mt560.tgt", header=None, index=None, sep="\t", mode="a")

### Create dataset folders and add dataset files

In [None]:
!mkdir multilingual-dataset

In [None]:
!mv {*.src,*.tgt} multilingual-dataset

In [None]:
!mkdir -p v7-dataset/v7.0/supervised/en-mul


In [None]:
!cp -v multilingual-dataset/*.{src,tgt} v7-dataset/v7.0/supervised/en-mul

### Zip dataset

In [None]:
# Zip Directory
!zip -r v7-dataset.zip v7-dataset/