## Compilation of `multilingual` training dataset for Sunbird language models

## Logic

#### V1: A model that translates any of the other languages to English (`mul-en`)
##### Source sentence: Any of the other languages
##### Target: English


#### V2: A model that translates English to any of the other languages (`en-mul`)
##### Source sentence: English
##### Target: Any of the other languages

In [49]:
# Import Python dependencies
import json
import pandas as pd

In [None]:
# Download the raw Sunbird dataset if needed
!wget https://transfer.sh/AvcWgi/sunbird-ug-lang-v4.0.jsonl

### Part 1: Create Multi-lingual to English target dataset (mul-en)

#### Training dataset creation logic (with an example from the Sunbird dataset)

In [50]:
with open("sunbird-ug-lang-v5.0.jsonl", "r") as f:
    data = list(f)

In [51]:
# Convert dataset to Dataframe
df = pd.DataFrame(data)

In [52]:
translated_sentence = json.loads(data[0])
translated_sentence.keys()

dict_keys(['English', 'Luganda', 'Runyankole', 'Ateso', 'Lugbara', 'Acholi'])

In [53]:
translated_sentence

{'English': 'Eggplants always grow best under warm conditions.',
 'Luganda': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
 'Runyankole': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
 'Ateso': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
 'Lugbara': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
 'Acholi': 'Bilinyanya pol kare dongo maber ka lyeto tye'}

In [26]:
# Function to generate multiple training examples from one translated sentence.
def training_examples_from_sentence(translated_sentence,
                                    target_language = 'English'):
  if target_language not in translated_sentence:
    raise ValueError(
        f'Target language {target_language} expected in translations, but '
        f'{translated_sentence.keys()} found')

  source_languages = set(translated_sentence.keys())
  source_languages.remove(target_language)

  if not source_languages:
    raise ValueError('There should be at least one language apart from the '
                    'target.')

  training_examples = [{'source': translated_sentence[lang], 
                        'target': translated_sentence[target_language]}
                        for lang in source_languages
                      ]

  return training_examples

In [54]:
training_examples = training_examples_from_sentence(translated_sentence)

In [55]:
training_examples

[{'source': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target': 'Eggplants always grow best under warm conditions.'}]

#### Application to the Sunbird dataset

In [56]:
# Check number of rows in dataset
len(df)

25007

In [57]:
c = []
for i in range(len(df)):
  c.append(training_examples_from_sentence(json.loads(data[i])))

In [58]:
from itertools import chain
dataset = pd.DataFrame(list(chain.from_iterable(c)))

In [59]:
# Number of language pairs after creating the training examples
dataset.shape

(125035, 2)

In [60]:
# train/test/val split

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(dataset, test_size=0.33, random_state=42)

In [61]:
test_df, val_df = train_test_split(test_df, test_size=0.5, random_state=42)

In [62]:
print(train_df.shape)
print(test_df.shape)
print(val_df.shape)

(83773, 2)
(20631, 2)
(20631, 2)


In [63]:
dataset[["source"]].to_csv(r'other.src', header=None, index=None, sep=' ', mode='a')
dataset[["target"]].to_csv(r'eng.tgt', header=None, index=None, sep=' ', mode='a')

#### Application to the AI4D Luganda dataset

In [66]:
ai4d_df = pd.read_csv("ai4d_luganda.csv")
ai4d_df.head()

Unnamed: 0,eng,lug
0,All refugees were requested to register with t...,Abanoonyiboobubudamu bonna baasabiddwa beewand...
1,They called for a refugees' meeting yesterday.,Baayise olukungaana lw'abanoonyiboobubudamu eg...
2,Refugees had misunderstandings between thems...,Abanoonyiboobubudamu b'abadde n'obutakkaanya w...
3,We were urged to welcome refugees into our com...,Twakubirizibwa okwaniriza abanoonyiboobubudamu...
4,More development is achieved when we work toge...,Bwe tukolera awamu enkulaakulana enyingi efuni...


In [67]:
ai4d_df.rename(columns={"eng": "English", "lug": "Luganda"}, inplace=True)
ai4d_df.columns

Index(['English', 'Luganda'], dtype='object')

In [68]:
ai4d = []
for i in range(len(ai4d_df)):
  ai4d.append(training_examples_from_sentence(ai4d_df.loc[i])[0])

ai4d

[{'source': 'Abanoonyiboobubudamu bonna baasabiddwa beewandiise ewa ssentebe.',
  'target': 'All refugees were requested to register with the chairman.'},
 {'source': "Baayise olukungaana lw'abanoonyiboobubudamu eggulo.",
  'target': "They called for a refugees' meeting yesterday."},
 {'source': "Abanoonyiboobubudamu b'abadde n'obutakkaanya wakati waabwe.",
  'target': 'Refugees had misunderstandings between   themselves.'},
 {'source': 'Twakubirizibwa okwaniriza abanoonyiboobubudamu mu bitundu byaffe.',
  'target': 'We were urged to welcome refugees into our communities.'},
 {'source': 'Bwe tukolera awamu enkulaakulana enyingi efunibwa.',
  'target': 'More development is achieved when we work together.'},
 {'source': 'Disitulikiti eziriraanye ensalo si ntebenkevu.',
  'target': 'The border districts are insecure.'},
 {'source': 'Abanoonyiboobubudamu batandise okulima okusobola okwebeezaawo.',
  'target': 'Refugees have started practicing farming so as to earn a living.'},
 {'source': 

#### Application to the Flores 101 dataset

In [69]:
flores_df = pd.read_csv("flores101.csv")
flores_df.head()

Unnamed: 0,lug,luo,eng
0,"Ku balaza, Banasayansi okuva mu setendekero ya...","Chieng' Wuoktich, josayans mawuok e Mbalariany...","On Monday, scientists from the Stanford Univer..."
1,Abakulira abanoonyereza bagamba nti kino kijak...,Jononro motelo wachoni ma nyalo kelo fweny mac...,Lead researchers say this may bring early dete...
2,Aba JAS 39C Gripen basasanila mu luguudo ku sa...,Ndegeno mar JAS 39C Gripen ne ogore piny e nda...,The JAS 39C Gripen crashed onto a runway at ar...
3,Omuvuzi wenyonyi yategerekeka nga omukulembeze...,Jariemb ndegeno noyangi kaka Squadron Dilokrit...,The pilot was identified as Squadron Leader Di...
4,Amawulire agakuno galaga ekimotoka kyomuliro e...,Ute fwambo ma alwora no golo ripot ni gach neg...,Local media reports an airport fire vehicle ro...


In [70]:
flores_df.rename(columns={"eng": "English", "lug": "Luganda", "luo": "Luo"}, inplace=True)
flores_df.columns

Index(['Luganda', 'Luo', 'English'], dtype='object')

In [71]:
flores = []
for i in range(len(flores_df)):
  flores.append(training_examples_from_sentence(flores_df.loc[i])[0])

flores

[{'source': "Chieng' Wuoktich, josayans mawuok e Mbalariany mar Stanford e Skul mar Thieth nolando ni negifwenyo gimanyien mitiyogo e nono tuoche ma nyalo pogo ng'injo mag del kaluwore kod kitgi: en chip moro matin ma inyalo go chapa gi printa ma bende inyalo losi kitiyo kod printa mapile mag inkjet kwom manyalo romo otonglo achiel mar Amerka e moro ka moro.",
  'target': 'On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.'},
 {'source': 'Jononro motelo wachoni ma nyalo kelo fweny machon mag tuoche kaka kansa, tibi, ayaki, gi maleria ne jotuo manie epinje ma yutogi tin, kama kuo mar tuo kaka kansa mar thuno, nyalo bedo nus ne mar pinje man-gi yuto man malo.',
  'target': 'Lead researchers say this may bring early detection of cancer, tuberculosis, HIV and malaria to patients 

#### Appliction to the MT560 dataset

In [72]:
mt560_df = pd.read_csv("mt560.csv")
mt560_df.head()

Unnamed: 0,source,english,source_language
0,Beduru gi Kuwe kod Ji Duto,Adam and Eve - Were They Real People?,luo
1,Hera umo richo mogundho.,"In fact, ""love covers a multitude of sins.""",luo
2,I mwaka me apar wiye angwen me loc pa kabaka K...,"In the fourteenth year of King Hezekiah, Senna...",ach
3,Muliraanwa wange y'ani?,Who really is my neighbor?,lug
4,Notego wang'e kuom pokne.,"He ""looked intently toward the payment of the ...",luo


In [74]:
# mt560_df.rename(columns={"eng": "English", "lug": "Luganda", "luo": "Luo"}, inplace=True)
# mt560_df.columns
mt560_df["source_language"].unique()

array(['luo', 'ach', 'lug', 'nyn'], dtype=object)

In [None]:
# mt560 = []
# for i in range(len(mt560_df)):
#   mt560.append(training_examples_from_sentence(mt560_df.loc[i])[0])

# mt560

### Putting the dataset together

#### Create the .txt files needed for the training dataset

In [19]:
language_list = list(dataset.columns)
language_codes = {
    "source": "src", "target": "tgt"
}

In [20]:
for language in language_list:
    train_df[language].to_csv(f"train.{language_codes[language]}", header=False, index=False, sep='\t', mode='a')
    test_df[language].to_csv(f"test.{language_codes[language]}", header=False, index=False, sep='\t', mode='a')
    val_df[language].to_csv(f"val.{language_codes[language]}", header=False, index=False, sep='\t', mode='a')

**Create initial dataset folder and add dataset files**


In [21]:
!mkdir multilingual-dataset


In [22]:
!mv {*.src,*.tgt} multilingual-dataset

In [23]:
!ls multilingual-dataset/


eng.tgt   other.src test.src  test.tgt  train.src train.tgt val.src   val.tgt


**Update dataset folder structure and create archive**


In [24]:
# !mkdir -p v6-dataset/v6.0/supervised/  # if this folder does not exist yet
!mkdir v6-dataset/v6.0/supervised/src-tgt

In [25]:
!cp -v multilingual-dataset/*.{src,tgt} v6-dataset/v6.0/supervised/src-tgt

multilingual-dataset/other.src -> v6-dataset/v6.0/supervised/src-tgt/other.src
multilingual-dataset/test.src -> v6-dataset/v6.0/supervised/src-tgt/test.src
multilingual-dataset/train.src -> v6-dataset/v6.0/supervised/src-tgt/train.src
multilingual-dataset/val.src -> v6-dataset/v6.0/supervised/src-tgt/val.src
multilingual-dataset/eng.tgt -> v6-dataset/v6.0/supervised/src-tgt/eng.tgt
multilingual-dataset/test.tgt -> v6-dataset/v6.0/supervised/src-tgt/test.tgt
multilingual-dataset/train.tgt -> v6-dataset/v6.0/supervised/src-tgt/train.tgt
multilingual-dataset/val.tgt -> v6-dataset/v6.0/supervised/src-tgt/val.tgt


In [27]:
# Zip Directory
!zip -r v6-dataset.zip v6-dataset/


  adding: v6-dataset/ (stored 0%)
  adding: v6-dataset/v6.0/ (stored 0%)
  adding: v6-dataset/v6.0/supervised/ (stored 0%)
  adding: v6-dataset/v6.0/supervised/en-run/ (stored 0%)
  adding: v6-dataset/v6.0/supervised/en-run/test.run (deflated 64%)
  adding: v6-dataset/v6.0/supervised/en-run/val.run (deflated 65%)
  adding: v6-dataset/v6.0/supervised/en-run/train.en (deflated 64%)
  adding: v6-dataset/v6.0/supervised/en-run/val.en (deflated 63%)
  adding: v6-dataset/v6.0/supervised/en-run/train.run (deflated 65%)
  adding: v6-dataset/v6.0/supervised/en-run/test.en (deflated 63%)
  adding: v6-dataset/v6.0/supervised/lug-teo/ (stored 0%)
  adding: v6-dataset/v6.0/supervised/lug-teo/train.lug (deflated 66%)
  adding: v6-dataset/v6.0/supervised/lug-teo/val.teo (deflated 66%)
  adding: v6-dataset/v6.0/supervised/lug-teo/test.teo (deflated 66%)
  adding: v6-dataset/v6.0/supervised/lug-teo/val.lug (deflated 65%)
  adding: v6-dataset/v6.0/supervised/lug-teo/test.lug (deflated 65%)
  adding: v6-

### Part 2: Create English to all languages dataset (en-mul)

In [29]:
# Multi-lingual case: generate all examples of source and target language
def training_examples_from_sentence(translated_sentence):

  languages = set(translated_sentence.keys())

  if len(languages) < 2:
    raise ValueError("There must be at least two different languages, "
                     f"found {languages})")

  training_examples = []
  for target_language in languages:

    source_languages = languages.copy()
    source_languages.remove(target_language)

    for source_language in source_languages:
      source_text = (f"<to_{target_language}> "
                     f"{translated_sentence[source_language]}")
      target_text = translated_sentence[target_language]

      training_examples.append({'source': source_text, 
                                'target': target_text})
      
  return training_examples

In [30]:
len(training_examples_from_sentence(translated_sentence))

30

In [31]:
training_examples_from_sentence(translated_sentence)

[{'source': '<to_Lugbara> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Eggplants always grow best under warm conditions.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': "<to_Lugbara> Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Ateso> Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.'},
 {'source': '<to_Ateso> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Epoloi ebirinyanyi ojok apa

In [32]:
# Create all pairs from dataset

m = []
for i in range(len(df)):
  m.append(training_examples_from_sentence(json.loads(data[i])))

In [33]:
m[0]

[{'source': '<to_Lugbara> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Eggplants always grow best under warm conditions.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': "<to_Lugbara> Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Ateso> Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.'},
 {'source': '<to_Ateso> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Epoloi ebirinyanyi ojok apa

In [34]:
len(m[0])

30

In [35]:
len(m)

25007

In [36]:
len(m)*len(m[0])

750210

In [37]:
from itertools import chain
multi_dataset = pd.DataFrame(list(chain.from_iterable(m)))

In [38]:
multi_dataset.tail(5).values

array([['<to_Runyankole> Gameteni silingi eza angiri eli vusi nzila siza ma dria',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> Gamente tiyo ki cente ma dwong adada me gero ki roco gudu.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> Itosomai apugan ikapun luipu kanginikaru kotoma aiduk irotin.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> Gavumenti essaasaanya ssente nnyingi nnyo buli mwaka mu kuzimba amakubo.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> The government spends a lot of money every year on road construction.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.']],
      dtype=object)