## Compilation of `multilingual` training dataset for Sunbird language models

## Logic

#### V1: A model that translates any of the other languages to English (`mul-en`)
##### Source sentence: Any of the other languages
##### Target: English


#### V2: A model that translates English to any of the other languages (`en-mul`)
##### Source sentence: English
##### Target: Any of the other languages

In [145]:
# Import Python dependencies
import json
import pandas as pd
from itertools import chain
from sklearn.model_selection import train_test_split

In [None]:
# Download the raw Sunbird dataset if needed
!wget https://transfer.sh/AvcWgi/sunbird-ug-lang-v4.0.jsonl

### Part 1: Create Multi-lingual to English target dataset (mul-en)

#### Training dataset creation logic (with an example from the Sunbird dataset)

In [146]:
with open("sunbird-ug-lang-v5.0.jsonl", "r") as f:
    sunbird_data = list(f)

In [147]:
# Convert dataset to Dataframe
sunbird_df = pd.DataFrame(sunbird_data)

In [148]:
translated_sentence = json.loads(sunbird_data[0])
translated_sentence.keys()

dict_keys(['English', 'Luganda', 'Runyankole', 'Ateso', 'Lugbara', 'Acholi'])

In [149]:
translated_sentence

{'English': 'Eggplants always grow best under warm conditions.',
 'Luganda': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
 'Runyankole': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
 'Ateso': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
 'Lugbara': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
 'Acholi': 'Bilinyanya pol kare dongo maber ka lyeto tye'}

In [151]:
# Function to generate multiple training examples from one translated sentence.
def training_examples_from_sentence(translated_sentence,
                                    target_language = 'English'):
  if target_language not in translated_sentence:
    raise ValueError(
        f'Target language {target_language} expected in translations, but '
        f'{translated_sentence.keys()} found')

  source_languages = set(translated_sentence.keys())
  source_languages.remove(target_language)

  if not source_languages:
    raise ValueError('There should be at least one language apart from the '
                    'target.')

  training_examples = [{'source': translated_sentence[lang], 
                        'target': translated_sentence[target_language]}
                        for lang in source_languages
                      ]

  return training_examples

In [152]:
training_examples = training_examples_from_sentence(translated_sentence)
training_examples

[{'source': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target': 'Eggplants always grow best under warm conditions.'},
 {'source': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target': 'Eggplants always grow best under warm conditions.'}]

#### Application to the Sunbird dataset

In [153]:
# Check number of rows in dataset
len(sunbird_df)

25007

In [156]:
sunbird = []
for i in range(len(sunbird_df)):
  sunbird.append(training_examples_from_sentence(json.loads(sunbird_data[i])))

sunbird[:2]

[[{'source': "Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
   'target': 'Eggplants always grow best under warm conditions.'},
  {'source': 'Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
   'target': 'Eggplants always grow best under warm conditions.'},
  {'source': 'Bilinyanya pol kare dongo maber ka lyeto tye',
   'target': 'Eggplants always grow best under warm conditions.'},
  {'source': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
   'target': 'Eggplants always grow best under warm conditions.'},
  {'source': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
   'target': 'Eggplants always grow best under warm conditions.'}],
 [{'source': "Eitaka ry'okuhingamu, obumwe n'obumwe nirireetera abahingi oburemeezi.",
   'target': 'Farmland is sometimes a challenge to farmers.'},
  {'source': "Ettaka ly'okulimirako n'okulundirako ebiseera ebimu kisoomooza abalimi",
   'target': 'Farmland is sometimes a challenge to farmers.'},
  {'source': 'Ngom me

In [157]:
sunbird_dataset = pd.DataFrame(list(chain.from_iterable(sunbird)))

In [158]:
# Number of language pairs after creating the training examples
sunbird_dataset.shape

(125035, 2)

In [159]:
# train/test/val split
train_df, test_df = train_test_split(sunbird_dataset, test_size=0.33, random_state=42)

In [160]:
test_df, val_df = train_test_split(test_df, test_size=0.5, random_state=42)

In [161]:
print(train_df.shape)
print(test_df.shape)
print(val_df.shape)

(83773, 2)
(20631, 2)
(20631, 2)


In [162]:
sunbird_dataset[["source"]].to_csv(r'other.src', header=None, index=None, sep=' ', mode='a')
sunbird_dataset[["target"]].to_csv(r'eng.tgt', header=None, index=None, sep=' ', mode='a')

In [163]:
language_list = list(sunbird_dataset.columns)
language_codes = {
    "source": "src", "target": "tgt"
}

In [164]:
for language in language_list:
    train_df[language].to_csv(f"train.{language_codes[language]}", header=False, index=False, sep='\t', mode='a')
    test_df[language].to_csv(f"test.{language_codes[language]}", header=False, index=False, sep='\t', mode='a')
    val_df[language].to_csv(f"val.{language_codes[language]}", header=False, index=False, sep='\t', mode='a')

#### Application to the AI4D Luganda dataset

In [166]:
ai4d_df = pd.read_csv("ai4d_luganda.csv")
ai4d_df.head()

Unnamed: 0,eng,lug
0,All refugees were requested to register with t...,Abanoonyiboobubudamu bonna baasabiddwa beewand...
1,They called for a refugees' meeting yesterday.,Baayise olukungaana lw'abanoonyiboobubudamu eg...
2,Refugees had misunderstandings between thems...,Abanoonyiboobubudamu b'abadde n'obutakkaanya w...
3,We were urged to welcome refugees into our com...,Twakubirizibwa okwaniriza abanoonyiboobubudamu...
4,More development is achieved when we work toge...,Bwe tukolera awamu enkulaakulana enyingi efuni...


In [167]:
ai4d_df.rename(columns={"eng": "English", "lug": "Luganda"}, inplace=True)
ai4d_df.columns

Index(['English', 'Luganda'], dtype='object')

In [168]:
ai4d = []
for i in range(len(ai4d_df)):
  ai4d.append(training_examples_from_sentence(ai4d_df.loc[i]))

ai4d[:5]

[[{'source': 'Abanoonyiboobubudamu bonna baasabiddwa beewandiise ewa ssentebe.',
   'target': 'All refugees were requested to register with the chairman.'}],
 [{'source': "Baayise olukungaana lw'abanoonyiboobubudamu eggulo.",
   'target': "They called for a refugees' meeting yesterday."}],
 [{'source': "Abanoonyiboobubudamu b'abadde n'obutakkaanya wakati waabwe.",
   'target': 'Refugees had misunderstandings between   themselves.'}],
 [{'source': 'Twakubirizibwa okwaniriza abanoonyiboobubudamu mu bitundu byaffe.',
   'target': 'We were urged to welcome refugees into our communities.'}],
 [{'source': 'Bwe tukolera awamu enkulaakulana enyingi efunibwa.',
   'target': 'More development is achieved when we work together.'}]]

In [169]:
ai4d_dataset = pd.DataFrame(list(chain.from_iterable(ai4d)))
ai4d_dataset.head()

Unnamed: 0,source,target
0,Abanoonyiboobubudamu bonna baasabiddwa beewand...,All refugees were requested to register with t...
1,Baayise olukungaana lw'abanoonyiboobubudamu eg...,They called for a refugees' meeting yesterday.
2,Abanoonyiboobubudamu b'abadde n'obutakkaanya w...,Refugees had misunderstandings between thems...
3,Twakubirizibwa okwaniriza abanoonyiboobubudamu...,We were urged to welcome refugees into our com...
4,Bwe tukolera awamu enkulaakulana enyingi efuni...,More development is achieved when we work toge...


In [170]:
ai4d_dataset[["source"]].to_csv(r'train_ai4d.src', header=None, index=None, sep='\t', mode='a')
ai4d_dataset[["target"]].to_csv(r'train_ai4d.tgt', header=None, index=None, sep='\t', mode='a')

#### Application to the Flores 101 dataset

In [171]:
flores_df = pd.read_csv("flores101.csv")
flores_df.head()

Unnamed: 0,lug,luo,eng
0,"Ku balaza, Banasayansi okuva mu setendekero ya...","Chieng' Wuoktich, josayans mawuok e Mbalariany...","On Monday, scientists from the Stanford Univer..."
1,Abakulira abanoonyereza bagamba nti kino kijak...,Jononro motelo wachoni ma nyalo kelo fweny mac...,Lead researchers say this may bring early dete...
2,Aba JAS 39C Gripen basasanila mu luguudo ku sa...,Ndegeno mar JAS 39C Gripen ne ogore piny e nda...,The JAS 39C Gripen crashed onto a runway at ar...
3,Omuvuzi wenyonyi yategerekeka nga omukulembeze...,Jariemb ndegeno noyangi kaka Squadron Dilokrit...,The pilot was identified as Squadron Leader Di...
4,Amawulire agakuno galaga ekimotoka kyomuliro e...,Ute fwambo ma alwora no golo ripot ni gach neg...,Local media reports an airport fire vehicle ro...


In [172]:
flores_df.rename(columns={"eng": "English", "lug": "Luganda", "luo": "Luo"}, inplace=True)
flores_df.columns

Index(['Luganda', 'Luo', 'English'], dtype='object')

In [173]:
flores = []
for i in range(len(flores_df)):
  flores.append(training_examples_from_sentence(flores_df.loc[i]))

flores[:5]

[[{'source': "Chieng' Wuoktich, josayans mawuok e Mbalariany mar Stanford e Skul mar Thieth nolando ni negifwenyo gimanyien mitiyogo e nono tuoche ma nyalo pogo ng'injo mag del kaluwore kod kitgi: en chip moro matin ma inyalo go chapa gi printa ma bende inyalo losi kitiyo kod printa mapile mag inkjet kwom manyalo romo otonglo achiel mar Amerka e moro ka moro.",
   'target': 'On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.'},
  {'source': "Ku balaza, Banasayansi okuva mu setendekero ya Stanford ku somero ly'ebyedagala balangirira okuvumbulwa kwa akuuma akakebera nga kasobola okusengeka obutafaali nga kasinzira kukika kyabwo: Akuuma katono akasobola okufulumizibwa ku lupapula akasobola okukolebwa ne Printa enungi ku sente entono nga emu eya US buli kamu.",
   'target': 'On M

In [174]:
flores_dataset = pd.DataFrame(list(chain.from_iterable(flores)))
flores_dataset.head(8)

Unnamed: 0,source,target
0,"Chieng' Wuoktich, josayans mawuok e Mbalariany...","On Monday, scientists from the Stanford Univer..."
1,"Ku balaza, Banasayansi okuva mu setendekero ya...","On Monday, scientists from the Stanford Univer..."
2,Jononro motelo wachoni ma nyalo kelo fweny mac...,Lead researchers say this may bring early dete...
3,Abakulira abanoonyereza bagamba nti kino kijak...,Lead researchers say this may bring early dete...
4,Ndegeno mar JAS 39C Gripen ne ogore piny e nda...,The JAS 39C Gripen crashed onto a runway at ar...
5,Aba JAS 39C Gripen basasanila mu luguudo ku sa...,The JAS 39C Gripen crashed onto a runway at ar...
6,Jariemb ndegeno noyangi kaka Squadron Dilokrit...,The pilot was identified as Squadron Leader Di...
7,Omuvuzi wenyonyi yategerekeka nga omukulembeze...,The pilot was identified as Squadron Leader Di...


In [175]:
flores_dataset[["source"]].to_csv(r'train_flores.src', header=None, index=None, sep='\t', mode='a')
flores_dataset[["target"]].to_csv(r'train_flores.tgt', header=None, index=None, sep='\t', mode='a')

#### Appliction to the MT560 dataset

In [179]:
mt560_df = pd.read_csv("mt560.csv")
mt560_df.head(10)

Unnamed: 0,source,english,source_language
0,Beduru gi Kuwe kod Ji Duto,Adam and Eve - Were They Real People?,luo
1,Hera umo richo mogundho.,"In fact, ""love covers a multitude of sins.""",luo
2,I mwaka me apar wiye angwen me loc pa kabaka K...,"In the fourteenth year of King Hezekiah, Senna...",ach
3,Muliraanwa wange y'ani?,Who really is my neighbor?,lug
4,Notego wang'e kuom pokne.,"He ""looked intently toward the payment of the ...",luo
5,Okuva mu Nnimi Zonna,Out of All the Languages,lug
6,Omiyo wang 'chieng' mare wuok ni jo maricho ko...,He makes his sun rise upon wicked people and g...,luo
7,Yakuwa Ayagala Obwenkanya,Jehovah Is a Lover of Justice,lug
8,Yoleka Obwenkanya ng'Okola ku Nsonga Zange,See That I Get Justice,lug
9,"""Akamwa kange kanaayogera amagezi; n'omutima g...","""The meditation of my heart will be of things ...",lug


In [176]:
mt560_df["source_language"].unique()

array(['luo', 'ach', 'lug', 'nyn'], dtype=object)

In [181]:
mt560_df.drop(columns="source_language", inplace=True)
mt560_df.rename(columns={"english": "target"}, inplace=True)
mt560_df.head()

Unnamed: 0,source,target
0,Beduru gi Kuwe kod Ji Duto,Adam and Eve - Were They Real People?
1,Hera umo richo mogundho.,"In fact, ""love covers a multitude of sins."""
2,I mwaka me apar wiye angwen me loc pa kabaka K...,"In the fourteenth year of King Hezekiah, Senna..."
3,Muliraanwa wange y'ani?,Who really is my neighbor?
4,Notego wang'e kuom pokne.,"He ""looked intently toward the payment of the ..."


In [182]:
mt560_df[["source"]].to_csv(r'train_mt560.src', header=None, index=None, sep='\t', mode='a')
mt560_df[["target"]].to_csv(r'train_mt560.tgt', header=None, index=None, sep='\t', mode='a')

### Putting the dataset together

**Create initial dataset folder and add dataset files**


In [187]:
!mkdir multilingual-dataset


mkdir: multilingual-dataset: File exists


In [188]:
!mv {*.src,*.tgt} multilingual-dataset

zsh:1: no matches found: *.src


In [189]:
!ls multilingual-dataset/

eng.tgt          train.src        train_flores.src val.src
other.src        train.tgt        train_flores.tgt val.tgt
test.src         train_ai4d.src   train_mt560.src
test.tgt         train_ai4d.tgt   train_mt560.tgt


**Update dataset folder structure and create archive**


In [194]:
!mkdir -p v7-dataset/v7.0/supervised/mul-en

In [195]:
!cp -v multilingual-dataset/*.{src,tgt} v7-dataset/v7.0/supervised/mul-en

multilingual-dataset/other.src -> v7-dataset/v7.0/supervised/mul-en/other.src
multilingual-dataset/test.src -> v7-dataset/v7.0/supervised/mul-en/test.src
multilingual-dataset/train.src -> v7-dataset/v7.0/supervised/mul-en/train.src
multilingual-dataset/train_ai4d.src -> v7-dataset/v7.0/supervised/mul-en/train_ai4d.src
multilingual-dataset/train_flores.src -> v7-dataset/v7.0/supervised/mul-en/train_flores.src
multilingual-dataset/train_mt560.src -> v7-dataset/v7.0/supervised/mul-en/train_mt560.src
multilingual-dataset/val.src -> v7-dataset/v7.0/supervised/mul-en/val.src
multilingual-dataset/eng.tgt -> v7-dataset/v7.0/supervised/mul-en/eng.tgt
multilingual-dataset/test.tgt -> v7-dataset/v7.0/supervised/mul-en/test.tgt
multilingual-dataset/train.tgt -> v7-dataset/v7.0/supervised/mul-en/train.tgt
multilingual-dataset/train_ai4d.tgt -> v7-dataset/v7.0/supervised/mul-en/train_ai4d.tgt
multilingual-dataset/train_flores.tgt -> v7-dataset/v7.0/supervised/mul-en/train_flores.tgt
multilingual-dat

In [196]:
# Zip Directory
!zip -r v7-dataset.zip v7-dataset/

  adding: v7-dataset/ (stored 0%)
  adding: v7-dataset/v7.0/ (stored 0%)
  adding: v7-dataset/v7.0/supervised/ (stored 0%)
  adding: v7-dataset/v7.0/supervised/mul-en/ (stored 0%)
  adding: v7-dataset/v7.0/supervised/mul-en/train.tgt (deflated 64%)
  adding: v7-dataset/v7.0/supervised/mul-en/test.src (deflated 58%)
  adding: v7-dataset/v7.0/supervised/mul-en/val.src (deflated 58%)
  adding: v7-dataset/v7.0/supervised/mul-en/other.src (deflated 63%)
  adding: v7-dataset/v7.0/supervised/mul-en/train_mt560.tgt (deflated 61%)
  adding: v7-dataset/v7.0/supervised/mul-en/train_ai4d.src (deflated 69%)
  adding: v7-dataset/v7.0/supervised/mul-en/train_flores.src (deflated 59%)
  adding: v7-dataset/v7.0/supervised/mul-en/eng.tgt (deflated 91%)
  adding: v7-dataset/v7.0/supervised/mul-en/val.tgt (deflated 64%)
  adding: v7-dataset/v7.0/supervised/mul-en/test.tgt (deflated 64%)
  adding: v7-dataset/v7.0/supervised/mul-en/train.src (deflated 58%)
  adding: v7-dataset/v7.0/supervised/mul-en/train_m

### Part 2: Create English to all languages dataset (en-mul)

In [29]:
# Multi-lingual case: generate all examples of source and target language
def training_examples_from_sentence(translated_sentence):

  languages = set(translated_sentence.keys())

  if len(languages) < 2:
    raise ValueError("There must be at least two different languages, "
                     f"found {languages})")

  training_examples = []
  for target_language in languages:

    source_languages = languages.copy()
    source_languages.remove(target_language)

    for source_language in source_languages:
      source_text = (f"<to_{target_language}> "
                     f"{translated_sentence[source_language]}")
      target_text = translated_sentence[target_language]

      training_examples.append({'source': source_text, 
                                'target': target_text})
      
  return training_examples

In [30]:
len(training_examples_from_sentence(translated_sentence))

30

In [31]:
training_examples_from_sentence(translated_sentence)

[{'source': '<to_Lugbara> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Eggplants always grow best under warm conditions.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': "<to_Lugbara> Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Ateso> Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.'},
 {'source': '<to_Ateso> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Epoloi ebirinyanyi ojok apa

In [32]:
# Create all pairs from dataset

m = []
for i in range(len(df)):
  m.append(training_examples_from_sentence(json.loads(data[i])))

In [33]:
m[0]

[{'source': '<to_Lugbara> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Epoloi ebirinyanyi ojok apakio nu emwanar akwap.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Bbiringanya lubeerera  asinga kukulira mu mbeera ya bugumu',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Lugbara> Eggplants always grow best under warm conditions.',
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': "<to_Lugbara> Entonga buriijo zikurira omu mbeera y'obwire erikutagata",
  'target': 'Birinyanya eyi zo kililiru ndeni angu driza ma alia.'},
 {'source': '<to_Ateso> Birinyanya eyi zo kililiru ndeni angu driza ma alia.',
  'target': 'Epoloi ebirinyanyi ojok apakio nu emwanar akwap.'},
 {'source': '<to_Ateso> Bilinyanya pol kare dongo maber ka lyeto tye',
  'target': 'Epoloi ebirinyanyi ojok apa

In [34]:
len(m[0])

30

In [35]:
len(m)

25007

In [36]:
len(m)*len(m[0])

750210

In [37]:
from itertools import chain
multi_dataset = pd.DataFrame(list(chain.from_iterable(m)))

In [38]:
multi_dataset.tail(5).values

array([['<to_Runyankole> Gameteni silingi eza angiri eli vusi nzila siza ma dria',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> Gamente tiyo ki cente ma dwong adada me gero ki roco gudu.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> Itosomai apugan ikapun luipu kanginikaru kotoma aiduk irotin.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> Gavumenti essaasaanya ssente nnyingi nnyo buli mwaka mu kuzimba amakubo.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.'],
       ['<to_Runyankole> The government spends a lot of money every year on road construction.',
        'Gavumenti neeshohoreza munonga omukwombeka enguuto eibara-mwaka.']],
      dtype=object)