<a href="https://colab.research.google.com/github/SunbirdAI/parallel-text-EDA/blob/main/Prepare_supplementary_translation_data_(MT560%2BFLORES101%2BAI4D).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
import gzip
import io
from IPython import display
import numpy as np
import pandas as pd
import tqdm

!pip install sacremoses
import sacremoses
display.clear_output()

# Prepare MT560 data

This is a big dataset, around 35GB compressed. Only a small part of it is relevant to the languages we are interested in, though. So first we find which lines have language codes  `lug` (Luganda), `ach` (Acholi), or `nyn` (Runyankore).

In [None]:
!wget https://object.pouta.csc.fi/OPUS-MT560/train.v1.lang.gz

languages = pd.read_csv('train.v1.lang.gz', engine='c', names=['code'])
num_lines = len(languages)

language_codes = ['lug', 'ach', 'nyn', 'luo']
line_languages = {}
for code in language_codes:
  lines = np.where(languages.code == code)[0]
  for l in lines:
    line_languages[l] = code

display.clear_output()

for code in language_codes:
  N = sum(value == code for value in line_languages.values())
  print(f'{N} lines of language {code}')

224749 lines of language lug
73172 lines of language ach
50379 lines of language nyn
136625 lines of language luo


If that looks OK, then remove unnecessary files and variables to make space for iterating over the full dataset.

In [None]:
del languages
!rm train.v1.lang*

Now retrieve the actual sentences. This should take ~30 minutes to download and ~50 minutes to iterate over.

In [None]:
!wget https://object.pouta.csc.fi/OPUS-MT560/train.v1.eng.tok.gz
!wget https://object.pouta.csc.fi/OPUS-MT560/train.v1.src.tok.gz
display.clear_output()

In [None]:
source = []
language = []

with gzip.open('train.v1.src.tok.gz','r') as f:
  for i, line in tqdm.tqdm(enumerate(f), total=num_lines):
    if i in line_languages:
      source.append(line)
      language.append(line_languages[i])

100%|██████████| 473791770/473791770 [25:59<00:00, 303872.81it/s]


In [None]:
english = [] 

with gzip.open('train.v1.eng.tok.gz','r') as f:
  for i, line in tqdm.tqdm(enumerate(f), total=num_lines):
    if i in line_languages:
      english.append(line)

100%|██████████| 473791770/473791770 [24:08<00:00, 327054.14it/s]


Detokenize the text, to remove extra spaces.

In [None]:
detokenizer = sacremoses.MosesDetokenizer(lang='en')
source_detokenized = []
english_detokenized = []
for i in tqdm.tqdm(range(len(source)), position=0):
  source_detokenized.append(
      detokenizer.detokenize([source[i].decode('utf8')]))
  english_detokenized.append(
      detokenizer.detokenize([english[i].decode('utf8')]))

100%|██████████| 484925/484925 [06:37<00:00, 1220.82it/s]


Create a CSV file with the results.

In [None]:
mt560 = pd.DataFrame()
mt560['source'] = source_detokenized
mt560['english'] = english_detokenized
mt560['source_language'] = language
mt560.to_csv('mt560.csv.gz', index=False, compression='gzip')
mt560.sample(n=10)

Unnamed: 0,source,english,source_language
395362,'Yueyo mar sabato pod odong' ne oganda Nyasaye...,"""There remains a sabbath resting for the peopl...",luo
455831,"(b) Kiki Yakuwa ky'atulabulako, era lwaki?",(b) What kinds of warnings does Jehovah offer ...,lug
177073,"Chiege Krista, ne owuoyo gi mor ahinya kaka Jo...","His wife, Krista, spoke fondly of being influe...",luo
22071,Abakristaayo ab'amagezi era abafaayo ku mbeera...,Wise Christians who care about their own spiri...,lug
371435,"Bwe kityo, ng'eyogera ku kuzuukira, King James...","Thus, describing the resurrection, the King Ja...",lug
386065,Dine bed ni Jehova ger kendo timo gik moko e y...,It is unlikely that Jesus would have felt that...,luo
362012,Gikawo okang 'mokwongo mondo' kuom hoch ma Nya...,"They reach out ""to comfort those in any sort o...",luo
7114,Tewali ggwanga lyandisubiddwa mukisa gwa kuwul...,No nation would miss out on hearing the good n...,lug
350597,"Mukene gitamo pi lok pa Paulo ni: ""Pingo gigen...","They may think of Paul's words: ""Why should my...",ach
73116,Gin mutimme i kom Moses miyowa pwony ma pire tek.,What happened to Moses teaches us this very im...,ach


Save the results in a Drive folder

In [None]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!cp mt560.csv.gz "/content/gdrive/Shareddrives/Sunbird AI/Projects/NLP Technology/Data/"

# Prepare FLORES101 data

2,000 professionally translated sentences in the public dataset [[link]](https://github.com/facebookresearch/flores).

In [None]:
!wget https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz
!tar xvzf flores101_dataset.tar.gz
display.clear_output()

In [None]:
flores = {}
for language in ['lug', 'luo', 'eng']:
  with open(f'flores101_dataset/dev/{language}.dev') as f:
    dev_lines = f.readlines()
  with open(f'flores101_dataset/devtest/{language}.devtest') as f:
    devtest_lines = f.readlines()
  lines = dev_lines + devtest_lines
  lines = [l.replace('\n', '') for l in lines]
  flores[language] = lines
  
flores = pd.DataFrame(flores)

In [None]:
flores.sample(n=10)

Unnamed: 0,lug,luo,eng
163,Elyaato erilwanyinya ebisoro ebikambwe lyabadd...,En achiel kwom meli mag kedo gi mbome mopandi ...,"An Avenger class mine countermeasures ship, th..."
1918,"Zino zisazibwa Norway, Sweden ne New Zealand, ...","Maye ipimo mana gi Norway, Sweden kod New Zeal...","This is matched by Norway, Sweden and New Zeal..."
455,Eno ensonga tekyalina makulu nga ebilawuli by’...,Mae bedo wach mathin ka jolos rang'i chopo e o...,This is becoming less of an issue as lens manu...
1098,"Wankubadde, omuwendo kubuli kikumi ku XDR -TB ...","Kata kamano, atamalo mar XDR-TB ei oganda duto...","However, the percentage of XDR-TB in the entir..."
1851,Kyefananyirizanga naye tekitera kwenyigirwamu ...,"En machiegini kode, to ok oting'o lony mag alp...",It is related to but usually not involving alp...
335,Omukyaala omwatikirivu Sezen Aksu owa Butuluki...,"Nyarber mapiny Turkey manyinge Sezen Aksu, not...",Turkish diva Sezen Aksu performed with the Ita...
1483,"Okwegadanga kwavira ddala kubya nnono, nga biv...",Tim hero ne nikod tenruok mathoth ahinya kod y...,Romanticism had a large element of cultural de...
1489,Emisomo ja Gothic jatutumuka nyoo mubiseera wa...,Yor Gothic nentiere eng'iende ekind senchuri m...,Gothic style peaked in the period between the ...
710,Waliyo empuku entono nyo kumpi n’entikko gyoli...,Nitie bur matin machiegini kod malo ma nyaka k...,There's a tiny cave near the top that must be ...
912,Tewali kyetaagisibwa nti offune omuwendo okuva...,Bende onge dwaro ni nyaka ibed kod namba mar g...,There is also no requirement that you obtain a...


In [None]:
flores.to_csv('flores101.csv.gz', index=False, compression='gzip')
!cp flores101.csv.gz "/content/gdrive/Shareddrives/Sunbird AI/Projects/NLP Technology/Data/"

# Prepare Makerere/AI4D Luganda data

"An English-Luganda parallel corpus" [[link]](https://zenodo.org/record/4764039), containing 15,000 Luganda-English sentence pairs.

In [None]:
!wget https://zenodo.org/record/4764039/files/Luganda.csv
display.clear_output()

There are some non-unicode characters in the file, so parse it indirectly.

In [None]:
with open('Luganda.csv', encoding='utf-8', errors='replace') as f:
  lines = f.readlines()

ai4d_luganda = pd.read_csv(io.StringIO(''.join(lines)))

Make the format consistent with the MT560 data.

In [None]:
ai4d_luganda = ai4d_luganda[['English', 'Luganda']]
ai4d_luganda = ai4d_luganda.dropna()
ai4d_luganda = ai4d_luganda.rename(
    columns={'Luganda': 'lug', 'English': 'eng'})
ai4d_luganda.sample(n=10)

Unnamed: 0,eng,lug
13624,Children are a blessing from God.,Abaana mukisa okuva eri Katonda.
2037,A good working rerationship between the emplo...,Enkolagana ennungi wakati w'omukozi ne mukama ...
13161,There was a reduction in the dropout rate in t...,Omuwendo gw'abaava mu ssomero gwakendeera omwa...
6369,People dying from this disease are mostly from...,Abantu abafa obulwadde buno okusinga bava mu n...
5379,I encourage my ferlow youths to never particip...,Nkubiriza bavubuka bannange obutaddamu kwenyig...
8613,The Ministry of Health inspected hospitals in ...,Munisitule y'ebyobulamu yalambudde amalwaluro ...
12099,Mobile money is the simplest and easiest way o...,Enkola y'okusindikira ensimbi ku masimu y'ekya...
14276,Substance abuse leads to increased mental illn...,Okunywa ebiragala kiviirako obuzibu ku bwongo ...
4393,We are all created as one in God's image.,ffenna twatondebwa nga omuntu omu mu kifaanany...
6509,Vanilla is a cash crop grown in Uganda.,Vanilla kirime ekivaamu ssente mu Uganda.


Upload to Drive folder.

In [None]:
ai4d_luganda.to_csv('ai4d_luganda.csv.gz', index=False, compression='gzip')
!cp ai4d_luganda.csv.gz "/content/gdrive/Shareddrives/Sunbird AI/Projects/NLP Technology/Data/"