# Conversion of machine translation datasets to SALT v2 format

This notebook converts existing Igbo machine translation datasets that are sourced, into the SALT v2 format.

The new format is .jsonl and looks like this: \
`
    [
    {'text': {'ibo': 'Anas na-agwa anyị na ọ zaghachiri, sị, "Ndị niile na-atụ egwu Allah."',
  'eng': 'Anas tells us that he replied, "All those who fear Allah."'}}
  ]
`

In [60]:
from IPython import display
import numpy as np
import pandas as pd
# from sklearn.model_selection import train_test_split
import os
import random
import json
import glob
import requests
import gzip
from tqdm import tqdm

In [61]:
# define paths
OUTPUT_DIR = 'salt-translation-plus-external-datasets/'
if not os.path.exists(OUTPUT_DIR):
    os.mkdir(OUTPUT_DIR)

temp_dir = "temp_dir/"
if not os.path.exists(temp_dir):
    os.mkdir(temp_dir) 

DATA_DIR = 'v7-dataset/v7.0/supervised/' 

In [50]:
def file_to_list(path):
    with open(path) as file:
        lines = file.readlines()
        lines = [line.rstrip() for line in lines]
        return lines
    
def url_to_list(url):
    response = requests.get(url)
    return response.text.splitlines()

## Test and Dev Data

TODO: To use existing SALT dev and test data, but to translate them into Igbo. 

Pricing for Igbo Translation can be found [here](https://docs.google.com/document/d/1BwSw8CCm9q71iZ7vuMpeOckDFdPNhwoC6bJ-BZNE_bU/edit?usp=sharing).

In [None]:
# ### DOWNLOAD SALT DATA
# !wget https://sunbird-translate.s3.us-east-2.amazonaws.com/salt-translation-plus-external-datasets.zip
# !unzip salt-translation-plus-external-datasets.zip

In [7]:
languages = ['lug', 'ach', 'nyn', 'luo']

if not os.path.exists('v7-dataset'):
    !wget https://sunbird-translate.s3.us-east-2.amazonaws.com/v7-dataset.zip
    !unzip v7-dataset.zip
    display.clear_output()
    
for language in languages:
    source = file_to_list(DATA_DIR + f'mul-en/train_mt560_{language}.src')
    target = file_to_list(DATA_DIR + f'mul-en/train_mt560_{language}.tgt')

    sentences = []
    for s, t in zip(source, target):
        sentences.append({'text': {language: s, 'eng': t}})

    with open(OUTPUT_DIR + f'mt560_{language}.jsonl', 'w') as outfile:
        for entry in sentences:
            json.dump(entry, outfile)
            outfile.write('\n')

# Parallel Igbo-English Data

### MT560 Data Source
Follow the approach here to get any language available in mt560:
https://colab.research.google.com/drive/1_a_d4phiWFhcLGkom3qIfelblxbSiTSB?usp=sharing

The parallel Igbo data has `415234` samples.

In [4]:
mt560 = pd.read_csv("/Users/user/Downloads/mt560.csv.gz", engine='c')
mt560_igbo = mt560[mt560["source_language"]=="ibo"]

In [6]:
print(len(mt560_igbo), "\n")
mt560_igbo.head()

415234 



Unnamed: 0,source,english,source_language
3,Jehova Chineke ga - enyekwa ya ocheeze nke Dev...,And Jehovah God will give him the throne of Da...,ibo
4,"Ka ihu anyanwụ malitere ịpụta, anyị hụrụ ntụpọ...","As the solar disk started to emerge, we saw th...",ibo
5,"Kpọghachite m, m ga - alọghachikwa, n'ihi na ị...","Cause me to turn back, and I shall readily tur...",ibo
7,Ndị Ezi Omume Ga - enwu Gbaa Dị Ka Anyanwụ,The Righteous Ones Will Shine as Brightly as t...,ibo
8,Ndị ji obi ụtọ na ịdị n'otu na - ejere Jehova ...,Happy and enjoying their united service to Jeh...,ibo


In [None]:
sentences = []

for row in mt560_igbo.itertuples():
    sentences.append({'text': {'ibo': row.source, 'eng': row.english}})

In [None]:
with open(OUTPUT_DIR + 'mt560_ibo.jsonl', 'w', encoding='UTF-8', errors='ignore') as outfile:
    for entry in sentences:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')

In [None]:
# ## UNCOMMENT IF WE CHOOSE TO SPLIT THE DATA INSTEAD INTO TRAIN, DEV, TEST
# train, dev = train_test_split(mt560_igbo, test_size=0.01, shuffle=True)
# test = dev[:500]

# print(len(train), len(dev), len(test))

# train_sentences = []
# dev_sentences = []
# test_sentences = []

# for row in train.itertuples():
#     train_sentences.append({'text': {'ibo': row.source, 'eng': row.english}})

# for row in dev.itertuples():
#     dev_sentences.append({'text': {'ibo': row.source, 'eng': row.english}})

# for row in test.itertuples():
#     test_sentences.append({'text': {'ibo': row.source, 'eng': row.english}})

### IgboNLP Data [Source](https://github.com/IgnatiusEzeani/IGBONLP/tree/master/ig_en_mt/benchmark_dataset)
This data has `10792` samples.


In [58]:
igbo_mt = ["test.en", "test.ig", "train.en", "train.ig", "val.en", "val.ig"]
for data in igbo_mt:
    !wget -P $temp_dir https://raw.githubusercontent.com/IgnatiusEzeani/IGBONLP/master/ig_en_mt/benchmark_dataset/$data
    display.clear_output()

## merge all .ig and .en files in igbo_mt list
for lang in ["en", "ig"]:
    with open(temp_dir+f'igbo_en.{lang}', 'w') as outfile: 
        for language_split in igbo_mt:
            if language_split.endswith(lang):
                with open(temp_dir+language_split) as infile:
                    outfile.write(infile.read())
                outfile.write("\n")

# load merged .ig and .en data and convert to the salt data format
igbo = file_to_list(temp_dir+"igbo_en.ig")
en = file_to_list(temp_dir+"igbo_en.en")

sentences = []
for s, t in zip(igbo, en):
    sentences.append({'text': {'ibo': s, 'eng': t}})

with open(OUTPUT_DIR + 'igbo_en.jsonl', 'w', encoding='utf-8') as outfile:
    for entry in sentences:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')

# delete temporary directory
!rm -rf $temp_dir

### Masakhane Eng-Igbo Data [Source](https://github.com/masakhane-io/lafand-mt/tree/main/data/text_files/en-ibo)

This data has `10000` samples.

In [82]:
igbo_mt = ["test.en", "test.ibo", "train.en", "train.ibo", "dev.en", "dev.ibo"]
for data in igbo_mt:
    !wget -P $temp_dir https://raw.githubusercontent.com/masakhane-io/lafand-mt/main/data/text_files/en-ibo/$data
    display.clear_output()

## merge all .ibo and .en files in igbo_mt list
for lang in ["en", "ibo"]:
    with open(temp_dir+f'masakhane_igbo_en.{lang}', 'w') as outfile: 
        for language_split in igbo_mt:
            if language_split.endswith(lang):
                with open(temp_dir+language_split) as infile:
                    outfile.write(infile.read())
                outfile.write("\n")

# load merged .ibo and .en data and convert to the salt data format
igbo = file_to_list(temp_dir+"masakhane_igbo_en.ibo")
en = file_to_list(temp_dir+"masakhane_igbo_en.en")

sentences = []
for s, t in zip(igbo, en):
    sentences.append({'text': {'ibo': s, 'eng': t}})

with open(OUTPUT_DIR + 'masakhane_igbo_en.jsonl', 'w', encoding='utf-8') as outfile:
    for entry in sentences:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')

# delete temporary directory
!rm -rf $temp_dir

### FaceBook No language left Behind (NLLB) Data [Source](https://huggingface.co/datasets/allenai/nllb/blob/main/README.md)

It was recommended to use the data only for training purposes. This data has `6110033` samples.

In [None]:
%%capture
!pip install datasets

In [71]:
from datasets import load_dataset

# the data is quite large, so it takes up to an hour to download
igbo_dataset = load_dataset("allenai/nllb", "eng_Latn-ibo_Latn")

Downloading and preparing dataset nllb/eng_Latn-ibo_Latn (download: 930.30 MiB, generated: 2.58 GiB, post-processed: Unknown size, total: 3.48 GiB) to /Users/user/.cache/huggingface/datasets/allenai___nllb/eng_Latn-ibo_Latn/1.0.0/28d4a24ef4e17a539baee89254dc6a56e75b1a7a10b1055757f2512af99f5b30...


Downloading data: 100%|██████████| 975M/975M [54:23<00:00, 299kB/s]    
                                                                                           

Dataset nllb downloaded and prepared to /Users/user/.cache/huggingface/datasets/allenai___nllb/eng_Latn-ibo_Latn/1.0.0/28d4a24ef4e17a539baee89254dc6a56e75b1a7a10b1055757f2512af99f5b30. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:01<00:00,  1.57s/it]


In [72]:
igbo_dataset 

DatasetDict({
    train: Dataset({
        features: ['translation', 'laser_score', 'source_sentence_lid', 'target_sentence_lid', 'source_sentence_source', 'source_sentence_url', 'target_sentence_source', 'target_sentence_url'],
        num_rows: 6110033
    })
})

In [75]:
igbo_dataset_df = pd.DataFrame(igbo_dataset["train"]["translation"])

In [76]:
igbo_dataset_df.head()

Unnamed: 0,eng_Latn,ibo_Latn
0,"Anas tells us that he replied, ""All those who ...","Anas na-agwa anyị na ọ zaghachiri, sị, ""Ndị ni..."
1,"""For in one hour such great wealth has been la...",n' ihi na n' otu awa ka a lara oké akụnụba dị ...
2,"They will come, and they will see my glory.","Ha ga-agakwuru, ha ga-hụ ebube m."
3,In one hour this great wealth has been ruined.,n' ihi na n' otu awa ka a lara oké akụnụba dị ...
4,"Seven days shall you wait until I come to you,...","Ị ga- echere m ruo ụbọchị asaa , ruo mgbe m ga..."


In [77]:
sentences = []

for row in igbo_dataset_df.itertuples():
    sentences.append({'text': {'ibo': row.ibo_Latn, 'eng': row.eng_Latn}})

In [79]:
sentences[0]

{'text': {'ibo': 'Anas na-agwa anyị na ọ zaghachiri, sị, "Ndị niile na-atụ egwu Allah."',
  'eng': 'Anas tells us that he replied, "All those who fear Allah."'}}

In [81]:
with open(OUTPUT_DIR + 'nllb_ibo_train.jsonl', 'w', encoding='UTF-8', errors='ignore') as outfile:
    for entry in sentences:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')

# FLORES 200

This dataset contains 2000 sentences with translations in 44 different African languages. We combine the dev and devtest splits into a single set.

In [10]:
if not os.path.exists('flores200_dataset'):
    !wget --trust-server-names https://tinyurl.com/flores200dataset
    !tar xvzf flores200_dataset.tar.gz 
    display.clear_output()

# languages = ['lug', 'eng', 'ibo', 'ewe', 'fon', 'hau', 'kam', 'kea', 'kik', 'kin',
#              'kmb', 'kon', 'lin', 'lua', 'luo', 'nso', 'nya', 'gaz', 'run', 'sag',
#              'sna', 'som', 'sot', 'ssw', 'swh', 'tir', 'tsn', 'tso', 'tum', 'twi',
#              'umb', 'wol', 'xho', 'yor', 'zul', 'aka', 'amh', 'bam', 'bem', 'cjk',
#              'dik', 'dyu', 'fuv', 'kbp']

languages = ['lug', 'luo', 'ibo']
source_sentences = {}

for language in languages:
    dev_path = glob.glob(f'flores200_dataset/dev/{language}*.dev')[0]
    devtest_path = glob.glob(f'flores200_dataset/devtest/{language}*.devtest')[0]
    source_sentences[language] = file_to_list(dev_path) + file_to_list(devtest_path)
    if not len(source_sentences[language]):
        raise ValueError(f'No text found for language {language}.')  

N = len(source_sentences['lug'])
sentences = []
for i in range(N):
    sentence = {'text': {}}
    for language in languages:
        sentence['text'][language] = source_sentences[language][i] 
    sentences.append(sentence)

with open(OUTPUT_DIR + f'flores200.jsonl', 'w') as outfile:
    for entry in sentences:
        json.dump(entry, outfile)
        outfile.write('\n')

# Get number of words in test and dev SALT data

In [52]:
def get_number_of_eng_words(json_file_path):
    salt_data = []
    for line in open(json_file_path, 'r'):
        salt_data.append(json.loads(line))

    values = [salt_data_["text"] for salt_data_ in salt_data]

    salt_data = pd.DataFrame(values)

    salt_data["no_of_enWords"] = salt_data["eng"].apply(lambda n: len(n.split()))

    mean_number_of_english_words = salt_data["no_of_enWords"].mean()
    total_number_of_english_words = sum(salt_data["no_of_enWords"])

    print("mean_number_of_english_words: ", round(mean_number_of_english_words), "\n",
        "total_number_of_english_words: ", total_number_of_english_words)

    return total_number_of_english_words, salt_data


In [63]:
salt_test = "salt-translation-plus-external-datasets/salt-test.jsonl"
salt_dev = "salt-translation-plus-external-datasets/salt-dev.jsonl"

no_of_test_eng_words, salt_test_df = get_number_of_eng_words(salt_test)
no_of_dev_eng_words, salt_dev_df = get_number_of_eng_words(salt_dev)

# total number of words in dev and test
total_ = no_of_test_eng_words + no_of_dev_eng_words
print("\n", "Total number of words in dev and test: ", total_)

mean_number_of_english_words:  9 
 total_number_of_english_words:  4571
mean_number_of_english_words:  9 
 total_number_of_english_words:  4469

 Total number of words in dev and test:  9040


In [42]:
salt_dev_df.head()

Unnamed: 0,eng,lug,ach,teo,lgg,nyn,no_of_enWords
0,It's the government's responsibility to teach ...,Buvunaanyizibwa bwa gavumenti okusomesa abantu...,Obedo tic pa gamente me pwonyo lwak i kom two ...,Erai aswam apugan aisisianakin itunga ke nuika...,Eri azi gamete ni imbata fezu 'ba ivile 'diyin...,N'obujunanizibwa bwa Gavumenti okwegyesa abant...,11
1,The issue of land grabbing is on a rise.,Ekibba ttaka kyeyongedde nnyo.,Time me mayo ngom tektek tye ka medde ameda.,Iyatasi noi akiro nuka aidem alupok.,E'yo angu opazaniri turia,Eshonga y'okwiba eitaka neyeyongyera.,9
2,Parents educate their children.,Abazadde basomesa abaana baabwe.,Lunyodo pwonyo lutino gi,Itosiomete auriak idwe kec.,Tipika eyi onita fe anzi eyivile 'diyini,Abazaire nibegyesa abaana baabo.,4
3,I passed all the questions in the examination ...,Nnatuuka ebibuuzo byonna ebyali ku lupapula lw...,Akato lapeny weng ma obedo i karatac peny.,Abu eong atub aingiseta kere luka apapula kangin.,Ma aga ozita karitasi obeta ni ma alia dria ra,Nkahika ebibuuzo byona ebyabaire biri omu kigy...,9
4,Several musicians held a concert in honor of t...,Abayimbi abatali bamu baakoze ekivvulu okujjuk...,Lugo wer mapol guwero wer me po pi luremgi ma ...,Apotu ayook luipu kojaikinos keda aitodiaret k...,Ba karakarau ongo co'ba 'diyi 'ye avita inzita...,Abeshongozi bamwe bakozireho ekiterane mukwiju...,11


In [54]:
print("salt_dev_df: ", salt_dev_df.shape)
print("salt_test_df: ", salt_test_df.shape)

salt_dev_df:  (500, 7)
salt_test_df:  (500, 7)


# Monolingual text (web scraped)

Data was scraped from the web using [this code](https://github.com/SunbirdAI/parallel-text-EDA/tree/main/back_translation).

In [14]:
url_prefix = ('https://raw.githubusercontent.com/SunbirdAI/'
              'parallel-text-EDA/main/back_translation/data/')
english_sentences = url_to_list(url_prefix + 'eng/daily-monitor.txt')
english_sentences += url_to_list(url_prefix + 'eng/new-vision.txt')
english_sentences = [{'text': {'eng': s}} for s in english_sentences]

In [15]:
luganda_sentences = url_to_list(url_prefix + 'lug/bukedde.txt')
luganda_sentences += url_to_list(url_prefix + 'lug/makerere.txt')
luganda_sentences = [{'text': {'lug': s}} for s in luganda_sentences]

In [16]:
acholi_sentences = url_to_list(url_prefix + 'ach/acholi-online.txt')
acholi_sentences += url_to_list(url_prefix + 'ach/misc.txt')
acholi_sentences += url_to_list(url_prefix + 'ach/rupiny.txt')
acholi_sentences = [{'text': {'ach': s}} for s in acholi_sentences]

In [17]:
len(acholi_sentences), len(luganda_sentences), len(english_sentences)

(6655, 12304, 88613)

In [18]:
with open(OUTPUT_DIR + f'monolingual-eng.jsonl', 'w') as outfile:
    for entry in english_sentences:
        json.dump(entry, outfile)
        outfile.write('\n')

with open(OUTPUT_DIR + f'monolingual-lug.jsonl', 'w') as outfile:
    for entry in luganda_sentences:
        json.dump(entry, outfile)
        outfile.write('\n')
        
with open(OUTPUT_DIR + f'monolingual-ach.jsonl', 'w') as outfile:
    for entry in acholi_sentences:
        json.dump(entry, outfile)
        outfile.write('\n') 