
### Neural Machine Translation Data Labling

Now that we have gathered our data and save `.csv` next we are going to create translation datasets based on the gathered tweets. We are going to create translation dataset for the following south african languages:

1. Zulu
2. Xhosa
3. Sotho
4. Afrikaans
5. English

___

Topic: `NMT`

Date: `2022/07/20`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___



### NMT Dataset

We are going to use `textblob` to create translation dataset from a huge set of english sentences which were scrapped on twitter.



### Installation of `textblob`
In the following code cell we are going to install the latest version of `textblob`. 

In [1]:
!pip install textblob  --upgrade -q

[?25l[K     |▌                               | 10 kB 20.9 MB/s eta 0:00:01[K     |█                               | 20 kB 27.7 MB/s eta 0:00:01[K     |█▌                              | 30 kB 15.7 MB/s eta 0:00:01[K     |██                              | 40 kB 6.2 MB/s eta 0:00:01[K     |██▋                             | 51 kB 5.7 MB/s eta 0:00:01[K     |███                             | 61 kB 6.7 MB/s eta 0:00:01[K     |███▋                            | 71 kB 7.3 MB/s eta 0:00:01[K     |████▏                           | 81 kB 5.6 MB/s eta 0:00:01[K     |████▋                           | 92 kB 6.3 MB/s eta 0:00:01[K     |█████▏                          | 102 kB 6.8 MB/s eta 0:00:01[K     |█████▋                          | 112 kB 6.8 MB/s eta 0:00:01[K     |██████▏                         | 122 kB 6.8 MB/s eta 0:00:01[K     |██████▊                         | 133 kB 6.8 MB/s eta 0:00:01[K     |███████▏                        | 143 kB 6.8 MB/s eta 0:00:01[K  

### Installing the package for helper Function

We are going to install a package called [`helperfs`](https://github.com/CrispenGari/helperfns). This is the package that i've created that contains some helper function that can be useful in this task.


In [2]:
!pip install helperfns -q

### Importing packages
In the following code cell we are going to import packages that we are going to use in this notebook.

In [3]:
import numpy as np
import pandas as pd

from textblob import TextBlob

import textblob
import os
import json
import multiprocessing

from helperfns.text import clean_sentence


textblob.__version__

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


'0.17.1'

### File System

We are going to store and load data from google drive so, we need to mount the goggle drive. In the following code cell we are mounting the google drive.


In [4]:
from google.colab import drive, files

drive.mount('/content/drive')

Mounted at /content/drive


### Languages
In the following code cell we are going to define our languages.

In [5]:
class Language:
  def __init__(self, code: str, name: str):
    self.name = name
    self.code = code

  def __repr__(self) -> str:
    return f"Language: <{self.name}>"

  def __str__(self) -> str:
      return f"Language: <{self.name}>"

languages = list(map(
    lambda x: Language(*(x['code'], x['name'])), 
    [
      { "code": "af", "name": "Afrikaans" },
      { "code": "en", "name": "English" },
      { "code": "xh", "name": "Xhosa" },
      { "code": "zu", "name": "Zulu" },
      { "code": "nr", "name": "Ndebele" },
      { "code": "ts", "name": "Tsonga" },
      { "code": "tn", "name": "Tswana" },
      { "code": "ss", "name": "Swati" },
      { "code": "st", "name": "Sotho" },
      { "code": "ve", "name": "Venda" }
    ]
))

languages

[Language: <Afrikaans>,
 Language: <English>,
 Language: <Xhosa>,
 Language: <Zulu>,
 Language: <Ndebele>,
 Language: <Tsonga>,
 Language: <Tswana>,
 Language: <Swati>,
 Language: <Sotho>,
 Language: <Venda>]

In [6]:
languages[0].name, languages[0].code

('Afrikaans', 'af')

### Languages that are not being translated

```
ve - venda
tn - Tswana
ss - Swati
nr - Ndebele (nd)
en - English
```

The mentioned languages will be filtered out when creating the dataset for language translation.

In [7]:
languages = [lang for lang in languages if lang.code not in ['ve', 'tn', 'ss', 'nr', 'en']]
languages

[Language: <Afrikaans>,
 Language: <Xhosa>,
 Language: <Zulu>,
 Language: <Tsonga>,
 Language: <Sotho>]

### Our Translation dataset

Our translation dataset will contain the following south african languages to english pairs.

```
af-en
xh-en
zu-en
ts-en
st-en
```

### Paths

In the following code cell we are going to define the paths where all our files are stored in the google drive.

In [8]:
base_dir = '/content/drive/My Drive/NLP Data/nmt'

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."

english_path = os.path.join(base_dir, "english/english.csv")


### Reading all english sentences

In the followng code cell we are going to read all english sentences and store them in a numpy array.

In [9]:
all_texts = pd.read_csv(english_path).sentence.values

In [10]:
len(all_texts)

108631

In [11]:
all_texts[:2]

array(['so where are we with this eisenhower is farewell address',
       'to you still beautiful even after death i say my farewell'],
      dtype=object)

### `textblob_translate` function
This function takes in a sentence and a language that we want our text to be translated and return to us a translated sentence.

In [12]:
def textblob_translate(sent: str, lang: Language) -> str:
  blob = TextBlob(sent)
  try:
    return sent, str(blob.translate(from_lang='en', to=lang.code)).strip().lower()
  except:
    return sent, sent

### Defining columns
In the following code cell we are going to define the columns of our `csv` files. Note that there will be only two columns which are:

1. `src` - the english sentences (source)
2. `trg` - the translated version of a sentence (target)

In [13]:
columns = np.array(['src', 'trg'])

### `create_translation_dataset` function

This function takes in a languge and does the translation and save the `csv` file in google drive.

> Note that creating such a huge dataset is a huge computation and we have  `~109K` english sentences and it can take days to do this. So what we will do is to run this notebook and select a minimum number of sentences that will be translated each and every day and save them with a defined file name (where sentences pairs for translations that belongs to same language) will be in their separate folder. And then later on we will create a write code for merging these files as a single one.

In [16]:
start_idx = 108_000
end_idx = 110_000 # in range [start, end] index
day = 110

def create_translation_dataset(lang: Language):
  src_trg = list()
  folder_path = os.path.join(base_dir, 'nmt_datasets', f'en-{lang.code}')
  if not os.path.exists(folder_path):
    os.mkdir(folder_path)
  save_path = os.path.join(folder_path, f'en-{lang.code}-{day}.csv')
  if lang.code != 'en':
    print()
    print(f"Creating translation dataset for en-{lang.code}")
    for txt in all_texts[start_idx:end_idx]:
      try:
        src_trg.append(textblob_translate(txt, lang))
      except (textblob.exceptions.NotTranslated, TypeError, Exception):
        continue
    print(f"Done creating translation dataset for en-{lang.code}")
  if len(src_trg) > 0:
    print(f"Got {len(src_trg)} examples for en-{lang.code}")
    dataframe = pd.DataFrame(src_trg, columns=columns, index=None)
    dataframe.to_csv(save_path)
    print(f"Saved en-{lang.code} dataset as en-{lang.code}-{day}.csv")

### Creating the dataset for each language.

In the next code cell we are going to create the dataset for each language.

In [17]:
for lang in languages:
  create_translation_dataset(lang)


Creating translation dataset for en-af
Done creating translation dataset for en-af
Got 630 examples for en-af
Saved en-af dataset as en-af-110.csv

Creating translation dataset for en-xh
Done creating translation dataset for en-xh
Got 630 examples for en-xh
Saved en-xh dataset as en-xh-110.csv

Creating translation dataset for en-zu
Done creating translation dataset for en-zu
Got 630 examples for en-zu
Saved en-zu dataset as en-zu-110.csv

Creating translation dataset for en-ts
Done creating translation dataset for en-ts
Got 630 examples for en-ts
Saved en-ts dataset as en-ts-110.csv

Creating translation dataset for en-st
Done creating translation dataset for en-st
Got 630 examples for en-st
Saved en-st dataset as en-st-110.csv


### References

1. [List_of_ISO_639-1_codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)