
### Language Intentification Dataset(LI)

In this notebook we are going to create a language identification dataset for the the following languages:

- Zulu
- Xhosa
- Sotho
- Afrikaans
- English
___

Topic: `LI`

Date: `2022/08/24`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___

### Importing packages
In the following code cell we are going to import packages that we are going to use in this notebook.

In [12]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os, random

### Setting Seed

In the following code cell we are going to set the seed for reprocucivity in this notebook.

In [13]:
SEED = 42

random.seed(SEED)
np.random.seed(SEED)

### File System

Our files are going to be stored in google drive so we need to mount the good drive so that we can be able to read and write files in there.


In [6]:
from google.colab import drive, files

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Paths

In the following code cell we are going to define the paths where all our files are stored in the google drive.

In [7]:
base_dir = '/content/drive/My Drive/NLP Data/nmt'

save_dir = os.path.join(base_dir, 'li_datasets')

read_dir = os.path.join(base_dir, 'nmt_datasets_pairs/za-nmt-dataset.csv')

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."
assert os.path.exists(save_dir), f"The path '{save_dir}' does not exists, check if you have mounted the google drive."
assert os.path.exists(read_dir), f"The path '{read_dir}' does not exists, check if you have mounted the google drive."

### Reading Sentences
In the following code cell we are going to read the sentences from our `za-nmt-dataset.csv` into a dataframe.

In [8]:
dataframe = pd.read_csv(read_dir)
dataframe.head(1)

Unnamed: 0.1,Unnamed: 0,en,af,st,zu,ts,xh
0,0,so where are we with this eisenhower is farewe...,so waar is ons met hierdie eisenhower afskeids...,joale re hokae ka eisenhower ena,ngakho-ke sikuphi nale-eisenhower ikheli le-fa...,so hi kwihi hi eisenhower leyi i farewell addr...,ke siphi na le ntetho idilesi ye-farewell


### Column names
In the following code cell we are going to declare column names as a numpy array. These column names are columns for our dataset.

In [9]:
column_names = np.array(['src', 'trg'])

### Extracting sentences and label them

In the following code cell we are going to extract all the sentences for each language in the dataframe columns and label them.

In [10]:
en_sents = [(sent, 'en') for sent in dataframe.en.values]
af_sents = [(sent, 'af') for sent in dataframe.af.values]
st_sents = [(sent, 'st') for sent in dataframe.st.values]	
zu_sents = [(sent, 'zu') for sent in dataframe.zu.values]
ts_sents = [(sent, 'ts') for sent in dataframe.ts.values]	
xh_sents = [(sent, 'xh') for sent in dataframe.xh.values]

### Merging and Shuffling.

In the following code cell we are going to medge our sentences for all languages and then shuffle them.

In [14]:
data = list()
data.extend(en_sents)
data.extend(af_sents)
data.extend(st_sents)
data.extend(zu_sents)
data.extend(ts_sents)
data.extend(xh_sents)

print("Total examples: {}".format(len(data)))

random.shuffle(data)

data[:10]

Total examples: 550434


[('watter kongreslid babbitt was en vir watter nuus of vermaakprogram', 'af'),
 ('_odhidaiabo pele ho ngangisano e ne e le mofumahali', 'st'),
 ('spider man universe bokaholoanyane', 'st'),
 ('ayisiyo inkcubeko yenkcubeko endikhulele kuyo ngaphambi kokuba uzalelwe oko kukwenza kwaye ukudala isizukulwane sophuhliso kwi-hormone engaqinisekanga ngamandla abo okuzala okanye injongo yexabiso',
  'xh'),
 ('the product extension make site more informative and share detailed with your site with the module you can attach video price',
  'en'),
 ('zinikwe inkuthazo esiphezulu esiphezulu kwi-sub ye-boss championship yeyakho',
  'xh'),
 ('they t even the same genre really anyone who spent five with both and the bom can see there s no ku fana ku tlula ka loko u lava ku kuma ku khorwisa ku lwisana na kereke a hi hanci ku becha eka .',
  'ts'),
 ('ku hetiseka loku teleke hi vutlhari ni ku hetiseka eka ku saseka b a a ri ehenhla swinene eka hinkwaswo leswi a a khorwisa ku ringana ku khorwisa n’we-xa-n

### Creating a Merged dataframe

In the following code cell we are going to create a merged dataframe for our language identification dataset.

In [17]:
df = pd.DataFrame(data, columns=column_names)
df.head(10)

Unnamed: 0,src,trg
0,watter kongreslid babbitt was en vir watter nu...,af
1,_odhidaiabo pele ho ngangisano e ne e le mofum...,st
2,spider man universe bokaholoanyane,st
3,ayisiyo inkcubeko yenkcubeko endikhulele kuyo ...,xh
4,the product extension make site more informati...,en
5,zinikwe inkuthazo esiphezulu esiphezulu kwi-su...,xh
6,they t even the same genre really anyone who s...,ts
7,ku hetiseka loku teleke hi vutlhari ni ku heti...,ts
8,trump -span hy moet weer president wees om hom...,af
9,any chance you could catch it in a trap and sq...,ts


### Spliting our dataset.
Next we are going to slit our dataset into 3 subsets which are:

1. train
2. test
3. validation

The test dataset will be splited together with the train data with a fraction of `5%` and validation sets will be taken from the train dataset as a fraction of `.05` as well. We are going to use the `train_test_split` function from `sklearn.model_selection` to split our dataframes.

In [20]:
train_df, test_df = train_test_split(df, test_size=.05, random_state=SEED)
train_df, val_df = train_test_split(train_df, test_size=.05, random_state=SEED)


print(f"TRAINING EXAMPLES: {len(train_df)}")
print(f"TESTING EXAMPLES: {len(test_df)}")
print(f"VALIDATION EXAMPLES: {len(val_df)}")

TRAINING EXAMPLES: 496766
TESTING EXAMPLES: 27522
VALIDATION EXAMPLES: 26146


### Saving datasets.

In the code cells that follows we are going to save the created sets as `.csv` files.

In [27]:
train_df.reset_index(drop=True, inplace=True)
train_df.to_csv(os.path.join(save_dir, 'train.csv'), index=False)
print("Saved!")
train_df.head()

Saved!


Unnamed: 0,src,trg
0,om mee te werk en in die volgende,af
1,yi nyikele hi dragona yimbirhi ivi yi nghena e...,ts
2,e ne e le eulogy ea pele,st
3,farewell my love,en
4,__tweet t support city as hulle die enigste kl...,af


In [28]:
test_df.reset_index(drop=True, inplace=True)
test_df.to_csv(os.path.join(save_dir, 'test.csv'), index=False)
print("Saved!")
test_df.head()

Saved!


Unnamed: 0,src,trg
0,ilungele into entsha incwadi entsha leyo inkcu...,xh
1,"as u nie daaroor grap nie, is dit nie die enig...",af
2,sports is bae,en
3,die beste speler op die veld gewees hierdie he...,af
4,_ace ek sweer s era x was fa sho action sport ...,af


In [29]:
val_df.reset_index(drop=True, inplace=True)
val_df.to_csv(os.path.join(save_dir, 'val.csv'), index=False)
print("Saved!")
val_df.head()

Saved!


Unnamed: 0,src,trg
0,ek het niks oorweeg in my filmtelevisie nie en...,af
1,xihloko hi munghana eka ku tsakisa swinene na ...,ts
2,ulimi lweretoric lwenzelwe ukuba nomthelela ok...,zu
3,damn my own me all love tho focus eka mintlang...,ts
4,yintoni eyoyikisayo yile bokongeza kum umntu o...,xh


### Saving unsplitted dataset.

In [30]:
df.to_csv(os.path.join(save_dir, 'lang-identification.csv'), index=False)
print("Saved!")

Saved!
