
### Neural Machine Translation Data Labling

Now that we have labeled our datasets it's time for us to download and merge `.csv` these small files.
___

Topic: `NMT`

Date: `2022/08/12`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___

### Importing packages
In the following code cell we are going to import packages that we are going to use in this notebook.

In [1]:
import numpy as np
import pandas as pd
import os

### File System

Our files are going to be stored in google drive so we need to mount the good drive so that we can be able to read and write files in there.


In [2]:
from google.colab import drive, files

drive.mount('/content/drive')

Mounted at /content/drive


### Our Translation dataset

Our translation dataset will contain the following south african languages to english pairs.

```
af-en
xh-en
zu-en
ts-en
st-en
```

We are going to download the small files and also merge them.

### Paths

In the following code cell we are going to define the paths where all our files are stored in the google drive.

In [3]:
base_dir = '/content/drive/My Drive/NLP Data/nmt'

save_dir = os.path.join(base_dir, 'nmt_datasets_pairs')

read_dir = os.path.join(base_dir, 'nmt_datasets')

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."
assert os.path.exists(save_dir), f"The path '{save_dir}' does not exists, check if you have mounted the google drive."
assert os.path.exists(read_dir), f"The path '{read_dir}' does not exists, check if you have mounted the google drive."

### Downloading and Merging Our Datasets

In the following code cell we are going to download and merge small files from each `en-<lang>.csv`.

### Defining columns
In the following code cell we are going to define the columns of our `csv` files. Note that there will be only two columns which are:

1. `src` - the english sentences (source)
2. `trg` - the translated version of a sentence (target)

In [18]:
columns = np.array(['src', 'trg'])
for folder in os.listdir(read_dir):
  print()
  total = 0
  print(f"* Creating a medged dataset for the: {folder}.")
  srcs, trgs = list(), list()
  for _file in os.listdir(os.path.join(read_dir, folder)):
    file_name = os.path.join(read_dir, folder, _file)
    if os.path.basename(file_name).split('.')[-1].lower() == 'csv':
      print(f"* downloading...{_file}")
      files.download(file_name)
      df = pd.read_csv(file_name) 
      srcs.extend(list(df.src.values))
      trgs.extend(list(df.trg.values))
      total += 1
  assert len(srcs) == len(trgs), f"length of src and trg sentences must be equal but got {len(srcs)} and {len(trgs)}."
  dataframe = pd.DataFrame(np.column_stack([srcs, trgs]), columns=columns)
  dataframe.drop_duplicates(subset=["src"], inplace=True)
  print(f"The total sentence pairs for {folder} are: {len(dataframe)}")
  dataframe.to_csv(os.path.join(save_dir, f"complete-{_file}"))
  files.download(os.path.join(save_dir, f"complete-{_file}"))
  print(f"Found {total} files, downloaded and medged.")
  print()
  print()



* Creating a medged dataset for the: en-af.
* downloading...en-af-1.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-3.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-4.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-5.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-6.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-7.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-8.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-9.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-10.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-11.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-12.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-13.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-14.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-15.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-16.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-17.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-20.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-25.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-26.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-27.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-28.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-30.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-31.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-33.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-34.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-35.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-37.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-40 (1).csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-40.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-41.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-43.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-45.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-48.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-50.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-51.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-53.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-55.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-57.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-59.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-61.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-63.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-64.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-66.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-68.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-70.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-72.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-74.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-76.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-78.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-80.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-82.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-84.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-86.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-88.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-90.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-92.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-94.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-96.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-98.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-100.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-104.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-108.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-af-110.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The total sentence pairs for en-af are: 91739


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Found 64 files, downloaded and medged.



* Creating a medged dataset for the: en-xh.
* downloading...en-xh-1.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-3.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-4.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-5.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-6.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-7.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-8.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-9.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-10.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-11.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-12.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-13.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-14.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-15.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-16.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-17.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-20.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-25.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-26.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-27.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-28.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-30.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-31.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-33.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-34.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-35.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-37.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-40.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-41.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-43.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-45.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-48.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-50.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-51.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-53.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-55.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-57.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-59.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-61.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-63.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-64.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-66.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-68.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-70.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-72.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-74.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-76.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-78.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-80.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-82.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-84.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-86.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-88.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-90.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-92.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-94.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-96.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-98.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-100.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-104.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-108.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-xh-110.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The total sentence pairs for en-xh are: 91739


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Found 63 files, downloaded and medged.



* Creating a medged dataset for the: en-zu.
* downloading...en-zu-1.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-3.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-4.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-5.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-6.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-7.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-8.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-9.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-10.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-11.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-12.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-13.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-14.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-15.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-16.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-17.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-20.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-25.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-26.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-27.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-28.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-30.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-31.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-33.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-34.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-35.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-37.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-40.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-41.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-43.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-45.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-48.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-50.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-51.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-53.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-55.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-57.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-59.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-61.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-63.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-64.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-66.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-68.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-70.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-72.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-74.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-76.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-78.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-80.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-82.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-84.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-86.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-88.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-90.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-92.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-94.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-96.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-98.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-100.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-104.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-108.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-zu-110.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The total sentence pairs for en-zu are: 91739


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Found 63 files, downloaded and medged.



* Creating a medged dataset for the: en-ts.
* downloading...en-ts-1.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-3.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-4.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-5.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-6.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-7.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-8.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-9.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-10.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-11.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-12.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-13.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-14.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-15.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-16.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-17.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-20.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-25.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-26.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-27.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-28.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-30.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-31.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-33.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-34.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-35.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-37.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-40.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-41.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-43.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-45.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-48.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-50.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-51.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-53.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-55.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-57.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-59.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-61.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-63.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-64.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-66.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-68.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-70.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-72.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-74.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-76.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-78.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-80.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-82.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-84.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-86.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-88.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-90.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-92.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-94.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-96.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-98.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-100.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-104.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-108.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-ts-110.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The total sentence pairs for en-ts are: 91739


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Found 63 files, downloaded and medged.



* Creating a medged dataset for the: en-st.
* downloading...en-st-1.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-3.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-4.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-5.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-6.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-7.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-8.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-9.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-10.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-11.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-12.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-13.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-14.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-15.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-16.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-17.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-20.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-25.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-26.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-27.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-28.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-30.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-31.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-33.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-34.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-35.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-37.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-41.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-40.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-43.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-45.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-48.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-50.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-51.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-53.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-55.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-57.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-59.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-61.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-63.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-64.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-66.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-68.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-70.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-72.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-74.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-76.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-78.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-80.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-82.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-84.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-86.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-88.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-90.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-92.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-94.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-96.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-98.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-100.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-104.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-108.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* downloading...en-st-110.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The total sentence pairs for en-st are: 91739


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Found 63 files, downloaded and medged.




### Creating a dataset 

Now we want to create a dataset that contains all language pairs in a single `.csv` file called `za-nmt-dataset.csv`

In [32]:
columns = "en, af, st, zu, ts, xh".split(", ")
en, af, st, zu, ts, xh = list(), list(), list(), list(), list(), list()
for i, _file in enumerate(os.listdir(save_dir)):
  file_name = os.path.join(save_dir, _file)
  df = pd.read_csv(file_name)
  if i == 0:
    en.extend(list(df.src.values))

  if _file.split('-')[-2] == 'af':
    af.extend(list(df.trg.values))
  if _file.split('-')[-2] == 'st':
    st.extend(list(df.trg.values))
  if _file.split('-')[-2] == 'zu':
    zu.extend(list(df.trg.values))
  if _file.split('-')[-2] == 'ts':
    ts.extend(list(df.trg.values))
  if _file.split('-')[-2] == 'xh':
    xh.extend(list(df.trg.values))
  

assert len(en) == len(af) == len(st) == len(zu) == len(ts) == len(xh) , f"length of src and trgs sentences must be equal."
dataframe = pd.DataFrame(np.column_stack([en, af, st, zu, ts, xh]), columns=columns)
dataframe.drop_duplicates(subset=["en"], inplace=True)
dataframe.to_csv(os.path.join(save_dir, f"za-nmt-dataset.csv"))
print("Done")

Done
