### Custom dataset.
In this notebook we are going to leran how we can load our own custom dataset from files. The dataset that i am using was found on [this site](http://www.statmt.org/europarl/).

First we will define the path where our files are located as the base_path. In my case i am using google drive.

In [2]:
base_path = '/content/drive/MyDrive/NLP Data/seq2seq/fr-eng'

### Imports

In [3]:
import os
import torch
from torchtext.legacy import data, datasets
import json
import pandas as pd
from sklearn.model_selection import train_test_split

We have two text files for the french and english sentences with the following file names:

```py
fr = "europarl-v7.fr-en.fr"
en = "europarl-v7.fr-en.en"
```

In [4]:
fr_path = "europarl-v7.fr-en.fr"
en_path = "europarl-v7.fr-en.en"

Now let's load the text into list of strings. We are going to use the new line as the surperator of each sentence.

In [8]:
eng_sentences = open(os.path.join(base_path, en_path), encoding='utf8').read().split('\n')
fr_sentences = open(os.path.join(base_path, fr_path), encoding='utf8').read().split('\n')

### Next we will check how many examples do we have for each language.

In [9]:
print("eng: ", len(eng_sentences))
print("fr: ", len(fr_sentences))

eng:  2007724
fr:  2007724


### Creating a pandas dataframe
Creatting the pd dataframe will help us to split the sets into train and test and the convert the splitted dataframes into either `.json` or `.csv` files which are the formats that are accepted by the `torchtext`. To make this very simple Im going to use only `500` sentence french to english pairs.

In [11]:
size = 500
raw_data ={
    'eng': [sent for sent in eng_sentences[:size]],
    'fr': [sent for sent in fr_sentences[:size]],
}

dataframe = pd.DataFrame(raw_data, columns=['eng', 'fr'])

### Checking our dataframe

In [12]:
dataframe.head(4)

Unnamed: 0,eng,fr
0,Resumption of the session,Reprise de la session
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog..."
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...


### Spliting the datasets.
We are going to use `sklearn` `train_test_split` to split these two datasets for the train and validation sets.

In [13]:
train, val = train_test_split(dataframe, test_size=.2)
len(train), len(val)

(400, 100)

### Creating json files.

We are going to create `json` files and save them to the `base_path` for these two sets. We will be using the `.to_json()` method to do this. 

**Note** you can also use the `.to_csv()` to create `csv` files for example:

```py
train.to_csv("train.csv", index=False)
val.to_csv("val.csv", index=False)
```

**Note**: When you are using `.to_json()` we should pass the arg `orient="records"` so that these json files will be the files that can be accepted by the `torchtext`. Basically what this is doing is to add json files as records by removing the list `[]` brakets

In [16]:
train.to_json(os.path.join(base_path, 'train.json'), orient="records", lines=True)
val.to_json(os.path.join(base_path, 'val.json'), orient="records", lines=True)

Now each record has the following format:

```json
{"eng":"For us new members, it was the first time, and this was a very interesting process.","fr":"C' \u00e9tait pour nous, nouveaux d\u00e9put\u00e9s, la premi\u00e8re fois, et c' est un processus extr\u00eamement int\u00e9ressant."}
```

### Let's load the tokenizer models

In [17]:
import spacy
import spacy.cli
spacy.cli.download('fr_core_news_sm')
import fr_core_news_sm, en_core_web_sm
spacy_fr = spacy.load('fr_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


In [36]:
def tokenize_fr(sent):
  sent = sent.lower()
  return [tok for tok in spacy_fr.tokenizer(sent)]

def tokenize_en(sent):
  sent = sent.lower()
  return [tok for tok in spacy_en.tokenizer(sent)]

### Creating fields

In [37]:
SRC = data.Field(
    tokenize = tokenize_fr,
    init_token = "<sos>",
    eos_token = "<eos>"
)
TRG = data.Field(
    tokenize = tokenize_en,
    init_token = "<sos>",
    eos_token = "<eos>"
)

In [38]:
fields ={
    "fr": ("src", SRC),
    "eng": ("trg", TRG)
}

### We are now ready to create our dataset.

We are going to use the `TabularDataset.splits()` method to create the train and validation datasets.

In [55]:
train_data, val_data = data.TabularDataset.splits(
  base_path,
  format="json",
  train="train.json",
  validation= 'val.json',
  fields=fields
)

In [51]:
print(vars(train_data.examples[0]))

{'src': [c, ', était, pour, nous, ,, nouveaux, députés, ,, la, première, fois, ,, et, c, ', est, un, processus, extrêmement, intéressant, .], 'trg': [for, us, new, members, ,, it, was, the, first, time, ,, and, this, was, a, very, interesting, process, .]}


### Building the vocabulary
Now we are ready to build the vocabulary.

**Note** In this simple example we will build the vocab on both sets. It is recomended that _when building the vocabulary we only need to build it on the train set_.

We will be building the vocab as follows without `min_freq=2` args since our dataset is small:

**Note**: The `min_freq=2` allows us to set the minimum frequency of each word meaning a word that appears less than two times will be converted to `<unk>` token.

```py
SRC.build_vocab(train_data, val_data, max_size=1000)
TRG.build_vocab(train_data, val_data, max_size=1000)
```



In [63]:
SRC.build_vocab(train_data, val_data, max_size=1000)
TRG.build_vocab(train_data, val_data, max_size=1000)

In [64]:
TRG.vocab.itos[11]

all

### Creating iterators

 Now you can create iterators and then load the iterators to the models. Again we are going to use the `BucketIterator`.

In [65]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 128

In [66]:
train_iter, val_iter = data.BucketIterator.splits(
    (train_data, val_data),
    batch_size=BATCH_SIZE,
    device=device,
    sort_key=lambda x: len(x.src)
)

### Checking the a single batch

In [68]:
batch = next(iter(train_iter))
batch.src

tensor([[  2,   2,   2,  ...,   2,   2,   2],
        [136,  97, 244,  ..., 127, 116, 355],
        [625, 607,   0,  ...,   0, 846, 709],
        ...,
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1]], device='cuda:0')

### Resources used.

1. [This Blog Post](https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95)
2. [Datasets List](http://www.statmt.org/europarl/)
3. [Alen Nie](https://anie.me/On-Torchtext/)

### Extra resources
1 [Harvard](http://nlp.seas.harvard.edu/2018/04/03/attention.html)