### French to English Translation `Multi30k` dataset.

There are a lot of datasets that comes with the [torchtext.datasets](https://pytorch.org/text/stable/datasets.html#multi30k) for machine translation. We are going to do French to english translation and latter on we will do english to french translation. In this notebook we are going to use `RNNs` but as we move we will be using Conv Nets and Transformers + Attention.

### Datasets that are available for machine translation in torchtext.

* Multi30k 
* IWSLT2016
* IWSLT2017

### Data preparation.

In [1]:
import torch
from torch import nn
from torch.nn  import functional as F
import spacy, math, random
import numpy as np
from torchtext.legacy import datasets, data
import time
from prettytable import PrettyTable
from matplotlib import pyplot as plt

**Note:** This notebook is based on this [notebook](https://github.com/CrispenGari/PyTorch-Python/blob/main/09_TorchText/03_Sequence_To_Sequence/04_Packed_Padded_Sequences%2C_Masking%2C_Inference_and_BLEU.ipynb)

### SEEDS

In [2]:
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Loading tokenizer models.
We are going to load two models, the english and the french model for tokenization.

In [3]:
import spacy
spacy.cli.download('fr_core_news_sm')

spacy_fr = spacy.load('fr_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


### Tokenization functions.

In [4]:
def tokenize_fr(sent):
  return [tok.text for tok in spacy_fr.tokenizer(sent)]

def tokenize_en(sent):
  return [tok.text for tok in spacy_en.tokenizer(sent)]

#### Creating the Fields.
Since we are using packed padded sequences we need to tell pytorch how long the actual (non-padded) sequences are. Luckly pytorch does this for us we just have to pass the argument `include_lengths=True`.

**Note:** Again we only pass `include_lengths` arg to True for the `SRC` (source).

In [5]:
SRC = data.Field(
    tokenize= tokenize_fr,
    lower= True,
    init_token = "<sos>",
    eos_token = "<eos>",
    include_lengths =True
)

TRG = data.Field(
    tokenize = tokenize_en,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)


### Loading the `Multi30k` dataset.
This time around we are using french instead of german so our extention will change for the `src` from `.de` to `.fr` in the `exts` tupple.

**Note:** Again note that we are going to pass the `SRC` extention first in the `exts` tupple.

In [17]:
train_data, valid_data, test_data = datasets.Multi30k.splits(
    root=".data",
    exts=('.fr', '.en'),
    fields = (SRC, TRG),
)

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:00<00:00, 1.60MB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 242kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 236kB/s]


FileNotFoundError: ignored