<a href="https://colab.research.google.com/github/RahulDogra-92/Pytorch-Projects/blob/main/Build_Neural_Machine_Translator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [55]:
#Downloading Enlish Language Model from spacy
!python -m spacy download en

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 28.7 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [56]:
#Downloading German Language Model
!python -m spacy download de

Collecting de_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz (14.9 MB)
[K     |████████████████████████████████| 14.9 MB 24.0 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/de_core_news_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/de
You can now load the model via spacy.load('de')


#Building Neural Machine Translator

which will convert german language to english language using sequence to sequence modelling concept

#Necessary Imports

In [97]:
!pip install -U torch==1.8.0 torchtext==0.9.0

Collecting torch==1.8.0
  Downloading torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5 MB)
[K     |████████████████████████████████| 735.5 MB 12 kB/s 
[?25hCollecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 12.4 MB/s 
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0+cu102
    Uninstalling torch-1.9.0+cu102:
      Successfully uninstalled torch-1.9.0+cu102
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.10.0
    Uninstalling torchtext-0.10.0:
      Successfully uninstalled torchtext-0.10.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.10.0+cu102 requires torch==1.9.0, but you have torch 1.8.0 which is incompatible.[0m
Successfully inst

In [98]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator, dataset

import spacy
import random
import math
import os

In [99]:
#This code will ensure that results are repeatable
SEED = 2222
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [100]:
#Now will include english and german language models from spacy 
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

In [101]:
#Let's write a function that will tokenize our english and german sentences into our individual tokens
def process_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]
def process_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

In [102]:
#This field module in torchtext provides a handle for how the data has to be processed.Will be using fields to perform the tokenization
#along with converting all the tokens into lowercase and with sequence-sequence model will padd each sentence with a <start> of string
#and <end> of string token in the beginning and in the end of the sentence respectively.So sequence-sequence models starts generating the 
#the tokens as soon as it sees the <start> of sequence token and continues generating tokens it sees <end> of string token.
#End of string token is denoted by <eos>(end of string) token.  

#So the responsibilty of torchtext.field is to ensure that the strings are tokenized, converted into lowercase and start of string and 
#end of string tokens are padded to each of the sentence.

Source = Field(tokenize=process_de, init_token='<sos>', eos_token='<eos>', lower=True) #tokenization of german as inputs
Target = Field(tokenize=process_en, init_token='<sos>', eos_token='<eos>', lower=True) #tokenization of english as outputs

In [103]:
#Now we have fields ready. Will import the dataset and apply these fields on the dataset.Will be using Multi30k dataset that
#is available within torchtext and split it into training/valid/testing datasets.(exts)--> here stands for extension of files that are to be
#used for Mutli30k dataset since it holds data from other languages also apart from english and german language.

train_data, valid_data, test_data = Multi30k(language_pair=('de', 'en'),split=('train', 'valid', 'test'), root='.data')


In [104]:
len(train_data),len(valid_data),len(test_data)

(29000, 1014, 1000)

In [105]:
#Now will build a vocabulary for this tokens withing each language so that each token within the language has an index and this index is
#used for OneHotEncoder representations internally 

#Will use build vocab method within the fields and pass the training data and set min_frequence = 2(words appeared at least twice are added) 
#into the vocabulary

Source.build_vocab(train_data, min_freq=2)
Target.build_vocab(train_data, min_freq=2)

In [106]:
#Next will create Iterators, they create batches out of the dataset, sort them, padd them and pass them to appropriate device
#and this can be easily done within torchtext using BucketIterators and these Iterators would return batches of data which will have
#src attribute and trg attribute and all the sentences will be converted into indexed form and BucketIterator creates the batches
#in such a way that it requires minimum amount of padding within each batch collecting similar length sentences together

BATCH_SIZE = 128

In [107]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [108]:
device.type


'cpu'

In [123]:
train_iterator, valid_iterator, test_iterator = BucketIterator.splits( 
(train_data, valid_data, test_data),sort_key=False, #don't sort test/validation data
batch_size=BATCH_SIZE, device=device) 