<a href="https://colab.research.google.com/github/Rajalaxm/language_translation/blob/main/language_translation_IS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import string
import re
from numpy import array, argmax, random, take
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Bidirectional, RepeatVector, TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from keras import optimizers
import matplotlib.pyplot as plt
% matplotlib inline
pd.set_option('display.max_colwidth',200)

Our data is a text file of English-German sentence pairs.First we will read the file using the function defined below.

In [2]:
#function to read raw text file
def read_text(filename):
  #open the file
  file= open(filename, mode='rt', encoding= 'utf-8')
  #read all text
  text= file.read()
  file.close()
  return text

Now let's define a function to split the text into English-German pairs separated by '\n' and then split these pairs into English sentences and German sentences.

In [3]:
#split a text into sentences
def to_lines(text):
  sents= text.strip().split('\n')
  sents= [i.split('\t') for i in sents]
  return sents

In [5]:
data= read_text("/content/sample_data/deu.txt")
deu_eng = to_lines(data)
deu_eng = array(deu_eng)

  This is separate from the ipykernel package so we can avoid doing imports until


The actual data contains over 1,50,000 sentence pairs.However, we will use the first 50,000 sentence pairs only to reduce the training time of the model. 

In [11]:
deu_eng = deu_eng[:50000, :]

IndexError: ignored

In [None]:
#Let's take a look at our data
deu_eng

Text to sequence conversion: 
To find our data in a Seq2Seq model, we will have to convert both the input and output sentences into integer sequences of fixed length.Before that,let's visualise the length of the sentences.We will capture the length of all the sentences in 2 seperate lists for English and German respectively. 

In [None]:
#empty lists
eng_1 = []
deu_1 = []
#populate the lists with sentence lengths
for i in deu_eng[ :,0]:
  deu_l.append(len(i.split()))

  for i in deu_eng[ :,1]:
    deu_l.append(len(i.split()))

In [None]:
length_df = pd.DataFrame({'eng': eng_l, 'deu': deu_l})

In [None]:
length_df.hist(bins =  30)
plt.show()

The maximum length of the German sentences is 11 and that of the English phrases is 8.
Let's vectorize on text data by using Keras's Tokenizer() class.It will turn our sentences into sequences of integers.Then we will pad those sequences with zerosto make all the sequences of same length. 

In [None]:
#function to build a tokenizer
def tokenization(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer

In [None]:
#prepare English tokenizer 
eng_tokenizer = tokenization(deu_eng[ :,0])
eng_vocab_size = len(eng_tokenizer.word_index)+1
eng_length = 8
print('English Vocabulary Size: %d' %eng_vocab_size)

In [None]:
#prepare Deutch tokenizer
den_tokenizer = tokenization(deu_eng[ :,1])
deu_vocab_size= len(den_tokenizer.word_index) +1

den_length = 8
print('Deutch Vocabulary Size = %d' %deu_vocab_size)

Given below is a function to prepare the sequences.It will also perform sequence padding to a maximum sentence length as mentioned above.

In [None]:
#encode and pad sequences
def encode_sequences(tokenizer,length,lines):
  #integer encode sequences
  seq = tokenizer.texts_to_sequences(lines)
  #pad sequences with 0 values
  seq = pad_sequences(seq, maxlen= length,padding = 'post')
  return seq

Model building
We will now split the data into train and test set for modeltraining and evaluation respectively.

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(deu_eng, test_size = 0.2, random_state)

It's time to encode the sentences.We will encode German sentences as the input sequences and English sentences as the target sequences.It will be done for both train and test data sets. 

In [None]:
#prepare training data
trainX = encode_sequences(deu_tokenizer, deu_length,train[ :,1])
trainY = encode_sequences(eng_tokenizer,eng_length,train[ :,0])

In [None]:
#prepare validation data
trainX = encode_sequences(deu_tokenizer, deu_length,train[ :,1])
trainY = encode_sequences(eng_tokenizer,eng_length,train[ :,0])

Now cover the exciting part!Let us define our Seq2Seq model architecture.We are using Embedded layer & an LSTM layer as our encoder and another LSTM layer followed by a Dense layer as the decoder. 

In [None]:
#Build NMT Model
def build_model(in_vocab, out_vocab, in_timesteps, out_timesteps, units)
     model= sequential
     model.add= (Embedding(in_vocab,units, input_length= in_timesteps, mask_zero= True))
     model.add(LSTM(units))
     model.add(RepealVector(out_timesteps))
     model.add(LSTM(units, return_sequences= True))
     model.add(Dense(out_vocab, activation='softmax'))
     return model
     

We are using RMSprop optimizer in this model as it is usually a good choice for recurrent neural networks.

In [None]:
model = build_model(deu_vocab_size, eng_vocab_size, deu_length, eng_length, 512)
rms= optimizers.RMSprop(lr = 0.001)
model.compile = (optimizer = rms, loss='sparse_categorical_crossentropy')

We have used 'sparse_categorical_crossentropy' as the loss function because it allows us to use the target sequence as it is instead of 1 hot encoded format.One hot encoding the target sequences with such a large vocabulary might consume our systems's entire memory.
It seems we are all set to start training our model.We will train it for 30 epochs & with a batch size of 512.You may change & play these hyperparameters.We will also be using Modelchekpoint() to save the best model with lowest validation loss.I personally prefer this method over early stopping.  

In [None]:
filename = 'model.h1.24_rajalaxmi'
checkpoint = ModelCheckpoint(filename, monitor= 'val_loss',verbose=1, save_best_only = True, mode='min')

history = model.fit(trainX, trainY.reshape(trainY.shape[0],trainY.shape[1], 1))
epochs = 5, batch_size=512,
validation_split = 0.2,
callbacks = [checkpoint], verbose=1)

Let's compare the training loss & validation loss

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train', 'validation'])
plt.show()

Let's load the saved model to make predictions.

In [None]:
model = load_model('model.h1.24_rajalaxmi')
preds = model.predict_classes(testX.reshape((testX.reshape[0],textX.shape[1])))

In [None]:
def get_word(n, tokenizer):
  for word, index in tokenizer.word_index.items():
    if index == n:
      return word
return None

In [None]:
#convert prediction into text(English)
preds_text =[]
for i in preds:
  temp =[]
  for j in range(len(i)):
    t = get_word(i[j], eng_tokenizer)
    if j> 0:
      if(t== get_word(i[j], eng_tokenizer))
         temp.append('')
      else:
        temp.append(t)
    else:
      if(t== None):
        temp.append('')
      else:
        temp.append(t)
preds_text.append(''.join(temp))

In [None]:
preds_df = pd.DataFrame({'actual': test[ :,0], 'predicted':preds_text})

In [None]:
pd.set_option('display.max_colwidth', 200)

In [None]:
pred_df.head(15)

In [None]:
pred_df.tail(15)