Seq2Seq model is machine learning approach used ofr loanguage processing. It is also known as 'Encoder-Decoder Architecture'. This model architecture is useful to solve usecases where input and output are both sequence of elements. For e.g. Translation of a language to another language( say French to English) here input is english sentence which is a sequence of words and also output is french sentence a sequence of symbols.There are many such usecases like image cationing, speech processing , conversational models and text summarization etc.

A Seq2Seq model converts an input sequence to output sequence. The input and output sequences may differ in length. here is the architecture of Seq2Seq model. 
<img src='Seq2Seq-Architecture.png' style='align: right'>

The model has two core component - 'Encoder' and 'Decoder'.  
**Encoder -** The encoder component takes a sequence as an input and returns a fixed-dimensional vector containing the states of the input sequence.

**Decoder -** The decoder component turns the vector into an output sequence. Decoder component is trained on both the output sequence aswell as the fixed representation from encoder.

Encoder component is built using an Embedding layer and a RNN (specialy LSTM/GRU to handle vanishing gradient problem adn long term dependency problem).  
The Embdedding layer accepts a 2D array of input sequences , the shape of 2D array is (batch_size, sequence_length) where batch_size if number of input samples and sequence_length is maximum length of any sequence. The Embedding layer then returns a 3D array of shape(batch_size, input_length, output_dim) where input_length is maximum length of any sequence and output_dim is the fixed dimension of output vector.  

This 3D array is then passed to RNN (LSTM/GRU). The RNN process an item in the sequence to its hidden layer and predicts the next item in the sequence.This predicted output is then passed as an input to next hidden layer along with next item in the input sequence and predicts another output ans so on. finally the output of last item in the sequence is converted into a vector of fixed size. 

Decoder component is implemented using a RNN model which takes the vector returned from encoder as an input and predicts the next item in the output sequence. While training this RNN , it is trained on items of output sequence given last state from encoder step which provides context to the output sequence.

Lets learn the encoder decoder model using an example. In this example we will translate Bengali Sentences to English sentences. Data can be downloaded from http://www.manythings.org/anki/ 

So input sequence is 'Bengali Sentence' and Output sequence is 'English Sentence'.

Lets first load necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

Pair of Bengali and English senetences downloaded from the given site is in text file seperated by a tab. So we will first load the sentence pairs from the text file.

In [166]:
data = pd.read_csv('Bong-English-Sent-Pairs.txt', sep='\t', header=None)
data.head()

Unnamed: 0,0,1,2
0,Go.,যাও।,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,যান।,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,যা।,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Run!,পালাও!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
4,Run!,পালান!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...


Collumn '0' corresponds to English Sentences and collumn '1' corresponds to Bengali Sentence. We do not need collumn 2 hence we will drop it.

In [167]:
data.drop(2, axis=1, inplace=True)

In [168]:
data.columns = ['English_Lang','Bengali_Lang']
data.head()

Unnamed: 0,English_Lang,Bengali_Lang
0,Go.,যাও।
1,Go.,যান।
2,Go.,যা।
3,Run!,পালাও!
4,Run!,পালান!


In [169]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4349 entries, 0 to 4348
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   English_Lang  4349 non-null   object
 1   Bengali_Lang  4349 non-null   object
dtypes: object(2)
memory usage: 68.1+ KB


In [170]:
data.isnull().sum()

English_Lang    0
Bengali_Lang    0
dtype: int64

Dataset has 4349 sentence pairs and there are no missing records. 

In [173]:
data_ = data.copy(deep=True)
data_.shape

(4349, 2)

Lets find out max length of both bengali sentence and english sentence becuase we need these to determine the sequence length.

In [174]:
max_len_bong = data_.Bengali_Lang.apply(lambda x : len([w for w in x.split(' ')])).max()
print('Max Length of Bengali Phrase : ', max_len_bong)

Max Length of Bengali Phrase :  18


In [175]:
max_len_eng = data_.English_Lang.apply(lambda x : len([w for w in x.split(' ')])).max()
print('Max Length of English Phrase : ', max_len_eng)

Max Length of English Phrase :  19


In [176]:
X = data_.Bengali_Lang
Y = data_.English_Lang

In [177]:
bong_tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=True)
bong_tokenizer.fit_on_texts(X)

eng_tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=True)
eng_tokenizer.fit_on_texts(Y)

In [179]:
bong_vocb_size = len(bong_tokenizer.word_index) + 1
print('Bengali Vocabulary Length ', bong_vocb_size)

Bengali Vocabulary Length  3321


In [180]:
eng_vocab_size =len(eng_tokenizer.word_index) + 1
print('English Vocabulary Length ', eng_vocab_size)

English Vocabulary Length  1876


I am not going to perform any kind of NLP preprocessing techniques such as stop word removals, stemming, lemmatization because this usecase is a language translation usecase. Each word in a sentence matters.
Next lets split our data set into training and testing set. Remember X is Bengali Sentence and Y is English Sentence

In [181]:
from sklearn.model_selection import train_test_split
x_train,x_test , y_train, y_test = train_test_split(X, Y , test_size=0.1, random_state=42)
print(x_train.shape , x_test.shape , y_train.shape, y_test.shape)

(3914,) (435,) (3914,) (435,)


I will now prepare the text data into sequence of integers where each integer corresponds to index of a word in that sequence. Because we can not directly pass text data to RNN. first we need to convert it into sequence . 
Keras Tokenization helps to converts sentences into sequence of integers. And once sequence is ready i will us 'pad_sequences' to padd 0 for shorter or bigger legth sequences. We need to convert both input and output sentences imto corresponding sequences.

In [183]:
# Russian Sentences to Sequence
x_train_seq = bong_tokenizer.texts_to_sequences(x_train)
x_train_seq = tf.keras.preprocessing.sequence.pad_sequences(x_train_seq, maxlen= max_len_bong, padding='post', truncating='post')



# English Sentences to Sequence
y_train_seq = eng_tokenizer.texts_to_sequences(y_train)
y_train_seq = tf.keras.preprocessing.sequence.pad_sequences(y_train_seq, maxlen= max_len_eng, padding='post', truncating='post')


I have used lower=True in the Tokenizer class , to allow lowering the words in a sentence. in pad_sequences i have used maxlen= max_len_bong which is 18 incase of bengali sentences and 19 incase of english sentences. maxlen tells what maximum length to consider for a sequence. if any sequence is shorter maxlen then 0 would be padded and if any sequence is bigger than maxlen then items in that sequence would be trimed off.   

Vocabulary size is nothing but total unique words in whole text corpus. We can see consider whole english sentences gives 1876 unique words. and bengali sentences have overall 3321 unique words. This vocabulary size is required in Embedding layer because Embedding layer needs to understand for how many words/items embeddings have to be generated. 

In [184]:
print(x_train_seq.shape, y_train_seq.shape)

(3914, 18) (3914, 19)


Now we have both input and output sequences are ready. Input sequences are sequences of integers corresponds to each word in corresponding bengali sentence. and output sequences are sequences of integers coreesponds to words in english sentence. We are ready to build Encoder-Decoder model or Seq2Seq model.

Encoder:
We will first use Embedding layer to generate word embeddings for each word in whole input sentences. 
Next these word embeddings will be sent to LSTM model to save the state and return a fixed size vector of states from item in a sequence.

Decoder:
We will send these fixed sized vector of states from encoder into another LSTM model which will get trained on output sequences conditioned context passed from encoder.

In [185]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim= bong_vocb_size, output_dim=512, input_length=max_len_bong))
model.add(tf.keras.layers.LSTM(units=512))
model.add(tf.keras.layers.RepeatVector(max_len_eng))
model.add(tf.keras.layers.LSTM(units=512, return_sequences=True))
model.add(tf.keras.layers.Dense(eng_vocab_size, activation='softmax'))

In the above code, the first step involves creating a keras sequential model.   

The first layer in the Encoder component is a Embedding Layer , This will convert our words (referenced by integers in the data) into meaningful embedding vectors. This Embedding() layer takes the size of the vocabulary as its first argument, then the size of the resultant embedding vector that you want as the next argument. Finally, because this layer is the first layer in the network, we must specify the “length” of the input i.e. the number of steps/words in each sample. In this example shape of embedding   

The next layer is the first of our two LSTM layers. To specify an LSTM layer, first we have to provide the number of nodes in the hidden layers within the LSTM cell,e.g. the number of cells in the forget gate layer, the tanh squashing input layer and so on. The next argument that is specified in the code above is the return_sequences=False argument. This argument ensures that LSTM cell returns the output of LSTM cell from last time step only. return_sequences=True ensures LSTM cell returns outputs from unrolled LSTM cell through all time stamps. As we have seen architecture diagram , only the output of last time step is given to decoder component , hence we have set return_sequences=False. <img src='Se2Seq-Return_Seq.png'>

A LSTM has as many cells as timesteps. The second LSTM would be having 19 cells as we have maximum 19 timesteps for english phrases. Lets understand why do we need RepeatVector . In the Encoder-Decoder model we pass the output of the last timestep in the encoder LSTM layer, to the LSTM layer in Decoder component. Now The output from encoder component needs to be passed to each and evry timestep in Decoder LSOM layer. But we get oly a single vector in encoder's LSTM layer. How do we pass this single output vector to every timestep in next LSTM layer. Answer is RepeatVector. We repeat the same output vector as many times as the next LSTM layer has time stpes. In our usecase next LSTM layer (Decoder Comp) will have 19 timesteps because this LSTM will be trained on output sequence and output sequence is engalish sentence which has max 19 length.That is why we have set RepeatVector(max_len_eng=19) so that the output of LSTM layer in Encoder component is passed to every 19 timesteps in next LSTM layer. Below figure will clear it better. <img src='Repeat-Vector.png'>  

Next layer is LSTM layer of Decoder component which will get trained on output sequence i.e eglish language sequence given the output of last timestep in the LSTM layer of Encoder. We are predicting the English words corresponds to input bengali words. so we need output from all timesteps in this LSTM layer. hence we have set return_sequences=True.

Next layer is Dense layer or fully connected layer of length english vocabulary size. This layer is used with an activation of Softmax which will quash the output between 0 and 1.

Next we will compile the model with optimizer='rmsprop' and loss='sparse_categorical_crossentropy'. Our usecase is to translate bengali sentences into english senetences. It is basically a multiclass classification problem where the model predicts the english word in the target english vocabulary given the russian sequence as input. We will use loss='sparse_categorical_crossentropy.

In [186]:
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Lets train the model with input sequences (x_train_seq) and output sequences (y_train_seq) . We will train our model with an epoch=200 and batch_size=512 and a validation_split=0.2.  

Epoch is a hyperparemeter which specifies how many times entire dataset is passed forward and backward through the neural network only once. here we want entire dataset to be used 200 times for training the model.  

batch_size = After how many training samples the gradient to be updated. Here batch_size=512 menas after every 512 samples in a epoch the model will update the weights through gradient descent.  

validation_split=0.2 means 20% of training samples to be used for validation purpose and remaining 80% to be used for training the model.

In [187]:
y_train_seq = y_train_seq.reshape(y_train_seq.shape[0], y_train_seq.shape[1], 1)
y_train_seq.shape

(3914, 19, 1)

In [120]:
model.fit(x_train_seq, y_train_seq, epochs=200, batch_size=512, validation_split = 0.2, verbose=1)

Train on 3131 samples, validate on 783 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200


Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200


Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200


Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<tensorflow.python.keras.callbacks.History at 0x1d60b77bda0>

Our Seq2Seq model has achieved an accuracy of 95%. We can increase this by training model on more traning samples.  

Now we will use this trained encoder-decoder model to translate a given bengali sentence in the tseting data set into english sentence.  First we will convert test bengali sentences into sequence of integers.

In [122]:
x_test_seq = bong_tokenizer.texts_to_sequences(x_test)
x_test_seq = tf.keras.preprocessing.sequence.pad_sequences(x_test_seq, maxlen= max_len_rus,padding='post',truncating='post')


In [125]:
x_test_seq

array([[  95,  100,   30, ...,    0,    0,    0],
       [  10,   31,   41, ...,    0,    0,    0],
       [  10,  970,   81, ...,    0,    0,    0],
       ...,
       [   1,  367, 2319, ...,    0,    0,    0],
       [  42,   37,  171, ...,    0,    0,    0],
       [   4, 1723,    0, ...,    0,    0,    0]])

We will predict output sequence passing x_test_seq into predict_classes method of model. model also has predict method but it would give the probability value of each class. where as predict_classes gives class with highest prob value.

In [126]:
preds = model.predict_classes(x_test_seq)

In [128]:
preds

array([[  12,    5,   50, ...,    0,    0,    0],
       [  10,    3,   54, ...,    0,    0,    0],
       [  90,   16,  422, ...,    0,    0,    0],
       ...,
       [   1,   19,    7, ...,    0,    0,    0],
       [   2,  181,  155, ...,    0,    0,    0],
       [  20,  847, 1176, ...,    0,    0,    0]], dtype=int64)

Now we have predicted output sequence. Lets get corresponding sentence for each predicted sequence

In [154]:
words = list(eng_tokenizer.word_index.keys())
predicted_value = [' '.join([words[y-1] for y in y_pred if y>0]) for y_pred in preds]


In [164]:
df = pd.DataFrame({
    'Input': x_test,
    'Actual': y_test,
    'Predicted': predicted_value
})

In [165]:
df.iloc[100:120,:]

Unnamed: 0,Input,Actual,Predicted
1210,আমি কখনই বলবো না।,I'll never tell.,i have learning french
1200,আমি আপনার সহায়তা করবো।,I'll assist you.,i'm your friend
2247,নিজেকে বিভ্রান্ত কোরো না।,Don't delude yourself.,don't delude yourself
1257,সে কি শিক্ষক?,Is he a teacher?,is he a teacher
1075,আপনারা কি ডাক্তার?,Are you doctors?,are you a doctor
1116,সবাই এটা দেখেছিলো।,Everyone saw it.,everybody saw it
1052,আমরা ডাক্তার।,We are doctors.,we're are
3004,তুই কি টমের উপর নজর রাখতে পারবি?,Can you keep an eye on Tom?,can you keep an eye on tom
2924,আমি এই ব্যাপারে খুব একটা ভালো নই।,I'm not very good at this.,i finally found my my well
621,টম আহত।,Tom is hurt.,tom's injured


We can see in the above dataframe model has done quiet well to predict english sequence for input bengali sequence.