# Encoder and Decoder Model

- In this file , I am going to train and save the trained Encoder and Decoder model .
- As well as , I am going to convert sentence of Text to Numerical sequence and I am going to create function for this conversion. Save the function Convert_data into .py format.
- I am going to use this NMT_convert_data that convert data into numerical sequence in next file . Which will be based on Prediction.

### Importing Libraries

In [36]:
import numpy as np
import os
import pandas as pd
import csv
import tensorflow.python as tf
import keras
from keras.preprocessing.text import Tokenizer
import json
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding , Bidirectional , LSTM , Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import plot_model
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Bidirectional, RepeatVector, TimeDistributed, Dense , Concatenate
import warnings
warnings.filterwarnings('ignore')

### Define path for both training data and testing data

In [37]:
train_data_path = r'C:\Users\ayush\jupyter Notebook\NMT\data\Training_data.csv'
test_data_path = r'C:\Users\ayush\jupyter Notebook\NMT\data\Testing_data.csv'

train_data_path = os.path.join(train_data_path)
test_data_path = os.path.join(test_data_path)

## Loading the train data

In [38]:
## reading csv file
training_data = pd.read_csv(train_data_path , encoding = 'UTF-8')
training_data.sample(5)

Unnamed: 0,Source,Target
145803,qui va là,<start>who goes there<end>
54895,me connaissezvous le moins du monde,<start>do not you know me at all<end>
31284,je reviens bientôt,<start>i will be back really soon<end>
153756,jadore jouer du chopin,<start>i love playing chopin<end>
134437,tu ne te rappelles pas je ne loublierai jamais,<start>you cannot remember it and i will never...


### Preprocessing the data for the encoder - decoder model
- The model requires the data in the form of numerical sequences. Since the model requires numerical data.
- Although we have text data, but we need to convert the SEQUENCE OF TEXT to NUMERICAL SEQUENCES.
- In this project we will do the conversion word wise.

#### Defining a function, which will convert text sequences to numerical, sequences.

In [39]:
def get_numeric_sequences(text_sequences) :
    ## create a tokenizer -> this will map each word to a number.
    ## defining the num_words -> this indicates that we need maximum of these words only to process.
    num_words = 1500
    ## initializing the tokenizer
    ## defining OOV -> this handles a word if it is out of vocabulary
    token = Tokenizer(num_words=num_words, oov_token="<UKN>")
    ## create the word_index -> word_index is the dictionary which maps the words to a numeric value.
    token.fit_on_texts(text_sequences)
    ## saving word_index.
    word_index = token.word_index
    ## limiting the words to num_words in the dictionary.
    word_indices = {word: index for word, index in token.word_index.items() if index <= num_words}
    ## converting sequences.
    num_sequences = token.texts_to_sequences(text_sequences)
    ## define vocabulary size -> this is the size of the word_index.
    ## we are incrementing it by 1 because the indexing starts with 1.
    vocab_size = len(word_indices) + 1
    
    return num_sequences, token, vocab_size , word_indices

In [40]:
def get_pad_sequeces(source_num_sequenecs, target_num_sequences, max_common_length = None) :
    
    '''This function returns padded sequences and maximun common length.'''
    
    ## finding out the maximum lenght of source sequences.
    source_max_len = max([len(seq) for seq in source_num_sequenecs])
    ## finding out the maximum lenght of target sequences.
    target_max_len = max([len(seq) for seq in target_num_sequences])    
    ## finding the common maximum length.
    COMMON_MAX_LENGTH = max(source_max_len, target_max_len)
    ## checking if the argument already has a max_common_length.
    if max_common_length != None :
        COMMON_MAX_LENGTH = max_common_length
    ## pad the sequences.
    source_padded_sequences = pad_sequences(source_num_sequenecs, maxlen = COMMON_MAX_LENGTH, padding='post')
    target_padded_sequences = pad_sequences(target_num_sequences, maxlen = COMMON_MAX_LENGTH, padding='post')
    
    return source_padded_sequences, target_padded_sequences, COMMON_MAX_LENGTH

In [41]:
def convert_data(source, target, max_common_length = None) :
    
    '''This function returns the complete converted set.'''
    
    ## get numerical sequences.
    source_num_sequences, source_token, source_vocab_size , word_indices_source = get_numeric_sequences(source)
    target_num_sequences, target_token, target_vocab_size , word_indices_target = get_numeric_sequences(target)
    ## get padded sequences.
    source_padded_sequences, target_padded_sequences, COMMON_MAX_LENGTH = get_pad_sequeces(source_num_sequences, target_num_sequences, max_common_length = max_common_length)
    
    return source_padded_sequences, target_padded_sequences, COMMON_MAX_LENGTH, source_vocab_size, target_vocab_size , word_indices_source , word_indices_target

In [42]:
source = training_data.Source
target = training_data.Target

In [43]:
source.shape , target.shape

((172352,), (172352,))

In [44]:
source_sequences, target_sequences, COMMON_MAX_LENGTH, source_vocab_size, target_vocab_size , word_indices_source , word_indices_target = convert_data(source, target)

In [45]:
source_sequences

array([[  11,   17,    1, ...,    0,    0,    0],
       [  44,    2,    1, ...,    0,    0,    0],
       [ 900,    3,   27, ...,    0,    0,    0],
       ...,
       [   1,    7, 1490, ...,    0,    0,    0],
       [  19,    1,    1, ...,    0,    0,    0],
       [  47,  700,    1, ...,    0,    0,    0]])

In [46]:
target_sequences

array([[   2,   13,  143, ...,    0,    0,    0],
       [   2,   69,    4, ...,    0,    0,    0],
       [   2,  190,    6, ...,    0,    0,    0],
       ...,
       [   2,  286,  103, ...,    0,    0,    0],
       [   2,    7, 1395, ...,    0,    0,    0],
       [   2,   12,  530, ...,    0,    0,    0]])

## Building Encoder - Decoder Model

### Defining Encoder

In [47]:
## 1st layer is the input layer.
encoder_input = Input(shape=(None,))
## 2nd layer is the embedding layer
encoder_embd = Embedding(source_vocab_size,100, mask_zero=True)(encoder_input)
## 3rd layer is the LSTM Bideirectional layer.
## The biderectional is being added because it will capture sequence information from both past and future.
encoder_lstm = Bidirectional(LSTM(32, return_state=True))
## getting output from encoder.
encoder_output, forw_state_h, forw_state_c, back_state_h, back_state_c = encoder_lstm(encoder_embd)
state_h_final = Concatenate()([forw_state_h, back_state_h])
state_c_final = Concatenate()([forw_state_c, back_state_c])

In [48]:
## Now take only states and create context vector
encoder_states= [state_h_final, state_c_final]

### Defining Decoder

In [49]:
decoder_input = Input(shape=(None,))
# For zero padding we have added +1 in marathi vocab size
decoder_embd = Embedding(target_vocab_size, 100, mask_zero=True)
decoder_embedding= decoder_embd(decoder_input)
# We used bidirectional layer above so we have to double units of this lstm
decoder_lstm = LSTM(64, return_state=True,return_sequences=True )
# just take output of this decoder dont need self states
decoder_outputs, _, _= decoder_lstm(decoder_embedding, initial_state=encoder_states)
# here this is going to predicct so we can add dense layer here
# here we want to convert predicted numbers into probability so use softmax
decoder_dense= Dense(target_vocab_size, activation='softmax')
# We will again feed predicted output into decoder to predict its next word
decoder_outputs = decoder_dense(decoder_outputs)

In [50]:
model = Model([encoder_input, decoder_input], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [51]:
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_5 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 embedding_4 (Embedding)     (None, None, 100)            150100    ['input_5[0][0]']             
                                                                                                  
 input_6 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 bidirectional_2 (Bidirecti  [(None, 64),                 34048     ['embedding_4[0][0]']         
 onal)                        (None, 32),                                                   

In [52]:
plot_model(model , show_shapes = True)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


In [53]:
encoder_input_data = source_sequences
decoder_input_data = target_sequences[:,:-1]
decoder_target_data = target_sequences[:,1:]

In [None]:
model.fit([encoder_input_data , decoder_input_data] , decoder_target_data , epochs = 15)

Epoch 1/15


Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15

In [None]:
model.save(r'C:\Users\ayush\jupyter Notebook\NMT\code_files\model.keras')

In [None]:
Common_length_path = r'C:\Users\ayush\jupyter Notebook\NMT\code_files\MAXIMUM_COMMON_LENGTH.txt'
with open(Common_length_path, 'w') as file :
    file.write(str(COMMON_MAX_LENGTH))

In [None]:
encoder_model = Model(encoder_input, encoder_states)
decoder_state_input_h = Input(shape=(64,))
decoder_state_input_c= Input(shape=(64,))
decoder_states_input= [decoder_state_input_h, decoder_state_input_c]

dec_embd2 = decoder_embd(decoder_input)

decoder_output2,state_h2, state_c2 = decoder_lstm(dec_embd2, initial_state=decoder_states_input)
deccoder_states2= [state_h2, state_c2]

decoder_output2 = decoder_dense(decoder_output2)

decoder_model = Model(
                      [decoder_input]+decoder_states_input,
                      [decoder_output2]+ deccoder_states2)

## Saving Encoder and Decoder model 

- This Encode and Decoder model is going to be used in next file which will basically predict sentences
- Similarly , I am going to save dictionary based json file for source_word_index and target_word_index

In [23]:
## saving the encoder model.
encoder_model.save(r'C:\Users\ayush\jupyter Notebook\NMT\code_files\encoder_model.keras')
## saving the decoder model.
decoder_model.save(r'C:\Users\ayush\jupyter Notebook\NMT\code_files\decoder_model.keras')

In [24]:
word_index_source_path = r'C:\Users\ayush\jupyter Notebook\NMT\code_files\source_word_indices.json'
with open(word_index_source_path, 'w') as file :
    json.dump(word_indices_source, file)

In [25]:
word_index_source_path = r'C:\Users\ayush\jupyter Notebook\NMT\code_files\target_word_indices.json'
with open(word_index_source_path, 'w') as file :
    json.dump(word_indices_target, file)