# <center> AI Chatbot Tensorflow Seq2seq Model</center>
### <center> Deep Learning Project_Global IA</center> 
<b>Presented by</b>								
Abonia Sojasingarayar						
M2-Artificial Intelligence-IA school		
<b>Guided  by</b>								
Yacine Aslima
Prof. AI-IA School

## Credits and Motivation

Siraj Raval - Founder of School of AI 

Andrew NG - Founder and CEO of Landing AI & Founder of deeplearning.ai

Tensorflow Community

Cornell University - For dataset



# Fundamentals:
A <b>sequence</b> is an ordered list of symbols. For Ex:
A sequence of webpages visited by a user, ordered by the time of access.
A sequence of words or characters typed on a cellphone by a user, or in a text such as a book.
A sequence of products bought by a customer in a retail store
A sequence of proteins in bioinformatics
A sequence of symptoms observed on a patient at a hospital

Note : we consider that a sequence is a list of symbols and do not contain numeric values.  A sequence of numeric values is usually called a time-series rather than a sequence, and the task of predicting a time-series is called time-series forecasting. 

The task of <b>sequence prediction</b> consists of predicting the next symbol of a sequence based on the previously observed symbols. For example, if a user has visited some webpages A, B, C, in that order, one may want to predict what is the next webpage that will be visited by that user to prefetch the webpage.
![title](DocImg/prediction.png)


## RNN:

Recurrent Neural Networks, or RNNs, were designed to work with sequence prediction problems. 


Recurrent means the output at the current time step becomes the input to the next time step. At each element of the sequence, the model considers not just the current input, but what it remembers about the preceding elements.
![title](DocImg/rnn.png)

![title](DocImg/rnns.jpg)


## Types of RNN:

One-to-Many: An observation as input mapped to a sequence with multiple steps as an output.
Many-to-One: A sequence of multiple steps as input mapped to class or quantity prediction.
Many-to-Many: A sequence of multiple steps as input mapped to a sequence with multiple steps as output.

![title](DocImg/rnntype.jpg)

### LSTM:
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence(long-term dependencies) in sequence prediction problems.
This is a behavior required in complex problem domains like machine translation, speech recognition, and more.


![title](DocImg/lstm.png)

<i>Image Source: https://colah.github.io</i>


## Seq2Seq LSTMs or RNN Encoder-Decoders:

An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence. Here, we propose to follow this elegant recipe, replacing the encoder RNN by a deep convolution neural network (CNN). … it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences!

<i>Source: — Oriol Vinyals, et al., Show and Tell: A Neural Image Caption Generator, 2014</i>

… an RNN Encoder–Decoder, consists of two recurrent neural networks (RNN) that act as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence.

<i>Source: — Kyunghyun Cho, et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014</i>

![title](DocImg/encoderdecoder.png)

In [None]:
#Import packages
import warnings  
with warnings.catch_warnings():  
    warnings.filterwarnings("ignore",category=FutureWarning)
import tensorflow as tf
import numpy as np
import re
import math
from tqdm import tqdm
import matplotlib.pyplot as plt

In [None]:
%load_ext tensorboard

In [None]:
tf.test.gpu_device_name()

## Dataset : Cornell Movie Dialogs Corpus
Description:
This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts.

Link to download dataset : https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html 					
                            https://www.kaggle.com/rajathmc/cornell-moviedialog-corpus 

220,579 conversational exchanges between 10,292 pairs of movie characters

Involves 9,035 characters from 617 movies

In total 304,713 utterances

Movie metadata included:
    
    Genres
    Release year
    IMDB rating
    Number of IMDB votes

Character metadata included:
    
    Gender (for 3,774 characters)
    Position on movie credits (3,321 characters)

README.txt (included) for details


In [None]:
#!wget http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
#!unzip cornell_movie_dialogs_corpus.zip

# Preprocessing or Data_Utils
<b>Step 1:</b>
Read from 'movie_conversations.txt'
Create a list of [list of line_id's]

Output Ex:['L194', 'L195', 'L196', 'L197']

<b>Step 2:</b>
Read from 'movie-lines.txt'
Create a dictionary with ( key = line_id, value = text )

Ex:
They do not!

They do to!

I hope so.

She okay?

Let's go.

Wow

Okay -- you're gonna need to learn how to lie.

No

I'm kidding.  You know how sometimes you just become this "persona"?  And you don't know how to quit?

Like my fear of wearing pastels?

<b>Step 3:</b>
Get lists of all conversations as Questions and Answers
 [questions]
 [answers]   

Question and answers are come from same conversation.As because there will be a question with the response.

Ex: For our first conversation

Q Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.

A Well, I thought we'd start with pronunciation, if that's okay with you.

Q Well, I thought we'd start with pronunciation, if that's okay with you.

A Not the hacking and gagging and spitting part.  Please.

Q Not the hacking and gagging and spitting part.  Please.

A Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?


<b>Step 4:</b>
Clean Text:
Text to lowercase 
Replacing certain words as follow:

Ex:   
    
    text = re.sub(r"i'm", "i am", text)
    
    text = re.sub(r"he's", "he is", text)
    
    text = re.sub(r"she's", "she is", text)
    
    text = re.sub(r"it's", "it is", text)
    
    text = re.sub(r"that's", "that is", text)
    
    text = re.sub(r"\'ll", " will", text)
    
    text = re.sub(r"\'ve", " have", text)
    
    text = re.sub(r"\'re", " are", text)
    
    text = re.sub(r"\'re", " are", text)
    
    text = re.sub(r"can't", "cannot", text)

<b>Step 5:</b>

Filter out the Questions  and Answers that are too short/long

Minimum&Maximum  length are 2&5

<b>Step 6:</b>

Get each word and its count  from filtered questions and answers in vocab dictionary

Get each word and its count  from filtered questions and answers in Question and Answer vocab dictionary 

<b>Step 7:</b>

Create vocabulary index with total number of words appear more than 2 times in vocab dictionary
	
    6281 words which appear more appear more than 2 times 

<b>Step 8:</b>

For each codes(<EOS>,<PAD>,<UNK><GO>) ,increment vocabulary index to 1 for each existing code 
Same for question and answer vocab.
	
    Now vocab index will be 6285

<b>Step 9:</b>

Create index vocabulary from vocabulary index dictionary 
	
    index vocabulary dict_items([(0, 'what'), (1, 'good'), (2, 'stuff'), (3, 'she'), (4, 'okay'), (5, 'they'),......
	......., (6283, '<PAD>'), (6284, '<EOS>'), (6285, '<UNK>'), (6286, '<GO>')])

<b>Step 10:</b>

Add EOS tag at the end of each answer 	
	Ex: the real you   -->   the real you <EOS>

<b>Step 11:</b>

Again filter out words in by comparing words in filtered question  and words in vocabulary index
Do the same for filtered answer 


In [None]:
#get the conversation and movie data
movie_line = "../Datasets/cornell movie-dialogs corpus/movie_lines.txt"
movie_convo = "../Datasets/cornell movie-dialogs corpus/movie_conversations.txt"

m_lines = open(movie_line , encoding='utf-8',errors='ignore').read().split('\n')
c_lines = open(movie_convo , encoding='utf-8',errors='ignore').read().split('\n')

#get converastion lines
convo_line = []
for lines in c_lines:
    _lines = lines.split(" +++$+++ ")[-1][1:-1].replace("'","").replace(" ","")
    convo_line.append(_lines.split(","))

#get movie lines
id_line = {}
for lines in m_lines:
    _lines = lines.split(" +++$+++ ")
    if len(_lines) == 5:
        id_line[_lines[0]] = _lines[4]
        
#Form questions and answers 
questions = []
answers = []

for line in convo_line:
    for i in range(len(line) -1):
        questions.append(id_line[line[i]])
        answers.append(id_line[line[i+1]])
        
#Clean and replace improper words using regular expression
def clean_text(text):
    text = text.lower()
    
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "that is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"n'", "ng", text)
    text = re.sub(r"'bout", "about", text)
    text = re.sub(r"'til", "until", text)
    text = re.sub(r"  ","",text)
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)
    
    return text

clean_questions = []
clean_answers = []

for q in questions:
    clean_questions.append(clean_text(q))
for a in answers:
    clean_answers.append(clean_text(a))
    
#get the min and max length of sentence need to be used
max_length = 5
min_length = 2

codes = ['<PAD>','<EOS>','<UNK>','<GO>']



short_questions_temp = []
short_answers_temp = []

i = 0
for question in clean_questions:
    if len(question.split()) >= min_length and len(question.split()) <= max_length:
        short_questions_temp.append(question)
        short_answers_temp.append(clean_answers[i])
    i += 1

# Filter out the answers that are too short/long
shorted_q = []
shorted_a = []

i = 0
for answer in short_answers_temp:
    if len(answer.split()) >= min_length and len(answer.split()) <= max_length:
        shorted_a.append(answer)
        shorted_q.append(short_questions_temp[i])
    i += 1
   
  

#Get the count of words from filtered questions and answers  
vocab = {}

for question in shorted_q:
    for words in question.split():
        if words not in vocab:
            vocab[words] = 1
        else:
            vocab[words] +=1
for answer in shorted_a:
    for words in answer.split():
        if words not in vocab:
            vocab[words] = 1
        else:
            vocab[words] +=1
            
questions_vocabs = {}
for answer in shorted_q:
    for words in answer.split():
        if words not in questions_vocabs:
            questions_vocabs[words] = 1
        else:
            questions_vocabs[words] +=1
            
answers_vocabs = {}
for answer in shorted_a:
    for words in answer.split():
        if words not in answers_vocabs:
            answers_vocabs[words] = 1
        else:
            answers_vocabs[words] +=1
            
#total number of words appear more than 2 times
vocabs_to_index = {}
threshold = 2
word_num = 0
for word, count in vocab.items():
    if count >= threshold:
        vocabs_to_index[word] = word_num
        word_num += 1

#add words in codes in the text and  increment vocab index to 1 for each existing code 
#same for question and answer vocab.6281 in vocab dict and now 6286        
for code in codes:
    vocabs_to_index[code] = len(vocabs_to_index)+1
    
for code in codes:
    questions_vocabs[code] = len(questions_vocabs)+1

for code in codes:
    answers_vocabs[code] = len(answers_vocabs)+1

#Convert index vocab to vocab index   
index_to_vocabs = {v_i: v for v, v_i in vocabs_to_index.items()}

#Add <EOS> to the end of all the answer in such a way model can learn the the sentence comes to the end 
for i in range(len(shorted_a)):
  shorted_a[i] += ' <EOS>'
  
#Get the question and with code <UNK> for the words which are not in vocab to index
#ex:'nowhere hi daddy <EOS> ' to '[6285, 179, 22, 6284]' as it doesnt find the word 'nowhere' in the vocabulary index dictionary

questions_int = []
for question in shorted_q:
    ints = []
    for word in question.split():
        if word not in vocabs_to_index:
            ints.append(vocabs_to_index['<UNK>'])
        else:
            ints.append(vocabs_to_index[word])
    questions_int.append(ints)
    
answers_int = []
for answer in shorted_a:
    ints = []
    for word in answer.split():
        if word not in vocabs_to_index:
            ints.append(vocabs_to_index['<UNK>'])
        else:
            ints.append(vocabs_to_index[word])
    answers_int.append(ints)


In [None]:
for code in codes:
  print(vocabs_to_index[code])

## Configuration:
<b>source_vocab_size</b> is the size of questions vocabulary dictionary.In our case:9611

<b>target_vocab_size</b> is the size of answers vocabulary dictionary.In our case:9636

<b>vocab size</b> is the length of vocabulary index dictionary in our case its 6286

The <b>learning rate</b> is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

<b>learning Rate decay:</b>An alternative to using a fixed learning rate is to instead vary the learning rate over the training process.
The way in which the learning rate changes over time (training epochs) is referred to as the learning rate schedule or learning rate decay.

The <b>keep_prob</b> value is used to control the dropout rate used when training the network. Essentially, it means that each connection between layers (in this case between the last densely connected layer and the readout layer) will only be used with probability 0.5 when training. This reduces overfitting.

The <b>batch size</b> is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

The number of <b>epochs</b> is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.

<b>Working Example:</b> Assume you have a dataset with 200 samples (rows of data) and you choose a batch size of 5 and 1,000 epochs.

This means that the dataset will be divided into 40 batches, each with five samples. The model weights will be updated after each batch of five samples.

This also means that one epoch will involve 40 batches or 40 updates to the model.

With 1,000 epochs, the model will be exposed to or pass through the whole dataset 1,000 times. That is a total of 40,000 batches during the entire training process.


In [None]:
target_vocab_size = len(answers_vocabs)
source_vocab_size = len(questions_vocabs)
vocab_size = len(index_to_vocabs)+1
embed_size = 1024
rnn_size = 1024
batch_size = 32
num_layers =  3
learning_rate = 0.001
learning_rate_decay = 0.99
min_lr = 0.0001
#keep_prob = 0.5
epochs=50
DISPLAY_STEP=30

### LSTM and DropoutWrapper

<b>class LSTMCell:</b> Long short-term memory unit (LSTM) recurrent network cell.

rnn_size: int, The number of units in the LSTM cell.

reuse:Python boolean describing whether to reuse variables in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.


<b>class DropoutWrapper</b>:Operator adding dropout to inputs and outputs of the given cell.

cell: an RNNCell, a projection to output_size is added to it.
input_keep_prob: unit Tensor or float between 0 and 1, input keep probability; if it is constant and 1, no input dropout will be added.

output_keep_prob: unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1, no output dropout will be added.

In [None]:
def lstm(rnn_size, keep_prob,reuse=False):
    lstm =tf.nn.rnn_cell.LSTMCell(rnn_size,reuse=reuse)
    drop =tf.nn.rnn_cell.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    return drop

# Attention Mechanism:
A neural network is considered to be an effort to mimic human brain actions in a simplified manner. Attention Mechanism is also an attempt to implement the same action of selectively concentrating on a few relevant things in input, while ignoring others in deep neural networks when producing output

### Bahdanau Attention

Bahdanau et al (2015) came up with a simple but elegant idea where they suggested that not only can all the input words be taken into account in the context vector, but relative importance should also be given to each one of them.

![title](DocImg/Battention.jpg)

<center>Overall process for Bahdanau Attention seq2seq model</center>


The first type of Attention, commonly referred to as Additive Attention, came from a paper by Dzmitry Bahdanau, which explains the less-descriptive original name. The paper aimed to improve the sequence-to-sequence model in machine translation by aligning the decoder with the relevant input sentences and implementing Attention. The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:

1;Producing the Encoder Hidden States - Encoder produces hidden states of each element in the input sequence

2.Calculating Alignment Scores between the previous decoder hidden state and each of the encoder’s hidden states are calculated (Note: The last encoder hidden state can be used as the first hidden state in the decoder)

3.Softmaxing the Alignment Scores - the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed

4.Calculating the Context Vector - the encoder hidden states and their respective alignment scores are multiplied to form the context vector

5.Decoding the Output - the context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new output

6.The process (steps 2-5) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. https://arxiv.org/abs/1409.0473

The second is the normalized form. This form is inspired by the weight normalization article:

Tim Salimans, Diederik P. Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." https://arxiv.org/abs/1602.07868

<b>Class BahdanauAttention</b>

Args:

num_units: The depth of the query mechanism.

memory: The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, ...]. 

memory_sequence_length (optional): Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths.

normalize: Python boolean. Whether to normalize the energy term.

probability_fn: (optional) A callable. Converts the score to probabilities. The default is tf.nn.softmax. Other options include tf.contrib.seq2seq.hardmax and tf.contrib.sparsemax.sparsemax. Its signature should be: probabilities = 

probability_fn(score).
score_mask_value: (optional): The mask value for score before passing into probability_fn. The default is -inf. Only used if 

memory_sequence_length is not None.

<b>class AttentionWrapper:</b>Wraps another RNNCell with attention.

dec_cell: An instance of RNNCell.

attention_mechanism: A list of AttentionMechanism instances or a single instance.

attention_layer_size: A list of Python integers or a single Python integer, the depth of the attention (output) layer(s). If None (default), use the context as attention at each time step. Otherwise, feed the context and cell output into the attention layer to generate attention at each time step. If attention_mechanism is a list, attention_layer_size must be a list of the same length. If attention_layer is set, this must be None. If attention_fn is set, it must guaranteed that the outputs of attention_fn also meet the above requirements.

source: 1. https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/
   2. https://blog.floydhub.com/attention-mechanism/

In [None]:
def attention(rnn_size,encoder_outputs,target_sequence_length,dec_cell):
    attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(rnn_size*2,encoder_outputs,
                                                                   memory_sequence_length=target_sequence_length)
    attention_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, attention_mechanism,
                                                             attention_layer_size=rnn_size/2)
    return attention_cell

A <b>placeholder</b> is used for feeding external data into a Tensorflow computation (stuff outside the graph). Here's some documentation: https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/#feeding

TensorFlow's feed mechanism lets you inject data into any Tensor in a computation graph. A python computation can thus feed data directly into the graph.

In [None]:
input_data = tf.placeholder(tf.int32, [None, None],name='input')
target_data = tf.placeholder(tf.int32, [None, None],name='target')
input_data_len = tf.placeholder(tf.int32,[None],name='input_len')
target_data_len = tf.placeholder(tf.int32,[None],name='target_len')
lr_rate = tf.placeholder(tf.float32,name='lr')
keep_prob = tf.placeholder(tf.float32,name='keep_prob')

# <center> Encoder</center>
The LSTM network can be organized into an architecture called the Encoder-Decoder LSTM that allows the model to be used to both support variable length input sequences and to predict or output variable length output sequences.

This architecture is the basis for many advances in complex sequence prediction problems such as speech recognition and text translation.

In this architecture, an <b>encoder</b> LSTM model reads the input sequence step-by-step. After reading in the entire input sequence, the hidden state or output of this model represents an internal learned representation of the entire input sequence as a fixed-length vector. This vector is then provided as an input to the <b>decoder</b> model that interprets it as each step in the output sequence is generated.

A <b> tf.variable</b> is used to store state in graph. It requires an initial value. One use case could be representing weights of a neural network or something similar. Here's documentation: (https://www.tensorflow.org/api_docs/python/tf/Variable)

A variable maintains state in the graph across calls to run(). You add a variable to the graph by constructing an instance of the class Variable.

The Variable() constructor requires an initial value for the variable, which can be a Tensor of any type and shape. The initial value defines the type and shape of the variable. After construction, the type and shape of the variable are fixed. The value can be changed using one of the assign methods.

<b>random_uniform</b>:Outputs random values from a uniform distribution. Generate a random tensor in TensorFlow so that you can use it and maintain it for further use even if you call session run multiple times

<b>encoder_embeddings:</b> holds the random value of tensors with shape of [source_vocab_size, embed_size]

it output : <tf.Variable 'Variable:0' shape=(9611, 128) dtype=float32_ref>

So [9611, 128] this is the shape of embedding matrix which xill be used for embedding lookup

### Word Embedding 

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text.

Take a look at this example – sentence=” Word Embeddings are Word converted into numbers ”

A word in this sentence may be “Embeddings” or “numbers ” etc.

A dictionary may be the list of all unique words in the sentence. So, a dictionary may look like – [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]

A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. The vector representation of “numbers” in this format according to the above dictionary is [0,0,0,0,0,1] and of converted is[0,0,0,1,0,0].
![title](DocImg/one-hot.jpg)            ![title](DocImg/wordembed.jpg)

![title](DocImg/one-hot to wordembed.jpg)

<center><i>Source:https://confengine.com/odsc-india-2019/proposal/10176/sequence-to-sequence-learning-with-encoder-decoder-neural-network-models</i></center>

### Word2Vec

Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network.

Mitolov introduced word2vec to the NLP community. These methods were prediction based in the sense that they provided probabilities to the words and proved to be state of the art for tasks like word analogies and word similarities. They were also able to achieve tasks like King -man +woman = Queen, which was considered a result almost magical. So let us look at the word2vec model used as of today to generate word vectors.

Word2vec is not a single algorithm but a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural networks which map word(s) to the target variable which is also a word(s). 

<i>source:https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/</i>

### <b>Class embedding_lookup:</b>
embedding_lookup function retrieves rows of the params tensor. The behavior is similar to using indexing with arrays in numpy.

![title](DocImg/embed_lookup2.jpg)

![title](DocImg/embed_lookup.jpg)

<i>Note:In above diagram 100 is the row size of embedding matrix but in our case its 9611 i.e our vocabulary size of questions</i>

source:Stackoverflow


In [None]:
encoder_embeddings = tf.Variable(tf.random_uniform([source_vocab_size, embed_size], -1, 1))
encoder_embedded = tf.nn.embedding_lookup(encoder_embeddings, input_data)

## Bidirectional_dynamic_rnn

If you want to have have multiple layers that pass the information backward or forward in time, there are two ways how to design this. Assume the forward layer consists of two layers F1, F2 and the backword layer consists of two layers B1, B2.

If you use tf.nn.bidirectional_dynamic_rnn the model will look like this (time flows from left to right):

![title](DocImg/bidir_rnn.png)

In [None]:
stacked_cells = lstm(rnn_size, keep_prob)

In [None]:
((encoder_fw_outputs,encoder_bw_outputs),
 (encoder_fw_final_state,encoder_bw_final_state)) = tf.nn.bidirectional_dynamic_rnn(cell_fw=stacked_cells, 
                                                                 cell_bw=stacked_cells, 
                                                                 inputs=encoder_embedded, 
                                                                 sequence_length=input_data_len, 
                                                                 dtype=tf.float32)

In [None]:
encoder_outputs = tf.concat((encoder_fw_outputs,encoder_bw_outputs),2)

In [None]:
encoder_outputs

### <b>class LSTMStateTuple</b>
Tuple used by LSTM Cells for state_size, zero_state, and output state.

Stores two elements: (c, h), in that order. Where c is the hidden state and h is the output.

Only used when state_is_tuple=True.

In [None]:
encoder_state_c = tf.concat((encoder_fw_final_state.c,encoder_bw_final_state.c),1)
encoder_state_h = tf.concat((encoder_fw_final_state.h,encoder_bw_final_state.h),1)
encoder_states = tf.nn.rnn_cell.LSTMStateTuple(c=encoder_state_c,h=encoder_state_h)

In [None]:
encoder_states

# <center> Decoder</center>
### <b>strided_slice</b>
To a first order, this operation extracts a slice of size <b>end - begin</b> from a tensor input starting at the location specified by begin. The slice continues by adding stride to the begin index until all dimensions are not less than end. Note that components of stride can be negative, which causes a reverse slice.

<b>ex:</b>

begin = [1, 0, 0] and end = [2, 1, 3]. Also, all the strides are 1. Work your way backwards, from the last dimension.

Start with element [1,0,0]. Now increase the last dimension only by its stride amount, giving you [1,0,1]. Keep doing this until you reach the limit. Something like [1,0,2], [1,0,3] (end of the loop). Now in your next iteration, start by incrementing the second to last dimension and resetting the last dimension, [1,1,0]. Here the second to last dimension is equal to end[1], so move to the first dimension (third to last) and reset the rest, giving you [2,0,0]. Again you are at the first dimension’s limit, so quit the loop.

In our case: input=target_data,begin=[0,0],end=[batchsize,-1], stride[1,1]

In [None]:
main = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
decoder_input = tf.concat([tf.fill([batch_size, 1],vocabs_to_index['<GO>']), main], 1)

In [None]:
#sam process as followed in encoder embedding and lookups
decoder_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, embed_size], -1, 1))
dec_cell_inputs = tf.nn.embedding_lookup(decoder_embeddings, decoder_input)

In [None]:
dec_cell = lstm(rnn_size*2,keep_prob)

In [None]:
dec_cell

### Dense layer

Single output layer

target_vocab_size=dimensionality of the output space.

In [None]:
#output layer for decoder
dense_layer = tf.layers.Dense(target_vocab_size)

### Class TrainingHelper

A helper for use during training. Only reads inputs.

Returned sample_ids are the argmax of the RNN output logits

dec_cell_inputs=>inputs: A (structure of) input tensors.

target_data_len=>sequence_length: An int32 vector tensor.

In [None]:
train_helper = tf.contrib.seq2seq.TrainingHelper(dec_cell_inputs, target_data_len)

### Cell state initializer
<b>attention_cell.zero_state</b>

The two are different things. state_is_tuple is used on LSTM cells because the state of LSTM cells is a tuple. cell.zero_state is the initializer of the state for all RNN cells.

You will generally prefer cell.zero_state function as it will initialize the required state class depending on whether state_is_tuple is true or not.

See this GitHub issue where you can see the cell.zero_state recommended - "use the zero_state function on the cell object".

Another reason why you may want cell.zero_state is because it is agnostic of the type of the cell (LSTM, GRU, RNN) 

In [None]:
attention_cell = attention(rnn_size,encoder_outputs,target_data_len,dec_cell)
state = attention_cell.zero_state(dtype=tf.float32, batch_size=batch_size)
state = state.clone(cell_state=encoder_states)

### Class BasicDecoder
Basic sampling decoder.

In [None]:
decoder_train = tf.contrib.seq2seq.BasicDecoder(cell=attention_cell, helper=train_helper, 
                                                  initial_state=state,
                                                  output_layer=dense_layer) 


### Class dynamic_decode
Perform dynamic decoding with decoder.

Calls initialize() once and step() repeatedly on the Decoder object.

impute_finished: Python boolean. If True, then states for batch entries which are marked as finished get copied through and the corresponding outputs get zeroed out. This causes some slowdown at each time step, but ensures that the final state and outputs have the correct values and that backprop ignores time steps that were marked as finished.

maximum_iterations: maximum allowed number of decoding steps. 

In [None]:
outputs_train, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder_train, 
                                                  impute_finished=True, 
                                                  maximum_iterations=tf.reduce_max(target_data_len))

# Greedy Search
A simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence.

This approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal.

ex:define a sequence of 10 words over a vocab of 5 words


        data = [[0.1, 0.2, 0.3, 0.4, 0.5],
               [0.5, 0.4, 0.3, 0.2, 0.1],
               [0.1, 0.2, 0.3, 0.4, 0.5],
               [0.5, 0.4, 0.3, 0.2, 0.1],
               [0.1, 0.2, 0.3, 0.4, 0.5],
               [0.5, 0.4, 0.3, 0.2, 0.1],
               [0.1, 0.2, 0.3, 0.4, 0.5],
               [0.5, 0.4, 0.3, 0.2, 0.1],
               [0.1, 0.2, 0.3, 0.4, 0.5],
               [0.5, 0.4, 0.3, 0.2, 0.1]]
     
After applying greedy decoder that mapped back to words in the vocabulary. 

Return the words which have maximum probability at athose time 

               [4, 0, 4, 0, 4, 0, 4, 0, 4, 0]

        
### Class GreedyEmbeddingHelper

A helper for use during inference.

Uses the argmax of the output (treated as logits) and passes the result through an embedding layer to get the next input.

embedding: A callable that takes a vector tensor of ids (argmax ids), or the params argument for embedding_lookup. The returned tensor will be passed to the decoder input.

start_tokens: int32 vector shaped [batch_size], the start tokens.

end_token: int32 scalar, the token that marks end of decoding.

In [None]:
infer_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(decoder_embeddings, 
                                                          tf.fill([batch_size], vocabs_to_index['<GO>']), 
                                                          vocabs_to_index['<EOS>'])

In [None]:
decoder_infer = tf.contrib.seq2seq.BasicDecoder(cell=attention_cell, helper=infer_helper, 
                                                  initial_state=state,
                                                  output_layer=dense_layer)

In [None]:
outputs_infer, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder_infer, impute_finished=True,
                                                          maximum_iterations=tf.reduce_max(target_data_len))

### Class identity
Return a tensor with the same shape and contents as input.

input=outputs_train.rnn_output for logits

In [None]:
training_logits = tf.identity(outputs_train.rnn_output, name='logits')
inference_logits = tf.identity(outputs_infer.sample_id, name='predictions')

### Padding and Masking
Now that all samples have a uniform length, the model must be informed that some part of the data is actually padding and should be ignored. That mechanism is <b>masking</b>.
ex:

    [
      ["The", "weather", "will", "be", "nice", "tomorrow"],
      ["How", "are", "you", "doing", "today"],
      ["Hello", "world", "!"]
    ]
can also be 

    [
      [83, 91, 1, 645, 1253, 927],
      [73, 8, 3215, 55, 927],
      [71, 1331, 4231]
    ]
Afetr padding

    [[  83   91    1  645 1253  927]
     [  73    8 3215   55  927    0]
     [ 711  632   71    0    0    0]]
     
After masking

    tf.Tensor(
    [[ True  True  True  True  True  True]
     [ True  True  True  True  True False]
     [ True  True  True False False False]], shape=(3, 6), dtype=bool)
### Class sequence_mask
When using the Functional API or the Sequential API, a mask generated by an Embedding or Masking layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). 

Note that in the call method of a subclassed model or layer, masks aren't automatically propagated, so you will need to manually pass a mask argument to any layer that needs one. 

Returns a mask tensor representing the first N positions of each cell.


If lengths has shape [d_1, d_2, ..., d_n] the resulting tensor mask has dtype and 

shape [d_1, d_2, ..., d_n, maxlen], with

mask[i_1, i_2, ..., i_n, j] = (j < lengths[i_1, i_2, ..., i_n])

Examples:
tf.sequence_mask([1, 3, 2], 5)  
                                # [[True, False, False, False, False],
                                #  [True, True, True, False, False],
                                #  [True, True, False, False, False]]

tf.sequence_mask([[1, 3],[2,0]])  
                                  # [[[True, False, False],
                                  #   [True, True, True]],
                                  #  [[True, True, False],
                                  #   [False, False, False]]]
                                  
length=target_data_len

max_len=tf.reduce_max(target_data_len)

In [None]:
masks = tf.sequence_mask(target_data_len, tf.reduce_max(target_data_len), dtype=tf.float32, name='masks')

### Class sequence_loss
Weighted cross-entropy loss for a sequence of logits.

logits: A Tensor of shape [batch_size, sequence_length, num_decoder_symbols] and dtype float. The logits correspond to the prediction across all classes at each timestep.

targets: A Tensor of shape [batch_size, sequence_length] and dtype int. The target represents the true class at each timestep.

weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.

In [None]:
cost = tf.contrib.seq2seq.sequence_loss(training_logits,target_data,masks)

## <center> Adam Optimizer</center>

Adam is different to classical stochastic gradient descent.

Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.

A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.

The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically:

<b>Adaptive Gradient Algorithm (AdaGrad)</b> that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).

<b>Root Mean Square Propagation (RMSProp)</b> that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).
Adam realizes the benefits of both AdaGrad and RMSProp.

Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.
Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
Adam is relatively easy to configure where the default configuration parameters do well on most problems.

alpha (in our case lr_rate): Also referred to as the learning rate or step size. The amount that the weights are updated during training is referred to as the step size or the “learning rate.”


The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training

source:https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

In [None]:
optimizer = tf.train.AdamOptimizer(lr_rate)

#  Gradient Clipping
Gradient clipping is a technique to prevent exploding gradients in very deep networks, usually in recurrent neural networks. A neural network is a learning algorithm, also called neural network or neural net, that uses a network of functions to understand and translate data input into a specific output. This type of learning algorithm is designed based on the way neurons function in the human brain. There are many ways to compute gradient clipping, but a common one is to rescale gradients so that their norm is at most a particular value. With gradient clipping, pre-determined gradient threshold be introduced, and  then gradients norms that exceed this threshold are scaled down to match the norm.  This prevents any gradient to have norm greater than the threshold and thus the gradients are clipped.  There is an introduced bias in the resulting values from the gradient, but gradient clipping can keep things stable. 

<b>Why is this Useful?</b>
Training recurrent neural networks can be very difficult. Two common issues with training recurrent neural networks are vanishing gradients and exploding gradients. Exploding gradients can occur when the gradient becomes too large and error gradients accumulate, resulting in an unstable network. Vanishing gradients can happen when optimization gets stuck at a certain point because the gradient is too small to progress. Gradient clipping can prevent these issues in the gradients that mess up the parameters during training.

![title](DocImg/gradclip.png)

<center><i>Source:https://deepai.org/</i></center>
### clip_by_value
Clips tensor values to a specified min and max.

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

loss: A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.

In [None]:
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)

# Padding
Before training, we work on the dataset to convert the variable length sequences into fixed length sequences, by padding. We use a few special symbols to fill in the sequence.


PAD : Filler

GO : Start decoding

UNK : Unknown; word not in vocabulary

Consider the following query-response pair.

Q : How are you?
A : I am fine.

Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be converted to:

Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]

A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]

EOS : End of sentence

The result of the padding sequences is pretty straight forward. You can now observe that the list of sentences that have been padded out into a matrix where each row in the matrix has an encoded sentence with the same length.Its computationaly expensive as we work with large dataset there we have bucketing techinique but its not applied for the moment.


# Bucketing
Introduction of padding did solve the problem of variable length sequences, but consider the case of large sentences. If the largest sentence in our dataset is of length 100, we need to encode all our sentences to be of length 100, in order to not lose any words. Now, what happens to “How are you?” ? There will be 97 PAD symbols in the encoded version of the sentence. This will overshadow the actual information in the sentence.

Bucketing kind of solves this problem, by putting sentences into buckets of different sizes. Consider this list of buckets : [ (5,10), (10,15), (20,25), (40,50) ]. If the length of a query is 4 and the length of its response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be padded to length 5 and the response will be padded to length 10. While running the model (training or predicting), we use a different model for each bucket, compatible with the lengths of query and response. All these models, share the same parameters and hence function exactly the same way.

If we are using the bucket (5,10), our sentences will be encoded to :

Q : [ PAD, “?”, “you”, “are”, “How” ]

A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]

In [None]:
def pad_sentence(sentence_batch, pad_int):
    padded_seqs = []
    seq_lens = []
    max_sentence_len = max([len(sentence) for sentence in sentence_batch])
    for sentence in sentence_batch:
        padded_seqs.append(sentence + [pad_int] * (max_sentence_len - len(sentence)))
        seq_lens.append(len(sentence))
    return padded_seqs, seq_lens


# Accuracy computation
<b>np.pad</b> will  take the input array and add the padding based on the shape 

<b>np.equal</b>compare the target and prediction elementwise

<b>np.mean</b>Average of the inputs given

    target = [1, 2, 3, 4, 5]
    print(np.pad(target, (2, 3), 'constant', constant_values=(4, 6)))
    output: [4 4 1 2 3 4 5 6 6 6]
    target='Iam good'=[1 5]___length=2,  logits='Iam doing good'=[1 7 5]___length=3
    if max_seq=3  then target=[1 5 0]   (After np.pad-Pad an array.)
                       logits=[1 7 5]
    take mean to get average accuracy of all batch:
        compare sentence using   np.equal([1, 5, 0], [1, 7 ,5])           output:[ True False False]
        mean of the result using np.mean(np.equal([1, 5, 0], [1, 7 ,5]))) output:0.3333333333
    

In [None]:
def get_accuracy(target, logits):
    max_seq = max(len(target[1]), logits.shape[1])
    if max_seq - len(target[1]):
        target = np.pad(
            target,
            [(0,0),(0,max_seq - len(target[1]))],
            'constant')
    if max_seq - logits.shape[1]:
        logits = np.pad(
            logits,
            [(0,0),(0,max_seq - logits.shape[1])],
            'constant')

    return np.mean(np.equal(target, logits))

# Train and Test data split
As we know input and output will be our questions and answers.Here we are splliting our dataset wrt batch size(128)

ex:Questions from 1 to 128  index in questions_int list will be our validation traing set and 128 to the end of list will be our train data its because as we need less data for validation than training

Same goes for test data from answers_int

In [None]:
train_data = questions_int[batch_size:]
test_data = answers_int[batch_size:]
val_train_data = questions_int[:batch_size]
val_test_data = answers_int[:batch_size]

In [None]:
len(train_data)

# Prepare train test validation set

we need to pad the sentence before use it to validate our model as we seen in padding section.
As our vocabulary index already has the word 'PAD' it.checking it us:

vocabs_to_index['<PAD>']--->6283(index where its located)
    
So padding the sentence if max length of sentence in vocab index is 4 as ex:

    ['how','are','you']=[0, 1, 2]                 -Before padding
    ['how','are','you','']=[0, 1, 2,6283]         -After padding

In [None]:
pad_int = vocabs_to_index['<PAD>']

In [None]:
val_batch_x,val_batch_len = pad_sentence(val_train_data,pad_int)
val_batch_y,val_batch_len_y = pad_sentence(val_test_data,pad_int)
val_batch_x = np.array(val_batch_x)
val_batch_y = np.array(val_batch_y)

# Round the length of train data 
we need to round the length of train data wrt batch size in order to have equal number of sentence in each batch 

For ex: 
        
     we have length of train data=103 and our bacth size=10
        103/10=10.3 
     we cant have 10.3 data so we need to round sentence in batch to 10 so now we must get the rounded train data to obtain the same for whole training set .its done as follow:
        10*10=100
     So our round length of train data is 100 we dont care about the the 3 sentence which is left 
        


In [None]:
no_of_batches = math.floor(len(train_data)//batch_size)
round_no = no_of_batches*batch_size

### Sentence to sequence
So as given below if we have a question sentence 'how are you' it must not be given as it is in our rnn it must be converted into vector

Ex:
    
    'how are you' to  [0, 1, 2]

In [None]:
def sentence_to_seq(sentence, vocabs_to_index):
    results = []
    for word in sentence.split(" "):
        if word in vocabs_to_index:
            results.append(vocabs_to_index[word])
        else:
            results.append(vocabs_to_index['<UNK>'])        
    return results

In [None]:
question_sentence = 'where are you'
question_sentence = sentence_to_seq(question_sentence, vocabs_to_index)
print(question_sentence)

# Tf Session Run/Train Model
Only after running <b>tf.global_variables_initializer()</b> in a session our variables hold the values we told them to hold when we declare them (tf.Variable(tf.zeros(...)), tf.Variable(tf.random_normal(...)),...).

From the TF doc :
    
    
    Calling tf.Variable() adds several ops to the graph:
    A variable op that holds the variable value.An initializer op that sets the variable to its initial value. This is actually a tf.assign op.The ops for the initial value, such as the zeros op for the biases variable in the example are also added to the graph.

    And also:Variable initializers must be run explicitly before other ops in your model can be run. The easiest way to do that is to add an op that runs all the variable initializers, and run that op before using the model.
    
### tqdm:

    for bs in tqdm(range(0,round_no  ,batch_size)):
    tqdm() takes bs(batch) and iterates over it, but each time it yields a new bs (between each iteration of the loop), it also updates a progress bar in out output cell. 
    
### Session
we need to run the session by providing the optimize which compute gradient wrt cost in each step and all input and target data  with its length.
It can return <b>prediction</b> of input data and it must be compared with the original target data to get the accuracy of our model.and which will be cumulated further to get total accuracy for all the batches in a single epochs.
Also return the <b>loss</b> for each batch which will be cumulated further to get total loss for all the batches in a single epochs

<b>optional:</b>Also in each epoch the current tf session  takes the input_sentence(question) we assigned before such as 'how are you' as input and gives the prediction (answers).It can help us to review how much the predicted answers are releent to the reponse of human reply.

In [None]:
#file_writer = tf.summary.FileWriter('D:/ML Projects/Global IA/Seq2Seq-Chatbot/Notebook/model_weights/log')

In [None]:
#summaries_op = tf.summary.merge_all()

In [None]:
save_path = '/ML Projects/Global IA/Seq2Seq-Chatbot/Notebook/model_weights/model_weights'
acc_plt = []
loss_plt = []

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(epochs):
        #_, summaries_str = sess.run([train_op, summaries_op])
        #fw.add_summary(summaries_str, global_step=i)
        total_accuracy = 0.0
        total_loss = 0.0
        for bs in tqdm(range(0,round_no  ,batch_size)):
          index = min(bs+batch_size, round_no )
          #print(bs,index)
      
          #padding done seperately for each batch in training and testing data
          batch_x,len_x = pad_sentence(train_data[bs:index],pad_int)
          batch_y,len_y = pad_sentence(test_data[bs:index],pad_int)
          batch_x = np.array(batch_x)
          batch_y = np.array(batch_y)
        
          pred,loss_f,opt = sess.run([inference_logits,cost,train_op], 
                                      feed_dict={input_data:batch_x,
                                                target_data:batch_y,
                                                input_data_len:len_x,
                                                target_data_len:len_y,
                                                lr_rate:learning_rate,
                                                keep_prob:0.75})

          train_acc = get_accuracy(batch_y, pred)
          total_loss += loss_f 
          total_accuracy+=train_acc
    
        total_accuracy /= (round_no // batch_size)
    
        total_loss /=  (round_no//batch_size)
        acc_plt.append(total_accuracy)
        loss_plt.append(total_loss)
        prediction_logits = sess.run(inference_logits, {input_data: [question_sentence]*batch_size,
                                         input_data_len: [len(question_sentence)]*batch_size,
                                         target_data_len: [len(question_sentence)]*batch_size,              
                                         keep_prob: 0.75,
                                         })[0]
        print('Epoch %d,Average_loss %f, Average Accucracy %f'%(epoch+1,total_loss,total_accuracy))
        print('  Inputs Words: {}'.format([index_to_vocabs[i] for i in question_sentence]))
        print('  Replied Words: {}'.format(" ".join([index_to_vocabs[i] for i in prediction_logits])))
        print('\n')
        saver = tf.train.Saver() 
        saver.save(sess, save_path)
    
    

In [None]:
%tensorboard --logdir /ML Projects/Global IA/Seq2Seq-Chatbot/Notebook/model_weights

In [None]:
#Accuracy vs Epochs
plt.plot(range(epochs),acc_plt)
plt.title("Change in Accuracy")
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()

In [None]:
#loss vs Epochs
plt.plot(range(epochs),loss_plt)
plt.title("Change in loss")
plt.xlabel('Epoch')
plt.ylabel('Lost')
plt.show()

In [None]:
import pickle

In [None]:
pickle.dump(acc_plt,open('accuracy.p','wb'))

In [None]:
pickle.dump(loss_plt,open('loss.p','wb'))

# Validation

### Get tokens:
Get all the codes/tokens we additionaly added in the vocab dictionary

    6283 '<PAD>'
    6284 '<EOS>'
    6285 '<UNK>'
    6286 '<GO>'

In [None]:
#get all the codes/tokens we additionaly added in the vocab dictionary
garbage = []
for code in codes:
  print(vocabs_to_index[code])
  garbage.append(vocabs_to_index[code])

### Prepare the questions,Answers, and predicted answers
In order to prepare the questions and human answer  and bot answer we need to clean the sentence by removing the token we discussed in the previous step.
    
    if we get the token of <EOS> we break the loop 
    if the word is not the one of the additional token <PAD>,<UNK>,<GO> then we wont consider these words.we must return the data without these tokens 

In [None]:
#prepare the question,answer and prediction data
def print_data(i,batch_x,index_to_vocabs):
  data = []
  for n in batch_x[i]:
    if n==garbage[1]:
      break
    else:
      if n not in [6283,6285,6286]:
        data.append(index_to_vocabs[n])
  return data

In [None]:
ques = []
real_answer = []
pred_answer = []
for i in range(len(val_batch_x)):
  ques.append(print_data(i,batch_x,index_to_vocabs))
  real_answer.append(print_data(i,batch_y,index_to_vocabs))
  pred_answer.append(print_data(i,pred,index_to_vocabs))

### Printing Real and predicted Answers
So from the below output we can comes to a conclusion how well our trained model works with the validation set(size of batch_size i.e 128 rows) that we have created earlier 

In [None]:
for i in range(len(val_batch_x)):
    print('row %d'%(i+1))
    print('QUESTION:',' '.join(ques[i]))
    print('REAL ANSWER:',' '.join(real_answer[i]))
    print('PREDICTED ANSWER:',' '.join(pred_answer[i]),'\n')

# Load save model for given sample test
As here i have assigned my sentence to the variable 
            
            question_sentence_2
Initialize tf.session with the graph and load the model meta graph from the saved path to use the model for prediction.
as earlier describe the iput for session run.So as we used differeznt keep_prob (dropout for regularization).Earlier i used 0.5 and now its 1.0 in order to see the better performance but its also difficult to conclude this hyperparameter value using this single example.Its for the example purpose.

As we dont know the target length for the moment i assign the maximum length

Finally out put the word id from vocab dictionary with the corresponding sentence for question and answer



In [None]:
question_sentence_2 = 'what are you doing?'
question_sentence_2 = sentence_to_seq(question_sentence_2, vocabs_to_index)
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(save_path + '.meta')
    loader.restore(sess, save_path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    input_data_len = loaded_graph.get_tensor_by_name('input_len:0')
    target_data_len = loaded_graph.get_tensor_by_name('target_len:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')

    prediction_logits = sess.run(logits, {input_data: [question_sentence_2]*batch_size,
                                         input_data_len: [len(question_sentence_2)]*batch_size,
                                         target_data_len : [5]*batch_size,
                                         keep_prob: 1.0})[0]

print('Input')
print('  Word Ids:      {}'.format([i for i in question_sentence_2]))
print('  Question: {}'.format([index_to_vocabs[i] for i in question_sentence_2]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i for i in prediction_logits]))
print('  Answer: {}'.format(" ".join([index_to_vocabs[i] for i in prediction_logits])))

### Even history to use with Tensorboard
Save the graph event file for to visualize the graph computation in tensorboard
 use the command below within relative path in the tensorflow
 
     env tensorboard --logdir=Notebook\model_weights

In [None]:
file_writer = tf.summary.FileWriter('D:/ML Projects/Global IA/Seq2Seq-Chatbot/Notebook/model_weights/log', sess.graph)

Retrieve the learned embeddings

Next, let's retrieve the word embeddings learned during training. This will be a matrix of shape (vocab_size, embedding-dimension).




We will now write the weights to disk. To use the Embedding Projector, we will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).



In [None]:
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

In [None]:
import io

encoder = info.features['text'].encoder

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

for num, word in enumerate(encoder.subwords):
  vec = weights[num+1] # skip 0, it's padding.
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()

# Conclusion

The training data on Cornell Movie Subtitle corpus produced a result that needs further improvement 
and more attention and speculation on training parameters. Adding more quality data will further 
improve performance. Also, the training model should be trained with other hyper-parameters and 
different datasets for further experimentation. 


# References
[1] Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation:https://arxiv.org/abs/1406.1078

[2] Sequence to Sequence Learning with Neural Networks:https://arxiv.org/abs/1409.3215

[3] Neural Machine Translation by Jointly Learning to Align and Translate:https://arxiv.org/abs/1409.0473

[4] A Neural Conversational Model:https://arxiv.org/abs/1506.05869

[6] Sequence-to-Sequence learning and Neural Conversation model 2017/08/02 
https://isaacchanghau.github.io/2017/08/02/Seq2Seq-Learning-andNeuralConversationalModel 

[7] A Formalization of a Simple Sequential Encoder-Decoder https://mc.ai/a-formalization-of-a-simple-sequential-encoder-decoder/ 

[8] Neural Machine Translation by Jointly Learning to Align and Translate Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (Submitted on 1 Sep 2014 (v1), last revised 19 May 2016 (this version, v7)) 

[9] Dataset collect and information about Cornell movie dialog corpus dataset available at https://www.cs.cornell.edu/ cristian/CornellMovieDialogsCorpus.htm 

[10]  I. N. d. Silva, D. H. Spatti, R. A. Flauzino, L. H. B. Liboni, and S. F. d. R. Alves, Artificial Neural  
Networks A Practical Course, Springer International Publishing, 2017.

[11]  O. Davydova, "7 Types of Artificial Neural Networks for Natural Language Processing," 
[Online].  
Available: https://www.kdnuggets.com/2017/10/7-types-artificial-neural-networks-natural language-processing.html.  

[12]  G. M and D. S. [Online]. Available:  https://www.sciencedirect.com/science/article/pii/S1352231097004470. 

[13]  T. Young, D. Hazarika, S. Poria, and E. Cambria, "Recent Trends in Deep Learning-Based Natural  Language Processing". 

[14]  R. Collobert and J. Weston, "A unified architecture for natural language processing: deep neural networks with multitask learning," in Proceedings of the 25th international conference on machine learning, 2008. 

[15] J¨org Tiedemann. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V, pages 237–248. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria, 2009. 

[16] Ashish Vaswani, Noam Shazier, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. 

[17] Oriol Vinyals and Quoc V. Le. A neural conversational model. CoRR, abs/1506.05869, 2015. 

[18] Joseph Weizenbaum. Eliza: a computer program for the study of natural language communication between man and machine. Commun. ACM, 9(1):36–45, January 1966


