# Building a Chat Bot with Deep NLP

# Ajay Bhat

The project has been split into 4 parts: 

Part 1 : Data Preprocessing

Part 2 : Building the Seq2Seq model

Part 3 : Training the Seq2Seq model

Part 4 :Testing the Seq2Seq model

# Importing the libraries 

1. numpy library to work with arrays

2. tensorflow for deep learning

3. regular expression library to clean the text 

4. time library to measure the training time of each epoch

In [1]:
import numpy as np
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
tf.disable_v2_behavior()
import re
import time

Instructions for updating:
non-resource variables are not supported in the long term


# PART 1 - DATA PREPROCESSING

# Importing the dataset

We are going to give two variable names for the data sets.

The dataset of "lines" as lines and

the dataset of "conversations" as conversations

In [2]:
lines = open('movie_lines.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')

To avoid encoding issue : pass argument encoding = 'utf-8'

To ignore the insignifacant errors , we can do : errors = 'ignore'

.split('\n') is done to split the lines of the dataset with respect to occurence of a new line 

In [3]:
conversations = open('movie_conversations.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')

Let's have a look at the lines dataset

In [4]:
lines

['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.",
 'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow',
 "L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.",
 'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No',
 'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?',
 'L868 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ The "real you".',
 'L867 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ What good stuff?',
 "L866 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I figured yo

Explaining with an example : 

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!

L1045  is a unique key identifier

u0 represents user 0 

m0 represents in movie 0

-------------------------------------------------------------------------------

Let's have a look at the conversations dataset

In [5]:
conversations

["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L271', 'L272', 'L273', 'L274', 'L275']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L276', 'L277']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L280', 'L281']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L363', 'L364']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L365', 'L366']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L367', 'L368']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L401', 'L402', 'L403']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L404', 'L405', 'L406', 'L407']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L575', 'L576']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L577', 'L578']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L662', 'L663']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L693', 'L69

Explaining with an example : 

Each row corresponds to one conversation between any two characters.

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']

Here the conversation is between user 0 and user 2 in movie 0

And the key identifiers : ['L194', 'L195', 'L196', 'L197'] help us to locate the actual conversations that have taken place between the two users from the lines dataset

# Creating a dictionary that maps each line and its id

We want to create a dataset composing of input and outputs, and the easiest way to do that is by maintaining a dictionary.

We want to map each line with it's ID.

So the key identifier of the dictionary will be the ID of the line and the value will be the line itself.

1. declare an empty dictionary

2. iterate through all the lines of the "lines" dataset
    a)now for each of the lines, split the line with respect to " +++$+++ "
    
    b)now get the first element and place it as the key, and get the last value       and place it as the value
    
    c) we are just placing an if statement to ensure that the line has 5              elements for splitting, else we might face some shfiting issue

In [6]:
id2line = {}
#iterate through all lines in the lines dataset
for line in lines:
    _line = line.split(' +++$+++ ')
    if len(_line) == 5:
        id2line[_line[0]] = _line[4]

Let's have a look at the dictionary we created in the above step

In [7]:
id2line

{'L1045': 'They do not!',
 'L1044': 'They do to!',
 'L985': 'I hope so.',
 'L984': 'She okay?',
 'L925': "Let's go.",
 'L924': 'Wow',
 'L872': "Okay -- you're gonna need to learn how to lie.",
 'L871': 'No',
 'L870': 'I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869': 'Like my fear of wearing pastels?',
 'L868': 'The "real you".',
 'L867': 'What good stuff?',
 'L866': "I figured you'd get to the good stuff eventually.",
 'L865': 'Thank God!  If I had to hear one more story about your coiffure...',
 'L864': "Me.  This endless ...blonde babble. I'm like, boring myself.",
 'L863': 'What crap?',
 'L862': 'do you listen to this crap?',
 'L861': 'No...',
 'L860': 'Then Guillermo says, "If you go any lighter, you\'re gonna look like an extra on 90210."',
 'L699': 'You always been this selfish?',
 'L698': 'But',
 'L697': "Then that's all you had to say.",
 'L696': 'Well, no...',
 'L695': "You never wanted to go out with 'me, did y

# Creating a list of all conversations with the IDs

We want to create a list of conversations with line IDs because we need to keep track of conversations for the training data.

1. declare an empty list

2. the last row in the conversations dataset is an empty row, so the little trick to care of that is skip the last row using array slicing in python [:-1]

   now, for each conversation in the conversations dataset
      a) split the conversation with respect to " +++$+++ "
      
      while splitting we ensure we want the last part of each conversation,           hence the usage of [:-1]. This last part consists of the list of line           IDs.
      
      Now we do not want the opening and closing square brackets,hence using 
      string slicing from [1:-1]
      
      We also want to get rid of all quotes(') and all the spaces(" "),hence         using replace("'","") and also using replace(" ","") respectively
      
      b) now append the line IDs to the list by splitting with respect to the          comma(",")


In [8]:
conversations_ids = []
for conversation in conversations[:-1]:
    _conversation = conversation.split(' +++$+++ ')[-1][1:-1].replace("'", "").replace(" ", "")
    conversations_ids.append(_conversation.split(','))  

Let's have a look at our conversations_ids list we created above

In [9]:
conversations_ids

[['L194', 'L195', 'L196', 'L197'],
 ['L198', 'L199'],
 ['L200', 'L201', 'L202', 'L203'],
 ['L204', 'L205', 'L206'],
 ['L207', 'L208'],
 ['L271', 'L272', 'L273', 'L274', 'L275'],
 ['L276', 'L277'],
 ['L280', 'L281'],
 ['L363', 'L364'],
 ['L365', 'L366'],
 ['L367', 'L368'],
 ['L401', 'L402', 'L403'],
 ['L404', 'L405', 'L406', 'L407'],
 ['L575', 'L576'],
 ['L577', 'L578'],
 ['L662', 'L663'],
 ['L693', 'L694', 'L695'],
 ['L696', 'L697', 'L698', 'L699'],
 ['L860', 'L861'],
 ['L862', 'L863', 'L864', 'L865'],
 ['L866', 'L867', 'L868', 'L869'],
 ['L870', 'L871', 'L872'],
 ['L924', 'L925'],
 ['L984', 'L985'],
 ['L1044', 'L1045'],
 ['L49', 'L50', 'L51'],
 ['L571', 'L572', 'L573'],
 ['L579', 'L580'],
 ['L595', 'L596', 'L597'],
 ['L598', 'L599', 'L600'],
 ['L659', 'L660'],
 ['L952', 'L953'],
 ['L394', 'L395'],
 ['L396', 'L397'],
 ['L589', 'L590', 'L591'],
 ['L592', 'L593'],
 ['L756', 'L757', 'L758'],
 ['L759', 'L760'],
 ['L164', 'L165'],
 ['L319', 'L320'],
 ['L441', 'L442', 'L443', 'L444', 'L445']

# Getting the questions and answers separately

"questions" will be the input and "answers" will be the target outcome for the neural network to learn.

from the conversations_ids we will get the key identifiers for each conversation and using the dictionary we created above, we will get the corresponding text of the line. 


1. Declare two empty lists - one for questions and the other for answers

2. iterate through the conversations_ids and get each conversation, now for each of the conversation separate out the question and answer and append them to the respective lists

Note: For the answer, we use one index right after the question becuase they denote the immediate response to a question and hence the index i+1,i.e, the index to a question appears right after the question is asked

In [10]:
questions = []
answers = []
for conversation in conversations_ids:
    for i in range(len(conversation) - 1):
        questions.append(id2line[conversation[i]])
        answers.append(id2line[conversation[i+1]])

Let's have a look at the questions list we created above

In [11]:
questions

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.',
 "You're asking me out.  That's so cute. What's your name again?",
 "No, no, it's my fault -- we didn't have a proper introduction ---",
 'Cameron.',
 "The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.",
 'Why?',
 'Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.',
 'Gosh, if only we could find Kat a boyfriend...',
 "C'esc ma tete. This is my head",
 "Right.  See?  You're ready for the quiz.",
 "I don't want to know how to say that though.  I want to know useful things. Like where the good stores are.  How much does champagne cost?  Stuff like Chat.  I have n

Let's have a look at the answers list we created above

In [12]:
answers

["Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.',
 "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?",
 'Forget it.',
 'Cameron.',
 "The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.",
 'Seems like she could get a date easy enough...',
 'Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.',
 "That's a shame.",
 'Let me see what I can do.',
 "Right.  See?  You're ready for the quiz.",
 "I don't want to know how to say that though.  I want to know useful things. Like where the good stores are.  How much does champagne cost?  Stuff like Chat.  I have never in my life had to point out my head to someone.",
 "That's because it's such a nice one.",
 'Forget French.',
 "Well, there's someone I think might be --",
 'Where?',
 "I 

# Cleaning of the texts

1. put everything in lowercase

2. remove all the apostrophes

3. removing all non-essential words

creating a function that will carry out all of the above activities

Steps 1 and 2 will be carried out in the below function

In [13]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}+=~|.?,]", "", text)
    return text

# Applying the above function for cleaning the questions


1.declare an empty list to store the questions after cleaning

2.iterate through the questions and clean each question one by one, and append the same to the clean_questions list

In [14]:
clean_questions = []
for question in questions:
    clean_questions.append(clean_text(question))

# Applying the clean_text function for cleaning the questions

1.declare an empty list to store the answers after cleaning

2.iterate through the answers and clean each answer one by one, and append the same to the clean_answers list

In [15]:
clean_answers = []
for answer in answers:
    clean_answers.append(clean_text(answer))

Let's have a look at the clean_questions list we created above 

In [16]:
clean_questions

['can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again',
 'well i thought we would start with pronunciation if that is okay with you',
 'not the hacking and gagging and spitting part  please',
 'you are asking me out  that is so cute what is your name again',
 "no no it's my fault  we didn't have a proper introduction ",
 'cameron',
 'the thing is cameron  i am at the mercy of a particularly hideous breed of loser  my sister  i cannot date until she does',
 'why',
 'unsolved mystery  she used to be really popular when she started high school then it was just like she got sick of it or something',
 'gosh if only we could find kat a boyfriend',
 "c'esc ma tete this is my head",
 'right  see  you are ready for the quiz',
 "i don't want to know how to say that though  i want to know useful things like where the good stores are  how much does champagne cost  stuff like chat  i have never in my life had to point out

Let's have a look at the clean_answers list we created above

In [17]:
clean_answers

['well i thought we would start with pronunciation if that is okay with you',
 'not the hacking and gagging and spitting part  please',
 "okay then how 'bout we try out some french cuisine  saturday  night",
 'forget it',
 'cameron',
 'the thing is cameron  i am at the mercy of a particularly hideous breed of loser  my sister  i cannot date until she does',
 'seems like she could get a date easy enough',
 'unsolved mystery  she used to be really popular when she started high school then it was just like she got sick of it or something',
 'that is a shame',
 'let me see what i can do',
 'right  see  you are ready for the quiz',
 "i don't want to know how to say that though  i want to know useful things like where the good stores are  how much does champagne cost  stuff like chat  i have never in my life had to point out my head to someone",
 "that is because it's such a nice one",
 'forget french',
 "well there's someone i think might be ",
 'where',
 "i counted on you to help my cause 

# Removing the not so frequent words from our corpus
We are doing this because we want to optimize the training and for that we need only the essential words from the vocabulary

For this we will create a dictionary, mapping each word to it's number of occurences in the corpus of movie dialogues

1. declare an empty dictionary in python

2. itertae through both clean_questions and for each question get the count of every word present

3. Similarly, iterate through clean_answers and for each answer get the count of every word present

In [18]:
word2count = {}
#going through the clean_questions
for question in clean_questions:
    for word in question.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
#going through the clean_answers
for answer in clean_answers:
    for word in answer.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

# Tokenization and filtering of the non-frequent words

1. We will assign a unique integer to each of the words present in the clean_questions and clean_answers list 

2. for each of the words, we will compare the number of occurences (with the help of the word2count{} dictionary we created above) of that particular word and filter out those words which do not cross the threshold,i.e, removal of infrequent words

3. and for those words which cross the threshold, we will assign a unique integer value 

In [19]:
threshold_questions = 20
threshold_answers = 20

The threshold is a hyperparameter in NLP, and please feel free to experiment with this value while training your model. But make sure not to keep the threshold value too low, else it might be too overwhelming for the model to learn

In [20]:
# carrying out the above mentioned steps for questions
questionswords2int = {}
word_number = 0
for word, count in word2count.items():
    if count >= threshold_questions:
        questionswords2int[word] = word_number
        word_number += 1
        
# carrying out the above mentioned steps for answers
answerswords2int = {}
word_number = 0
for word, count in word2count.items():
    if count >= threshold_answers:
        answerswords2int[word] = word_number
        word_number += 1

You might be wondering as to why did we create two dictionaries with exactly the same definition, well here's why:

It is recommended to make two separate dictionaries in case we want to use different thresholds to filter out the non frequent words in the dictionaries of the questions and the answers. If we do that, these two dictionaries will be different.


Let's have a look at the questionswords2int dictionary we created above

In [21]:
questionswords2int

{'can': 0,
 'we': 1,
 'make': 2,
 'this': 3,
 'quick': 4,
 'and': 5,
 'andrew': 6,
 'are': 7,
 'having': 8,
 'an': 9,
 'incredibly': 10,
 'public': 11,
 'break': 12,
 'up': 13,
 'on': 14,
 'the': 15,
 'again': 16,
 'well': 17,
 'i': 18,
 'thought': 19,
 'would': 20,
 'start': 21,
 'with': 22,
 'if': 23,
 'that': 24,
 'is': 25,
 'okay': 26,
 'you': 27,
 'not': 28,
 'part': 29,
 'please': 30,
 'asking': 31,
 'me': 32,
 'out': 33,
 'so': 34,
 'cute': 35,
 'what': 36,
 'your': 37,
 'name': 38,
 'no': 39,
 "it's": 40,
 'my': 41,
 'fault': 42,
 "didn't": 43,
 'have': 44,
 'a': 45,
 'proper': 46,
 'cameron': 47,
 'thing': 48,
 'am': 49,
 'at': 50,
 'mercy': 51,
 'of': 52,
 'particularly': 53,
 'breed': 54,
 'loser': 55,
 'sister': 56,
 'cannot': 57,
 'date': 58,
 'until': 59,
 'she': 60,
 'does': 61,
 'why': 62,
 'mystery': 63,
 'used': 64,
 'to': 65,
 'be': 66,
 'really': 67,
 'popular': 68,
 'when': 69,
 'started': 70,
 'high': 71,
 'school': 72,
 'then': 73,
 'it': 74,
 'was': 75,
 'just':

Let's have a look at the answerswords2int dictionary we created above

In [22]:
answerswords2int

{'can': 0,
 'we': 1,
 'make': 2,
 'this': 3,
 'quick': 4,
 'and': 5,
 'andrew': 6,
 'are': 7,
 'having': 8,
 'an': 9,
 'incredibly': 10,
 'public': 11,
 'break': 12,
 'up': 13,
 'on': 14,
 'the': 15,
 'again': 16,
 'well': 17,
 'i': 18,
 'thought': 19,
 'would': 20,
 'start': 21,
 'with': 22,
 'if': 23,
 'that': 24,
 'is': 25,
 'okay': 26,
 'you': 27,
 'not': 28,
 'part': 29,
 'please': 30,
 'asking': 31,
 'me': 32,
 'out': 33,
 'so': 34,
 'cute': 35,
 'what': 36,
 'your': 37,
 'name': 38,
 'no': 39,
 "it's": 40,
 'my': 41,
 'fault': 42,
 "didn't": 43,
 'have': 44,
 'a': 45,
 'proper': 46,
 'cameron': 47,
 'thing': 48,
 'am': 49,
 'at': 50,
 'mercy': 51,
 'of': 52,
 'particularly': 53,
 'breed': 54,
 'loser': 55,
 'sister': 56,
 'cannot': 57,
 'date': 58,
 'until': 59,
 'she': 60,
 'does': 61,
 'why': 62,
 'mystery': 63,
 'used': 64,
 'to': 65,
 'be': 66,
 'really': 67,
 'popular': 68,
 'when': 69,
 'started': 70,
 'high': 71,
 'school': 72,
 'then': 73,
 'it': 74,
 'was': 75,
 'just':

# Adding the last tokens to the above two dictionaries

EOS = End of sentence

SOS = Start of sentence

In [23]:
tokens = ['<PAD>', '<EOS>', '<OUT>', '<SOS>']

Here we need to assign a unique integer to token in the questionswords2int dictionary. Since words are added, by incrementing one at a time, hence here we just calculate the length of the dictionary and 1 to it

In [24]:
for token in tokens:
    questionswords2int[token] = len(questionswords2int) + 1

Doing the same as above to answerswords2int dictionary

Here we need to assign a unique integer to token in the answerswords2int dictionary. Since words are added, by incrementing one at a time, hence here we just calculate the length of the dictionary and 1 to it

In [25]:
for token in tokens:
    answerswords2int[token] = len(answerswords2int) + 1

# Creating an inverse dictionary of the answerswords2int dictionary

We are doing so because we need the inverse mapping from integers to answers words in the implementation of the Seq2Seq architectural model. 
Also, we need this only for the answerswords dictionary and not for questionswords.

1. declare a new dictionary in python

2. there is a single line trick in python to reverse the mapping of a dictionary,i.e, to interchange the key-value pairs and store them in a new dictionary

In [26]:
answersints2word = {w_i: w for w, w_i in answerswords2int.items()}

Let's have a look at the newly created answersints2word dictionary (inverse mapped dictionary)

In [27]:
answersints2word

{0: 'can',
 1: 'we',
 2: 'make',
 3: 'this',
 4: 'quick',
 5: 'and',
 6: 'andrew',
 7: 'are',
 8: 'having',
 9: 'an',
 10: 'incredibly',
 11: 'public',
 12: 'break',
 13: 'up',
 14: 'on',
 15: 'the',
 16: 'again',
 17: 'well',
 18: 'i',
 19: 'thought',
 20: 'would',
 21: 'start',
 22: 'with',
 23: 'if',
 24: 'that',
 25: 'is',
 26: 'okay',
 27: 'you',
 28: 'not',
 29: 'part',
 30: 'please',
 31: 'asking',
 32: 'me',
 33: 'out',
 34: 'so',
 35: 'cute',
 36: 'what',
 37: 'your',
 38: 'name',
 39: 'no',
 40: "it's",
 41: 'my',
 42: 'fault',
 43: "didn't",
 44: 'have',
 45: 'a',
 46: 'proper',
 47: 'cameron',
 48: 'thing',
 49: 'am',
 50: 'at',
 51: 'mercy',
 52: 'of',
 53: 'particularly',
 54: 'breed',
 55: 'loser',
 56: 'sister',
 57: 'cannot',
 58: 'date',
 59: 'until',
 60: 'she',
 61: 'does',
 62: 'why',
 63: 'mystery',
 64: 'used',
 65: 'to',
 66: 'be',
 67: 'really',
 68: 'popular',
 69: 'when',
 70: 'started',
 71: 'high',
 72: 'school',
 73: 'then',
 74: 'it',
 75: 'was',
 76: 'ju

# Adding the End Of String token to the end of every answer

We now need to add the EOS token to end of every answer. It is very important for the decoding part of the Seq2Seq architectural implementation.
The end of the answer is specified by the EOS token.

1. loop through all the answers in the clean_answers list, and to each of these cleaned answers append the EOS token one by one.

Note: Make sure to separate the last word of each answer and the EOS token with a space

In [28]:
for i in range(len(clean_answers)):
    clean_answers[i] += ' <EOS>'

Let's have a look at the modified clean_answers list after appending the EOS token

In [29]:
clean_answers

['well i thought we would start with pronunciation if that is okay with you <EOS>',
 'not the hacking and gagging and spitting part  please <EOS>',
 "okay then how 'bout we try out some french cuisine  saturday  night <EOS>",
 'forget it <EOS>',
 'cameron <EOS>',
 'the thing is cameron  i am at the mercy of a particularly hideous breed of loser  my sister  i cannot date until she does <EOS>',
 'seems like she could get a date easy enough <EOS>',
 'unsolved mystery  she used to be really popular when she started high school then it was just like she got sick of it or something <EOS>',
 'that is a shame <EOS>',
 'let me see what i can do <EOS>',
 'right  see  you are ready for the quiz <EOS>',
 "i don't want to know how to say that though  i want to know useful things like where the good stores are  how much does champagne cost  stuff like chat  i have never in my life had to point out my head to someone <EOS>",
 "that is because it's such a nice one <EOS>",
 'forget french <EOS>',
 "wel

# Tranalating all the questions and the answers into integers and replacing all the words that were filtered out by "< OUT>"

1. We are doing this because we want to sort all the questions and all the answers by their length.

2. The reason we are sorting them based on length is because it optimizes the training perfomance. 


So the step by step approach will be as follows:

declare an empty list in python (questions_into_int[])

1. loop over all the questions in clean_questions list

2. create an empty list to store the integer values(ints[])

3. loop over all the words in a question   

    if word is not present in questionswords2int, we want to replace that word by the integer that represents token <OUT> ,
    
    hence append the integer representing token <OUT> to the list names ints[]
    
    else, if word is present in the questionswords2int, we will append the unique integer associated with the word from questionswords2int list.
    
4. now outside the inner for loop, append the ints[] list to the questions_into_int[] list
    
        

In [30]:
questions_into_int = []
for question in clean_questions:
    ints = [] # list of integers, which will be the associated integer to each of the word present in that question
    for word in question.split():
        if word not in questionswords2int:
            ints.append(questionswords2int['<OUT>'])
        else:
            ints.append(questionswords2int[word])
    questions_into_int.append(ints)

Doing the same set of above steps for answers

In [31]:
answers_into_int = []
for answer in clean_answers:
    ints = [] # list of integers, which will be the associated integer to each of the word present in that question
    for word in answer.split():
        if word not in answerswords2int:
            ints.append(answerswords2int['<OUT>'])
        else:
            ints.append(answerswords2int[word])
    answers_into_int.append(ints)

Let's have a look at the questions_to_int list

In [32]:
questions_into_int

[[0,
  1,
  2,
  3,
  4,
  8824,
  8824,
  5,
  6,
  8824,
  7,
  8,
  9,
  10,
  8824,
  11,
  12,
  13,
  14,
  15,
  8824,
  16],
 [17, 18, 19, 1, 20, 21, 22, 8824, 23, 24, 25, 26, 22, 27],
 [28, 15, 8824, 5, 8824, 5, 8824, 29, 30],
 [27, 7, 31, 32, 33, 24, 25, 34, 35, 36, 25, 37, 38, 16],
 [39, 39, 40, 41, 42, 1, 43, 44, 45, 46, 8824],
 [47],
 [15,
  48,
  25,
  47,
  18,
  49,
  50,
  15,
  51,
  52,
  45,
  53,
  8824,
  54,
  52,
  55,
  41,
  56,
  18,
  57,
  58,
  59,
  60,
  61],
 [62],
 [8824,
  63,
  60,
  64,
  65,
  66,
  67,
  68,
  69,
  60,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  60,
  78,
  79,
  52,
  74,
  80,
  81],
 [82, 23, 83, 1, 84, 85, 86, 45, 87],
 [8824, 88, 8824, 3, 25, 41, 89],
 [90, 91, 27, 7, 92, 93, 15, 8824],
 [18,
  94,
  95,
  65,
  96,
  97,
  65,
  98,
  24,
  99,
  18,
  95,
  65,
  96,
  100,
  101,
  77,
  102,
  15,
  103,
  104,
  7,
  97,
  105,
  61,
  106,
  107,
  108,
  77,
  109,
  18,
  44,
  110,
  111,
  41,
  112,
  113,
 

Let's have a look at the answers_into_int list

In [33]:
answers_into_int

[[17, 18, 19, 1, 20, 21, 22, 8824, 23, 24, 25, 26, 22, 27, 8823],
 [28, 15, 8824, 5, 8824, 5, 8824, 29, 30, 8823],
 [26, 73, 97, 1533, 1, 860, 33, 482, 387, 8824, 210, 242, 8823],
 [245, 74, 8823],
 [47, 8823],
 [15,
  48,
  25,
  47,
  18,
  49,
  50,
  15,
  51,
  52,
  45,
  53,
  8824,
  54,
  52,
  55,
  41,
  56,
  18,
  57,
  58,
  59,
  60,
  61,
  8823],
 [399, 77, 60, 84, 129, 45, 58, 865, 289, 8823],
 [8824,
  63,
  60,
  64,
  65,
  66,
  67,
  68,
  69,
  60,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  60,
  78,
  79,
  52,
  74,
  80,
  81,
  8823],
 [24, 25, 45, 1701, 8823],
 [287, 32, 91, 36, 18, 0, 128, 8823],
 [90, 91, 27, 7, 92, 93, 15, 8824, 8823],
 [18,
  94,
  95,
  65,
  96,
  97,
  65,
  98,
  24,
  99,
  18,
  95,
  65,
  96,
  100,
  101,
  77,
  102,
  15,
  103,
  104,
  7,
  97,
  105,
  61,
  106,
  107,
  108,
  77,
  109,
  18,
  44,
  110,
  111,
  41,
  112,
  113,
  65,
  114,
  33,
  41,
  89,
  65,
  115,
  8823],
 [24, 25, 116, 40, 117, 45, 1

# Sorting questions and answers by the length of the questions

We are doing so because it will speed up the training and help to reduce the loss. The reason for this is because it will reduce the amount of padding during training.

1.create two empty lists in python
 one called sorted_clean_answers[] and the other sorted_clean_questions[]
 
 We will place a limit on the length of the questions, because very lengthy questions will be too overwhelming for the chatbot to learn from. 
Also this limit on the length can be considered as a hyperparamater which can be tuned to get better perfomance.(here let's take the limit to be 25)

2. looping over different possible lengths of the questions(upto the limit)
    for each of the questions, we need to get two important elements - index of     the question and the question itself
    
    the trick to get these two elements at the same time is to use the enumerate function


In [34]:
sorted_clean_questions = []
sorted_clean_answers = []
for length in range(1, 25 + 1):
    for i in enumerate(questions_into_int):
        if len(i[1]) == length:
            sorted_clean_questions.append(questions_into_int[i[0]])
            sorted_clean_answers.append(answers_into_int[i[0]])

Let's have a look at the sorted_clean_questions

In [35]:
sorted_clean_questions

[[47],
 [62],
 [123],
 [147],
 [135],
 [39],
 [175],
 [39],
 [182],
 [183],
 [221],
 [36],
 [62],
 [135],
 [62],
 [110],
 [375],
 [132],
 [297],
 [211],
 [36],
 [222],
 [182],
 [26],
 [62],
 [182],
 [8824],
 [455],
 [250],
 [182],
 [193],
 [8824],
 [8824],
 [669],
 [97],
 [8824],
 [8824],
 [36],
 [39],
 [36],
 [8824],
 [147],
 [90],
 [36],
 [771],
 [8824],
 [637],
 [39],
 [39],
 [939],
 [8824],
 [1121],
 [39],
 [231],
 [69],
 [211],
 [39],
 [142],
 [1267],
 [211],
 [1113],
 [1113],
 [1113],
 [1113],
 [211],
 [340],
 [149],
 [26],
 [92],
 [669],
 [36],
 [1140],
 [1262],
 [8824],
 [8824],
 [1511],
 [211],
 [1552],
 [36],
 [36],
 [211],
 [1601],
 [1601],
 [1601],
 [1601],
 [1601],
 [1601],
 [26],
 [669],
 [67],
 [669],
 [231],
 [97],
 [1630],
 [1601],
 [1601],
 [1601],
 [8824],
 [1601],
 [1601],
 [67],
 [1671],
 [674],
 [1791],
 [211],
 [38],
 [211],
 [1224],
 [211],
 [1350],
 [211],
 [17],
 [123],
 [36],
 [62],
 [1840],
 [1773],
 [211],
 [1848],
 [211],
 [211],
 [1823],
 [222],
 [1148],


Let's have a look at the sorted_clean_answers

In [36]:
sorted_clean_answers

[[15,
  48,
  25,
  47,
  18,
  49,
  50,
  15,
  51,
  52,
  45,
  53,
  8824,
  54,
  52,
  55,
  41,
  56,
  18,
  57,
  58,
  59,
  60,
  61,
  8823],
 [8824,
  63,
  60,
  64,
  65,
  66,
  67,
  68,
  69,
  60,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  60,
  78,
  79,
  52,
  74,
  80,
  81,
  8823],
 [102, 8823],
 [1529, 77, 101, 1550, 33, 149, 608, 8823],
 [27, 153, 227, 3, 6453, 8823],
 [26, 27, 7, 160, 253, 65, 1280, 97, 65, 613, 8823],
 [1387, 134, 8823],
 [27, 239, 133, 194, 226, 74, 8823],
 [196, 8823],
 [20, 27, 124, 612, 32, 45, 1512, 47, 8823],
 [41,
  2582,
  157,
  18,
  44,
  78,
  45,
  103,
  1112,
  50,
  963,
  15,
  8824,
  144,
  219,
  519,
  8823],
 [111, 8824, 93, 45, 271, 8823],
 [180, 75, 77, 45, 272, 273, 8823],
 [279,
  24,
  18,
  280,
  18,
  20,
  110,
  128,
  281,
  76,
  116,
  278,
  267,
  75,
  264,
  74,
  5,
  18,
  282,
  283,
  284,
  93,
  8824,
  152,
  5,
  41,
  8824,
  8824,
  285,
  8823],
 [116, 60, 244, 4625, 383, 297, 8823],

-------------------------------------------------------------------------------
# PART 2 - BUILDING THE SEQ2SEQ MODEL
-------------------------------------------------------------------------------

# Creating placeholders for the inputs and the targets

In TensorFlow, all avariables are used in tensors. Tensors are like an advanced numpy array that allows very fast computations in Deep Neural Networks.

All variables used as tensors must be defined as what we call TensorFlow placeholders.

This is more of an advanced data structure that can contain tensors and also additional features.

-------------------------------------------------------------------------------

We will be defining a function that will call model inputs, and inside this function we will create a placeholder for the inputs and a placeholder for the targets.
Then we will add a learning rate and even more hybrid parameters.

In short, we will be creating placeholders to be able to use these variables in future training.

-------------------------------------------------------------------------------

1. we will start by creating a new variable called inputs which will be the TensorFlow place holder containing the input

we will need to call the tensorflow placeholder function

we will take 3 paramters for this function

the first parameter is going to be type of the data(integers in this case)

the second argument is going to be dimensions of the matrix from the input data. And since the inputs are the lists of questions encoded into unique integer (lists of integers) and therefore with padding we will get a 2 dimesnional matrix. (represented as [None,None] 

the last argument is just the name we are going to give to the input.

------------------------------------------------------------------------------

We are going to do the same set of steps for targets as well

-------------------------------------------------------------------------------
Now we are going to create 2 more tensorflow placeholders, one which will hold the learning rate(hyper parameter) and the other(keep_prob parameter) which will hold the parameter to control the drop-out rate(regularization).

In [37]:
def model_inputs():
    inputs = tf.compat.v1.placeholder(tf.int32, [None,None], name = 'input')
    targets = tf.placeholder(tf.int32, [None,None], name = 'target')
    lr = tf.placeholder(tf.float32, name = 'learning_rate')
    keep_prob = tf.placeholder(tf.float32,name = 'keep_prob')
    return inputs,targets,lr,keep_prob

# Preprocessing the targets

Before we start creating the encoding layer and the decoding layers, we have to prepare a set of targets. This is because the decoder will only accept a certain format of the targets. 

Q) What exactly is this format ?

ans:  The format is two-fold.

   First, the target needs to be in two batches. The RNN of the decoder will not accept single targets i.e, single answers.
      
   Second important element is that each of the answers in the batch must start with the SOS token.
    
-------------------------------------------------------------------------------
So the two things we will do in the below function is to create batches and add the SOS token. 

How to add this SOS token?
Since we need to keep the same size for all the answers using padding, what we will do is take all the answers inside batches and remove the last column of these answers. 

We will remove the last column and take the rest of the columns a dthen make a concatenation to add a SOS tokens at the beginning of the target in batches.

-------------------------------------------------------------------------------

1.create the left side of the concatenation which is a matrix of batch size lines and one column filled with the SOS tokens. 

2.then make the right side of the concatentation which will be target answers in the batch except the last token identifier of these answers, for all the answers in the batch.

In [38]:
def preprocess_targets(targets, word2int, batch_size):
    left_side = tf.fill([batch_size, 1], word2int['<SOS>'])
    right_side = tf.strided_slice(targets, [0,0], [batch_size,-1], [1,1])
    preprocessed_targets = tf.concat([left_side, right_side], 1)
    return preprocessed_targets

-------------------------------------------------------------------------------

# Creating the Encoder RNN Layer

The arguments of this function include: 

1. rnn_inputs that corresponds to the model inputs

2. rnn_size is the number of input tensors of the encoder

3. num_layers, the number of layers in the RNN

4. keep_prob, for dropout regularization(to improve accuracy)

5. sequence_length, which is the list of th elength of each question in the batch.
-------------------------------------------------------------------------------

1. In tensorflow we have an amazing class that will help us create an LSTM

assign variable lstm  to tf.contrib(module).rnn(submodule).BasicLSTMCell(rnn_size) 

2. assign variable lstm_dropout to  tf.contrib(module).rnn(submodule).DropoutWrapper(class)(lstm, keep_prob)

3. we are now ready to create the encoder cell

assign variable encoder_cell to tf.contrib(module).rnn(submodule).MultiRNNCell(the number of lstm dropout mulitplied to number of layers we have as an argument to the function)

4. now to get the encoder_state, we will get ot from the bidirectional_dynamic_rnn function from the nn module by tensorflow

the above step created a dynamic version of a bidirectional RNN(this will help us in making our chatbot more powerful)

the dynamic version of bidirectional rnn will take the input and build independent forward and abckward RNN's. 

NOTE: We need to make sure in case of dynamic bidirectional RNNs that inpput size of forward cell and backward cell must match.


In [39]:
def encoder_rnn(rnn_inputs, rnn_size, num_layers, keep_prob, sequence_length):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    lstm_dropout = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_probe = keep_prob)
    encoder_cell = tf.contrib.rnn.MultiRNNCell([lstm_dropout] * num_layers)
    _, encoder_state = tf.nn.bidirectional_dynamic_rnn(cell_fw = encoder_cell, 
                                                       cell_bw = encoder_cell, 
                                                       sequence_length = sequence_length, 
                                                       inputs = rnn_inputs,
                                                       dtype = tf.float32)
    return encoder_state

-------------------------------------------------------------------------------

# Creating the Decoder of the RNN layer

We will be doing this in three steps:

1. Decode the training set

2. Decode the validation set

3. And eventually, we will be ready to take care of the decoder of the RNN layer

# Step 1. Decoding the training set

The arguments of this function include:

1. encoder_state

2. decoder_cell

3. decoder_embedded_input

4. sequence_length

5. decoding_scope

6. output_function

7. keep_prob

8. batch_size
-------------------------------------------------------------------------------

1. The first thing we need to do is get the attention states

initialize attention_states as a 3 dimesnional matrix initialized with zeros

Since we are dealing with batches, the number of lines is going to be batch_size

The number of elements on the third axis is going to be decoder_cell.output_size

2. We will get the attention_keys, the attention_values, the atention_score_function and attention_construct_function using the TensorFlow function belonging to the seq2seq submodule which is prepare_attention()

the attention_keys is the keys that are to be compared with the target_states

the attention_values, the values that we will use to construct the context vectors

the attention_score_function is used to compute the similarity in between keys and the target 

the attention_construct_function is function used to build the attention state

3. The next step is to get the training_decoder_function that will do the decoding of the training set. 

training_decoder_function is obtained from another tensorflow function present in the seq2seq submodule called attention_decoder_fn_train()


4. the next step is to get the decoder_output, decoder_final_state and the decoder_final_context_state(but we need only the decoder_output)

this is obtained from the function present in the dynamic_rnn_encoder submodule present in the TensorFlow library

5. the final step is to apply dropout to our decoder_output
decoder_output_dropout = tf.nn(module).dropout()

6. return the output_function(decoder_output_dropout)


In [40]:
def decode_training_set(encoder_state, decoder_cell, decoder_embedded_input, sequence_length, decoding_scope, output_function, keep_prob, batch_size):
    attention_states = tf.zeros([batch_size, 1, decoder_cell.output_size])
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states, attention_option = "bahdanau", num_units = decoder_cell.output_size)
    training_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_train(encoder_state[0],
                                                                              attention_keys,
                                                                              attention_values,
                                                                              attention_score_function,
                                                                              attention_construct_function,
                                                                              name = "attn_dec_train")
    decoder_output, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(decoder_cell,
                                                                                                              training_decoder_function,
                                                                                                              decoder_embedded_input,
                                                                                                              sequence_length,
                                                                                                              scope = decoding_scope)
    decoder_output_dropout = tf.nn.dropout(decoder_output, keep_prob)
    return output_function(decoder_output_dropout)

# Step 2. Decoding the test/validation set


Here are we are going to make a very similar function as above, but for the observatoins of the test set and the validation set.

These are new observations that will not be used in the training.

In this function, we will not be using the attention_decoder_fn_train function by tensorflow, instead we will be using attention_decoder_fn_inference()

This function will take 4 new arguments in addition the function above, these include:

1.sos_id

2.eos_id

3.maximum_length

4.num_words



In [41]:
def decode_test_set(encoder_state, decoder_cell, decoder_embeddings_matrix, sos_id, eos_id, maximum_length, num_words, decoding_scope, output_function, keep_prob, batch_size):
    attention_states = tf.zeros([batch_size, 1, decoder_cell.output_size])
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states, attention_option = "bahdanau", num_units = decoder_cell.output_size)
    test_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_inference(output_function,
                                                                              encoder_state[0],
                                                                              attention_keys,
                                                                              attention_values,
                                                                              attention_score_function,
                                                                              attention_construct_function,
                                                                              decoder_embeddings_matrix,
                                                                              sos_id,
                                                                              eos_id,
                                                                              maximum_length,
                                                                              num_words,
                                                                              name = "attn_dec_inf")
    test_predictions, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(decoder_cell,
                                                                                                                test_decoder_function,
                                                                                                                scope = decoding_scope)
    return test_predictions

# Step 3. Creating the Decoder RNN

This function will have the following arguments:

1.decoder_rnn

2.decoder_embeddings_matrix

3.encoder_state  // output of the encoder becomes input of the decoder

4.num_words //total number of words in our corpus of words

5.sequence_length 

6.rnn_size // number of layers we want in our RNN decoder

7.num_layers

8.word2int //the dictionary which we have defined earlier

9.keep_prob // for the dropout(regularization) rate

10.batch_size 

-----------------------------------------------------------------------------

1.introduce the decoding scope from tensorflow
    with tf.variable_scope("decoding") as decoding_scope
 
2.assign the variable lstm to the tensorflow function BasicLSTMCell in submodule nn and pass the rnn_size as parameter

3.apply dropout regularization to reduce overfitting and improve accuracy

assign variable lstm_dropout to the tensorflow function called DropoutWrapper present in the rnn submodule. Pass the parameters lstm and input_keep_prob to the function

4.to the variable decoder_cell apply the tensorflow function MultiRNNCell present in the rnn submodule. To this function pass the product of lstm_dropout and num_layers as a parameter.

5.we need to initialize some weights that will be associated to the neurons of the fully connected layers of the neural network inside our decoder.

assign the variable weight to the the tensorflow function call truncated_normal_initializer that will generate a truncated normal distribution of the weights.

To this function call pass argument stddev(standard deviation as 0.1)

6.to a new variable called biases initialize zeros, for that use the tensorflow function called zeros_initializer()

7.the next step is to make the output function 

assign the variable output_function to the tensorflow function fully_connected present in the layers module

the arguments to this function will be (x, 
                                        num_words, 
                                        None(for normalization),
                                        scope = decoding_scope,
                                        weights_initializers = weights,
                                        biases_initializers = biases)
                                        

8.the next step is to get our training predictions(with the help of function we defined above) decode_training_set()

training_predictions = decode_training_set()
#refer the function definition for the arguments and figure it yourself ;)

9.we have to take our decoding scope and specify that we want to reuse the variables introduced in this decoding scope

10.we now need to get the test_predictions. We will get this with a function we defined called decode_test_set. 
#refer the function definition for the arguments and figure it yourself ;)

11.finally return the training_predictions and the test_predictions

In [42]:
def decoder_rnn(decoder_embedded_input, decoder_embeddings_matrix, encoder_state, num_words, sequence_length, rnn_size, num_layers, word2int, keep_prob, batch_size):
    with tf.variable_scope("decoding") as decoding_scope:
        lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
        lstm_dropout = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
        decoder_cell = tf.contrib.rnn.MultiRNNCell([lstm_dropout] * num_layers)
        weights = tf.truncated_normal_initializer(stddev = 0.1)
        biases = tf.zeros_initializer()
        output_function = lambda x: tf.contrib.layers.fully_connected(x,
                                                                      num_words,
                                                                      None,
                                                                      scope = decoding_scope,
                                                                      weights_initializer = weights,
                                                                      biases_initializer = biases)
        training_predictions = decode_training_set(encoder_state,
                                                   decoder_cell,
                                                   decoder_embedded_input,
                                                   sequence_length,
                                                   decoding_scope,
                                                   output_function,
                                                   keep_prob,
                                                   batch_size)
        decoding_scope.reuse_variables()
        test_predictions = decode_test_set(encoder_state,
                                           decoder_cell,
                                           decoder_embeddings_matrix,
                                           word2int['<SOS>'],
                                           word2int['<EOS>'],
                                           sequence_length - 1,
                                           num_words,
                                           decoding_scope,
                                           output_function,
                                           keep_prob,
                                           batch_size)
    return training_predictions, test_predictions

-----------------------------------------------------------------------------

# Building the Seq2Seq Model

This function is the final ultimate function which we will build using the above defined functions. This will be the brain of our chatbot. 

This function will take the following arguments:

1.inputs which are the questions of the Cornell movie corpus dialogue dataset

2.targets. which will be the answers to our questions

3.keep_prob

4.batch_size

5.sequence_length

6.answers_num_words

7.questions_num_words

8.encoder_embedding_size, which is the number of dimensions of the embedding matrix for the encoder

9.decoder_embedding_size, which is the number of dimesnions of the embeddig matrix for the decoder

10.rnn_size

11.num_layers

12.questionswords2int, dictionary which we defined previously to preprocess the targets

-----------------------------------------------------------------------------

1.Before we get the encoder_state we need the encoder_embedded_input, that's why the first thing we have to do is introduce encoder_embedded_input and assign it to the twnsorflow function of embed_sequence present in the layers submodule.
encoder_embedded_input = tf.contrib(module).embed_sequence(submodule)(inputs,#inputs is the argument we want to embed
                                                                      answers_num + 1, # total number of answer words
                                                                      encoder_embedding_size, #number of dimesnions in the                                                                              #embedding matrix of the encoder
                                                                      initializer = random uniform initializer(tf function))
                                                                     
2.encoder_state is the output of the encoder and will be the input of the decoder. We will get this from the RNN of our encoder. We will feed the RNN with the encoder_embedded input and will return the encoder_state.

encoder_state = encoder_rnn(), function which we have defined previously
#kindly refer the function definition to figure out the arguments

3.we now need to get the preprocessed_targets, because we will need them for training

preprocessed_targets = preprocessed_targets(), function which we have defined previously
#kindly refer the function definition to figure out the arguments

4.decoder_embeddings_matrix which we will get by creating a tensorflow variable using the Variable class. 
The variable class takes several arguments which are going ti=o be mostly the dimensions of the emddings_matrix.
decoder_embeddings_matrix = tf.Variable(tf.random_uniform([questions_num_words + 1, decoder_embedding_size]#raindom numbers taken between 0 and 1 for uniform distributions
                       0,
                       1))
                       
5.the next step naturally is to get the decoder_embedded_input

we will use the tensorflow function called embedding_lookup present in the nn module which will take the decoder_embeddings_matrix as argument as well as preprocessed_targets

6.now we need to assign the training_predictions and test_predictions to the function decoder_rnn(), function which we defined earlier 
#kindly refer to the function definition to develop an intution of the argument

7.finally return the two variables training_predictions and the test_predictions
                      


In [43]:
def seq2seq_model(inputs, targets, keep_prob, batch_size, sequence_length, answers_num_words, questions_num_words, encoder_embedding_size, decoder_embedding_size, rnn_size, num_layers, questionswords2int):
    encoder_embedded_input = tf.contrib.layers.embed_sequence(inputs,
                                                              answers_num_words + 1,
                                                              encoder_embedding_size,
                                                              initializer = tf.random_uniform_initializer(0, 1))
    encoder_state = encoder_rnn(encoder_embedded_input, rnn_size, num_layers, keep_prob, sequence_length)
    preprocessed_targets = preprocess_targets(targets, questionswords2int, batch_size)
    decoder_embeddings_matrix = tf.Variable(tf.random_uniform([questions_num_words + 1, decoder_embedding_size], 0, 1))
    decoder_embedded_input = tf.nn.embedding_lookup(decoder_embeddings_matrix, preprocessed_targets)
    training_predictions, test_predictions = decoder_rnn(decoder_embedded_input,
                                                         decoder_embeddings_matrix,
                                                         encoder_state,
                                                         questions_num_words,
                                                         sequence_length,
                                                         rnn_size,
                                                         num_layers,
                                                         questionswords2int,
                                                         keep_prob,
                                                         batch_size)
    return training_predictions, test_predictions

-------------------------------------------------------------------------------

# PART 3 -  TRAINING THE SEQ2SEQ MODEL

-------------------------------------------------------------------------------

# Setting the hyperparameters

The whole process of getting the batches of input into the neural network and then forward propagating them inside the encoders in the encoder states and then forward propagating the encoder states with targets inside the deep recurrent neural network to get the final answers/outputs.Then back-propagating the loss generated by the outputs and the targets back into the neural network and updating the weights towards the direction of a better ability for the chatbot to speak like a human. 

1.An epoch is basically one whole iteration of the above mentioned steps(take 100, if training is taking too long, adjust it to 50, but not lower).

2.batch_size, we are setting it to be 64(usually a power of 2)

3.rnn_size, we are setting it to be 512

4.num_layers, we are setting it to be 3, can change later if necessary

5.encoding_embedding_size,(number of columns in the embedded matrix), taken as 512

6.similarly, decoding_embedding_size is taken to be as 512

7.learning_rate, we will start with 0.01

8.learning_rate_decay, which represents the percentage by wich learning_rate is reduced over the iterations of the training

9.min_learning_rate, as a lower bound to take care of early stopping if the learning_rate decreases drastically

10.keep_probability, dropout regularization hyperparameter to prevent overfitting
According to Geoffery Hinton, the master of Deep Learning and Artificial Intelligence, in his paper states: dropping out 20% of the input units and 50% of the hidden units was often found to be optimal.

In [44]:
epochs = 100

batch_size = 64

rnn_size = 512

num_layers = 3

encoding_embedding_size = 512

decoding_embedding_size = 512

learning_rate = 0.01

learning_rate_decay = 0.9

min_learning_rate = 0.0001

keep_probability = 0.5

-------------------------------------------------------------------------------

# Defining a session

We will define a TensorFlow session in which all the tensorflow training will be run.

To open a session in TensorFlow, we are going to create an object of the interactive session class.

Before startinig a session, it is necessary to reset the graph, and hence

1. reset graph using tensorflow library:
    tf.compat.v1.reset_default_graph()
    
2. define a session:
    session = tf.compat.v1.InteractiveSession()

In [45]:
tf.compat.v1.reset_default_graph()
session = tf.compat.v1.InteractiveSession()

-------------------------------------------------------------------------------

# Loading the Model Inputs

We will be using a function which we defined previously in Part 2

In [46]:
inputs, targets, lr, keep_prob = model_inputs()

-------------------------------------------------------------------------------

# Setting the sequence length

We are going to set the sequence length to maximum length which will be 25(which we have already done in end of part 1 - data preprocessing).


We are going to use tensorflow palceholder with default function.

1.The arguments will be: maximum_length(25)

2.Sequence shape, since there is no tensor to deal with, input None

3.name of the sequence_length

In [47]:
sequence_length = tf.placeholder_with_default(25, None, name = 'sequence_length')

-------------------------------------------------------------------------------

# Getting the shape of the input tensor

We need to get the shape of the input because this will be one of the arguments of one specific function we will use for training.

The specific function is actually the ones function by tensorflow(created a tensor of ones)

For this we will use shape function of tenorflow

In [48]:
input_shape = tf.shape(inputs)

-------------------------------------------------------------------------------

# Getting the training and test predictions

We will use the function seq2seq_model we defined above

In [50]:
training_predictions, test_predictions = seq2seq_model(tf.reverse(inputs, [-1]),
                                                       targets,
                                                       keep_prob,
                                                       batch_size,
                                                       sequence_length,
                                                       len(answerswords2int),
                                                       len(questionswords2int),
                                                       encoding_embedding_size,
                                                       decoding_embedding_size,
                                                       rnn_size,
                                                       num_layers,
                                                       questionswords2int)

AttributeError: module 'tensorflow_core.compat.v1' has no attribute 'contrib'