# Building a Chat Bot with Deep NLP

The project has been split into 4 parts: 

Part 1 : Data Preprocessing

Part 2 : Building the Seq2Seq model

Part 3 : Training the Seq2Seq model

Part 4 :Testing the Seq2Seq model

# Importing the libraries 

1. numpy library to work with arrays

2. tensorflow for deep learning

3. regular expression library to clean the text 

4. time library to measure the training time of each epoch

In [None]:
import numpy as np
import tensorflow as tf
import re
import time

# PART 1 - DATA PREPROCESSING

# Importing the dataset

We are going to give two variable names for the data sets.

The dataset of "lines" as lines and

the dataset of "conversations" as conversations

In [None]:
lines = open('movie_lines.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')

To avoid encoding issue : pass argument encoding = 'utf-8'

To ignore the insignifacant errors , we can do : errors = 'ignore'

.split('\n') is done to split the lines of the dataset with respect to occurence of a new line 

In [None]:
conversations = open('movie_conversations.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')

Let's have a look at the lines dataset


'L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.",
 'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow',
 "L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.",
 'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No',

In [None]:
# conversation

# "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
#  "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
#  "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']",
#  "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']",
#  "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']",

# Creating a dictionary that maps each line and its id

We want to create a dataset composing of input and outputs, and the easiest way to do that is by maintaining a dictionary.

We want to map each line with it's ID.

So the key identifier of the dictionary will be the ID of the line and the value will be the line itself.

1. declare an empty dictionary

2. iterate through all the lines of the "lines" dataset
    a)now for each of the lines, split the line with respect to " +++$+++ "
    
    b)now get the first element and place it as the key, and get the last value       and place it as the value
    
    c) we are just placing an if statement to ensure that the line has 5              elements for splitting, else we might face some shfiting issue

In [None]:
id2line = {}
#iterate through all lines in the lines dataset
for line in lines:
    _line = line.split(' +++$+++ ')
    if len(_line) == 5:
        id2line[_line[0]] = _line[4]

In [None]:
# id2line 

# 'L169795': 'And?',
#  'L261935': 'Dad!',
#  'L341879': 'What do you see, Starling?',
#  'L306525': 'Please -- was at the time brandishing your firearm, trying in his rage to shoot an acquaintance -- friend of long standing --',
#  'L612204': 'AH UGH.',
#  'L73334': 'Bruce Wayne. In the flesh.',

# Creating a list of all conversations with the IDs

In [None]:
conversations_ids = []
for conversation in conversations[:-1]:
    _conversation = conversation.split(' +++$+++ ')[-1][1:-1].replace("'", "").replace(" ", "")
    conversations_ids.append(_conversation.split(','))  

In [None]:
# conversation_ids list
# ['L194', 'L195', 'L196', 'L197'],
#  ['L198', 'L199'],
#  ['L200', 'L201', 'L202', 'L203'],
#  ['L204', 'L205', 'L206'],
#  ['L207', 'L208']

In [None]:
questions = []
answers = []
for conversation in conversations_ids:
    for i in range(len(conversation) - 1):
        questions.append(id2line[conversation[i]])
        answers.append(id2line[conversation[i+1]])

Let's have a look at the answers list we created above

# Cleaning of the texts

1. put everything in lowercase

2. remove all the apostrophes

3. removing all non-essential words

creating a function that will carry out all of the above activities

Steps 1 and 2 will be carried out in the below function

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}+=~|.?,]", "", text)
    return text

# Applying the above function for cleaning the questions




In [None]:
clean_questions = []
for question in questions:
    clean_questions.append(clean_text(question))

# Applying the clean_text function for cleaning the questions



In [None]:
clean_answers = []
for answer in answers:
    clean_answers.append(clean_text(answer))

Let's have a look at the clean_questions list we created above 


'can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again',
 'well i thought we would start with pronunciation if that is okay with you',
 'not the hacking and gagging and spitting part  please',
 'you are asking me out  that is so cute what is your name again',
 "no no it's my fault  we didn't have a proper introduction ",
 'cameron',

Let's have a look at the clean_answers list we created above

'well i thought we would start with pronunciation if that is okay with you',
 'not the hacking and gagging and spitting part  please',
 "okay then how 'bout we try out some french cuisine  saturday  night",
 'forget it',
 'cameron',
 'the thing is cameron  i am at the mercy of a particularly hideous breed of loser  my sister  i cannot date until she does',
 'seems like she could get a date easy enough',
 'unsolved mystery  she used to be really popular when she started high school then it was just like she got sick of it or something',

In [None]:
word2count = {}
#going through the clean_questions
for question in clean_questions:
    for word in question.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
#going through the clean_answers
for answer in clean_answers:
    for word in answer.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

# Tokenization and filtering of the non-frequent words

1. We will assign a unique integer to each of the words present in the clean_questions and clean_answers list 

2. for each of the words, we will compare the number of occurences (with the help of the word2count{} dictionary we created above) of that particular word and filter out those words which do not cross the threshold,i.e, removal of infrequent words

3. and for those words which cross the threshold, we will assign a unique integer value 

In [None]:
threshold_questions = 20
threshold_answers = 20

The threshold is a hyperparameter in NLP, and please feel free to experiment with this value while training your model. But make sure not to keep the threshold value too low, else it might be too overwhelming for the model to learn

In [None]:
# carrying out the above mentioned steps for questions
questionswords2int = {}
word_number = 0
for word, count in word2count.items():
    if count >= threshold_questions:
        questionswords2int[word] = word_number
        word_number += 1
        
# carrying out the above mentioned steps for answers
answerswords2int = {}
word_number = 0
for word, count in word2count.items():
    if count >= threshold_answers:
        answerswords2int[word] = word_number
        word_number += 1

Let's have a look at the questionswords2int dictionary we created above
 
 'dowd': 2252,
 'december': 7082,
 'zuzu': 3282,
 'impressed': 4407,
 'ditch': 7778,
 'bark': 2253,
 'slut': 0,
 'scattered': 8479,
 'pa': 4033,
 'italian': 1,
 'box': 2056,
 'thirteen': 8423,
 'gone!': 7780,
 'legit': 4409,
 'health': 3334,

Let's have a look at the answerswords2int dictionary we created above

 'dowd': 2252,
 'december': 7082,
 'zuzu': 3282,
 'impressed': 4407,
 'ditch': 7778,
 'bark': 2253,
 'slut': 0,
 'scattered': 8479,
 'pa': 4033,
 'italian': 1,
 'box': 2056,
 'thirteen': 8423,
 'gone!': 7780,
 'legit': 4409,

# Adding the last tokens to the above two dictionaries

EOS = End of sentence

SOS = Start of sentence

In [None]:
tokens = ['<PAD>', '<EOS>', '<OUT>', '<SOS>']

In [None]:
for token in tokens:
    questionswords2int[token] = len(questionswords2int) + 1

Doing the same as above to answerswords2int dictionary

Here we need to assign a unique integer to token in the answerswords2int dictionary. Since words are added, by incrementing one at a time, hence here we just calculate the length of the dictionary and 1 to it

In [None]:
for token in tokens:
    answerswords2int[token] = len(answerswords2int) + 1

# Creating an inverse dictionary of the answerswords2int dictionary

We are doing so because we need the inverse mapping from integers to answers words in the implementation of the Seq2Seq architectural model. 
Also, we need this only for the answerswords dictionary and not for questionswords.

1. declare a new dictionary in python

2. there is a single line trick in python to reverse the mapping of a dictionary,i.e, to interchange the key-value pairs and store them in a new dictionary

In [None]:
answersints2word = {w_i: w for w, w_i in answerswords2int.items()}

Let's have a look at the newly created answersints2word dictionary (inverse mapped dictionary)
 0: 'slut',
 1: 'italian',
 2: 'bend',
 3: 'power',
 4: 'skin',
 5: 'labor',
 6: 'bore',
 7: 'spaghetti',
 8: 'king!',
 9: 'sneaking',
 

# Adding the End Of String token to the end of every answer



In [None]:
for i in range(len(clean_answers)):
    clean_answers[i] += ' <EOS>'

Let's have a look at the modified clean_answers list after appending the EOS token

'well i thought we would start with pronunciation if that is okay with you <EOS>',
 'not the hacking and gagging and spitting part  please <EOS>',
 "okay then how 'bout we try out some french cuisine  saturday  night <EOS>",
 'forget it <EOS>',
 'cameron <EOS>',
 'the thing is cameron  i am at the mercy of a particularly hideous breed of loser  my sister  i cannot date until she does <EOS>',
 'seems like she could get a date easy enough <EOS>',
 'unsolved mystery  she used to be really popular when she started high school then it was just like she got sick of it or something <EOS>',

# Tranalating all the questions and the answers into integers and replacing all the words that were filtered out by "< OUT>"



In [None]:
questions_into_int = []
for question in clean_questions:
    ints = [] # list of integers, which will be the associated integer to each of the word present in that question
    for word in question.split():
        if word not in questionswords2int:
            ints.append(questionswords2int['<OUT>'])
        else:
            ints.append(questionswords2int[word])
    questions_into_int.append(ints)

Doing the same set of above steps for answers

In [None]:
answers_into_int = []
for answer in clean_answers:
    ints = [] # list of integers, which will be the associated integer to each of the word present in that question
    for word in answer.split():
        if word not in answerswords2int:
            ints.append(answerswords2int['<OUT>'])
        else:
            ints.append(answerswords2int[word])
    answers_into_int.append(ints)

Let's have a look at the questions_to_int 

[8065,
  3783,
  3552,
  6773,
  1310,
  8824,
  8824,
  6112,
  3217,
  8824,
  2293,
  72,
  4780,
  3347,
  8824,
  4244,
  5451,
  4866,
  2513,
  6233,
  8824,
  5107],
 

Let's have a look at the answers_into_int list
[3164,
  2079,
  4359,
  3783,
  134,
  3579,
  1036,
  8824,
  6166,
  8419,
  435,
  649,
  1036,
  4069,]

# Sorting questions and answers by the length of the questions

We are doing so because it will speed up the training and help to reduce the loss. The reason for this is because it will reduce the amount of padding during training.

1.create two empty lists in python
 one called sorted_clean_answers[] and the other sorted_clean_questions[]
 
 We will place a limit on the length of the questions, because very lengthy questions will be too overwhelming for the chatbot to learn from. 
Also this limit on the length can be considered as a hyperparamater which can be tuned to get better perfomance.(here let's take the limit to be 25)

2. looping over different possible lengths of the questions(upto the limit)
    for each of the questions, we need to get two important elements - index of     the question and the question itself
    
    the trick to get these two elements at the same time is to use the enumerate function


In [None]:
sorted_clean_questions = []
sorted_clean_answers = []
for length in range(1, 25 + 1):
    for i in enumerate(questions_into_int):
        if len(i[1]) == length:
            sorted_clean_questions.append(questions_into_int[i[0]])
            sorted_clean_answers.append(answers_into_int[i[0]])

-------------------------------------------------------------------------------
# PART 2 - BUILDING THE SEQ2SEQ MODEL
-------------------------------------------------------------------------------

# Creating placeholders for the inputs and the targets

In TensorFlow, all avariables are used in tensors. Tensors are like an advanced numpy array that allows very fast computations in Deep Neural Networks.

All variables used as tensors must be defined as what we call TensorFlow placeholders.

This is more of an advanced data structure that can contain tensors and also additional features.

-------------------------------------------------------------------------------

We will be defining a function that will call model inputs, and inside this function we will create a placeholder for the inputs and a placeholder for the targets.
Then we will add a learning rate and even more hybrid parameters.

In short, we will be creating placeholders to be able to use these variables in future training.

-------------------------------------------------------------------------------

In [None]:
def model_inputs():
    inputs = tf.placeholder(tf.int32, [None,None], name = 'input')
    targets = tf.placeholder(tf.int32, [None,None], name = 'target')
    lr = tf.placeholder(tf.float32, name = 'learning_rate')
    keep_prob = tf.placeholder(tf.float32,name = 'keep_prob')
    return inputs,targets,lr,keep_prob

# Preprocessing the targets

In [None]:
def preprocess_targets(targets, word2int, batch_size):
    left_side = tf.fill([batch_size, 1], word2int['<SOS>'])
    right_side = tf.strided_slice(targets, [0,0], [batch_size,-1], [1,1])
    preprocessed_targets = tf.concat([left_side, right_side], 1)
    return preprocessed_targets

-------------------------------------------------------------------------------

# Creating the Encoder RNN Layer

The arguments of this function include: 

1. rnn_inputs that corresponds to the model inputs

2. rnn_size is the number of input tensors of the encoder

3. num_layers, the number of layers in the RNN

4. keep_prob, for dropout regularization(to improve accuracy)

5. sequence_length, which is the list of th elength of each question in the batch.
-------------------------------------------------------------------------------

1. In tensorflow we have an amazing class that will help us create an LSTM

assign variable lstm  to tf.contrib(module).rnn(submodule).BasicLSTMCell(rnn_size) 

2. assign variable lstm_dropout to  tf.contrib(module).rnn(submodule).DropoutWrapper(class)(lstm, keep_prob)

3. we are now ready to create the encoder cell

assign variable encoder_cell to tf.contrib(module).rnn(submodule).MultiRNNCell(the number of lstm dropout mulitplied to number of layers we have as an argument to the function)

4. now to get the encoder_state, we will get ot from the bidirectional_dynamic_rnn function from the nn module by tensorflow

the above step created a dynamic version of a bidirectional RNN(this will help us in making our chatbot more powerful)

the dynamic version of bidirectional rnn will take the input and build independent forward and abckward RNN's. 

NOTE: We need to make sure in case of dynamic bidirectional RNNs that inpput size of forward cell and backward cell must match.


In [None]:
def encoder_rnn(rnn_inputs, rnn_size, num_layers, keep_prob, sequence_length):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    lstm_dropout = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
    encoder_cell = tf.contrib.rnn.MultiRNNCell([lstm_dropout] * num_layers)
    _, encoder_state = tf.nn.bidirectional_dynamic_rnn(cell_fw = encoder_cell, 
                                                       cell_bw = encoder_cell, 
                                                       sequence_length = sequence_length, 
                                                       inputs = rnn_inputs,
                                                       dtype = tf.float32)
    return encoder_state

-------------------------------------------------------------------------------

# Creating the Decoder of the RNN layer

# Step 1. Decoding the training set

The arguments of this function include:

1. encoder_state

2. decoder_cell

3. decoder_embedded_input

4. sequence_length

5. decoding_scope

6. output_function

7. keep_prob

8. batch_size
-------------------------------------------------------------------------------

1. The first thing we need to do is get the attention states

initialize attention_states as a 3 dimesnional matrix initialized with zeros

Since we are dealing with batches, the number of lines is going to be batch_size

The number of elements on the third axis is going to be decoder_cell.output_size

2. We will get the attention_keys, the attention_values, the atention_score_function and attention_construct_function using the TensorFlow function belonging to the seq2seq submodule which is prepare_attention()

the attention_keys is the keys that are to be compared with the target_states

the attention_values, the values that we will use to construct the context vectors

the attention_score_function is used to compute the similarity in between keys and the target 

the attention_construct_function is function used to build the attention state

3. The next step is to get the training_decoder_function that will do the decoding of the training set. 

training_decoder_function is obtained from another tensorflow function present in the seq2seq submodule called attention_decoder_fn_train()


4. the next step is to get the decoder_output, decoder_final_state and the decoder_final_context_state(but we need only the decoder_output)

this is obtained from the function present in the dynamic_rnn_encoder submodule present in the TensorFlow library

5. the final step is to apply dropout to our decoder_output
decoder_output_dropout = tf.nn(module).dropout()

6. return the output_function(decoder_output_dropout)


In [None]:
def decode_training_set(encoder_state, decoder_cell, decoder_embedded_input, sequence_length, decoding_scope, output_function, keep_prob, batch_size):
    attention_states = tf.zeros([batch_size, 1, decoder_cell.output_size])
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states, attention_option = "bahdanau", num_units = decoder_cell.output_size)
    training_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_train(encoder_state[0],
                                                                              attention_keys,
                                                                              attention_values,
                                                                              attention_score_function,
                                                                              attention_construct_function,
                                                                              name = "attn_dec_train")
    decoder_output, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(decoder_cell,
                                                                                                              training_decoder_function,
                                                                                                              decoder_embedded_input,
                                                                                                              sequence_length,
                                                                                                              scope = decoding_scope)
    decoder_output_dropout = tf.nn.dropout(decoder_output, keep_prob)
    return output_function(decoder_output_dropout)

# Step 2. Decoding the test/validation set


Here are we are going to make a very similar function as above, but for the observatoins of the test set and the validation set.

These are new observations that will not be used in the training.


In [None]:
def decode_test_set(encoder_state, decoder_cell, decoder_embeddings_matrix, sos_id, eos_id, maximum_length, num_words, decoding_scope, output_function, keep_prob, batch_size):
    attention_states = tf.zeros([batch_size, 1, decoder_cell.output_size])
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states, attention_option = "bahdanau", num_units = decoder_cell.output_size)
    test_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_inference(output_function,
                                                                              encoder_state[0],
                                                                              attention_keys,
                                                                              attention_values,
                                                                              attention_score_function,
                                                                              attention_construct_function,
                                                                              decoder_embeddings_matrix,
                                                                              sos_id,
                                                                              eos_id,
                                                                              maximum_length,
                                                                              num_words,
                                                                              name = "attn_dec_inf")
    test_predictions, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(decoder_cell,
                                                                                                                test_decoder_function,
                                                                                                                scope = decoding_scope)
    return test_predictions

# Step 3. Creating the Decoder RNN

This function will have the following arguments:

1.decoder_rnn

2.decoder_embeddings_matrix

3.encoder_state  // output of the encoder becomes input of the decoder

4.num_words //total number of words in our corpus of words

5.sequence_length 

6.rnn_size // number of layers we want in our RNN decoder

7.num_layers

8.word2int //the dictionary which we have defined earlier

9.keep_prob // for the dropout(regularization) rate

10.batch_size 

-----------------------------------------------------------------------------



In [None]:
def decoder_rnn(decoder_embedded_input, decoder_embeddings_matrix, encoder_state, num_words, sequence_length, rnn_size, num_layers, word2int, keep_prob, batch_size):
    with tf.variable_scope("decoding") as decoding_scope:
        lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
        lstm_dropout = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
        decoder_cell = tf.contrib.rnn.MultiRNNCell([lstm_dropout] * num_layers)
        weights = tf.truncated_normal_initializer(stddev = 0.1)
        biases = tf.zeros_initializer()
        output_function = lambda x: tf.contrib.layers.fully_connected(x,
                                                                      num_words,
                                                                      None,
                                                                      scope = decoding_scope,
                                                                      weights_initializer = weights,
                                                                      biases_initializer = biases)
        training_predictions = decode_training_set(encoder_state,
                                                   decoder_cell,
                                                   decoder_embedded_input,
                                                   sequence_length,
                                                   decoding_scope,
                                                   output_function,
                                                   keep_prob,
                                                   batch_size)
        decoding_scope.reuse_variables()
        test_predictions = decode_test_set(encoder_state,
                                           decoder_cell,
                                           decoder_embeddings_matrix,
                                           word2int['<SOS>'],
                                           word2int['<EOS>'],
                                           sequence_length - 1,
                                           num_words,
                                           decoding_scope,
                                           output_function,
                                           keep_prob,
                                           batch_size)
    return training_predictions, test_predictions

-----------------------------------------------------------------------------

# Building the Seq2Seq Model

This function is the final ultimate function which we will build using the above defined functions. This will be the brain of our chatbot. 

This function will take the following arguments:

1.inputs which are the questions of the Cornell movie corpus dialogue dataset

2.targets. which will be the answers to our questions

3.keep_prob

4.batch_size

5.sequence_length

6.answers_num_words

7.questions_num_words

8.encoder_embedding_size, which is the number of dimensions of the embedding matrix for the encoder

9.decoder_embedding_size, which is the number of dimesnions of the embeddig matrix for the decoder

10.rnn_size

11.num_layers

12.questionswords2int, dictionary which we defined previously to preprocess the targets

-----------------------------------------------------------------------------




In [None]:
def seq2seq_model(inputs, targets, keep_prob, batch_size, sequence_length, answers_num_words, questions_num_words, encoder_embedding_size, decoder_embedding_size, rnn_size, num_layers, questionswords2int):
    encoder_embedded_input = tf.contrib.layers.embed_sequence(inputs,
                                                              answers_num_words + 1,
                                                              encoder_embedding_size,
                                                              initializer = tf.random_uniform_initializer(0, 1))
    encoder_state = encoder_rnn(encoder_embedded_input, rnn_size, num_layers, keep_prob, sequence_length)
    preprocessed_targets = preprocess_targets(targets, questionswords2int, batch_size)
    decoder_embeddings_matrix = tf.Variable(tf.random_uniform([questions_num_words + 1, decoder_embedding_size], 0, 1))
    decoder_embedded_input = tf.nn.embedding_lookup(decoder_embeddings_matrix, preprocessed_targets)
    training_predictions, test_predictions = decoder_rnn(decoder_embedded_input,
                                                         decoder_embeddings_matrix,
                                                         encoder_state,
                                                         questions_num_words,
                                                         sequence_length,
                                                         rnn_size,
                                                         num_layers,
                                                         questionswords2int,
                                                         keep_prob,
                                                         batch_size)
    return training_predictions, test_predictions

-------------------------------------------------------------------------------

# PART 3 -  TRAINING THE SEQ2SEQ MODEL

-------------------------------------------------------------------------------

# Setting the hyperparameters

The whole process of getting the batches of input into the neural network and then forward propagating them inside the encoders in the encoder states and then forward propagating the encoder states with targets inside the deep recurrent neural network to get the final answers/outputs.Then back-propagating the loss generated by the outputs and the targets back into the neural network and updating the weights towards the direction of a better ability for the chatbot to speak like a human. 

1.An epoch is basically one whole iteration of the above mentioned steps(take 100, if training is taking too long, adjust it to 50, but not lower).

2.batch_size, we are setting it to be 64(usually a power of 2)

3.rnn_size, we are setting it to be 512

4.num_layers, we are setting it to be 3, can change later if necessary

5.encoding_embedding_size,(number of columns in the embedded matrix), taken as 512

6.similarly, decoding_embedding_size is taken to be as 512

7.learning_rate, we will start with 0.01

8.learning_rate_decay, which represents the percentage by wich learning_rate is reduced over the iterations of the training

9.min_learning_rate, as a lower bound to take care of early stopping if the learning_rate decreases drastically

10.keep_probability, dropout regularization hyperparameter to prevent overfitting
According to Geoffery Hinton, the master of Deep Learning and Artificial Intelligence, in his paper states: dropping out 20% of the input units and 50% of the hidden units was often found to be optimal.

In [None]:
epochs = 100

batch_size = 64

rnn_size = 512

num_layers = 3

encoding_embedding_size = 512

decoding_embedding_size = 512

learning_rate = 0.01

learning_rate_decay = 0.9

min_learning_rate = 0.0001

keep_probability = 0.5

-------------------------------------------------------------------------------

# Defining a session

We will define a TensorFlow session in which all the tensorflow training will be run.



In [None]:
tf.reset_default_graph()
session = tf.InteractiveSession()

-------------------------------------------------------------------------------

# Loading the Model Inputs

We will be using a function which we defined previously in Part 2

In [None]:
inputs, targets, lr, keep_prob = model_inputs()

-------------------------------------------------------------------------------

# Setting the sequence length

We are going to set the sequence length to maximum length which will be 25(which we have already done in end of part 1 - data preprocessing).


We are going to use tensorflow palceholder with default function.

1.The arguments will be: maximum_length(25)

2.Sequence shape, since there is no tensor to deal with, input None

3.name of the sequence_length

In [None]:
sequence_length = tf.placeholder_with_default(25, None, name = 'sequence_length')

-------------------------------------------------------------------------------

# Getting the shape of the input tensor

We need to get the shape of the input because this will be one of the arguments of one specific function we will use for training.

The specific function is actually the ones function by tensorflow(created a tensor of ones)

For this we will use shape function of tenorflow

In [None]:
input_shape = tf.shape(inputs)

-------------------------------------------------------------------------------

# Getting the training and test predictions

We will use the function seq2seq_model we defined above

In [None]:
training_predictions, test_predictions = seq2seq_model(tf.reverse(inputs, [-1]),
                                                       targets,
                                                       keep_prob,
                                                       batch_size,
                                                       sequence_length,
                                                       len(answerswords2int),
                                                       len(questionswords2int),
                                                       encoding_embedding_size,
                                                       decoding_embedding_size,
                                                       rnn_size,
                                                       num_layers,
                                                       questionswords2int)


--------------------------------------------------------------------------------
# Setting up the loss error, the Optimizer and Gradient Clipping

We are going to define a new scope here which will contain two final elements that we will use for the training:

1.The loss error - weighted cross entropy loss error

2.Optimizer with gradient clipping -Adam Optimizer and then apply gradient clipping to avoid exploding and vanishing gradient issues

We will obtain our loss_error from the sequence loss function present in the seq2seq submodule present in the contrib module.
It takes the arguments the training_predictions and the targets.
This is necessary as we will be calculating our loss_error based on the difference between the two.
The third argument is the tensor of weights initialized to ones to appropriate shape.

We will get the optimizer which will be an object of the AdamOptimizer class which is a class in tensorflow and taken from the module train.

We will compute the gradients by a function provided by the optimizer object called compute_gradients

Clipped gradients mean the gradients are clipped to a particular value, below and above which our gradient values cannot go(to avoid vanishing and exploding gradient problems)

In [None]:
with tf.name_scope("optimization"):
    loss_error = tf.contrib.seq2seq.sequence_loss(training_predictions,
                                                  targets,
                                                  tf.ones([input_shape[0], sequence_length]))
    optimizer = tf.train.AdamOptimizer(learning_rate)
    gradients = optimizer.compute_gradients(loss_error)
    clipped_gradients = [(tf.clip_by_value(grad_tensor, -5., 5.), grad_variable) for grad_tensor, grad_variable in gradients if grad_tensor is not None]
    optimizer_gradient_clipping = optimizer.apply_gradients(clipped_gradients)

# Padding the sequences with the < PAD> token

Why do we need to do padding ?

All the sentences in a batch, whether they are questions or answers must have the same length. This a must do in Deep NLP.

Creating a function called apply_padding.

It will take 2 arguments:

1.batch_of_sequences and 2.word2int dictionary

The task of the function is to complete the sentences using PAD tokens so that all the sentences in the batch have the same length.



In [None]:
def apply_padding(batch_of_sequences, word2int):
    max_sequence_length = max([len(sequence) for sequence in batch_of_sequences])
    return [sequence + [word2int['<PAD>']] * (max_sequence_length - len(sequence)) for sequence in batch_of_sequences]


--------------------------------------------------------------------------------
# Splitting the data into batches of questions and answers

We can naturally guess the arguments of this function:

1.questions

2.answers

3.batch size

In [None]:
def split_into_batches(questions, answers, batch_size):
    for batch_index in range(0, len(questions) // batch_size):
        start_index = batch_index * batch_size
        questions_in_batch = questions[start_index : start_index + batch_size]
        answers_in_batch = answers[start_index : start_index + batch_size]
        padded_questions_in_batch = np.array(apply_padding(questions_in_batch, questionswords2int))
        padded_answers_in_batch = np.array(apply_padding(answers_in_batch, answerswords2int))
        yield padded_questions_in_batch, padded_answers_in_batch

--------------------------------------------------------------------------------
# Splitting the questions and answers into training and validation sets

The valiation set is keeping 10%-15% of the training data as validation set and will not be used for training the neural network. It will be used to check the predictive power of the model.

In [None]:
training_validation_split = int(len(sorted_clean_questions) * 0.15)
training_questions = sorted_clean_questions[training_validation_split:]
training_answers = sorted_clean_answers[training_validation_split:]
validation_questions = sorted_clean_questions[:training_validation_split]
validation_answers = sorted_clean_answers[:training_validation_split]

--------------------------------------------------------------------------------
# Training

In [None]:
batch_index_check_training_loss = 100
batch_index_check_validation_loss = ((len(training_questions)) // batch_size // 2) - 1
total_training_loss_error = 0
list_validation_loss_error = []
early_stopping_check = 0
early_stopping_stop = 1000
checkpoint = "chatbot_weights.ckpt" # For Windows users, replace this line of code by: checkpoint = "./chatbot_weights.ckpt"
session.run(tf.global_variables_initializer())
for epoch in range(1, epochs + 1):
    for batch_index, (padded_questions_in_batch, padded_answers_in_batch) in enumerate(split_into_batches(training_questions, training_answers, batch_size)):
        starting_time = time.time()
        _, batch_training_loss_error = session.run([optimizer_gradient_clipping, loss_error], {inputs: padded_questions_in_batch,
                                                                                               targets: padded_answers_in_batch,
                                                                                               lr: learning_rate,
                                                                                               sequence_length: padded_answers_in_batch.shape[1],
                                                                                               keep_prob: keep_probability})
        total_training_loss_error += batch_training_loss_error
        ending_time = time.time()
        batch_time = ending_time - starting_time
        if batch_index % batch_index_check_training_loss == 0:
            print('Epoch: {:>3}/{}, Batch: {:>4}/{}, Training Loss Error: {:>6.3f}, Training Time on 100 Batches: {:d} seconds'.format(epoch,
                                                                                                                                       epochs,
                                                                                                                                       batch_index,
                                                                                                                                       len(training_questions) // batch_size,
                                                                                                                                       total_training_loss_error / batch_index_check_training_loss,
                                                                                                                                       int(batch_time * batch_index_check_training_loss)))
            total_training_loss_error = 0
        if batch_index % batch_index_check_validation_loss == 0 and batch_index > 0:
            total_validation_loss_error = 0
            starting_time = time.time()
            for batch_index_validation, (padded_questions_in_batch, padded_answers_in_batch) in enumerate(split_into_batches(validation_questions, validation_answers, batch_size)):
                batch_validation_loss_error = session.run(loss_error, {inputs: padded_questions_in_batch,
                                                                       targets: padded_answers_in_batch,
                                                                       lr: learning_rate,
                                                                       sequence_length: padded_answers_in_batch.shape[1],
                                                                       keep_prob: 1})
                total_validation_loss_error += batch_validation_loss_error
            ending_time = time.time()
            batch_time = ending_time - starting_time
            average_validation_loss_error = total_validation_loss_error / (len(validation_questions) / batch_size)
            print('Validation Loss Error: {:>6.3f}, Batch Validation Time: {:d} seconds'.format(average_validation_loss_error, int(batch_time)))
            learning_rate *= learning_rate_decay
            if learning_rate < min_learning_rate:
                learning_rate = min_learning_rate
            list_validation_loss_error.append(average_validation_loss_error)
            if average_validation_loss_error <= min(list_validation_loss_error):
                print('I speak better now!!')
                early_stopping_check = 0
                saver = tf.train.Saver()
                saver.save(session, checkpoint)
            else:
                print("Sorry I do not speak better, I need to practice more.")
                early_stopping_check += 1
                if early_stopping_check == early_stopping_stop:
                    break
    if early_stopping_check == early_stopping_stop:
        print("My apologies, I cannot speak better anymore. This is the best I can do.")
        break
print("Game Over")

--------------------------------------------------------------------------------

# PART 4 - TESTING THE SEQ2SEQ MODEL

--------------------------------------------------------------------------------

# Loading the weights and running the session

In [None]:
checkpoint = "./chatbot_weights.ckpt"
session = tf.InteractiveSession()
session.run(tf.global_variables_initializer())
saver = tf.train.Saver()
saver.restore(session, checkpoint)

--------------------------------------------------------------------------------
# Converting the questions from strings to list of encoding integers

In [None]:
def convert_string2int(question, word2int):
    question = clean_text(question)
    return [word2int.get(word, word2int['<OUT>']) for word in question.split()]

--------------------------------------------------------------------------------
# Setting up the chat

In [None]:
while(True):
    question = input("You: ")
    if question == 'Goodbye':
        break
    question = convert_string2int(question, questionswords2int)
    question = question + [questionswords2int['<PAD>']] * (25 - len(question))
    fake_batch = np.zeros((batch_size, 25))
    fake_batch[0] = question
    predicted_answer = session.run(test_predictions, {inputs: fake_batch, keep_prob: 0.5})[0]
    answer = ''
    for i in np.argmax(predicted_answer, 1):
        if answersints2word[i] == 'i':
            token = ' I'
        elif answersints2word[i] == '<EOS>':
            token = '.'
        elif answersints2word[i] == '<OUT>':
            token = 'out'
        else:
            token = ' ' + answersints2word[i]
        answer += token
        if token == '.':
            break
    print('ChatBot: ' + answer)