**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading the Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [4]:
#@ Downloading the Libraries and Dependencies:
# !pip install nlpia                                                       # Downloading the NLPIA Package.

import numpy as np                                                         # Module for matrix multiplication.
from nlpia.loaders import get_data 
import os
from random import shuffle                                                 # Module for shuffling the Dataset.
from IPython.display import display

from keras.models import Model
from keras.layers import Input, LSTM, Dense

**Getting the Data**
* I will use the [**Cornell Movie Dialog Dataset**](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). Using the entire Cornell Movie Dialog Dataset can be computationally intensive because a few sequences have more than 2000 tokens. I will use **NLPIA** Package to load the Cornell Movie Dialog Dataset and I will pre process the Dialog Corpus.

In [5]:
#@ Getting the Data:
df = get_data("moviedialog")                                                # Accessing the Cornell Movie Dialog Corpus.

#@ Processing the Data:
input_texts = []                                                            # The array holds the input text from the Corpus.
target_texts = []                                                           # The array holds the target text from the Corpus.
input_vocabulary = set()                                                    # Holds the seen characters in the input text.
output_vocabulary = set()                                                   # Holds the seen characters in the target txt.

start_token = "\t"                                                          # Target sequence is annotated with start Token.
stop_token = "\n"                                                           # Target sequence is annotated with sop Token.
max_training_samples = min(25000, len(df) - 1)                              # Defines the lines used for Training.

for input_text, target_text in zip(df.statement, df.reply):
  target_text = start_token + target_text + stop_token                      # The Target Text needs to be wrapped with start and stop tokens.
  input_texts.append(input_text)
  target_texts.append(target_text)
  
  #@ Compiling the Vocabulary set:
  for char in input_text:
    if char not in input_vocabulary:
      input_vocabulary.add(char)
  
  for char in target_text:
    if char not in output_vocabulary:
      output_vocabulary.add(char)

#@ Inspecting the Data:
display(f"Number of samples: {len(input_texts)}")
display(f"Number of unique Input Tokens: {len(input_vocabulary)}")
display(f"Number of unique Output Tokens: {len(output_vocabulary)}")

'Number of samples: 64350'

'Number of unique Input Tokens: 44'

'Number of unique Output Tokens: 46'

**Building the Character Dictionary**
* I will convert each characters of the Input and Target Texts into one hot vectors that represent each characters. In order to generate one hot vectors I will generate token dictionaries where every character is mapped to an index. I will also generate the reverse dictionaries which will be used to convert generated index into characters. 

In [6]:
#@ Sorting the List of Characters:
input_vocabulary = sorted(input_vocabulary)
output_vocabulary = sorted(output_vocabulary)

#@ Calculating the Maximum number of Unique Characters:
input_vocab_size = len(input_vocabulary)
output_vocab_size = len(output_vocabulary)

#@ Determining the Maximum number of Sequence Tokens:
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

#@ Creating the Token Dictionaries:
input_token_index = dict([(char, i) for i,char in enumerate(input_vocabulary)])
target_token_index = dict([(char, i) for i,char in enumerate(output_vocabulary)])

#@ Creating the Reverse Token Dictionaries:
reverse_input_char_index = dict((i, char) for char,i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char,i in target_token_index.items())

#@ Inspecting the Data:
display(f"Maximum sequence length for Inputs: {max_encoder_seq_length}")
display(f"Maximum sequence length for Outputs: {max_decoder_seq_length}")

'Maximum sequence length for Inputs: 100'

'Maximum sequence length for Outputs: 102'

**Generating One Hot Encoded Training sets**
* Now, I will convert the input and target text into one hot Encoded Tensors.

In [7]:
#@ Creating character sequence Encoder and Decoder Training Set:

#@ Initializing the Tensors with zeros:
encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, input_vocab_size), dtype="float32")
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, output_vocab_size), dtype="float32")
decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, output_vocab_size), dtype="float32")

#@ Looping over the Training Samples:
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
  #@ Looping over each character of each Samples:
  for t, char in enumerate(input_text):
    encoder_input_data[i, t, input_token_index[char]] = 1.
  for t, char in enumerate(target_text):
    decoder_input_data[i, t, target_token_index[char]] = 1.
    if t > 0:
      decoder_target_data[i, t-1, target_token_index[char]] = 1.
  # decoder_input_data[i, t+1:, target_token_index[" "]] = 1.
  # decoder_target_data[i, t:, target_token_index[" "]] = 1.

### **Sequence to Sequence Chatbot**
* I have completed all the Training set preparations by performing the tasks such as Converting the preprocessed Corpus into Input and Target Samples and creating Index Dictionaries and converting the Samples into One hot Tensors. Now, I will train the Sequence to sequence Chatbot. 

In [8]:
#@ Parameters of LSTM Neural Networks:
batch_size = 64                                 # Number of samples shown to the network before updating the weights.
epochs = 100                                    # Number of times for passing the Training.
num_neurons = 256                               # Setting the number of neuron dimensions to 256.

In [9]:
#@ Sequence to Sequence Encoder Decoder Network:

#@ Creating the Thought Encoder using Keras Functional API:
encoder_inputs = Input(shape=(None, input_vocab_size))
encoder = LSTM(num_neurons, return_state=True)                                          # Returning the internal state of LSTM.
encoder_outputs, state_h, state_c = encoder(encoder_inputs) 
encoder_states = [state_h, state_c]                                                     # First value of LSTM is the Output.

#@ Creating the Thought Decoder using Keras Functional API:
decoder_inputs = Input(shape=(None, output_vocab_size))
decoder_lstm = LSTM(
    num_neurons, return_sequences=True, return_state=True
)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)      # Passing initial state to the LSTM Layer.
decoder_dense = Dense(output_vocab_size, activation="softmax")                          
decoder_outputs = decoder_dense(decoder_outputs)                                        # Passing the output to the Softmax Layer.

#@ Creating the Sequence to Sequence Neural Network Model:
model = Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs
)

#@ Compiling the Sequence to Sequence Neural Network Model:
model.compile(
    optimizer="rmsprop",
    loss="categorical_crossentropy",                                                    # Using Categorical Crossentropy.
    metrics=["accuracy"]
)

#@ Training the Sequence to Sequence Neural Network Model:
model.fit(
    [encoder_input_data, decoder_input_data], 
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.1                                                               # 10% of Samples are splitted for Validation.
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f06ee3616a0>

**Saving the Sequence to Sequence Chatbot Model**

In [None]:
#@ Saving the Sequence to Sequence Network Model:
model_structure = model.to_json()
with open("chatbot_model.json", "w") as json_file:
  json_file.write(model_structure)
model.save_weights("chatbot_model.h5")
print("Model saved successful!!")

model.load_weights("chatbot_model.h5")                         # Loading the saved Model.

**Assembling the Model for Sequence Generation**

In [10]:
#@ Creating the Response Generator Model:
encoder_model = Model(encoder_inputs, encoder_states)
thought_input = [Input(shape=(num_neurons,)), Input(shape=(num_neurons,))]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=thought_input)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

#@ Creating the Model:
decoder_model = Model(
    inputs=[decoder_inputs] + thought_input,
    outputs=[decoder_outputs] + decoder_states
)

### **Predicting the Sequence**
* I will define a Function for generating the Response of the Chatbot. This Function is the heart of Response Generation of the Chatbot which accepts one hot encoded input sequence, generates the Thought Vector and the Thought Vector generates the appropriate response by using the **Trained Network**. 

In [12]:
#@ Building the Character Based Translator:
def decode_sequence(input_seq):
  #@ Generating the Thought Vector:
  thought = encoder_model.predict(input_seq)

  target_seq = np.zeros((1, 1, output_vocab_size))                          # Initializing it as a Zero Tensor.
  target_seq[0, 0, target_token_index[stop_token]] = 1.                           # First Input Token to the decoder is the input token.

  stop_condition = False
  generated_sequence = ""
  while not stop_condition:
    output_tokens, h, c = decoder_model.predict([target_seq] + thought)     # Passing the generated Token and latest state to the Decoder.
    generated_token_idx = np.argmax(output_tokens[0, -1, :])
    generated_char = reverse_target_char_index[generated_token_idx]
    generated_sequence += generated_char
    if (generated_char == stop_token or 
        len(generated_sequence) > max_decoder_seq_length):
      stop_condition = True                                                 # Setting the condition to True will stop the Loop.
    
    target_seq = np.zeros((1, 1, output_vocab_size))
    target_seq[0, 0, generated_token_idx] = 1.
    thought = [h, c]                                                        # Updating the Thought Vector.

  return generated_sequence

**Generating the Response**
* Now, I will define a helper function to convert the Input String into a reply for the Chatbot to use.

In [13]:
def Response(input_text):
  input_seq = np.zeros((1, max_encoder_seq_length, input_vocab_size), dtype="float32")
  for t, char in enumerate(input_text):
    input_seq[0, t, input_token_index[char]] = 1.
  decoded_sentence = decode_sequence(input_seq)
  print("T2 Reply:", decoded_sentence)

In [21]:
Response("do you sing a song?")

T2 Reply: not at all.



In [20]:
Response("do you like coffee?")

T2 Reply: i don't know. i haven't seen the way you want to get the disturbers and then we can get the general to 


In [16]:
Response("do you like football?")

T2 Reply: no.

