**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading the Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [4]:
#@ Downloading the Libraries and Dependencies:
# !pip install nlpia                                                       # Downloading the NLPIA Package.

from nlpia.loaders import get_data 
import os
from random import shuffle                                                 # Module for shuffling the Dataset.
from IPython.display import display

from keras.models import Model
from keras.layers import Input, LSTM, Dense

**Getting the Data**
* I will use the [**Cornell Movie Dialog Dataset**](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). Using the entire Cornell Movie Dialog Dataset can be computationally intensive because a few sequences have more than 2000 tokens. I will use **NLPIA** Package to load the Cornell Movie Dialog Dataset and I will pre process the Dialog Corpus.

In [5]:
#@ Getting the Data:
df = get_data("moviedialog")                                                # Accessing the Cornell Movie Dialog Corpus.

#@ Processing the Data:
input_texts = []                                                            # The array holds the input text from the Corpus.
target_texts = []                                                           # The array holds the target text from the Corpus.
input_vocabulary = set()                                                    # Holds the seen characters in the input text.
output_vocabulary = set()                                                   # Holds the seen characters in the target txt.

start_token = "\t"                                                          # Target sequence is annotated with start Token.
stop_token = "\n"                                                           # Target sequence is annotated with sop Token.
max_training_samples = min(25000, len(df) - 1)                              # Defines the lines used for Training.

for input_text, target_text in zip(df.statement, df.reply):
  target_text = start_token + target_text + stop_token                      # The Target Text needs to be wrapped with start and stop tokens.
  input_texts.append(input_text)
  target_texts.append(target_text)
  
  #@ Compiling the Vocabulary set:
  for char in input_text:
    if char not in input_vocabulary:
      input_vocabulary.add(char)
  
  for char in target_text:
    if char not in output_vocabulary:
      output_vocabulary.add(char)

**Building the Character Dictionary**
* I will convert each characters of the Input and Target Texts into one hot vectors that represent each characters. In order to generate one hot vectors I will generate token dictionaries where every character is mapped to an index. I will also generate the reverse dictionaries which will be used to convert generated index into characters. 

In [6]:
#@ Sorting the List of Characters:
input_vocabulary = sorted(input_vocabulary)
output_vocabulary = sorted(output_vocabulary)

#@ Calculating the Maximum number of Unique Characters:
input_vocab_size = len(input_vocabulary)
output_vocab_size = len(output_vocabulary)

#@ Determining the Maximum number of Sequence Tokens:
max_encoder_seq_length = [len(txt) for txt in input_texts]
max_decoder_seq_length = [len(txt) for txt in target_texts]

#@ Creating the Token Dictionaries:
input_token_index = dict([(char, i) for i,char in enumerate(input_vocabulary)])
target_token_index = dict([(char, i) for i,char in enumerate(output_vocabulary)])

#@ Creating the Reverse Token Dictionaries:
reverse_input_char_index = dict((i, char) for char,i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char,i in target_token_index.items())