# **Aim of the Project**
Aim of the project is to build an intelligent
conversational chatbot, Riki, that can understand
complex queries from the user and intelligently respond.


# **Business Requirement**
R-Intelligence Inc. has invested in Python, PySpark, and Tensorflow. Using emerging technologies of Artificial Intelligence, Machine Learning, and Natural Language Processing, **Riki– the chatbot** should make the whole conversation as realistic as talking to an actual human.

The chatbot should understand that users have different intents and make it extremely simple to work around these by presenting the users with options and recommendations that best suit their needs.

# **-- Import all the necessary Python packages**

In [5]:
import re
import numpy as np
import pandas as pd
import os
import random
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences 
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.utils import to_categorical
import warnings
warnings.filterwarnings("ignore")

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **-- Data Preparation**

## **Load Pre-trained GloVe: Global Vectors for Word Representation**
Download the glove model available at https://nlp.stanford.edu/projects/glove/
Specification: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

Load the glove word embedding into a dictionary where the key is a unique word token
and the value is a d dimension vector

In [6]:
f = open('/content/drive/MyDrive/glove.twitter.27B.25d.txt', 'r+', encoding="utf8")
glove_embedding = {}
vector_size = 25
for line in f:
    word = " ".join(line.split()[0:len(line.split()) - vector_size])
    vector = np.array([float(val) for val in line.split()[-vector_size:]])
    glove_embedding[word] = vector
f.close()
glove_embedding['gooooooooossss'] = np.array([1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.])
glove_embedding['eooooooooossss'] = np.array([0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5])

## **Load Dataset Cornell Movie--Dialogs Corpus**

This corpus contains a large metadata-rich collection of fictional conversations extracted from
raw movie scripts:
➢ 220,579 conversational exchanges between 10,292 pairs of movie characters
➢ involves 9,035 characters from 617 movies
➢ in total 304,713 utterances

In all files the field separator is " +++$+++ "

Contains the actual text of each utterance

In [31]:
movie_lines_features = ["LineID", "Character", "Movie", "Name", "Line"]
movie_lines = pd.read_csv("/content/drive/MyDrive/movie_lines.txt", sep = "\+\+\+\$\+\+\+", engine = "python", index_col = False, names = movie_lines_features)

# Using only the required columns, namely, "LineID" and "Line"
movie_lines = movie_lines[["LineID", "Line"]]

# Strip the space from "LineID" for further usage and change the datatype of "Line"
movie_lines["LineID"] = movie_lines["LineID"].apply(str.strip)



Filter the conversations till max word length and convert the
dialogues pairs

In [None]:
pairs = []
lines = movie_lines["Line"]
for i in range(len(lines)):
  if i+1 == len(lines): 
    break
  elif len(str(lines[i]).strip().split(' ')) <= 7 and len(lines[i+1]).strip().split(' ')) <= 7: 
    pairs.append((str(lines[i]),str(lines[i+1])))
  else:
    continue

## **Data Cleaning**

In [22]:
word_mapping = {"ain't": "is not","aren't": "are not","can't": "cannot","'cause": "because", "could've": "could have",
                "couldn't": "could not","didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", 
                "hasn't": "has not", "haven't": "have not", "he'd": "he would", "he'll": "he will", "he's": "he is", 
                "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", 
                "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have", "I'm": "I am", "I've": "I have", 
                "i'd": "i would", "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have",
                "i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", 
                "it'll": "it will", "it'll've": "it will have", "it's": "it is", "let's": "let us", 
                "ma'am": "madam", "mayn't": "may not", "might've": "might have", "mightn't": "might not",
                "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", 
                "needn't": "need not", "needn't've": "need not have", "o'clock": "of the clock", "oughtn't": "ought not",
                "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", 
                "she's": "she is", "should've": "should have", "shouldn't": "should not","shouldn't've": "should not have", 
                "so've": "so have","so's": "so as", "this's": "this is", "that'd": "that would","that'd've": "that would have", 
                "that's": "that is","there'd": "there would", "there'd've": "there would have",  "there's": "there is",
                "here's": "here is","they'd": "they would","they'd've": "they would have","they'll": "they will", 
                "they'll've": "they will have", "they're": "they are", "they've": "they have","to've": "to have",
                "wasn't": "was not","we'd": "we would", "we'd've": "we would have", "we'll": "we will","we'll've": "we will have", 
                "we're": "we are", "we've": "we have","weren't": "were not", "what'll": "what will", "what'll've": "what will have", 
                "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have",
                "where'd": "where did", "where's": "where is","where've": "where have", 
                "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", 
                "why've": "why have", "will've": "will have", "won't": "will not","won't've": "will not have",
                "would've": "would have","wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                "y'all'd": "you all would", "y'all'd've": "you all would have", "y'all're": "you all are", "y'all've": "you all have",
                "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                "you're": "you are", "you've": "you have", "'bout": "about", "intellectu": "intellectally","arwticle": "article",
                "dissconnected": "disconnected", "deaaaddddd": "dead", "y-y-y-you": "you","g-g-g-going": "going",
                "t-t-t-to": "to","muh-muh-muh-marry": "marry","Ah-ah-ah-are": "are","C-C-C-C-Candy": "Candy",
                "I-I-I-I": "I","th-th-think": "think"}

In [23]:
def clean_text(text):
    text = text.lower()
    text = re.sub('"','', text)
    text = ' '.join([word_mapping[word] if word in word_mapping else word for word in text.split(' ')])
    text = re.sub(r"'s\b", '', text)
    text = re.sub("[^a-zA-Z0-9]", " ", text) 
    tokens = [word for word in text.split()]
    return " ".join(tokens).strip()

In [32]:
final_pairs = []
for sent1, sent2 in pairs:
    sent1 = clean_text(sent1)
    sent2 = clean_text(sent2)
    final_pairs.append((sent1, sent2))

In [33]:
final_pairs

[('they do not', 'they do to'),
 ('they do to', 'i hope so'),
 ('i hope so', 'she okay'),
 ('she okay', 'let us go'),
 ('let us go', 'wow'),
 ('like my fear of wearing pastels', 'the real you'),
 ('the real you', 'what good stuff'),
 ('what crap', 'do you listen to this crap'),
 ('do you listen to this crap', 'no'),
 ('you always been this selfish', 'but'),
 ('but', 'then that is all you had to say'),
 ('then that is all you had to say', 'well no'),
 ('tons', 'have fun tonight'),
 ('have fun tonight', 'i believe we share an art instructor'),
 ('i believe we share an art instructor', 'you know chastity'),
 ('you know chastity', 'looks like things worked out tonight huh'),
 ('looks like things worked out tonight huh', 'hi'),
 ('you got something on your mind', 'where'),
 ('where', 'there'),
 ('forget french', 'that is because it is such a nice one'),
 ('c esc ma tete this is my head', 'let me see what i can do'),
 ('great', 'joey'),
 ('joey', 'who'),
 ('you might wanna think about it', '

# **-- Model Architecture**

## **Create two dictionaries:**
* target_word2id
* target_id2word

## **Prepare the input data with embedding.**

The input data is a list of lists:
*   First list is a list of sentences
* Each sentence is a list of words

In [34]:
input_docs = [sent1 for sent1, sent2 in final_pairs]
target_docs = ['gooooooooossss '+ sent2 +' eooooooooossss' for sent1, sent2 in final_pairs]

In [35]:
len(input_docs)

88424

## **LSTM encoder**

***Step 1 :*** To get input words encoded in the form of (encoder
outputs, encoder hidden state, encoder context) from input words

In [36]:
enc_tokenizer = Tokenizer()
enc_tokenizer.fit_on_texts(input_docs)
end_tokenized_sents = enc_tokenizer.texts_to_sequences(input_docs)

max_input_length = max([len(tokens.split(' ')) for tokens in input_docs])
    
input_pad_data = pad_sequences(end_tokenized_sents, max_input_length, padding='post', value=0)
encoder_input_data = np.array(input_pad_data)

enc_target_word2id = enc_tokenizer.word_index
enc_target_id2word = dict((token, word) for word, token in enc_target_word2id.items())
enc_nbr_tokens = len(enc_target_word2id)+1
print(max_input_length, enc_nbr_tokens, encoder_input_data.shape)

18 16385 (88424, 18)


# **LSTM decoder**

***Step 2 :*** To get target words encoded in the form of (decoder
outputs, decoder hidden state, decoder context) from target words. Use encoder
hidden states and encoder context (represents input memory) as initial state.

In [37]:
# DECODER IP: <START> HELLO WORLD
dec_tokenizer = Tokenizer(split=' ', lower=False)
dec_tokenizer.fit_on_texts(target_docs)
dec_tokenized_sents = dec_tokenizer.texts_to_sequences(target_docs)

max_target_length = max([len(tokens.split(' ')) for tokens in target_docs])
    
dec_input_data = [sent[:-1] for sent in dec_tokenized_sents]    
dec_input_pad_data = pad_sequences(dec_input_data, max_target_length-1, padding='post', value=0)
decoder_input_data = np.array(dec_input_pad_data)

dec_target_word2id = dec_tokenizer.word_index
dec_target_id2word = dict((token, word) for word, token in dec_target_word2id.items())
dec_nbr_tokens = len(dec_target_word2id)+1
print(max_target_length, dec_nbr_tokens, decoder_input_data.shape)

21 16411 (88424, 20)


In [38]:
# HELLO WORLD <eos>
target_output = [sent[1:] for sent in dec_tokenized_sents]

dec_output_pad_data = pad_sequences(target_output, max_target_length-1, padding='post', value=0)
decoder_output_data = np.array(dec_output_pad_data)
decoder_output_data.shape

(88424, 20)

## **Train Model**

In [39]:
def training_data_generator(enc_data, dec_ip, dec_op, enc_nbr_tokens, dec_nbr_tokens, batch_size=64):
    i = 0
    while i < len(enc_data):
        if i+batch_size > len(enc_data): 
            batch_size = len(enc_data) - i + 1
        enc_ip = to_categorical(enc_data[i:i+batch_size], enc_nbr_tokens)
        dec_ip = to_categorical(dec_ip[i:i+batch_size], dec_nbr_tokens)
        dec_op = to_categorical(dec_op[i:i+batch_size], dec_nbr_tokens)
        
        if i+batch_size > len(enc_data):
            i = 0
        else:
            i += batch_size
        
        yield enc_ip, dec_ip, dec_op

### **Define model**

***step 3 :*** Use a dense layer to predict the next token out of the vocabulary given
decoder output generated

In [40]:
def define_models(n_input, n_output, n_units):
    encoder_inputs = Input(shape=(None, n_input))
    encoder = LSTM(n_units, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_inputs)
    encoder_states = [state_h, state_c]
    decoder_inputs = Input(shape=(None, n_output))
    decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = Dense(n_output, activation='softmax')
    decoder_outputs = decoder_dense(decoder_outputs)
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    encoder_model = Model(encoder_inputs, encoder_states)
    decoder_state_input_h = Input(shape=(n_units,))
    decoder_state_input_c = Input(shape=(n_units,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
    return model, encoder_model, decoder_model

## **-- Generate the model summary**


In [41]:
model, infenc, infdec = define_models(enc_nbr_tokens, dec_nbr_tokens, 256)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None, 16385) 0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None, 16411) 0                                            
__________________________________________________________________________________________________
lstm (LSTM)                     [(None, 256), (None, 17041408    input_1[0][0]                    
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, None, 256),  17068032    input_2[0][0]                    
                                                                 lstm[0][1]                   

In [42]:
infenc.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None, 16385)]     0         
_________________________________________________________________
lstm (LSTM)                  [(None, 256), (None, 256) 17041408  
Total params: 17,041,408
Trainable params: 17,041,408
Non-trainable params: 0
_________________________________________________________________


In [43]:
infdec.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, None, 16411) 0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 256)]        0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 256)]        0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, None, 256),  17068032    input_2[0][0]                    
                                                                 input_3[0][0]              

### **Model Compile**
Use loss ='categorical_crossentropy' and optimizer='rmsprop'

In [44]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

## **Model Fit**

In [45]:
X1, X2, y = next(training_data_generator(encoder_input_data, decoder_input_data, decoder_output_data, enc_nbr_tokens, dec_nbr_tokens, 512))

In [49]:
es = EarlyStopping(monitor='accuracy', mode='auto', verbose=1, patience=10)

In [50]:
model.fit([X1, X2], y, epochs=3*len(input_docs)//1024, callbacks=[es])

Epoch 1/259
Epoch 2/259
Epoch 3/259
Epoch 4/259
Epoch 5/259
Epoch 6/259
Epoch 7/259
Epoch 8/259
Epoch 9/259
Epoch 10/259
Epoch 11/259
Epoch 12/259
Epoch 13/259
Epoch 14/259
Epoch 15/259
Epoch 16/259
Epoch 17/259
Epoch 18/259
Epoch 19/259
Epoch 20/259
Epoch 21/259
Epoch 22/259
Epoch 23/259
Epoch 24/259
Epoch 25/259
Epoch 26/259
Epoch 27/259
Epoch 28/259
Epoch 29/259
Epoch 30/259
Epoch 31/259
Epoch 32/259
Epoch 33/259
Epoch 34/259
Epoch 35/259
Epoch 36/259
Epoch 37/259
Epoch 38/259
Epoch 39/259
Epoch 40/259
Epoch 41/259
Epoch 42/259
Epoch 43/259
Epoch 44/259
Epoch 45/259
Epoch 46/259
Epoch 47/259
Epoch 48/259
Epoch 49/259
Epoch 50/259
Epoch 51/259
Epoch 52/259
Epoch 53/259
Epoch 54/259
Epoch 55/259
Epoch 56/259
Epoch 57/259
Epoch 58/259
Epoch 59/259
Epoch 60/259
Epoch 61/259
Epoch 62/259
Epoch 63/259
Epoch 64/259
Epoch 65/259
Epoch 66/259
Epoch 67/259
Epoch 68/259
Epoch 69/259
Epoch 70/259
Epoch 71/259
Epoch 72/259
Epoch 73/259
Epoch 74/259
Epoch 75/259
Epoch 76/259
Epoch 77/259
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f054d6aab38>

In [51]:
model.save('/content/drive/MyDrive/riki.h5')

# **-- Generate the prediction**

In [52]:
# input encoder shape: 1Xsent_lengthXnbr_of_tokens
def enc_text_to_seq(text):
    text = clean_text(text)
    tokens = []
    for token in text.split(' '):
        if token in enc_target_word2id.keys():
            tokens.append(enc_target_word2id[token])
    pad_data = np.zeros((max_input_length, ), dtype='int')
    for i, token in enumerate(tokens):
        pad_data[i] = token
    data = to_categorical(pad_data, enc_nbr_tokens)
    data = np.expand_dims(data, axis=0)
    return data

In [53]:
def predict_sequence(infenc, infdec, source, n_steps, cardinality):
    state = infenc.predict(source)
    target_seq = np.array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)
    output = []
    tokens = []
    for t in range(n_steps):
        yhat, h, c = infdec.predict([target_seq] + state)
        output.append(yhat[0,0,:])
        state = [h, c]
        target_seq = yhat
        if np.argmax(yhat) != 0:
            tokens.append(np.argmax(yhat))
        else:
            break
    if len(tokens) > 0:
        return ' '.join([dec_target_id2word[token] for token in tokens if dec_target_id2word[token] != 'eooooooooossss'])
    else:
        return 'No response'

In [58]:
for i in range(1,11,1):
    print(f'Question: \t {input_docs[i]}')
    print(f'Answer: \t {final_pairs[i][1]}')
    print(f'Riki Answer: \t {predict_sequence(infenc, infdec, enc_text_to_seq(input_docs[i]), 256, dec_nbr_tokens)}')
    print('\n')

Question: 	 they do to
Answer: 	 i hope so
Riki Answer: 	 i hope so


Question: 	 i hope so
Answer: 	 she okay
Riki Answer: 	 she okay you this


Question: 	 she okay
Answer: 	 let us go
Riki Answer: 	 let us go


Question: 	 let us go
Answer: 	 wow
Riki Answer: 	 wow


Question: 	 like my fear of wearing pastels
Answer: 	 the real you
Riki Answer: 	 it real you this


Question: 	 the real you
Answer: 	 what good stuff
Riki Answer: 	 what good stuff


Question: 	 what crap
Answer: 	 do you listen to this crap
Riki Answer: 	 do do you like this crap


Question: 	 do you listen to this crap
Answer: 	 no
Riki Answer: 	 no


Question: 	 you always been this selfish
Answer: 	 but
Riki Answer: 	 but to


Question: 	 but
Answer: 	 then that is all you had to say
Riki Answer: 	 then that is all you had to say this




# **-- Project report/synopsis**

## **Objective**

This project will help to create chatbot named Riki which will accept complex question from user and intelligently respond. 


## **Technique**

Riki is developed using supervised learning algorithm i.e LSTM Autoencoder which is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM architecture with Keras. 

## **Python notebook description:**

**Data was prepared** with **cornel Movie** dataset and pre-trained word representation **GloVe** for obtaining vector representations for words.

Data set was **cleaned** and input data was prepared with **embedding**

In this architecture, an **encoder** LSTM model **reads the input** sequence step-by-step. After reading in the entire input sequence, **output of this encoder** model represents an internal learned representation of the entire input sequence as a fixed-length vector. This vector is then provided as an **input** to the **decoder** model that interprets it as each step in the **output sequence** is generated.

Model was created with **optimizer as 'rmsprop'** and calculated **loss with 'categorical_crossentropy'** and  **metrics as 'accuracy'** and **epoch as 3xlen(input_docs)/1024** and training leaded to **accuracy of 0.9948**.

The **performance of the model** is evaluated based on the model’s ability to recreate the input sequence which we can see as riki's answer with actual answer which was close enough.

## **Learning:**

Project helped me to enchance my skill set in machine learning where i was able to create fair enough chatbot which was able to understand sequences and respond to queries with different intents to nearly accurate answer 