# Building A Conversational Chatbot

## Aim of the Project


Aim of the project is to build an intelligent
conversational chatbot, Riki, that can understand
complex queries from the user and intelligently respond.

## Background


R-Intelligence Inc., an AI startup, has partnered with an online chat and discussion website
bluedit.io. They have an average of over 5 million active customers across the globe and more
than 100,000 active chat rooms. Due to the increased traffic, they are looking at improving
their user experience with a chatbot moderator, which helps them engage in a meaningful
conversation and keeps them updated on trending topics, while merely chatting with Riki, a
chatbot. The Artificial Intelligence-powered chat experience provides easy access to
information and a host of options to the customers.

## Dataset description

Cornell Movie-Dialogs Corpus
A large metadata-rich collection of fictional conversations extracted from raw movie scripts. (220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies).

Distributed together with: Chameleons in Imagined Conversations: A new Approach to Understanding Coordination of Linguistic Style in Dialogs. Cristian Danescu-Niculescu-Mizil and Lillian Lee. Cognitive Modeling and Computational Linguistics Workshop at ACL 2011.


In [3]:
import pandas as pd
import numpy as np
import re
import nltk
import tensorflow as tf

from keras.layers import Input, Embedding, LSTM, TimeDistributed, Dense, Bidirectional
from keras.models import Model, load_model
from keras.layers import Activation, dot, concatenate

INPUT_LENGTH = 20
OUTPUT_LENGTH = 22


## Load the data

In [4]:
txt=open('/content/movie_lines_cleaned.txt','r').readlines()

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Clean text


In [6]:
def clean_text(text):
    '''Clean text by removing unnecessary characters and altering the format of words.'''
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "that is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"n'", "ng", text)
    text = re.sub(r"'bout", "about", text)
    text = re.sub(r"'til", "until", text)
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|]", "", text)
    text = " ".join(text.split())
    return text

In [7]:
#split the data to questions and answers
questions=[]
answers=[]
for index,sent in enumerate(txt):
  if index%2==0:
    questions.append(sent)
  else:
    answers.append(sent)


In [8]:
# Clean the data
clean_questions = []
for question in questions:
    clean_questions.append(clean_text(question))
clean_answers = []    
for answer in answers:
    clean_answers.append(clean_text(answer))

In [9]:
last_que=clean_questions.pop() # to balance data

In [10]:
# Find the length of sentences (not using nltk due to processing speed)
lengths = []
# lengths.append([len(nltk.word_tokenize(sent)) for sent in clean_questions]) #nltk approach
for question in clean_questions:
    lengths.append(len(question.split()))
for answer in clean_answers:
    lengths.append(len(answer.split()))
# Create a dataframe so that the values can be inspected
lengths = pd.DataFrame(lengths, columns=['counts'])
print(np.percentile(lengths, 80))
print(np.percentile(lengths, 85))
print(np.percentile(lengths, 90))
print(np.percentile(lengths, 95))

16.0
19.0
24.0
33.0


In [11]:
# Remove questions and answers that are shorter than 1 word and longer than 20 words. 
min_line_length = 2
max_line_length = 20

# Filter out the questions that are too short/long
short_questions_temp = []
short_answers_temp = []

for i, question in enumerate(clean_questions):
    if len(question.split()) >= min_line_length and len(question.split()) <= max_line_length:
        short_questions_temp.append(question)
        short_answers_temp.append(clean_answers[i])

# Filter out the answers that are too short/long
short_questions = []
short_answers = []

for i, answer in enumerate(short_answers_temp):
    if len(answer.split()) >= min_line_length and len(answer.split()) <= max_line_length:
        short_answers.append(answer)
        short_questions.append(short_questions_temp[i])
        
print(len(short_questions))
print(len(short_answers))

95845
95845


In [12]:
r = np.random.randint(1,len(short_questions))

for i in range(r, r+3):
    print(short_questions[i])
    print(short_answers[i])
    print()

geez, agent desmond, it is threethirty in the morning. where are we going to sleep?
it is a piece of paper with the letter t imprinted on it. take a look.

what is it?
agent desmond, would you hold the finger for me. there's something up there.

there appears to be a contusion under the ring finger of her left hand.
cole said she was 17.



## Preprocessing for word based model

In [13]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [14]:
#choosing number of samples
num_samples = 30000  # Number of samples to train on.
short_questions = short_questions[:num_samples]
short_answers = short_answers[:num_samples]
#tokenizing the qns and answers
short_questions_tok = [nltk.word_tokenize(sent) for sent in short_questions]
short_answers_tok = [nltk.word_tokenize(sent) for sent in short_answers]

## training data & validation data

In [15]:
#train-validation split
data_size = len(short_questions_tok)

# We will use the first 0-80th %-tile (80%) of data for the training
training_input  = short_questions_tok[:round(data_size*(80/100))]
training_input  = [tr_input[::-1] for tr_input in training_input] #reverseing input seq for better performance
training_output = short_answers_tok[:round(data_size*(80/100))]

# We will use the remaining for validation
validation_input = short_questions_tok[round(data_size*(80/100)):]
validation_input  = [val_input[::-1] for val_input in validation_input] #reverseing input seq for better performance
validation_output = short_answers_tok[round(data_size*(80/100)):]

print('training size', len(training_input))
print('validation size', len(validation_input))

training size 24000
validation size 6000


## Word en/decoding dictionaries

In [16]:
# Create a dictionary for the frequency of the vocabulary
vocab = {}
for question in short_questions_tok:
    for word in question:
        if word not in vocab:
            vocab[word] = 1
        else:
            vocab[word] += 1

for answer in short_answers_tok:
    for word in answer:
        if word not in vocab:
            vocab[word] = 1
        else:
            vocab[word] += 1            

In [17]:
# Remove rare words from the vocabulary.
# We will aim to replace fewer than 5% of words with <UNK>
# You will see this ratio soon.
threshold = 15
count = 0
for k,v in vocab.items():
    if v >= threshold:
        count += 1

In [18]:
print("Size of total vocab:", len(vocab))
print("Size of vocab we will use:", count)

Size of total vocab: 20027
Size of vocab we will use: 1925


In [19]:
#we will create dictionaries to provide a unique integer for each word.
WORD_CODE_START = 1
WORD_CODE_END=2
WORD_CODE_PADDING = 0

word_num  = 3 #number 1 & 2  are left for WORD_CODE_START & WORD_CODE_END for model decoder later
encoding = {'START': 1, 'END':2}
decoding = {1:  'START',2: 'END'}
for word, count in vocab.items():
    if count >= threshold: #get vocabularies that appear above threshold count
        encoding[word] = word_num 
        decoding[word_num ] = word
        word_num += 1

print("No. of vocab used:", word_num)

No. of vocab used: 1928


In [20]:
#include unknown token for words not in dictionary
decoding[len(encoding)+3] = '<UNK>'
encoding['<UNK>'] = len(encoding)+3

In [21]:
dict_size = word_num+3


In [22]:

np.save('word2id.npy',encoding)
np.save('id2word.npy',decoding)

##  Vectorizing dataset

In [23]:
def transform(encoding, data, vector_size=20):
    """
    :param encoding: encoding dict built by build_word_encoding()
    :param data: list of strings
    :param vector_size: size of each encoded vector
    """
    transformed_data = np.zeros(shape=(len(data), vector_size))
    for i in range(len(data)):
        for j in range(min(len(data[i]), vector_size)):
            try:
                transformed_data[i][j] = encoding[data[i][j]]
            except:
                transformed_data[i][j] = encoding['<UNK>']
    return transformed_data

In [24]:
#encoding training set
encoded_training_input = transform(
    encoding, training_input, vector_size=INPUT_LENGTH)
encoded_training_output = transform(
    encoding, training_output, vector_size=OUTPUT_LENGTH)

print('encoded_training_input', encoded_training_input.shape)
print('encoded_training_output', encoded_training_output.shape)

encoded_training_input (24000, 20)
encoded_training_output (24000, 22)


In [25]:
#encoding validation set
encoded_validation_input = transform(
    encoding, validation_input, vector_size=INPUT_LENGTH)
encoded_validation_output = transform(
    encoding, validation_output, vector_size=OUTPUT_LENGTH)

print('encoded_validation_input', encoded_validation_input.shape)
print('encoded_validation_output', encoded_validation_output.shape)

encoded_validation_input (6000, 20)
encoded_validation_output (6000, 22)


##  Model Building


###   Sequence-to-Sequence 

In [26]:
tf.keras.backend.clear_session()

In [27]:
encoder_input = Input(shape=(INPUT_LENGTH,))
decoder_input = Input(shape=(OUTPUT_LENGTH,))

### Using glove for embedding layer

In [28]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('/content/drive/MyDrive/glove/glove.twitter.27B.25d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((dict_size, 25))
for word, i in encoding.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

embed_layer = Embedding(input_dim=dict_size, output_dim=25,input_length=INPUT_LENGTH, trainable=True, mask_zero=True)
embed_layer.build((None,))
embed_layer.set_weights([embedding_matrix])

Loaded 1193515 word vectors.


In [29]:
encoder = embed_layer(encoder_input)
encoder = LSTM(512, return_sequences=True, unroll=True)(encoder)
encoder_last = encoder[:,-1,:]

print('encoder', encoder)
print('encoder_last', encoder_last)

decoder = embed_layer(decoder_input)
decoder = LSTM(512, return_sequences=True, unroll=True)(decoder, initial_state=[encoder_last, encoder_last])

print('decoder', decoder)

# For the plain Sequence-to-Sequence, we produced the output from directly from decoder
# output = TimeDistributed(Dense(output_dict_size, activation="softmax"))(decoder)

encoder KerasTensor(type_spec=TensorSpec(shape=(None, 20, 512), dtype=tf.float32, name=None), name='lstm/transpose_2:0', description="created by layer 'lstm'")
encoder_last KerasTensor(type_spec=TensorSpec(shape=(None, 512), dtype=tf.float32, name=None), name='tf.__operators__.getitem/strided_slice:0', description="created by layer 'tf.__operators__.getitem'")
decoder KerasTensor(type_spec=TensorSpec(shape=(None, 22, 512), dtype=tf.float32, name=None), name='lstm_1/transpose_2:0', description="created by layer 'lstm_1'")


### Attention Mechanism
Reference: Effective Approaches to Attention-based Neural Machine Translation's Global Attention with Dot-based scoring function (Section 3, 3.1) https://arxiv.org/pdf/1508.04025.pdf

In [30]:

# Equation (7) with 'dot' score from Section 3.1 in the paper.
# Note that we reuse Softmax-activation layer instead of writing tensor calculation
attention = dot([decoder, encoder], axes=[2, 2])
attention = Activation('softmax', name='attention')(attention)
print('attention', attention)

context = dot([attention, encoder], axes=[2,1])
print('context', context)

decoder_combined_context = concatenate([context, decoder])
print('decoder_combined_context', decoder_combined_context)

# Has another weight + tanh layer as described in equation (5) of the paper
output = TimeDistributed(Dense(512, activation="tanh"))(decoder_combined_context)
output = TimeDistributed(Dense(dict_size, activation="softmax"))(output)
print('output', output)

attention KerasTensor(type_spec=TensorSpec(shape=(None, 22, 20), dtype=tf.float32, name=None), name='attention/Softmax:0', description="created by layer 'attention'")
context KerasTensor(type_spec=TensorSpec(shape=(None, 22, 512), dtype=tf.float32, name=None), name='dot_1/MatMul:0', description="created by layer 'dot_1'")
decoder_combined_context KerasTensor(type_spec=TensorSpec(shape=(None, 22, 1024), dtype=tf.float32, name=None), name='concatenate/concat:0', description="created by layer 'concatenate'")
output KerasTensor(type_spec=TensorSpec(shape=(None, 22, 1931), dtype=tf.float32, name=None), name='time_distributed_1/Reshape_1:0', description="created by layer 'time_distributed_1'")


In [31]:
model = Model(inputs=[encoder_input, decoder_input], outputs=[output])
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 22)]         0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            [(None, 20)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           multiple             48275       input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 20, 512)      1101824     embedding[0][0]              

In [32]:
def batch_generator(X,y,out,steps,batch_size):
    idx=0
    while True: 
          batch_x = np.array(X[idx * batch_size : (idx+1) * batch_size])
          batch_y = np.array(y[idx * batch_size : (idx+1) * batch_size])
          output_y = np.array(out[idx * batch_size : (idx+1) * batch_size])
          yield [batch_x,batch_y],output_y ## Yields data
          if idx<steps:
              # print(idx,steps)
              idx+=1
          # else:
          #     # idx=0


In [33]:
training_encoder_input = encoded_training_input
training_decoder_input = np.zeros_like(encoded_training_output)
training_decoder_input[:, 1:-1] = encoded_training_output[:,:-2]
training_decoder_input[:, 0] = WORD_CODE_START
training_decoder_input[:, -1] = WORD_CODE_END
training_decoder_output = np.eye(dict_size)[encoded_training_output.astype('int')]

validation_encoder_input = encoded_validation_input
validation_decoder_input = np.zeros_like(encoded_validation_output)
validation_decoder_input[:, 1:-1] = encoded_validation_output[:,:-2]
validation_decoder_input[:, 0] = WORD_CODE_START
validation_decoder_input[:, -1] = WORD_CODE_END

validation_decoder_output = np.eye(dict_size)[encoded_validation_output.astype('int')]

In [36]:
BATCH_SIZE =128
EPOCHS = 25
# STEPS=(np.ceil((len(training_encoder_input) / float(BATCH_SIZE))-1)).astype(np.int)

steps_per_epoch = len(training_encoder_input)//BATCH_SIZE
validation_steps=len(validation_encoder_input)//BATCH_SIZE

my_training_batch_generator=batch_generator(training_encoder_input,training_decoder_input,training_decoder_output,steps_per_epoch,BATCH_SIZE)
my_validation_batch_generator=batch_generator(validation_encoder_input,validation_decoder_input,validation_decoder_output,validation_steps,BATCH_SIZE)



In [51]:
model.fit_generator(my_training_batch_generator,
          #validation_split=0.05,
          steps_per_epoch=steps_per_epoch, epochs=EPOCHS,verbose=1,
          validation_data=my_validation_batch_generator,validation_steps=validation_steps)

model.save('chatboot_model.h5')



Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [37]:
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit_generator(my_training_batch_generator,
          #validation_split=0.05,
          steps_per_epoch=steps_per_epoch, epochs=EPOCHS,verbose=1,
          validation_data=my_validation_batch_generator,validation_steps=validation_steps)

model.save('chatboot_model.h5')



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## 3. Model testing

In [39]:
def prediction(raw_input):
    clean_input = clean_text(raw_input)
    input_tok = [nltk.word_tokenize(clean_input)]
    input_tok = [input_tok[0][::-1]]  #reverseing input seq
    encoder_input = transform(encoding, input_tok, 20)
    decoder_input = np.zeros(shape=(len(encoder_input), OUTPUT_LENGTH))
    decoder_input[:,0] = WORD_CODE_START
    decoder_input[:,-1] = WORD_CODE_END

    for i in range(1, OUTPUT_LENGTH):
        output = model.predict([encoder_input, decoder_input]).argmax(axis=2)
        decoder_input[:,i] = output[:,i]
    return output

def decode(decoding, vector):
    """
    :param decoding: decoding dict built by word encoding
    :param vector: an encoded vector
    """
    text = ''
    for i in vector:
        if i == 0:
            break
        text += ' '
        text += decoding[i]
    return text

In [37]:
for i in range(20):
    seq_index = np.random.randint(1, len(short_questions))
    output = prediction(short_questions[seq_index])
    print ('Q:', short_questions[seq_index])
    print ('A:', decode(decoding, output[0]))

Q: not yet
A:  i saw her die . she was shot . with this gun .
Q: where were you at twelve o'clock last night?
A:  mrs. grant , governor ... i will not hurt you .
Q: maybe it is supposed to end now. maybe god would not have it any other way.
A:  i appreciate that . but i am also sorry to am of my street .
Q: ai not it the truth.
A:  <UNK> , i have <UNK> hair . i am thinking of <UNK> the speech .
Q: i do not care what you have got started. do you want to go?
A:  i got your message . where is craig ?
Q: yes, sir.
A:  look ... i have ... i have got a problem . a big problem ...
Q: yyyy... yyye... yyyess.
A:  keep , what ?
Q: yes...yes...i will explain it all. just put the gun down.
A:  get on !
Q: well, we would like to find out something about him. what does he do for a living?
A:  better in san <UNK> ? more <UNK> there ? what ?
Q: thank you... mr. shaw.
A:  <UNK> . and i want not you .
Q: monsieur, insofar as it is in my power
A:  but i want to make some <UNK> . get <UNK> <UNK> away will

In [49]:
for i in range(6):
    seq_index = np.random.randint(1, len(short_questions))
    output = prediction(short_questions[seq_index])
    print ('Q:', short_questions[seq_index])
    print ('A:', decode(decoding, output[0]))
    print ('RA:', short_answers[seq_index])

Q: rome is going to pay an allotment to the german tribes on an annual basis.
A:  get on .
RA: what deal?
Q: purely personal. i believe you might enjoy one another.
A:  mr. <UNK> time . but take it easy on me , girl .
RA: if you do not want me to pose for him, why do you want me to meet him?
Q: it is a post all vienna seeks. if you want it for your husband, come tonight.
A:  come on , baby , let 's go in the house .
RA: is not it obvious?
Q: is it the truth?
A:  i saw her die . she was shot .
RA: my heart weeps.
Q: what are you talking about, bob?
A:  its ' ... ah ... about my daughter ... .
RA: if you are that worried, maybe we should just steal one.
Q: 'night miss jenny do not let the bedbugs bite.
A:  if you were here to hurt you i would have done it already .
RA: just a man. goodnight pearl, sleep tight and do not let the bedbugs bite.


In [39]:
raw_input = input()
output = prediction(raw_input)
print (decode(decoding, output[0]))

hello mr. shaw
 i have you come from <UNK> .


In [41]:
out_last_ques= prediction(last_que)
print(last_que)
print(decode(decoding, out_last_ques[0]))

colonel durnford... william vereker. i hear you have been seeking officers?
 somebody left me a message . well where is craig and dayday ?


## Resources
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.htmll