# Chatbot

This notebook provides an implementation of a seq2seq encoder-decoder for a conversational chatbot with no attention. 

### Approach

- Upload datasets: 2 datasets have been used in order to have a mix between movie conversations and "small talk"
- Define Q & A
- Extend contracted forms using a dict
- Clean strings
- Create embeddings
- Padding strings
- Reverse encoder input
- Define seq2seq model


### Dataset references
- dataset1: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html 
- dataset2: https://www.kaggle.com/grafstor/simple-dialogs-for-chatbot

### Results
The model achieve an accuracy of 0.20

In [1]:
#some imports

import pandas as pd 
import string
import numpy as np

from collections import OrderedDict

In [2]:
#importing tensorflow

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import backend as K

from tensorflow.keras.preprocessing.sequence import pad_sequences

In [3]:
#select GPU

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected.")

gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_visible_devices(gpus[1], 'GPU')
tf.config.experimental.set_memory_growth(gpus[1], enable=True)
#tf.config.gpu_options.allow_growth = True
gpus

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

## Uploading dataset

In [4]:
df=pd.read_csv('archive/dialogs.txt',sep='\t')

In [5]:
df.head()

Unnamed: 0,Q,ANS
0,"hi, how are you doing?",i'm fine. how about yourself?
1,i'm fine. how about yourself?,i'm pretty good. thanks for asking.
2,i'm pretty good. thanks for asking.,no problem. so how have you been?
3,no problem. so how have you been?,i've been great. what about you?
4,i've been great. what about you?,i've been good. i'm in school right now.


In [6]:
#uploading files

lines= open('dataset_cornel/movie_lines.txt',encoding='utf-8',errors='ignore').read().split('\n')
conversations= open('dataset_cornel/movie_conversations.txt',encoding='utf-8',errors='ignore').read().split('\n')
lines

['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.",
 'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow',
 "L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.",
 'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No',
 'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?',
 'L868 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ The "real you".',
 'L867 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ What good stuff?',
 "L866 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I figured yo

In [7]:
#preparing the dataset

conversation_index=[] #nested list of conversation indexes
for conversation in conversations:
    conversation_index.append(conversation.split('+++$+++')[-1][2:-1].replace("'","").split(','))

In [8]:
dict_i2text={} #index to text dictionary
for line in lines[:-1]:   #-1 cause the last one is empty
    key=line.split('+++$+++')[0][1:-1]
    dict_i2text[int(key)]=line.split('+++$+++')[-1]

ord_list_i2text = list(OrderedDict(sorted(dict_i2text.items())).items())

In [9]:
#cleaning from long sentences

long_threshold=8 

cut_dataset=7*1000
cut_offset=5*1000

ord_list_i2text = [x for x in ord_list_i2text if not len(x[1].split())>long_threshold][cut_offset:cut_dataset+cut_offset]

ord_list_i2text


[(26292, ' Look, I told you...'),
 (26303, ' Thank you, Officer Ripley.  That will be...'),
 (26309, ' You are a head case.  Have a donut.'),
 (26312, " Why won't you check out LV-426?"),
 (26314, ' What are you talking about. What people?'),
 (26316, ' How many colonists?'),
 (26317, ' Sixty, maybe seventy families.'),
 (26318, ' Sweet Jesus.'),
 (26322, ' Yeah.  What?'),
 (26325, ' So what do I tell this guy?'),
 (26355, " No.  There's no way!"),
 (26356, ' Hear me out...'),
 (26362, " What about you?  What's your interest in this?"),
 (26364, ' Yeah, yeah.  I saw the commercial.'),
 (26366, " That's right."),
 (26367, ' Running loaders, forklifts, that sort of thing?'),
 (26370, ' If I go.'),
 (26376, ' Yello?  Oh, Ripley.  Hi...'),
 (26378, " That's the plan.  My word on it."),
 (26385, " This floor's freezing."),
 (26387, ' Would you, Sir?'),
 (26388, ' Hey, Vasquez...you ever been mistaken for a man?'),
 (26389, ' No.  Have you?'),
 (26398, " Whoooah!  No shit?  I'm impressed."),

In [10]:
len(ord_list_i2text)

7000

In [11]:
question=[] #defining question and answer lists
answer=[]

for i in range(len(ord_list_i2text)-1):
    if ord_list_i2text[i][0]-ord_list_i2text[i+1][0]!=1:
        question.append(ord_list_i2text[i][1])
        answer.append(ord_list_i2text[i+1][1])



In [12]:
dataset2=[df['Q'],df['ANS']]

for j in range(len(dataset2[0])):
    if not (len(dataset2[0][j].split())>long_threshold or (len(dataset2[1][j].split())>long_threshold)):
        question.append(dataset2[0][j])
        answer.append(dataset2[1][j])

In [13]:
len(question),len(answer)

(9382, 9382)

## Prepocessing dataset

- Set to lower case
- Remove contract form
- Remove punctuation
- add sos and eos tokens to answers


In [14]:
contractions_dict = {
"im": "i am",
"dont": "do not",
"doesnt": "does not",
"theres": "there is",
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "I would",
"i'd've": "I would have",
"i'll": "I will",
"i'll've": "I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that had",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": " what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}


In [15]:
question


[' Look, I told you...',
 ' Thank you, Officer Ripley.  That will be...',
 ' You are a head case.  Have a donut.',
 " Why won't you check out LV-426?",
 ' What are you talking about. What people?',
 ' How many colonists?',
 ' Sixty, maybe seventy families.',
 ' Sweet Jesus.',
 ' Yeah.  What?',
 ' So what do I tell this guy?',
 " No.  There's no way!",
 ' Hear me out...',
 " What about you?  What's your interest in this?",
 ' Yeah, yeah.  I saw the commercial.',
 " That's right.",
 ' Running loaders, forklifts, that sort of thing?',
 ' If I go.',
 ' Yello?  Oh, Ripley.  Hi...',
 " That's the plan.  My word on it.",
 " This floor's freezing.",
 ' Would you, Sir?',
 ' Hey, Vasquez...you ever been mistaken for a man?',
 ' No.  Have you?',
 " Whoooah!  No shit?  I'm impressed.",
 " Let's go...let's go.  Cycle through!",
 " Hey, 'Top.'  What's the op?",
 ' Sir?',
 ' Yes, Hicks?',
 " Hudson, Sir.  He's Hicks.",
 " What's the question?",
 ' Sounds like you, Hicks.',
 ' Fuck you.',
 ' Anytime. 

In [16]:
# to set each string to lowerstring_list = [each_string.lower() for each_string in string_list]

In [17]:
#function to expand the contractions and remove punctuation

def exp_remPunt(l):
    '''
    params: l is a list (of quest or ans)
    '''
    clean_l=[]
    table = str.maketrans(dict.fromkeys(string.punctuation))
    remove_digits = str.maketrans('', '', string.digits)

    for sent in l:
        for word in sent.split():
            if word in contractions_dict:
                sent = sent.replace(word, contractions_dict[word])  #expand
                
        sent = sent.translate(remove_digits)
        clean_l.append(sent.translate(table).lower()) #remove punt and set to lower
    
    return clean_l


In [18]:
question=exp_remPunt(question)
answer=exp_remPunt(answer)

In [19]:

answer

[' thank you officer ripley  that will be',
 ' you are a head case  have a donut',
 ' why will not you check out lv',
 ' what are you talking about what people',
 ' how many colonists',
 ' sixty maybe seventy families',
 ' sweet jesus',
 ' yeah  what',
 ' so what do i tell this guy',
 ' no  theres no way',
 ' hear me out',
 ' what about you  whats your interest in this',
 ' yeah yeah  i saw the commercial',
 ' thats right',
 ' running loaders forklifts that sort of thing',
 ' if i go',
 ' yello  oh ripley  hi',
 ' thats the plan  my word on it',
 ' this floors freezing',
 ' would you sir',
 ' hey vasquezyou ever been mistaken for a man',
 ' no  have you',
 ' whoooah  no shit  im impressed',
 ' lets golets go  cycle through',
 ' hey top  whats the op',
 ' sir',
 ' yes hicks',
 ' hudson sir  hes hicks',
 ' whats the question',
 ' sounds like you hicks',
 ' fuck you',
 ' anytime  anywhere',
 ' i hope you are right  i really do',
 ' are there any questions  hudson',
 ' still nothing from t

## Creating embeddings

- create list with all the words
- clean the list: each word must have just one occurrence (set)
- create dicts for word to index and index to word

In [20]:
#create list with all the words
words_list=[]

for sent in question:
    for word in sent.split():
        words_list.append(word)
        
for sent in answer:
    for word in sent.split():
        words_list.append(word)
        
words_list = set(words_list)
vocab_len=len(words_list) #length of the vocabulary without offset

vocab_len

4630

In [21]:
#define funct to create dicts
def index_to_word(words_list):
    
    d= { (index +3) : word for index,word in enumerate(words_list)}
    
    d[0]='<pad>'
    d[1]='<sos>'
    d[2]='<eos>'
    return d

def word_to_index(words_list):
    d= { word : (index +3) for index,word in enumerate(words_list)}
    
    d['<pad>']=0
    d['<sos>']=1
    d['<eos>']=2
    return d

In [22]:
dict_i2w=index_to_word(words_list) #index to word
dict_w2i=word_to_index(words_list) #word to index

In [23]:
#sent translation to embedding
encoder_input_data=[]
decoder_input_data=[]
decoder_output_data=[]
for sent in question:
    emb_str=[]
    for word in sent.split():
        emb_str.append(dict_w2i[word])
    encoder_input_data.append(emb_str) #quest
    
for sent in answer:
    emb_str=[]
    for word in sent.split():
        emb_str.append(dict_w2i[word])
    decoder_input_data.append([1]+emb_str+[2]) #ans with sos ans eos
    decoder_output_data.append(emb_str+[2]) #ans with just eos


In [24]:
decoder_output_data

[[1262, 1209, 1412, 4090, 3825, 353, 3269, 2],
 [1209, 4160, 3595, 4253, 809, 387, 3595, 2988, 2],
 [4486, 353, 1204, 1209, 439, 752, 960, 2],
 [1139, 4160, 1209, 49, 2893, 1139, 1578, 2],
 [966, 1694, 1374, 2],
 [4025, 3424, 1095, 1771, 2],
 [939, 3166, 2],
 [2782, 1139, 2],
 [2731, 1139, 3830, 1484, 2185, 1399, 2670, 2],
 [2565, 4438, 2565, 985, 2],
 [4551, 3192, 752, 2],
 [1139, 2893, 1209, 2997, 3378, 2042, 3766, 1399, 2],
 [2782, 2782, 1484, 4206, 2484, 671, 2],
 [1263, 3989, 2],
 [1134, 3588, 3892, 3825, 4333, 784, 1375, 2],
 [4619, 1484, 293, 2],
 [277, 4039, 4090, 3973, 2],
 [1263, 2484, 123, 1640, 2926, 4273, 502, 2],
 [1399, 4580, 1463, 2],
 [1184, 1209, 4596, 2],
 [1152, 3933, 2575, 2645, 2891, 4067, 3595, 2203, 2],
 [2565, 387, 1209, 2],
 [2627, 2565, 2293, 967, 3783, 2],
 [1770, 4173, 293, 1708, 1637, 2],
 [1152, 3250, 2997, 2484, 2381, 2],
 [4596, 2],
 [4349, 1933, 2],
 [3949, 4596, 2160, 1933, 2],
 [2997, 2484, 1165, 2],
 [3726, 2309, 1209, 1933, 2],
 [1749, 1209, 2],
 [

In [25]:
#bucketing threshold

buck_t1=long_threshold

'''
here just a threshold is implemented, so actually there is only one bucket. The code is structured in order
to easily add other buckets in the code if necessary
'''
#buck_t_n=max([len(x) for x in encoder_input_data]) 


b1_enc_in=[]  #bucket 1 encoder input
b1_dec_in=[]  #bucket 1 decoder input
b1_dec_out=[] #bucket 1 decoder output

for index in range(len(encoder_input_data)):
    if len(encoder_input_data[index])<=buck_t1:
        b1_enc_in.append(encoder_input_data[index])
        b1_dec_in.append(decoder_input_data[index])
        b1_dec_out.append(decoder_output_data[index])
        
b1_dec_out

[[1262, 1209, 1412, 4090, 3825, 353, 3269, 2],
 [1209, 4160, 3595, 4253, 809, 387, 3595, 2988, 2],
 [4486, 353, 1204, 1209, 439, 752, 960, 2],
 [1139, 4160, 1209, 49, 2893, 1139, 1578, 2],
 [966, 1694, 1374, 2],
 [4025, 3424, 1095, 1771, 2],
 [939, 3166, 2],
 [2782, 1139, 2],
 [2731, 1139, 3830, 1484, 2185, 1399, 2670, 2],
 [2565, 4438, 2565, 985, 2],
 [4551, 3192, 752, 2],
 [1139, 2893, 1209, 2997, 3378, 2042, 3766, 1399, 2],
 [2782, 2782, 1484, 4206, 2484, 671, 2],
 [1263, 3989, 2],
 [1134, 3588, 3892, 3825, 4333, 784, 1375, 2],
 [4619, 1484, 293, 2],
 [277, 4039, 4090, 3973, 2],
 [1263, 2484, 123, 1640, 2926, 4273, 502, 2],
 [1399, 4580, 1463, 2],
 [1184, 1209, 4596, 2],
 [1152, 3933, 2575, 2645, 2891, 4067, 3595, 2203, 2],
 [2565, 387, 1209, 2],
 [2627, 2565, 2293, 967, 3783, 2],
 [1770, 4173, 293, 1708, 1637, 2],
 [1152, 3250, 2997, 2484, 2381, 2],
 [4596, 2],
 [4349, 1933, 2],
 [3949, 4596, 2160, 1933, 2],
 [2997, 2484, 1165, 2],
 [3726, 2309, 1209, 1933, 2],
 [1749, 1209, 2],
 [

In [26]:
len(b1_dec_out)

9213

In [27]:
#function to find max len of decoder input
def get_max_(l):
    return max([len(sent) for sent in l])

In [28]:
#padding encoder and decoder inputs
b1_dec_len=get_max_(b1_dec_in)



b1_enc_in = pad_sequences(b1_enc_in,buck_t1, padding='post')
b1_dec_in = pad_sequences(b1_dec_in,b1_dec_len, padding='post')
b1_dec_out = pad_sequences(b1_dec_out,b1_dec_len, padding='post')

b1_dec_len

12

In [29]:
#create one-hot encoded answers for evaluation purposes
b1_dec_out=tf.keras.utils.to_categorical(b1_dec_out,(vocab_len+3))
b1_dec_out.shape

(9213, 12, 4633)

In [30]:
#last step is to reverse encoder inputs
b1_enc_in=np.flip(b1_enc_in,1)

## Building models

- define model for training
- define model for inference



In [31]:
lat_dim=256 

In [32]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, LSTM, Dense, TimeDistributed, Embedding

In [44]:
#building the model
K.clear_session()

#encoder for training
encoder_input=Input(shape=(None,),name='encoder_input',dtype='int32')
encoder_embedding= Embedding(vocab_len+3,lat_dim,mask_zero=True,name='encoder_embedding')(encoder_input)
encoder= LSTM(lat_dim, return_state= True,name="encoder",dropout=0.4)
encoder_output,state_h,state_c =encoder(encoder_embedding)


#encoder for inference
encoder_model = Model(encoder_input, [state_h,state_c ])

#decoder
decoder_input=Input(shape=(None,),name='decoder_input',dtype='int32')
decoder_embedding= Embedding(vocab_len+3,lat_dim,mask_zero=True,name='decoder_embedding')(decoder_input)
decoder= LSTM(lat_dim, return_state= True, return_sequences=True,name='decoder',dropout=0.4)
decoder_output,_,_=decoder(decoder_embedding,initial_state=[state_h,state_c])

dense_layer=Dense(vocab_len+3,activation='softmax',name='dense_layer')
prediction= dense_layer(decoder_output)

#define model for training
model_training= Model([encoder_input,decoder_input],prediction)


#decoder for inference
dec_state_in_h = Input(shape=(lat_dim,))   #new states
dec_state_in_c = Input(shape=(lat_dim,))

decoder_output,state_h, state_c=decoder(decoder_embedding,initial_state=[dec_state_in_h,dec_state_in_c])
decoder_output=dense_layer(decoder_output) 



decoder_model= Model([decoder_input] + [dec_state_in_h,dec_state_in_c],
                    [decoder_output] + [state_h, state_c]
                    )

model_training.summary()
encoder_model.summary()
decoder_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
decoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
encoder_embedding (Embedding)   (None, None, 256)    1186048     encoder_input[0][0]              
__________________________________________________________________________________________________
decoder_embedding (Embedding)   (None, None, 256)    1186048     decoder_input[0][0]              
____________________________________________________________________________________________

In [45]:
opt_a=tf.keras.optimizers.Adam(learning_rate=0.001,amsgrad=True)
opt_r=tf.keras.optimizers.RMSprop()
model_training.compile(loss='categorical_crossentropy', optimizer=opt_r,metrics=['accuracy'])

In [46]:
early_stopping_=tf.keras.callbacks.EarlyStopping(patience=2)#,monitor='loss')

In [None]:

model_training.fit([b1_enc_in,b1_dec_in],b1_dec_out,batch_size=128,epochs=280,
                   validation_split=0.2,
                   callbacks=[early_stopping_]
                  )

Epoch 1/280
Epoch 2/280
Epoch 3/280
Epoch 4/280

In [42]:
#saving models

model_training.save('models/training_b1_nobucket.h5')
encoder_model.save('models/encoder_inf_b1_nobucket.h5')
decoder_model.save('models/decoder_inf_b1_nobucket.h5')

## Saving Dictionaries

In [43]:
import json

a_file = open("dict_w2i_chatbot.json", "w")

json.dump(dict_w2i, a_file)

a_file.close()

a_file = open("dict_i2w_chatbot.json", "w")

json.dump(dict_i2w, a_file)

a_file.close()

a_file = open("contractions_dict.json", "w")

json.dump(contractions_dict, a_file)

a_file.close()
