<h1>Just a simple Chatbot</h1>
<br/>
<p><i>    This project was born within the course on Natural Language Processing realized for Coursera by Anna Potapenko, Alexey Zobnin, Anna Kozlova, Sergey Yudin and Andrei Zimovnov for the National Research University Higher School of Economics - I will use Starspace, for each reference see the article 'StarSpace: Embed All The Things!'(arXiv:1709.03856) published by Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, Jason Weston
</i>    </p> 
<p>My basic idea to create a chatbot for the HSE Honor project is to make a model as simple as possible. In the moment it shows that it can work - even just rudimentary - I'm going to implement more complex techniques on it. To create the model I will initially use the Cornell Movie Dialogues as a dataset and I take advantage of some of the functions we learned to use during the course.</p>

In [1]:
from datasets import *
import numpy as np

dataset_path = 'data/cornell/' 

In [2]:
# I use datasets.py functions to read our data 
max_sentence_len = 25  # this is the sentences max_lenght that we consider   

data = readCornellData(dataset_path, max_len = max_sentence_len, fast_preprocessing=False) 

100%|██████████| 83097/83097 [00:05<00:00, 16121.24it/s]


In [3]:
# Now, we just explore the data
initial_data_len = len(data)
print('Size of our dataset: ', initial_data_len, '\n')
print('Three lines of our dataset: ', data[:3], '\n')

print('The same lines in a more readable form: ')
for line in data[:3]:
    que, ans = line
    print(' Q:', que, '\n', 'A:', ans)

Size of our dataset:  38659 

Three lines of our dataset:  [('there', 'where'), ('have fun tonight', 'tons'), ('what good stuff', 'the real you')] 

The same lines in a more readable form: 
 Q: there 
 A: where
 Q: have fun tonight 
 A: tons
 Q: what good stuff 
 A: the real you


<p>As indicated in the Project's instructions, the utilities with which we load conversations filter and split them into pairs of questions and answers.</p>

<h3>A very simple model</h3>
<p>At the bottom of a hierarchy of complexity we can create a model based on rules. In such a model, we condition our answers to a series of predefined questions. This model could gives an adequate response to certain tasks in which the possible questions and answears are limited and can be predefined in detail. 
Returning to our example, we can create a database of questions to which certain answers can match. As shown below. </p>

In [4]:
ques = []
answ = []
for line in data:
    q, a = line
    ques.append(q)
    answ.append(a)

<p>If we ask 'have fun tonight' the chatbot must answear:</p>

In [5]:
question = 'have fun tonight'
idx = ques.index(question)
print(answ[idx])

tons


<p>...of course, this kind of answer is not appropriate for creating a chatbot that speaks in natural language. The model above would be in error as soon as I tried to ask a question not included in our list; and we must consider that the possible questions are endless. Secondly in a Chatbot the possible answers to the same question should be different.</p>

<h2>...where we were with the StackOverflow assistant bot</h2>
<p>So, I now re-start from where we stopped off with the StackOverflow assistant bot project. I will create a model in which our sample questions are used as training datasets. Each question is transformed into embeddings, so we can use it to create a vector space of possible questions.</p>
<p>Then using mathematical functions, our model can identify the distance (and proximity) with questions never seen before and make a ranking. Inspired by the ruled based model seen above, I will consider a success if our model returns as an appropriate response one of the answear already provided in our dataset. </p>

<p>To create the vector representation I use Starspace that show during the HSE course to be fast and to work quite well. I will use the train mode 4: the questions will be considered examples and the answers will be the corresponding labels. Then first of all I have to create the correct file of data.</p>

In [6]:
# prepares the sentences for our training by reducing them to lowercase and removing strange and not useful characters.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def prepare_text(sentence):
    '''A filter function to prepare our sentences.'''
    
    GOOD_SYMBOLS_RE = re.compile('[^0-9a-z ]')
    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;#+_]')
    REPLACE_SEVERAL_SPACES = re.compile('\s+')

    sentence = sentence.lower()
    sentence = REPLACE_BY_SPACE_RE.sub(' ', sentence)
    sentence = GOOD_SYMBOLS_RE.sub('', sentence)
    sentence = REPLACE_SEVERAL_SPACES.sub(' ', sentence)
    
    return sentence

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marcofosci/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# apply the filter above to all dataset

def prepare_data(data):
    '''A utility function to prepare all dataset'''

    train_data = []
    for line in data:
        new_line = []
        for sentence in line:    
            new_line.append(prepare_text(sentence))
        train_data.append(new_line)
        
    return train_data

In [8]:
train_data = prepare_data(data)

In [9]:
# create the file for starSpace 

def prepare_file(train_data, f_out):
    '''A function to create the correct datafile for starSpace.'''
    
    out = open(f_out, 'w', encoding='utf8')
    for line in train_data:
        que, ans = line
        newline = que + '\t' + ans
        print(newline, sep='\t', file=out)
    out.close()

In [10]:
# I entered the data size in the filename to avoid confusion in my directory with other experiments
data_file = 'starspace/data/data_' + str(initial_data_len) + '.tsv'

prepare_file(train_data, data_file)

<h5>Creation of embeddings with Starspace</h5>
<p>At this point I will create the embedding file using Starspace from a terminal. Basic I will use the configuration that we used for the Assignment of the third week</p>

In [None]:
# trainMode 4 - output:starspace_4
starspace train \
-trainFile 'data/data_38658.tsv' \
-model 'data/train_38658' \
-trainMode 4 \
-lr 0.05 \
-adagrad 1 \
-ngrams 1 \
-epoch 5 \
-verbose 1 \
-similarity cosine \
-minCount 2 \
-fileFormat labelDoc \
-negSearchLimit 10 \
-dim 100 

In [11]:
# I use this function to load embeddings from tsv file

def load_embeddings(embeddings_path):
    '''Loads pre-trained word embeddings from tsv file.'''

    embeddings = {}
    words = []

    for line in open(embeddings_path, encoding='utf-8'):
        words = line.strip().split('\t')
        embeddings[words[0]] = np.array(np.float32(words[1:]))

    embeddings_dim = len(words) - 1
        
    return embeddings, embeddings_dim

In [12]:
# I load the embeddings that have been created
embeddings_file = 'starspace/data/train_38658.tsv'
embeddings, embeddings_dim = load_embeddings(embeddings_file)

<p>Now that we have created our embeddings file we need a function that converts the questions we will make during the conversation into embeddings. We have already used a similar function during the course: <b>question_to_vec</b>.

In [13]:
def question_to_vec(question, embeddings, dim):
    '''Transforms a string to an embedding by filtering and averaging word embeddings.'''
    
    result = np.zeros(dim)
    cnt = 0
    sentence = question.split(' ')
    
    for word in sentence:
        if word in embeddings:
            result += embeddings[word]
            cnt += 1
            
    if cnt != 0:
        result = result / cnt
    
    return result 

<p>We create the vector space that represents our dataset: in it we couple each index of our questions to the related embeddings. The same index will then be fundamental for coupling the questions to the answers.</p>

In [14]:
def emb_dict(data):
    '''
    Transform questions in embeddings 
    Return also dictionaries for question and answear
    '''
    
    n_ans = len(data)
    idx_to_emb = np.zeros((n_ans, embeddings_dim), dtype=np.float32)
    ques = {}
    answ = {}

    for i, line in enumerate(data):
        que, ans = line
        idx_to_emb[i, :] = question_to_vec(prepare_text(que), embeddings, embeddings_dim)
        ques[i] = que
        answ[i] = ans 
        
    return idx_to_emb, ques, answ

In [15]:
idx_to_emb, idx_to_que, idx_to_ans = emb_dict(data)

In [16]:
# an example to understand our data structures
i = random.randint(0, len(data))
print('Our initial ', i, ' element in data: ', data[i], '\n')

print('The corresponding question: ', idx_to_que[i])
print('The corresponding answer: ', idx_to_ans[i], '\n')

print('The question embeddings: ', '\n', idx_to_emb[i, :])

Our initial  20717  element in data:  ('do you really need these', 'only to see') 

The corresponding question:  do you really need these
The corresponding answer:  only to see 

The question embeddings:  
 [-0.00257774  0.01850432  0.00950121 -0.00854143  0.00569524  0.01834482
 -0.00486118 -0.01803708 -0.01664543  0.01179293  0.00612149  0.01358672
 -0.00492432  0.003093   -0.01275836  0.01827809  0.01453148 -0.00889496
  0.00450172  0.01037154  0.01742019  0.01418392  0.00685231 -0.01177643
 -0.0228159   0.01592279  0.02278012 -0.00831177  0.01313655  0.01717566
  0.01569474 -0.02681505  0.01871077  0.00532841 -0.00797549 -0.0150606
  0.00750742 -0.01026774  0.0118776  -0.0206617  -0.00414936 -0.010803
  0.0241643   0.00539037 -0.00280713 -0.01279942 -0.00745639  0.0138423
 -0.00948625  0.0079759  -0.00520654  0.00251925  0.004598    0.01119206
  0.00324697  0.02887112 -0.03167317  0.01572986  0.01749413 -0.01249568
  0.00294685  0.01866975  0.008182   -0.02027527 -0.01188302  0.025

<p>Now we need a function that compare a new question with those within our vector space. I will use the <b>pairwise_distances</b> of the python <b>sklearn</b> module.</p>

In [17]:
from sklearn.metrics.pairwise import pairwise_distances_argmin

def best_ans(question):
    '''find the closest question in our vector space and returns the corresponding answear'''
    
    question = prepare_text(question)
    question_vec = question_to_vec(question, embeddings, embeddings_dim) 
    best = pairwise_distances_argmin(question_vec.reshape(1, -1), idx_to_emb, metric='cosine')[0]
    ans = idx_to_ans[best]
    return ans

<p>If I ask the question 'i' the model should answer me with the answer 'i' - not always in reality, it depends on the training phase and the over/under fitting degree of our embeddings</p>

In [18]:
print(idx_to_que[i],'\n')
print(best_ans(idx_to_que[i]))

do you really need these 

only to see


<p>But now if I ask a new question that is not in our dataset, I should still get an answer</p>

In [19]:
question = 'What about her'

if question in idx_to_que.values():
    print('---- is in our DATASET ----')
else: 
    print(best_ans(question))

no way


<h4>A problem with this model</h4>
<p>A serious problem with this model is that it will always return the same answer for the same question.</p>

In [20]:
# using the same question as before
print(best_ans(question))

no way


<h4>An easy solution</h4>

<p>Consider for a moment the vector space of our questions. What happen if we consider a certain number of the nearest vectors instead to take just the closest solution? If, for example, we consider the 5 vectors closest to our question and from time to time we randomly choose the related answer relative to one of those five?</p>

<p>In this way, if we have a good dataset and the training phase has produced good results we will still be able to have plausible answers that can vary from time to time. Then, instead of having <b>best_ans</b> we will have the function <b>one_of_best</b>.</p>

In [21]:
from sklearn.metrics.pairwise import pairwise_distances

def one_of_best (question, n):
    '''find the n closest question in our vector space and returns one of the corresponding answear'''
    
    question = prepare_text(question)
    question_vec = question_to_vec(question, embeddings, embeddings_dim) 
    dist = pairwise_distances(question_vec.reshape(1, -1), idx_to_emb, metric='cosine')[0]
    ranks = np.argsort(dist)
    one_of = random.choice(ranks[:n])
    ranking = ranks[:n]

    return idx_to_ans[one_of], ranking

In [22]:
print('Then we had as best answear: ', best_ans(question))

Then we had as best answear:  no way


In [23]:
ans, rank = one_of_best(question, 5)

print('Answear to the closest questions: \n')
for x in rank:
    print(idx_to_ans[x])

Answear to the closest questions: 

no way
get something going there
shes dying
of course
are you


In [24]:
print('Selected answear: ', ans)

Selected answear:  of course


In [25]:
question = 'do you know, i am very strong'
print('Another example. We have the question:')
print(question, '\n')
ans, _ = one_of_best(question, 5)
print('First answear: ', ans)
ans, _ = one_of_best(question, 5)
print('Second answear:', ans)

Another example. We have the question:
do you know, i am very strong 

First answear:  yes you are
Second answear: are you persuadable


<h4>Further improve</h4>
<p>To further improve this model we can enlarge the starting dataset. This operation, in addition to increasing the size of our vector space by improving the ability to recognize the questions, also provides a greater range of possible answers. On the other hand, as possible problems we can incur an excessive weighting of our algorithm during the training phase. However, using Starespace allows us to handle large datasets.</p>

In [26]:
# I extend the dataset by taking all the examples provided by the Cornell Movie Dialogues
max_sentence_len = 2000  # the lenght of all sentences is less than 2000    

data = readCornellData(dataset_path, max_len = max_sentence_len, fast_preprocessing=False)

100%|██████████| 83097/83097 [00:05<00:00, 15483.32it/s]


In [27]:
# Size of our data
new_data_len = len(data)
print('Size of our dataset: ', new_data_len, '\n')    

Size of our dataset:  221272 



In [28]:
train_data = prepare_data(data)

In [29]:
# I entered the data size in the filename to avoid confusion in my directory with other experiments
data_file = 'starspace/data/data_' + str(new_data_len) + '.tsv'

prepare_file(train_data, data_file)

In [None]:
# trainMode 4 - output:starspace_4
starspace train \
-trainFile 'data/data_221272.tsv' \
-model 'data/train_221272' \
-trainMode 4 \
-lr 0.05 \
-adagrad 1 \
-ngrams 1 \
-epoch 5 \
-verbose 1 \
-similarity cosine \
-minCount 2 \
-fileFormat labelDoc \
-negSearchLimit 10 \
-dim 100 

<h4>Some considerations on embeddings provided by starspace</h4>
<p>- The initial dataset had <b>37.836 examples</b> loaded and a dictionary of <b>7.041 words</b>. With 100 dimensions for our embeddings, Starspace took a total of 41 seconds for a training of 5 epochs. <i>The final train error was</i>: <b>0.01311882</b></p>
<p>- The new dataset have <b>220.895 examples</b> loaded and a dictionary of <b>46.560 words</b>. With 100 dimensions for our embeddings, Starspace took a total of 8 minutes and 52 seconds for a training of 5 epochs. <i>The final train error was</i>: <b>0.01259348</b></p>   

In [30]:
# I load the embeddings that have been created
embeddings_file = 'starspace/data/train_221272.tsv'
embeddings, embeddings_dim = load_embeddings(embeddings_file)

In [31]:
# We create the vector space of our new dataset
idx_to_emb, idx_to_que, idx_to_ans = emb_dict(data)

<p>Let's try now to generate some test answers</p>

In [32]:
question = 'are you joking?'
ans, rank = one_of_best(question, 5)

print('Five answears to the closest questions: \n')
for x in rank:
    print(idx_to_ans[x])

Five answears to the closest questions: 

dont think that i want to save myself from any embarrassment from the awkwardness of meeting anna its not that its that you can say certain things easier if youre alone please sandro do try to understand me it would look like i was trying to influence you to force you to control you and that makes me feel uncomfortable
yes i know i started to say i started to say joe that
to the penny exactly one million dollars in cash
a joke is a story with a humorous climax
i never joke about money


In [33]:
print('We have the question:')
print(question, '\n')
ans, _ = one_of_best(question, 5)
print('First answear: ', ans)
ans, _ = one_of_best(question, 5)
print('Second answear:', ans)

We have the question:
are you joking? 

First answear:  a joke is a story with a humorous climax
Second answear: i never joke about money


<h4>...a larger dataset</h4>
<p>Considering the speed of use of Starspace you can think about changing the parameters of the training phase, playing in particular on the learning rate, on the dimension of our embeddings, on the learning periods. To do this it is advisable to find an efficient performance meter (on it I will work in future).</p> 

<p>For the moment, I'm content to test the model on a larger dataset: the Open Subtitles with over a million and a half pairs of questions/answers. The use of this dataset allows at least to greatly widen the variety of responses that we have available.</p>

In [34]:
# Now I try to use data from Open Subtitles
max_sentence_len = 5000  # the lenght of all sentences is less than 2000    
path = 'data/opensubs/' 
data = readOpensubsData(path, max_len=max_sentence_len, fast_preprocessing=True)

OpenSubtitles data files:   0%|          | 0/2319 [00:00<?, ?it/s]

Loading OpenSubtitles conversations in data/opensubs/.


OpenSubtitles data files:  17%|█▋        | 403/2319 [00:32<02:35, 12.29it/s]

Skipping file data/opensubs/OpenSubtitles/en/Action/2003/602_152466_207871_batoru_rowaiaru_ii_rekuiemu.xml.gz with errors.


OpenSubtitles data files:  21%|██        | 488/2319 [00:39<02:29, 12.25it/s]

Skipping file data/opensubs/OpenSubtitles/en/Action/2004/59_84873_113518_appurushdo.xml.gz with errors.


OpenSubtitles data files:  52%|█████▏    | 1196/2319 [01:56<01:49, 10.26it/s]

Skipping file data/opensubs/OpenSubtitles/en/Comedy/2003/529_124078_171007_how_to_lose_a_guy_in_10_days.xml.gz with errors.


OpenSubtitles data files:  54%|█████▎    | 1241/2319 [02:02<01:46, 10.13it/s]

Skipping file data/opensubs/OpenSubtitles/en/Comedy/2004/2480_226704_299940_little_black_book.xml.gz with errors.


OpenSubtitles data files:  75%|███████▍  | 1732/2319 [02:54<00:59,  9.90it/s]

Skipping file data/opensubs/OpenSubtitles/en/Drama/2000/179_88528_119102_batoru_rowaiaru.xml.gz with errors.


OpenSubtitles data files:  79%|███████▉  | 1834/2319 [03:04<00:48,  9.94it/s]

Skipping file data/opensubs/OpenSubtitles/en/Drama/2002/3265_149497_204017_unfaithful.xml.gz with errors.


OpenSubtitles data files:  81%|████████  | 1872/2319 [03:07<00:44,  9.98it/s]

Skipping file data/opensubs/OpenSubtitles/en/Drama/2003/1723_68784_89159_big_fish.xml.gz with errors.


OpenSubtitles data files:  83%|████████▎ | 1931/2319 [03:11<00:38, 10.06it/s]

Skipping file data/opensubs/OpenSubtitles/en/Drama/2004/146_206647_272090_eternal_sunshine_of_the_spotless_mind.xml.gz with errors.


OpenSubtitles data files:  89%|████████▉ | 2067/2319 [03:23<00:24, 10.15it/s]

Skipping file data/opensubs/OpenSubtitles/en/Family/2001/3935_19508_22105_cats__dogs.xml.gz with errors.


OpenSubtitles data files:  90%|█████████ | 2091/2319 [03:26<00:22, 10.15it/s]

Skipping file data/opensubs/OpenSubtitles/en/Horror/1922/1166_134135_184270_nosferatu_eine_symphonie_des_grauens.xml.gz with errors.


OpenSubtitles data files: 100%|██████████| 2319/2319 [03:46<00:00, 10.23it/s]
100%|██████████| 1648080/1648080 [00:35<00:00, 45889.08it/s]


In [35]:
# Now, we just explore the data
openS_data_len = len(data)
print('Size of our dataset: ', openS_data_len, '\n')
print('Three lines of our dataset: ', data[:3], '\n')

print('The same lines in a more readable form: ')
for line in data[:3]:
    que, ans = line
    print(' Q:', que, '\n', 'A:', ans)

Size of our dataset:  1616544 

Three lines of our dataset:  [('right then go straight to the office', 'don t dawdle on the way'), ('don t dawdle on the way', 'don t worry'), ('is your mother here too', 'why are you outside')] 

The same lines in a more readable form: 
 Q: right then go straight to the office 
 A: don t dawdle on the way
 Q: don t dawdle on the way 
 A: don t worry
 Q: is your mother here too 
 A: why are you outside


In [36]:
train_data = prepare_data(data)

In [37]:
# I entered the data size in the filename to avoid confusion in my directory with other experiments
data_file = 'starspace/data/data_' + str(openS_data_len) + '.tsv'

prepare_file(train_data, data_file)

In [None]:
# trainMode 4 - output:starspace_4
starspace train \
-trainFile 'data/data_1616544.tsv' \
-model 'data/train_1616544_' \
-trainMode 4 \
-lr 0.05 \
-adagrad 1 \
-ngrams 1 \
-epoch 5 \
-verbose 1 \
-similarity cosine \
-minCount 2 \
-fileFormat labelDoc \
-negSearchLimit 10 \
-dim 100 

<p>- The Open Subtitles dataset had <b>1.616.048 examples</b> loaded and a dictionary of <b>95.764 words</b>. With 100 dimensions for our embeddings, Starspace took a total of 54 minutes and 10 seconds for a training of 5 epochs. <i>The final train error was</i>: <b>0.01831765</b></p>

In [38]:
# I load the embeddings that have been created
embeddings_file = 'starspace/data/train_1616544.tsv'
embeddings, embeddings_dim = load_embeddings(embeddings_file)

In [39]:
# We create the vector space of our new dataset
idx_to_emb, idx_to_que, idx_to_ans = emb_dict(data)

In [40]:
question = 'What do you think, about Amazon forest?'
ans, rank = one_of_best(question, 5)

print('Answear to the closest questions: \n')
for x in rank:
    print(idx_to_ans[x])

Answear to the closest questions: 

last year i had to drive to lakewood to talk with a client and i went
last year i had to drive to lakewood to talk with a client and i went
could be fun i m tempted
i don t like that that makes me look like a big dufus doesn t it
i haven t had a vacation for ages and a few days over there


In [41]:
print('We have the question:')
print(question, '\n')
ans, _ = one_of_best(question, 5)
print('First answear: ', ans)
ans, _ = one_of_best(question, 5)
print('Second answear:', ans)

We have the question:
What do you think, about Amazon forest? 

First answear:  last year i had to drive to lakewood to talk with a client and i went
Second answear: could be fun i m tempted


In [42]:
# I save question embeddings and answears in a pickle file to use it on Telegram NlpHonorBot
import pickle
pickle.dump(idx_to_emb, open('NlpHonorBot_emb.pkl', 'wb'))
pickle.dump(idx_to_ans, open('NlpHonorBot_ans.pkl', 'wb'))

# we don't need for the chatbot, but it is also good to save our index of questions for future tests
pickle.dump(idx_to_que, open('NlpHonorBot_que.pkl', 'wb'))

<h2>Problems, possible next steps and conclusions</h2>
<p>Among the problems of this model, the most evident is that we use pre-packaged phrases for answers: for so many that they can be, they will never be able to return the variability of a real discourse. Furthermore, the language used will follow that of the dataset used for the training phase.</p>

<p>Another big problem is that this model has no memory of the statements made, every couple of questions and answers is in itself and does not influence the continuation of the conversation as it happens in natural language.</p>

<p>The first problem is partly solveble by implementing a <b>generative model</b> for the creation of responses. This solution leads to a whole series of other problems related to the connection between the vector space of the questions and that of the answers. A good solution should include an <b>encoder-decoder</b> model.</p>

<p>The second problem is complex and calls into question some mechanisms - such as the <b>Attention of the sequence to sequence models</b> - that learn and in some way memorize significant parts of the discourse.</p>

<p>These are issues that go beyond the time I have available for this project. I will commit myself to implement them in a future projects. I hope this sheet can help other students to understand some of the algorithms developed during the course.</p>

<p>I implemented the results of this project in the Telegram chatbot: <b>NlpHonorBot</b>.</p>

