<a href="https://colab.research.google.com/github/Rainniee/Neural-Networks-AI/blob/master/Chatbot%20With%20Movie%20Dialog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Pre-trained Word Vectors

In [0]:
# spaCy is a wonderful Python library for natural language processing 
# both to tokenize text (i.e., turn text into a list of words) and for its database of word vectors
!pip install spacy



In [0]:
# spaCy requires a "model" file, which is a bundle of statistical information that allows the library to parse text into words and parts of speech
# While spaCy comes with a model when you install it, that model does not include word vectors, so you'll need to download a model that does include them. 
# For English, recommended one is en_core_web_lg
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz#egg=en_core_web_lg==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB)
[K    100% |████████████████████████████████| 852.3MB 50.8MB/s 
[?25hInstalling collected packages: en-core-web-lg
  Running setup.py install for en-core-web-lg ... [?25ldone
[?25hSuccessfully installed en-core-web-lg-2.0.0

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_lg -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_lg

    You can now load the model via spacy.load('en_core_web_lg')



In [0]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [0]:
# able to look up the word vector for a particular word using spaCy:
nlp.vocab['summarization'].vector

array([-0.33014  , -0.24604  , -0.083086 , -0.17316  , -0.27574  ,
        0.30908  ,  0.24861  ,  0.29057  ,  0.20428  , -0.9444   ,
        0.095839 ,  0.21136  ,  0.11212  ,  0.028754 , -0.034047 ,
        0.28768  , -0.14761  , -0.33668  , -0.47391  , -0.70142  ,
       -0.10779  ,  0.42247  ,  0.27796  , -0.022983 ,  0.076136 ,
        1.1443   ,  0.97622  ,  0.015171 , -0.069094 , -0.56979  ,
        0.28381  ,  0.24827  ,  0.8933   ,  0.10587  , -0.39049  ,
       -0.018418 , -0.0029734, -0.18025  , -0.24388  ,  0.084734 ,
       -0.11058  , -0.17906  , -0.83465  ,  0.29406  ,  0.03161  ,
       -0.16967  ,  0.50273  ,  0.12738  , -0.42268  , -0.036089 ,
        0.046027 ,  0.26427  ,  0.2542   ,  0.49341  , -0.23321  ,
        0.83422  , -0.84002  , -0.10217  ,  0.42499  , -0.34567  ,
        0.17311  ,  0.79635  ,  0.30024  ,  0.47831  ,  0.64396  ,
        0.041427 ,  0.039959 ,  0.40117  ,  0.21181  , -0.40258  ,
       -0.70354  , -0.085231 ,  0.21149  , -0.35306  ,  0.1419

### Parsing Corpus of Conversations



```
So now we need some data for the bot. 
In particular, we need some conversations: the text of the turns along with information about which turn is in response to which. 
Some researchers at Cornell University have made available a very interesting corpus of conversations: The Cornell Movie Dialog Corpus, containing "220,579 conversational exchanges between 10,292 pairs of movie characters." 
The data is stored in several plain text files, which can download by running the following cells:
```



In [0]:
!curl -L -O http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9684k  100 9684k    0     0  5915k      0  0:00:01  0:00:01 --:--:-- 5915k


In [0]:
!unzip cornell_movie_dialogs_corpus.zip

Archive:  cornell_movie_dialogs_corpus.zip
   creating: cornell movie-dialogs corpus/
  inflating: cornell movie-dialogs corpus/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/cornell movie-dialogs corpus/
  inflating: __MACOSX/cornell movie-dialogs corpus/._.DS_Store  
  inflating: cornell movie-dialogs corpus/chameleons.pdf  
  inflating: __MACOSX/cornell movie-dialogs corpus/._chameleons.pdf  
  inflating: cornell movie-dialogs corpus/movie_characters_metadata.txt  
  inflating: cornell movie-dialogs corpus/movie_conversations.txt  
  inflating: cornell movie-dialogs corpus/movie_lines.txt  
  inflating: cornell movie-dialogs corpus/movie_titles_metadata.txt  
  inflating: cornell movie-dialogs corpus/raw_script_urls.txt  
  inflating: cornell movie-dialogs corpus/README.txt  
  inflating: __MACOSX/cornell movie-dialogs corpus/._README.txt  




```
Will work with two files from this corpus:
1. movie_lines.txt 
   Has the movie lines themselves, associated with a short unique identifier; 
2. movie_conversations.txt 
   Has lists of which lines occurred together in conversations, in the order in which they occurred. 
```



In [0]:
# The following parse the two files and create lookup dictionaries that associate unique IDs to lines (movie_lines) and each line to the line that follows it (responses).
movie_lines = {}
for line in open("./cornell movie-dialogs corpus/movie_lines.txt",
                 encoding="latin1"):
    line = line.strip()
    parts = line.split(" +++$+++ ")
    if len(parts) == 5:
        movie_lines[parts[0]] = parts[4]
    else:
        movie_lines[parts[0]] = ""

In [0]:
import json
responses = {}
for line in open("./cornell movie-dialogs corpus/movie_conversations.txt",
                 encoding="latin1"):
    line = line.strip()
    parts = line.split(" +++$+++ ")
    line_ids = json.loads(parts[3].replace("'", '"'))
    for first, second in zip(line_ids[:-1], line_ids[1:]):
        responses[first] = second

In [0]:
# Just to make sure everything works, the cell below prints out five random pairs of conversational turns from the corpus:

import random
for pair in random.sample(responses.items(), 5):
    print("A:", movie_lines[pair[0]])
    print("B:", movie_lines[pair[1]])
    print()

A: They probably stopped off somewhere. Have her call me when she gets back. I've got Lyndsey here and I want to know what time to put her to bed.
B: Okay. Later.

A: From the grave?
B: MyDick.

A: My brother says he likes you, too.
B: Really?

A: Whistler!
B: Are we bringing home strays now?

A: Maybe.
B: No, you weren't



### Making a Sentence Vector



```
To make the sentence vector for each line of dialog, will use spaCy. 
```



In [0]:
# The sentence_mean() function takes the spaCy object and uses it to tokenize the string that pass into the function (i.e., break it up into words)
# It then uses numpy's mean() function to find the average of the vectors, producing a new vector. 
# The shape of the resulting vector (i.e., the number of dimensions) should be the same as the shape of the individual word vectors.
import numpy as np
def sentence_mean(nlp, s):
    if s == "":
        s = " "
    doc = nlp(s, disable=['tagger', 'parser'])
    return np.mean(np.array([w.vector for w in doc]), axis=0)
sentence_mean(nlp, "This... is a test.").shape

(300,)

### Similarity Lookups



```
The kind of "database" wIll need to use for this is an approximate nearest neighbors lookup
which allows you to store items along with the vector that represents them
and then do fast searches to find items with similar vectors (even items that weren't in the original dataset).
```



In [0]:
!pip install simpleneighbors

Collecting simpleneighbors
  Downloading https://files.pythonhosted.org/packages/a2/8e/b8ca38e4305bdf5c4cac5d9bf4b65022a2d3641a978b28ce92f9e4063c7b/simpleneighbors-0.0.1-py2.py3-none-any.whl
Collecting annoy (from simpleneighbors)
[?25l  Downloading https://files.pythonhosted.org/packages/9c/bf/8e3f7051d694afc086184d223e892d0fc18aca1e4147042d0521a6adedb5/annoy-1.15.1.tar.gz (643kB)
[K    100% |████████████████████████████████| 645kB 24.8MB/s 
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/77/cb/7a/6f3ed44099e394e0cb0b6b41213b61fe6595b726530744f2ce
Successfully built annoy
Installing collected packages: annoy, simpleneighbors
Successfully installed annoy-1.15.1 simpleneighbors-0.0.1


In [0]:
from simpleneighbors import SimpleNeighbors

# makes a new Simple Neighbors object called nns and initializes it with 300 dimensions (the shape of the word vectors in spaCy, and also the shape of our summary vectors)
nns = SimpleNeighbors(300)

# it then samples ten thousand random conversational turns from the Cornell corpus
# finds sentence vectors for each of them and adds them to the database
for i, line_id in enumerate(random.sample(list(responses.keys()), 10000)):
    # show progress
    if i % 1000 == 0: print(i, line_id, movie_lines[line_id])
    line_text = movie_lines[line_id]
    summary_vector = sentence_mean(nlp, line_text)
# The np.any() line just checks to make sure that we don't add any vectors that are all zeroes by accident
# this can mess up the nearest-neighbor search
    if np.any(summary_vector):
        nns.add_one(line_id, summary_vector)
nns.build()

0 L574934 What's so fucking funny?
1000 L283957 Five hundred dollars.
2000 L219544 My associates did a biopsy on this man recently.  He's supposed to have a melanoma, or a carcinoma, some kind of noma. Hmmm. I can't seem to find any record of it.
3000 L113363 This is very awkward.
4000 L28440 Then what would be enough?  If we were married?
5000 L424172 Where? Where is she?
6000 L147521 You rob an associate of mine... a friend and--
7000 L23385 Close enough to walk to!
8000 L326265 Bull shit .. I'm in my prime ..
9000 L68383 Nixon lives in Saddle River, New York.




```
 (You can change this string to whatever you want.) It then uses the Simple Neighbors object to find the turn in the database with the most similar vector, and then uses the responses lookup to find the response to that turn. That response will be our bot's output.
```



In [0]:
# this code finds the turn most similar to the string in the variable sentence
# it then uses the Simple Neighbors object to find the turn in the database with the most similar vector
# and then uses the Responses Lookup to find the response to that turn. 
sentence = "I like making bots."
picked = nns.nearest(sentence_mean(nlp, sentence), 5)[0]
response_line_id = responses[picked]

# that response will be our bot's output.
print("Your line:\n\t", sentence)
print("Most similar turn:\n\t", movie_lines[picked])
print("Response to most similar turn:\n\t", movie_lines[response_line_id])

Your line:
	 I like making bots.
Most similar turn:
	 I like it.
Response to most similar turn:
	 Blue ruin is cheap gin in case you were wondering.


### Make Online Chatbot

In [0]:
!pip install https://github.com/aparrish/semanticsimilaritychatbot/archive/master.zip

Collecting https://github.com/aparrish/semanticsimilaritychatbot/archive/master.zip
  Downloading https://github.com/aparrish/semanticsimilaritychatbot/archive/master.zip
[K     / 122kB 8.3MB/s
Building wheels for collected packages: semanticsimilaritychatbot
  Building wheel for semanticsimilaritychatbot (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-bzl5hifr/wheels/f7/af/8e/8a8fbef31bfbfc3b935425efa03db03825795d85f4e23f8255
Successfully built semanticsimilaritychatbot
Installing collected packages: semanticsimilaritychatbot
Successfully installed semanticsimilaritychatbot-0.0.1


In [0]:
# create a chatbot object, passing in the spaCy language object (nlp) and the number of dimensions
from semanticsimilaritychatbot import SemanticSimilarityChatbot
chatbot = SemanticSimilarityChatbot(nlp, 300)

In [0]:
# the .add_pair() method in the object takes two strings: a turn and the response to that turn
# get these from the responses and movie_lines lookups
# again sampling ten thousand pairs at random.
sample_n = 10000

for first_id, second_id in random.sample(list(responses.items()), sample_n):
    chatbot.add_pair(movie_lines[first_id], movie_lines[second_id])
chatbot.build()


In [0]:
# the .response_for() method returns a plausible response from the database, based on semantic similarity
print(chatbot.response_for("Hello computer!"))

Han, don't. It'll be all right.


In [0]:
# To add variety, the .response_for() method actually selects randomly among several similar turns.

my_turn = "The weather's nice today, don't you think?"
for i in range(5, 51, 5):
    print("picking from", i, "possible responses:")
    print(chatbot.response_for(my_turn, i))
    print()

picking from 5 possible responses:
Of course. I would like to look.

picking from 10 possible responses:
You're mad, that's your trouble, you're mad.

picking from 15 possible responses:
I don't have to do any such thing. I'm eating my lunch, okay?

picking from 20 possible responses:
Everybody does?

picking from 25 possible responses:
Yeah, Dad.  I'm happy right now.

picking from 30 possible responses:
I don't have to do any such thing. I'm eating my lunch, okay?

picking from 35 possible responses:
Who wants true? Who wants moving?

picking from 40 possible responses:
Don't bother.

picking from 45 possible responses:
Who wants true? Who wants moving?

picking from 50 possible responses:
Mack, I'm just trying to keep up with now.



In [0]:
# the Semantic Similarity Chatbot object has a .save() method that saves the pre-built database to disk, using a filename prefix you supply
# It saves three different files: <prefix>.annoy, <prefix>-data.pkl, and <prefix>-chatbot.pkl
chatbot.save("movielines-10k-sample")

'''able to use a previously-saved database using the .load() class method
   this means you don't have to build the database again, can just load it and start calling .response_for().)
'''
chatbot = SemanticSimilarityChatbot.load("movielines-10k-sample", nlp)

In [0]:
# download all of the files from the pre-built bot to your computer so we can use them later
from google.colab import files
files.download('movielines-10k-sample.annoy')
files.download('movielines-10k-sample-data.pkl')
files.download('movielines-10k-sample-chatbot.pkl')

#### Making it Interactive

In [0]:
# create a little interactive interface for chatting with the bot that we just built

chatbot_html = """
<style type="text/css">#log p { margin: 5px; font-family: sans-serif; }</style>
<div id="log"
     style="box-sizing: border-box;
            width: 600px;
            height: 32em;
            border: 1px grey solid;
            padding: 2px;
            overflow: scroll;">
</div>
<input type="text" id="typehere" placeholder="type here!"
       style="box-sizing: border-box;
              width: 600px;
              margin-top: 5px;">
<script>
function paraWithText(t) {
    let tn = document.createTextNode(t);
    let ptag = document.createElement('p');
    ptag.appendChild(tn);
    return ptag;
}
document.querySelector('#typehere').onchange = async function() {
    let inputField = document.querySelector('#typehere');
    let val = inputField.value;
    inputField.value = "";
    let resp = await getResp(val);
    let objDiv = document.getElementById("log");
    objDiv.appendChild(paraWithText('😀: ' + val));
    objDiv.appendChild(paraWithText('🤖: ' + resp));
    objDiv.scrollTop = objDiv.scrollHeight;
};
async function colabGetResp(val) {
    let resp = await google.colab.kernel.invokeFunction(
        'notebook.get_response', [val], {});
    return resp.data['application/json']['result'];
}
async function webGetResp(val) {
    let resp = await fetch("/response.json?sentence=" + 
        encodeURIComponent(val));
    let data = await resp.json();
    return data['result'];
}
</script>
"""

In [0]:
import IPython
from google.colab import output

display(IPython.display.HTML(chatbot_html + \
                             "<script>let getResp = colabGetResp;</script>"))

def get_response(val):
    resp = chatbot.response_for(val)
    return IPython.display.JSON({'result': resp})

output.register_callback('notebook.get_response', get_response)