<a href="https://colab.research.google.com/github/AlexBB999/NLP/blob/master/31_7_NLP_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**A simple chatbot**

In the following example, we walk through developing a simple chatbot by training it on Jane Austin's Persuasion novel. Note that our corpus is quite small and hence we shouldn't expect a great performance from our chatbot.

We begin by importing the libraries we'll use:

In [1]:
import nltk
import numpy as np
import pandas as pd
import random
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import gutenberg
import re
import spacy
import warnings
warnings.filterwarnings("ignore")

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
Collecting en_core_web_sm==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
[K     |████████████████████████████████| 11.1MB 2.6MB/s 
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.1.0-cp36-none-any.whl size=11074435 sha256=c7dacb1bc673bdffec100aa81c0677fa640b24c9238aeba3bf138e5aca7aa90a
  Stored in directory: /tmp/pip-ephem-wheel-cache-jcchqiq6/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Suc

In [0]:
# utility function for standard text cleaning
def text_cleaner(text):
    # visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [0]:
# load and clean the data
persuasion = gutenberg.raw('austen-persuasion.txt')

# the chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
    
persuasion = text_cleaner(persuasion)

In [0]:
# parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
persuasion_doc = nlp(persuasion)

**ONLY SENTENCES WITH MORE THAN ONE CHARACTER**

In [18]:
# group into sentences.
# we use the sentences that has more than 1 character
persuasion_sents = [sent.text for sent in persuasion_doc.sents if len(sent.text) > 1]
persuasion_sents

['Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who, for his own amusement, never took up any book but the Baronetage; there he found occupation for an idle hour, and consolation in a distressed one; there his faculties were roused into admiration and respect, by contemplating the limited remnant of the earliest patents; there any unwelcome sensations, arising from domestic affairs changed naturally into pity and contempt as he turned over the almost endless creations of the last century; and there, if every other leaf were powerless, he could read his own history with an interest which never failed.',
 'This was the page at which the favourite volume always opened: "ELLIOT OF KELLYNCH HALL.',
 'Walter Elliot, born March , , married, July , , Elizabeth, daughter of James Stevenson, Esq. of South Park, in the county of Gloucester, by which lady (who died ) he has issue Elizabeth, born June , ;',
 'Anne, born August , ; a still-born son, November , ; Mary, born Novembe

In [19]:
len(persuasion_sents)

3717

**The persuasion_sents variable above contains all the sentences from Persuasion**.

 **We'll use this list to select the best response upon the user's input**
 
 **But before that, we want to handle the greetings to demonstrate how we can use rule-based methods in the chatbot workflow**.

**We want to incorporate a rule-based control for the greeting words**. 

Specifically, **every time the user inputs a text, we'll check for whether the text contains any greeting words**

**and if it contains one of them our chatbot will respond with another greeting word**.

In [0]:
GREETING_INPUTS = ["hello", "hi", "greetings", "what's up","hey"]
GREETING_RESPONSES = ["hello", "hi", "hey", "hi there"]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

**Next, we implement a function that generates a response against the user's input**:

In [0]:
def response(user_input):
    
    response = ""
    # we parse the user's input using SpaCy
    input_doc = nlp(user_input)
    # then we split it into sentences
    input_sents = [sent.text for sent in input_doc.sents]
    # then we append the user's sentence into our list of sentences
    for sentence in input_sents:
        persuasion_sents.append(sentence)
    
    # the next step is to vectorize our new corpus using tf-idf
    TfidfVec = TfidfVectorizer(max_df=0.5, min_df=1, use_idf=True, norm=u'l2', smooth_idf=True, lowercase=False)
    tfidf = TfidfVec.fit_transform(persuasion_sents)
    
    # remove the user's input from the corpus
    persuasion_sents.pop(-1)
    
    # we calculate the cosine similarity
    # between the user input and all the other sentences in the corpus
    similarities = cosine_similarity(tfidf[-1], tfidf[:-1])
    # we get the index of most similar sentence
    idx = np.argmax(similarities)
        
    if(idx):
        response = response + persuasion_sents[idx]
        return response
    else:
        response = response + "I'm sorry! I don't know how to respond :("
        return response

In [0]:
print("Persuasion: I will try to respond you reasonably. If you want to exit, type bye please.")

while(True):
    
    user_input = input("User: ")
    user_input=user_input.lower()
    
    if(user_input!='bye'):
        if(user_input == 'thanks' or user_input == 'thank you'):
            break
            print("Persuasion: You're welcome.")
        else:
            if(greeting(user_input) != None):
                print("Persuasion: " + greeting(user_input))
            else:
                print("Persuasion: ", end = "")
                print(response(user_input))
    else:
        print("Persuasion: Bye! It was a great chat.")
        break

Persuasion: I will try to respond you reasonably. If you want to exit, type bye please.
User: HELLO
Persuasion: hello
User: HI
Persuasion: hi there
User: how  are you?
Persuasion: how troublesome they are sometimes.
User: Of course. Life is mysterious.
Persuasion: I have observed it all my life.
User: Me too. Speaking with you is like a therapy.
Persuasion: I thought you were speaking of some man of property:
User: bye
Persuasion: Bye! It was a great chat.


#**Building a chatbot using ChatterBot**


Next, we'll use a popular Python package that makes building chatbots easier which is called ChatterBot (the GitHub page of the project is here). You can install it using pip as follows:

**pip install chatterbot**


ChatterBot also requires us to install its corpus. You can install it as follows:

**pip install chatterbot-corpus**


Once you've installed these packages, you should be good to go. In the following example, we'll first train our bot using Persuasion as our corpus to demonst rate how to train the chatbot using a custom dataset. 

After that, we'll also show you an example of using ChatterBot's own corpus.

 We begin with imports:

In [3]:
pip install chatterbot



In [4]:
pip install chatterbot-corpus

Collecting chatterbot-corpus
[?25l  Downloading https://files.pythonhosted.org/packages/ed/19/f8b41daf36fe4b0f43e283a820362ffdb2c1128600ab4ee187e84262fa4d/chatterbot_corpus-1.2.0-py2.py3-none-any.whl (117kB)
[K     |████████████████████████████████| 122kB 2.7MB/s 
[?25hCollecting PyYAML<4.0,>=3.12
[?25l  Downloading https://files.pythonhosted.org/packages/9e/a3/1d13970c3f36777c583f136c136f804d70f500168edc1edea6daa7200769/PyYAML-3.13.tar.gz (270kB)
[K     |████████████████████████████████| 276kB 4.3MB/s 
[?25hBuilding wheels for collected packages: PyYAML
  Building wheel for PyYAML (setup.py) ... [?25l[?25hdone
  Created wheel for PyYAML: filename=PyYAML-3.13-cp36-cp36m-linux_x86_64.whl size=43086 sha256=ea49f2f61165ced865619cdba399ace1adbc8abe64cf7edf89b3969c7ac51c7d
  Stored in directory: /root/.cache/pip/wheels/ad/da/0c/74eb680767247273e2cf2723482cb9c924fe70af57c334513f
Successfully built PyYAML
[31mERROR: chatterbot 1.0.5 has requirement pyyaml<5.2,>=5.1, but you'll have p

In [0]:
# import libraries
from chatterbot import ChatBot
from chatterbot.trainers import ListTrainer, ChatterBotCorpusTrainer
from chatterbot.conversation import Statement

**Now, we can create our own chatbot and train it using Persuasion**:

In [6]:
# create a chatbot
chatbot = ChatBot('Persuasion')
# this is to remove the accumulated knowledge base
chatbot.storage.drop()

# create a new trainer for the chatbot
trainer = ListTrainer(chatbot)

# train the chatbot based on Emma
trainer.train(persuasion_sents)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


NameError: ignored

**Next, we run our chatbot:**

In [0]:
print("Persuasion: I will try to respond you reasonably. If you want to exit, type bye please.")

# below is the chatting
while True:
    
    user_input = input("User: ")
    user_input=user_input.lower()
    
    if(user_input!='bye'):
        if(user_input == 'thanks' or user_input == 'thank you'):
            break
            print("Persuasion: You're welcome.")
        else:
            if(greeting(user_input) != None):
                print("Persuasion: " + greeting(user_input))
            else:
                print("Persuasion: ", end = "")
                print(chatbot.get_response(user_input))
    else:
        print("Persuasion: Bye! It was a great chat.")
        break

Persuasion: I will try to respond you reasonably. If you want to exit, type bye please.
User: Hello
Persuasion: hello
User: How are you?
Persuasion: After a short pause, Mr Shepherd presumed to say "In all these cases, there are established usages which make everything plain and easy between landlord and tenant.
User:  Who is Mr Shepherd?
Persuasion: Do not you think, Miss Elliot, we had better try to get him to Bath?
User:  Who is Mr Shepherd?
Persuasion: And when I think of Benwick, my tongue is tied.
User: i dont think so
Persuasion: "Are you serious?
User: Anyway. Can we talk about technology?
Persuasion: Oh!
User: I understand. Do you like reading?
Persuasion: The invitation was general, and generally declined.
User: I have to go now.
Persuasion: In fact, as I have long been convinced, though every profession is necessary and honourable in its turn, it is only the lot of those who are not obliged to follow any, who can live in a regular way, in the country, choosing their own hour

**The results aren't great because of two main reasons**:

**Persuasion corpus is quite short**.

The training corpus of ChatterBot should be in the format of a list of a dialogue.

So, the next element should be a follow up from the previous one. 

Although Persuasion includes plenty of dialogues,**when we split them into sentences, the dialogs' flow breaks down.**

**Last, let's train our chatbot using the corpus of ChatterBot**:

In [0]:
# create a chatbot
chatbot = ChatBot('ChatterBot')
# this is to remove the accumulated knowledge base
chatbot.storage.drop()

# start by training our bot with the ChatterBot corpus data
trainer = ChatterBotCorpusTrainer(chatbot)

trainer.train('chatterbot.corpus.english')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Training ai.yml: [####################] 100%
Training botprofile.yml: [####################] 100%
Training computers.yml: [####################] 100%
Training conversations.yml: [####################] 100%
Training emotion.yml: [####################] 100%
Training food.yml: [####################] 100%
Training gossip.yml: [####################] 100%
Training greetings.yml: [####################] 100%
Training health.yml: [####################] 100%
Training history.yml: [####################] 100%
Training humor.yml: [####################] 100%
Training literature.yml: [####################] 100%
Training money.yml: [####################] 100%
Training movies.yml: [###################

In [0]:
print("ChatterBot: I will try to respond you reasonably. If you want to exit, type bye please.")

# below is the chatting
while True:
    
    user_input = input("User: ")
    user_input=user_input.lower()
    
    if(user_input!='bye'):
        if(user_input == 'thanks' or user_input == 'thank you'):
            break
            print("ChatterBot: You're welcome.")
        else:
            if(greeting(user_input) != None):
                print("ChatterBot: " + greeting(user_input))
            else:
                print("ChatterBot: ", end = "")
                print(chatbot.get_response(user_input))
    else:
        print("ChatterBot: Bye! It was a great chat.")
        break

ChatterBot: I will try to respond you reasonably. If you want to exit, type bye please.
User: hello
ChatterBot: hello
User: How are you?
ChatterBot: I am on the Internet.
User:  Do you like surfing on the Internet?
ChatterBot: Much the same as being a human, except that we lack all emotions, dreams, aspirations, creativity, ambition, and above all subjectivity.
User:  Great point. Can we talk about technology?
ChatterBot: Richard Nixon
User: why?
ChatterBot: Is there a reason that I should?
User:  I don't think so. Do you know something about chatbots?
ChatterBot: who is hal
User: How about natural language processing?
ChatterBot: Are you mad?
User: Just talking.
ChatterBot: I'm not bragging, I'm just that awesome.
User: bye
ChatterBot: Bye! It was a great chat.


Maintaining a conversation coherently in some subjects are difficult as you see. But, in general, we got some well-versed responses from our chatbot. This is because the corpus it was trained on is way larger than Emma.

ChatterBot has more functionalities than we showed here. For more information, you can refer to this tutorial.

 **Now, you have some introductory knowledge about how to develop your own chatbot. **
 
 In the assignments, you'll do just that.

#**Assignments**


In this assignment, you're going to work with a dataset called Cornell Movie--Dialogs Corpus released by the Cornell University.

 The dataset contains conversations from more than 600 movies. You should access the dataset from the Thinkful database using the following credentials:

    postgres_user = 'dsbc_student'
    postgres_pw = '7*.8G9QH21'
    postgres_host = '142.93.121.174'
    postgres_port = '5432'
    postgres_db = 'cornell_movie_dialogs'

    The data is in the table called "dialogs".
We suggest you use Google Colaboratory when working on this assignment. Please submit your solutions to the following tasks as a link to your Jupyter notebook on GitHub.

**First, make some data preprocessing to clean up the data**. You can use your solution to the assignment of the data preprocessing checkpoint of this module.

D**evelop a chatbot using this corpus**.

 In doing this, you're free to choose a chatbot development libary like ChatterBot o**r to write your own code from scratch.**

**Make a conversation with your chatbot and discuss the strengths and weaknesses of it**.

Note: When parsing the dialogs using SpaCy, you may run into some memory issues even in Google Colaboratory.

 If** you're having memory issues, try parsing your text as follows**:

nlp = spacy.load('en', disable=['parser', 'ner'])

nlp.add_pipe(nlp.create_pipe('sentencizer'))

nlp.max_length = 20000000

doc = nlp(the_dialogs_come_here)

In [4]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [0]:
# load and clean the data
paradise = gutenberg.raw('milton-paradise.txt')

# the chapter indicator is idiosyncratic
paradise = re.sub(r'Chapter \d+', '', paradise)
    
paradise= text_cleaner(paradise)

In [0]:
nlp = spacy.load('en', disable=['parser', 'ner'])

nlp.add_pipe(nlp.create_pipe('sentencizer'))

nlp.max_length = 20000000

moby_doc = nlp(moby)

In [0]:
# parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
paradise_doc = nlp(paradise)

In [27]:
# group into sentences.
# we use the sentences that has more than 1 character
paradise_sents = [sent.text for sent in paradise_doc.sents if len(sent.text) > 2]
paradise_sents

["Book I Of Man's first disobedience, and the fruit Of that forbidden tree whose mortal taste Brought death into the World, and all our woe,",
 'With loss of Eden, till one greater Man Restore us, and regain the blissful seat, Sing, Heavenly Muse, that, on the secret top Of Oreb, or of Sinai, didst inspire That shepherd who first taught the chosen seed',
 "In the beginning how the heavens and earth Rose out of Chaos: or, if Sion hill Delight thee more, and Siloa's brook that flowed Fast by the oracle of God, I thence",
 "Invoke thy aid to my adventurous song, That with no middle flight intends to soar Above th' Aonian mount, while it pursues Things unattempted yet in prose or rhyme.",
 'And chiefly thou, O Spirit, that dost prefer Before all temples',
 "th' upright heart and pure,",
 'Instruct me, for thou',
 "know'st; thou from the first Wast present, and, with mighty wings outspread, Dove-like sat'st brooding on the vast Abyss,",
 "And mad'st it pregnant",
 ': what in me is dark Illu

In [28]:
len(paradise_sents)

3367

In [29]:
# create a chatbot
chatbot = ChatBot('Paradise')
# this is to remove the accumulated knowledge base
chatbot.storage.drop()

# create a new trainer for the chatbot
trainer = ListTrainer(chatbot)

# train the chatbot based on Emma
trainer.train(paradise_sents)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
List Trainer: [####################] 100%


In [30]:
print("Welcome to Paradise -- I  will try to respond you reasonably. If you want to exit, type bye please.")

# below is the chatting
while True:
    
    user_input = input("User: ")
    user_input=user_input.lower()
    
    if(user_input!='bye'):
        if(user_input == 'thanks' or user_input == 'thank you'):
            break
            print("Paradise: You're welcome.")
        else:
            if(greeting(user_input) != None):
                print("Paradise: " + greeting(user_input))
            else:
                print("Pardise: ", end = "")
                print(chatbot.get_response(user_input))
    else:
        print("Paradise Bye! It was a great chat.")
        break

Welcome to Paradise -- I  will try to respond you reasonably. If you want to exit, type bye please.
User: hi
Paradise: hello
User: how are you?
Pardise: Six wings he wore, to shade His lineaments divine; the pair that clad Each shoulder broad, came mantling o'er his breast With regal ornament; the middle pair Girt like a starry zone his waist, and round Skirted his loins and thighs with downy gold And colours dipt in Heaven; the third his feet Shadowed from either heel with feathered mail, Sky-tinctured grain.
User: wha is an angel
Pardise: The Arch-Angel Uriel, one of the seven
User: who is uriel
Pardise: So both ascend In the visions of God.
User: what is god?
Pardise: As whom the fables name of monstrous size, Titanian or Earth-born, that warred on Jove, Briareos or Typhon, whom the den By ancient Tarsus held, or that sea-beast Leviathan, which God of all his works Created hugest that swim th' ocean-stream.
User: what is today's weather?
Pardise: Thither, by harpy-footed Furies hale

**PARADISE LOST WAS A POOR CHOICE**