# <center> Chatbot with Python and NLTK and MaxentClassifier <center>

In this project, we aims to build a chatbot with NLTK library and MaxentClassifier linked the crypto-currency domain.

<b> Ressources <b> : https://medium.com/x8-the-ai-community/build-your-first-chatbot-in-python-334247814900

## 1. Build the Chatbot with NLTK

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords', quiet = True, raise_on_error = True)
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('nps_chat')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import PunktSentenceTokenizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\macbi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\macbi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\macbi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package nps_chat to
[nltk_data]     C:\Users\macbi\AppData\Roaming\nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!


In [2]:
import string
import random
import pandas as pd
from termcolor import colored
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
inputs = ("hi", "hello", "hi there")
outputs = ["hi", "hey", "hi there", "welcome"]
filename = "source.txt"

In [4]:
lem = nltk.stem.WordNetLemmatizer()
remove_punctuation = dict((ord(punct), None) for punct in string.punctuation)

In [5]:
# Modify the format to better process classification
def fetch_features(chat) : 
    features = {}
    for word in nltk.word_tokenize(chat) :
        features['contains({})'.format(word.lower())] = True
    return features

In [6]:
def lemmatise(tokens) : 
    return [lem.lemmatize(token) for token in tokens]

In [7]:
def tokenise(text) :
    return lemmatise(nltk.word_tokenize(text.lower().translate(remove_punctuation)))

In [8]:
def greet(sentence) :
    for word in sentence.split() : 
        if word.lower() in inputs : 
            return random.choice(outputs)

In [9]:
def match(user_response) : 
    resp = ''
    q_list.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer = tokenise, stop_words = 'english')
    tfidf = TfidfVec.fit_transform(q_list)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx = vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf == 0) :
        resp = resp + "Sorry! I don't know the answer to this. Would you like to try again? Type bye to exit"
        return resp
    else : 
        resp_ids = qa_dict[idx]
        resp_str = ''
        s_id = resp_ids[0]
        end = resp_ids[1]
        while s_id < end : 
            resp_str = resp_str + " " + sent_tokens[s_id]
            s_id += 1
        resp = resp + resp_str
        return resp

In [10]:
chats = nltk.corpus.nps_chat.xml_posts()[:10000]
featuresets = [(fetch_features(chat.text), chat.get('class')) for chat in chats]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.MaxentClassifier.train(train_set)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.70805        0.050
             2          -1.23669        0.851
             3          -0.90765        0.884
             4          -0.73889        0.900
             5          -0.62801        0.911
             6          -0.54673        0.921
             7          -0.48390        0.926
             8          -0.43420        0.932
             9          -0.39454        0.936
            10          -0.36269        0.940
            11          -0.33682        0.944
            12          -0.31547        0.947
            13          -0.29753        0.949
            14          -0.28218        0.951
            15          -0.26887        0.953
            16          -0.25717        0.955
            17          -0.24678        0.956
            18          -0.23748        0.957
            19          -0.22909        0.958
 

In [11]:
ques_bank = open(filename, 'r', errors = 'ignore')
qb_text = ques_bank.read()
qb_text = qb_text.lower()
sent_tokens = nltk.sent_tokenize(qb_text)
word_tokens = nltk.word_tokenize(qb_text)
qa_dict = {}
q_list = []
s_count = 0

In [12]:
while s_count < len(sent_tokens) : 
    result = classifier.classify(fetch_features(sent_tokens[s_count]))
    if("question" in result.lower()) : 
        next_question_id = s_count + 1
        next_question = classifier.classify(fetch_features(sent_tokens[next_question_id]))
        while(not("question" in next_question.lower()) and next_question_id < len(sent_tokens) - 1) : 
            next_question_id += 1
            next_question = classifier.classify(fetch_features(sent_tokens[next_question_id]))
        q_list.append(sent_tokens[s_count])
        end = next_question_id
        if(next_question_id - s_count > 5) : 
            end = s_count + 5
        qa_dict.update({len(q_list) - 1:[s_count + 1, end]})
        s_count = next_question_id
    else : 
        s_count += 1

In [13]:
flag = True
print(colored("C-3PO :\nI am C-3PO, I have all the answers If you want to exit, type bye", "blue"))
while(flag == True) : 
    print(colored("\nYOU :", 'red', attrs = ['bold']))
    u_input = input()
    u_input = u_input.lower()
    if(u_input != 'bye') : 
        if(greet(u_input) != None) : 
            print(colored("\nC-3PO :", 'blue', attrs = ['bold']))
            print(greet(u_input))
        else :
            print(colored("\nC-3PO :", 'blue', attrs = ['bold']))
            print(colored(match(u_input).strip().capitalize(), 'blue'))
            q_list.remove(u_input)
    else : 
        flag = False
        print(colored("\nC-3PO : Bye! Take care", 'blue', attrs = ['bold']))

[34mC-3PO :
I am C-3PO, I have all the answers If you want to exit, type bye[0m
[1m[31m
YOU :[0m
bye
[1m[34m
C-3PO : Bye! Take care[0m


## 2. Training the model

In [14]:
chats = nltk.corpus.nps_chat.xml_posts()[:10000]
featuresets = [(fetch_features(chat.text), chat.get('class')) for chat in chats]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.MaxentClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.70805        0.050
             2          -1.23669        0.851
             3          -0.90765        0.884
             4          -0.73889        0.900
             5          -0.62801        0.911
             6          -0.54673        0.921
             7          -0.48390        0.926
             8          -0.43420        0.932
             9          -0.39454        0.936
            10          -0.36269        0.940
            11          -0.33682        0.944
            12          -0.31547        0.947
            13          -0.29753        0.949
            14          -0.28218        0.951
            15          -0.26887        0.953
            16          -0.25717        0.955
            17          -0.24678        0.956
            18          -0.23748        0.957
            19          -0.22909        0.958
 

In [15]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
for post in posts[100:150] : 
    print(post.text)

PART
sho*
.ACTION keeps 10-19-20sUser115s place nice and warm.
hey any guys with cams wanna play?
.ACTION sits on 10-19-20sUser68's lap.
JOIN
JOIN
any guyz wanna chat 
hi there
boo, it's a female.
hey 10-19-20sUser126
PART
i wonna chat
PART
24 f nc single mom
where did everyone gooo?
sure 10-19-20sUser126
JOIN
what did you but on e-bay
i feel like im in the wrong room
yeee haw 10-19-20sUser30
im considering changing my nickname to "ihavehotnips"
JOIN
i don't want hot pics of a female, I can look in a mirror.
hi 10-19-20sUser64
wb 10-19-20sUser139
u should 10-19-20sUser44:)
PART
PART
JOIN
single dad here
JOIN
ty 10-19-20sUser68
PART
JOIN
Hi 10-19-20sUser139
PART
JOIN
hi 10-19-20sUser138
HAHAHA
yw 10-19-20sUser139
you should make it 'iamahotnip', 10-19-20sUser44
alright
hi 10-19-20sUser139.
you're fucking hot.
i thought of that!
hi 10-19-20sUser126, its so late
lmao
ahah "iamahotniplickme"
PART


In [16]:
chats = nltk.corpus.nps_chat.xml_posts()[:10000]
def fetch_features(chat) : 
    features = {}
    for word in nltk.word_tokenize(chat) :
        features['contains({})'.format(word.lower())] = True
    return features
featuresets = [(fetch_features(chat.text), chat.get('class')) for chat in chats]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.MaxentClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))    

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.70805        0.050
             2          -1.23669        0.851
             3          -0.90765        0.884
             4          -0.73889        0.900
             5          -0.62801        0.911
             6          -0.54673        0.921
             7          -0.48390        0.926
             8          -0.43420        0.932
             9          -0.39454        0.936
            10          -0.36269        0.940
            11          -0.33682        0.944
            12          -0.31547        0.947
            13          -0.29753        0.949
            14          -0.28218        0.951
            15          -0.26887        0.953
            16          -0.25717        0.955
            17          -0.24678        0.956
            18          -0.23748        0.957
            19          -0.22909        0.958
 

## 3. Build a question bank

In [17]:
ques_bank = open('crypto_faq.txt','r',errors = 'ignore')
qb_text = ques_bank.read()
qb_text = qb_text.lower()
sent_tokens = nltk.sent_tokenize(qb_text)
word_tokens = nltk.word_tokenize(qb_text)
qa_dict = {}
q_list = [] 
s_count = 0

## 4. Preprocessing and fetching

In [18]:
while s_count < len(sent_tokens):
    result = classifier.classify(fetch_features(sent_tokens[s_count]))
    if("question" in result.lower()) :
        next_question_id = s_count + 1
        next_question = classifier.classify(fetch_features(sent_tokens[next_question_id]))
        while(not("question" in next_question.lower()) and next_question_id < len(sent_tokens) - 1) :
            next_question_id += 1
            next_question = classifier.classify(fetch_features(sent_tokens[next_question_id]))
        q_list.append(sent_tokens[s_count])
        end = next_question_id
        if(next_question_id - s_count > 5) :
            end = s_count + 5
        qa_dict.update({len(q_list) - 1:[s_count + 1,end]})
        s_count = next_question_id
    else:
        s_count += 1

In [19]:
def lemmatise(tokens):
    return [lem.lemmatize(token) for token in tokens]
remove_punctuation = dict((ord(punct), None) for punct in string.punctuation)

In [20]:
def tokenise(text):
    return lemmatise(nltk.word_tokenize(text.lower().translate(remove_punctuation)))

## 5. Deploy the model

In [21]:
def match(user_response):
    resp = ''
    q_list.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer = tokenise, stop_words = 'english')
    tfidf = TfidfVec.fit_transform(q_list)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx = vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf == 0):
        resp = resp + "Sorry! I don't know the answer to this. Would you like to try again. Type bye to exit"
        return resp
    else:
        resp_ids = qa_dict[idx]
        resp_str = ''
        s_id = resp_ids[0]
        end = resp_ids[1]
        while s_id < end :
            resp_str = resp_str + " " + sent_tokens[s_id]
            s_id += 1
        resp = resp + resp_str
        return resp

In [None]:
flag = True
print(colored("C-3PO :\nI am C-3PO, I have all the answers If you want to exit, type bye", 'blue', attrs = ['bold']))
while(flag == True) : 
    print(colored("\nYOU :", 'red', attrs = ['bold']))
    u_input = input()
    u_input = u_input.lower()
    if(u_input != 'bye') : 
        if(greet(u_input) != None) : 
            print(colored("\nC-3PO :", 'blue', attrs = ['bold']))
            print(greet(u_input))
        else :
            print(colored("\nC-3PO :", 'blue', attrs = ['bold']))
            print(colored(match(u_input).strip().capitalize(), 'blue'))
            q_list.remove(u_input)
    else : 
        flag = False
        print(colored("\nC-3PO : Bye! Take care", 'blue', attrs = ['bold']))

[1m[34mC-3PO :
I am C-3PO, I have all the answers If you want to exit, type bye[0m
[1m[31m
YOU :[0m
hi
[1m[34m
C-3PO :[0m
hi there
[1m[31m
YOU :[0m
bitcoin
[1m[34m
C-3PO :[0m


  'stop_words.' % sorted(inconsistent))


[34mThe bitcoin protocol can change the financial landscape we see today. the protocol can act as a currency, voting mechanism, global identification and reputation application, a micro-tipper, crowdfunding platform, initiate trusts, wills and contracts, decentralized domain names, future markets, and basically everything the financial system of today can handle plus so much more. the currency application is just the beginning of this evolution of world's finances. what happens if i lose my bitcoins?[0m
[1m[31m
YOU :[0m
ethereum


  'stop_words.' % sorted(inconsistent))


[1m[34m
C-3PO :[0m
[34mEthereum is a decentralized smart contracts platform that is powered by a cryptocurrency called ether. a good starting point to learn more about its workings would be the â€œwhat is ethereum?â€ page.[0m
[1m[31m
YOU :[0m
cryptocurrency


  'stop_words.' % sorted(inconsistent))


[1m[34m
C-3PO :[0m
[34mA3. cryptocurrency is a type of virtual currency that uses cryptography to secure transactions that are digitally recorded on a distributed ledger, such as a blockchain. a transaction involving cryptocurrency that is recorded on a distributed ledger is referred to as an â€œon-chainâ€ transaction; a transaction that is not recorded on the distributed ledger is referred to as an â€œoff-chainâ€ transaction. q4.[0m
[1m[31m
YOU :[0m
tax


  'stop_words.' % sorted(inconsistent))


[1m[34m
C-3PO :[0m
[34mA44. information on virtual currency is available at virtual currencies (irs.gov/virtual_currency). many questions about the tax treatment of virtual currency can be answered by referring to notice 2014-21 pdf and rev. rul.[0m
[1m[31m
YOU :[0m
virtual currency


  'stop_words.' % sorted(inconsistent))


[1m[34m
C-3PO :[0m
[34mA1. virtual currency is a digital representation of value, other than a representation of the u.s. dollar or a foreign currency (â€œreal currencyâ€), that functions as a unit of account, a store of value, and a medium of exchange. some virtual currencies are convertible, which means that they have an equivalent value in real currency or act as a substitute for real currency. the irs uses the term â€œvirtual currencyâ€ in these faqs to describe the various types of convertible virtual currency that are used as a medium of exchange, such as digital currency and cryptocurrency.[0m
[1m[31m
YOU :[0m
