# Chatbot Using Python NLTK

A chatbot is an Artificial Intelligence software that can simulate a conversation with a user in natural language through messaging application, websites, mobile apps etc. These chatbot can be used in various industries for different purposes. 

Here I am trying to build a simple chatbot using NLTK library(Natural Language Toolkit).

In [20]:
# Importing necessary libraries
import pandas as pd
import nltk
import numpy as np
import re 
from nltk.stem import wordnet # To perform lemmatization
from sklearn.feature_extraction.text import CountVectorizer # To perform bow
from sklearn.feature_extraction.text import TfidfVectorizer # To perform tfidf
from nltk import pos_tag # For parts of speech
from sklearn.metrics import pairwise_distances # To perform cosine similarity
from nltk import word_tokenize # To create tokens
from nltk.corpus import stopwords # for stop words

In [21]:
# Importing the dataset
df = pd.read_excel('dialog_talk_agent.xlsx')
df.head()

Unnamed: 0,Context,Text Response
0,Tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,I can help you work smarter instead of harder
2,Define yourself,
3,Describe yourself,
4,tell me about yourself,


In [22]:
df.shape[0] # returns the number of rows in dataset

1592

The above data contains 1592 data points and two columns context which can be inferred as the query and text response is the response for that query. we can see that there are null values in the dataset if we open the dataset in excel and observe we can find that in our dataset the data is in different clusters, the same type of questions in one place and then followed by next similar kind of questions.

Null values are present for the same type of questions whose response can be almost similar and in that similar group of questions, the response is given to the first and the rest filled with null. So what we can do is use **ffill()** which returns the value of previous response in place of null values as below.

In [23]:
df.ffill(axis = 0,inplace=True) # fills the null value with the previous value.
df

Unnamed: 0,Context,Text Response
0,Tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,I can help you work smarter instead of harder
2,Define yourself,I can help you work smarter instead of harder
3,Describe yourself,I can help you work smarter instead of harder
4,tell me about yourself,I can help you work smarter instead of harder
...,...,...
1587,can we chat,Talking is what I do best.
1588,I'll be back in a few minutes,I'll be waiting.
1589,I'll be back,All right. I'll be here.
1590,I'll get back to you in a moment,Till next time.


Let’s get into our first step, text normalization where we convert the data into lower case and then remove special characters and then perform lemmatization.

Let us create a function that converts given text to lower case and removes special characters and numbers.

In [24]:
df1=df.head(10) # copy of first ten rows of dataset

In [25]:
# function that converts text into lower case and removes special characters
def step1(x):
    for i in x:
        a=str(i).lower()
        p=re.sub(r'[^a-z0-9]',' ',a)
        print(p)

In [26]:
step1(df1['Context'])

tell me about your personality
i want to know you better
define yourself
describe yourself
tell me about yourself
all about you
tell me some stuff about you
talk some stuff about you
talk about yourself
about yourself


We can see that our text is clean. Word tokenizing is the process of converting the normal text strings into a list of tokens.

In [29]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [30]:
# word tokenizing
s = 'tell me about your personality'
words = word_tokenize(s)

print(words)

['tell', 'me', 'about', 'your', 'personality']


The **pos_tag** function returns the parts of speech of each token so that the lemmatizer function detects the parts of speech of token and then it converts the token to its root word as below

In [32]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [33]:
lemma = wordnet.WordNetLemmatizer() # intializing lemmatizer
lemma.lemmatize('absorbed', pos = 'v')

'absorb'

In [36]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [37]:
pos_tag(nltk.word_tokenize(s),tagset = None) # returns the parts of speech of every word

[('tell', 'VB'),
 ('me', 'PRP'),
 ('about', 'IN'),
 ('your', 'PRP$'),
 ('personality', 'NN')]

We shall now create a function that performs all the steps mentioned above

In [38]:
# function that performs text normalization steps

def text_normalization(text):
    text=str(text).lower() # text to lower case
    spl_char_text=re.sub(r'[^ a-z]','',text) # removing special characters
    tokens=nltk.word_tokenize(spl_char_text) # word tokenizing
    lema=wordnet.WordNetLemmatizer() # intializing lemmatization
    tags_list=pos_tag(tokens,tagset=None) # parts of speech
    lema_words=[]   # empty list 
    for token,pos_token in tags_list:
        if pos_token.startswith('V'):  # Verb
            pos_val='v'
        elif pos_token.startswith('J'): # Adjective
            pos_val='a'
        elif pos_token.startswith('R'): # Adverb
            pos_val='r'
        else:
            pos_val='n' # Noun
        lema_token=lema.lemmatize(token,pos_val) # performing lemmatization
        lema_words.append(lema_token) # appending the lemmatized token into a list
    
    return " ".join(lema_words) # returns the lemmatized tokens as a sentence 

Let’s check our function and apply it to the dataset.

In [39]:
text_normalization('telling you some stuff about me')

'tell you some stuff about me'

In [40]:
df['lemmatized_text']=df['Context'].apply(text_normalization) # applying the fuction to the dataset to get clean text
df.tail(15)

Unnamed: 0,Context,Text Response,lemmatized_text
1577,I need to talk to you,Good conversation really makes my day.,i need to talk to you
1578,I want to speak with you,I'm always here to lend an ear.,i want to speak with you
1579,let's have a discussion,Talking is what I do best.,let have a discussion
1580,I just want to talk,Talking is what I do best.,i just want to talk
1581,let's discuss something,Talking is what I do best.,let discuss something
1582,can I speak,Talking is what I do best.,can i speak
1583,can we talk,Talking is what I do best.,can we talk
1584,let's talk,Talking is what I do best.,let talk
1585,I want to talk to you,Talking is what I do best.,i want to talk to you
1586,can we chat,Talking is what I do best.,can we chat


We can see that our function worked well and thus we applied the same to our data. Our next step is word embedding, it is representation for text where words that have the same meaning have a similar representation. We have two models for this process bag of words (bow) and tf-idf ( Term Frequency-Inverse Document Frequency).

## Using bow

The bag-of-words is a representation of text that describes the occurrence of words within a document. Consider if our dictionary contains the words {Playing, is, love}, and we want to vectorize the text “Playing football is love”, we would have the following vector: (1, 0, 1, 1).

## bag of words

In [44]:
cv = CountVectorizer() # intializing the count vectorizer
X = cv.fit_transform(df['lemmatized_text']).toarray()

In [45]:
# returns all the unique word from data 

features = cv.get_feature_names()
df_bow = pd.DataFrame(X, columns = features)
df_bow.head()

Unnamed: 0,abort,about,absolutely,abysmal,actually,adore,advice,advise,affirmative,afraid,...,yeh,yep,yes,yet,you,your,youre,yours,yourself,yup
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Consider the row 0 of the dataset the bow marks all the words present in the text as 1 and rest as 0.

Stop words are extremely common words that would appear to be of little value in matching a user’s need and hence they are excluded from the vocabulary entirely. Below are the predefined stop words.

In [46]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [47]:
# all the stop words we have 

stop = stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Let us consider an example and try getting a response to the query.

In [48]:
Question ='Will you help me and tell me about yourself more' # considering an example query

In [49]:
# checking for stop words

Q=[]
a=Question.split()
for i in a:
    if i in stop:
        continue
    else:
        Q.append(i)
    b=" ".join(Q)

In [50]:
Question_lemma = text_normalization(b) # applying the function that we created for text normalizing
Question_bow = cv.transform([Question_lemma]).toarray() # applying bow

In [51]:
text_normalization

<function __main__.text_normalization(text)>

In [52]:
Question_bow

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

From above We can see that we have taken a question ‘Will you help me and tell me about yourself more’ and then perform text normalization and then applying the bow to the question. Now to get the related response we shall find the cosine similarity between the question and the lemmatized text we have.

## Cosine Similarity :

Cosine similarity is a measure of similarity between two vectors. It returns a value that is computed by taking the dot product and dividing that by the product of their norms between two vectors.

#### Cosine Similarity (a, b) = Dot product(a, b) / ||a|| * ||b||

In [53]:
# cosine similarity for the above question we considered.

cosine_value = 1- pairwise_distances(df_bow, Question_bow, metric = 'cosine' )
(cosine_value)

array([[0.25819889],
       [0.        ],
       [0.        ],
       ...,
       [0.        ],
       [0.        ],
       [0.        ]])

In [54]:
df['similarity_bow']=cosine_value # creating a new column 

In [55]:
df_simi = pd.DataFrame(df, columns=['Text Response','similarity_bow']) # taking similarity value of responses for the question we took
df_simi 

Unnamed: 0,Text Response,similarity_bow
0,Just think of me as the ace up your sleeve.,0.258199
1,I can help you work smarter instead of harder,0.000000
2,I can help you work smarter instead of harder,0.000000
3,I can help you work smarter instead of harder,0.000000
4,I can help you work smarter instead of harder,0.288675
...,...,...
1587,Talking is what I do best.,0.000000
1588,I'll be waiting.,0.000000
1589,All right. I'll be here.,0.000000
1590,Till next time.,0.000000


In [56]:
df_simi_sort = df_simi.sort_values(by='similarity_bow', ascending=False) # sorting the values
df_simi_sort.head()

Unnamed: 0,Text Response,similarity_bow
211,I'm glad to help. What can I do for you?,0.57735
194,I'm glad to help. What can I do for you?,0.57735
184,I'm glad to help. What can I do for you?,0.408248
186,I'm glad to help. What can I do for you?,0.408248
200,I'm glad to help. What can I do for you?,0.408248


In [57]:
threshold = 0.2 # considering the value of p=smiliarity to be greater than 0.2
df_threshold = df_simi_sort[df_simi_sort['similarity_bow'] > threshold] 
df_threshold

Unnamed: 0,Text Response,similarity_bow
211,I'm glad to help. What can I do for you?,0.57735
194,I'm glad to help. What can I do for you?,0.57735
184,I'm glad to help. What can I do for you?,0.408248
186,I'm glad to help. What can I do for you?,0.408248
200,I'm glad to help. What can I do for you?,0.408248
219,I'm glad to help. What can I do for you?,0.333333
728,It's my pleasure to help.,0.333333
188,I'm glad to help. What can I do for you?,0.333333
190,I'm glad to help. What can I do for you?,0.333333
191,I'm glad to help. What can I do for you?,0.333333


- Finally using bow for the question 'Will you help me and tell me about yourself more' , the above are the responses we got using bow and the smiliarity value of responses, we consider the response with highest similarity

In [58]:
index_value = cosine_value.argmax() # returns the index number of highest value
index_value

194

We can see that at index 194 we have the highest similarity text for the query we considered. Let us print the text at that position and see whether it is related or not.

In [59]:
(Question)

'Will you help me and tell me about yourself more'

In [60]:
df['Text Response'].loc[index_value] # The text at the above index becomes the response for the question

"I'm glad to help. What can I do for you?"

We can see that our model worked pretty well.

## Using tf-idf :

tf is **Term Frequency**, scoring of the frequency of the word in the current document and idf is **Inverse Document Frequency**, scoring of how rare the word is across documents. Here document represents a single text, say row 0 or row1, etc, where documents refer to all the rows in the dataset.

In [64]:
# using tf-idf

tfidf=TfidfVectorizer() # intializing tf-id 
x_tfidf=tfidf.fit_transform(df['lemmatized_text']).toarray() # transforming the data into array

In [65]:
# returns all the unique word from data with a score of that word

df_tfidf=pd.DataFrame(x_tfidf,columns=tfidf.get_feature_names()) 
df_tfidf.head()

Unnamed: 0,abort,about,absolutely,abysmal,actually,adore,advice,advise,affirmative,afraid,...,yeh,yep,yes,yet,you,your,youre,yours,yourself,yup
0,0.0,0.407572,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.330555,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.218768,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.64179,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.64179,0.0
4,0.0,0.45379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.608937,0.0


The above are the values obtained using tf-idf. Now using cosine similarity lets us find the response that we get with tf-idf.

## similarity

In [68]:
Question1 ='Tell me about yourself.'

In [69]:
Question_lemma1 = text_normalization(Question1)
Question_tfidf = tfidf.transform([Question_lemma1]).toarray() # applying tf-idf

In [70]:
cos=1-pairwise_distances(df_tfidf,Question_tfidf,metric='cosine')  # applying cosine similarity
cos

array([[0.56511191],
       [0.        ],
       [0.39080996],
       ...,
       [0.        ],
       [0.        ],
       [0.        ]])

In [71]:
df['similarity_tfidf']=cos # creating a new column 
df_simi_tfidf = pd.DataFrame(df, columns=['Text Response','similarity_tfidf']) # taking similarity value of responses for the question we took
df_simi_tfidf 

Unnamed: 0,Text Response,similarity_tfidf
0,Just think of me as the ace up your sleeve.,0.565112
1,I can help you work smarter instead of harder,0.000000
2,I can help you work smarter instead of harder,0.390810
3,I can help you work smarter instead of harder,0.390810
4,I can help you work smarter instead of harder,1.000000
...,...,...
1587,Talking is what I do best.,0.000000
1588,I'll be waiting.,0.000000
1589,All right. I'll be here.,0.000000
1590,Till next time.,0.000000


In [72]:
df_simi_tfidf_sort = df_simi_tfidf.sort_values(by='similarity_tfidf', ascending=False) # sorting the values
df_simi_tfidf_sort.head(10)

Unnamed: 0,Text Response,similarity_tfidf
4,I can help you work smarter instead of harder,1.0
16,I can help you work smarter instead of harder,0.771758
9,I can help you work smarter instead of harder,0.759428
8,I can help you work smarter instead of harder,0.651909
379,I should get one. It's all work and no play la...,0.594479
500,The virtual world is my playground. I'm always...,0.590474
0,Just think of me as the ace up your sleeve.,0.565112
6,I can help you work smarter instead of harder,0.514553
48,I'm not programmed for that exact question. Tr...,0.445403
24,"I'm a relatively new bot, but I'm wise beyond ...",0.434832


In [73]:
threshold = 0.2 # considering the value of p=smiliarity to be greater than 0.2
df_threshold = df_simi_tfidf_sort[df_simi_tfidf_sort['similarity_tfidf'] > threshold] 
df_threshold

Unnamed: 0,Text Response,similarity_tfidf
4,I can help you work smarter instead of harder,1.0
16,I can help you work smarter instead of harder,0.771758
9,I can help you work smarter instead of harder,0.759428
8,I can help you work smarter instead of harder,0.651909
379,I should get one. It's all work and no play la...,0.594479
500,The virtual world is my playground. I'm always...,0.590474
0,Just think of me as the ace up your sleeve.,0.565112
6,I can help you work smarter instead of harder,0.514553
48,I'm not programmed for that exact question. Tr...,0.445403
24,"I'm a relatively new bot, but I'm wise beyond ...",0.434832


- by using tfidf for the question 'Will you help me and tell me about yourself more' , the above are the responses we got and the smiliarity value of responses, we consider the response with highest similarity

In [74]:
index_value1 = cos.argmax() # returns the index number of highest value
index_value1

4

At index 4 we got higher similarity text that relates to our question. Let us see the response to our question.

In [75]:
Question1

'Tell me about yourself.'

In [76]:
df['Text Response'].loc[index_value1]  # returns the text at that index

'I can help you work smarter instead of harder'

Using tf-idf we got a different response which is also a pretty good response to the question.

Now let’s build a function that returns the response to the query using Bag of Words and tf-idf. It is very simple we just have to combine all the topics we saw earlier in this article.

## Model Using Bag of Words

In [77]:
# Function that removes stop words and process the text

def stopword_(text):   
    tag_list=pos_tag(nltk.word_tokenize(text),tagset=None)
    stop=stopwords.words('english')
    lema=wordnet.WordNetLemmatizer()
    lema_word=[]
    for token,pos_token in tag_list:
        if token in stop:
            continue
        if pos_token.startswith('V'):
            pos_val='v'
        elif pos_token.startswith('J'):
            pos_val='a'
        elif pos_token.startswith('R'):
            pos_val='r'
        else:
            pos_val='n'
        lema_token=lema.lemmatize(token,pos_val)
        lema_word.append(lema_token)
    return " ".join(lema_word) 

In [78]:
# defining a function that returns response to query using bow

def chat_bow(text):
    s=stopword_(text)
    lemma=text_normalization(s) # calling the function to perform text normalization
    bow=cv.transform([lemma]).toarray() # applying bow
    cosine_value = 1- pairwise_distances(df_bow,bow, metric = 'cosine' )
    index_value=cosine_value.argmax() # getting index value 
    return df['Text Response'].loc[index_value]

Let’s see some output responses for a different queries

In [79]:
chat_bow('hi there')

'Hey!'

In [80]:
chat_bow('Your are amazing')

'Terrific!'

In [81]:
chat_bow('i miss you')

"I've been right here all along!"

## Model Using tf-idf

In [82]:
# defining a function that returns response to query using tf-idf

def chat_tfidf(text):
    lemma=text_normalization(text) # calling the function to perform text normalization
    tf=tfidf.transform([lemma]).toarray() # applying tf-idf
    cos=1-pairwise_distances(df_tfidf,tf,metric='cosine') # applying cosine similarity
    index_value=cos.argmax() # getting index value 
    return df['Text Response'].loc[index_value]

Let’s see some output responses for a different queries

In [83]:
chat_tfidf('hi')

'Hey!'

In [84]:
chat_tfidf('how are you')

'Lovely, thanks.'

In [85]:
chat_tfidf('i love you')

"That's great to hear."

In [86]:
chat_tfidf('thanks for your support!')

"It's my pleasure to help."

In [87]:
chat_tfidf('will you reply accurately?')

"Oh, don't give up on me!"

In [88]:
chat_tfidf('will you marry me?')

'In the virtual sense that I can, sure.'

In [89]:
chat_tfidf('i miss you and i love you')

"I've been right here all along!"

In [90]:
chat_tfidf('ask sravya to read')

"Oops. Sorry about that. I'm still learning."

In [91]:
chat_tfidf('you are amazing and hope to see u soon.')

'Bye.'

## Conclusion:

The model we built doesn’t have any artificial intelligence, but still, it responded pretty well.

***