# A simple chatbot using NLTK

Simple Chatbot scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.


In 1950, Alan Turing's famous article "Computing Machinery and Intelligence" was published, which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably—on the basis of the conversational content alone—between the program and a real human. 

 ELIZA, published in 1966,by Weizenbaum, seemed to be able to fool users into believing that they were conversing with a real human. It imitated the language of a psychotherapist from only 200 lines of code. [Link for ELIZA](https://www.google.com/url?q=http%3A%2F%2Fpsych.fullerton.edu%2Fmbirnbaum%2Fpsych101%2FEliza.htm%3Futm_source%3Dubisend.com%26utm_medium%3Dblog-link%26utm_campaign%3Dubisend)

The term "ChatterBot" was originally coined by Michael Mauldin (creator of the first Verbot, Julia) in 1994 to describe these conversational programs.

Other : PARRY (1972), A.L.I.C.E, D.U.D.E

In [1]:
import io                  #reading and writing files or handling streams of data.
import random              #generating random numbers and selecting random elements from a sequence.
import string              #provides functions and constants related to string operations
import warnings            #to control warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
#to convert a collection of text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features.
from sklearn.metrics.pairwise import cosine_similarity
#to calculate the similarity between two vectors using cosine similarity.
warnings.filterwarnings('ignore')

TF-IDF helps to give importance to words that are frequent in a document but rare in the entire dataset.
It consists of two components:

**Term Frequency (TF):** Measures how often a word appears in a document. (document = sentence) (corpus = collection of documents)

TF = Number of times a word appears in the document/ Total words in the document

​**Inverse Document Frequency (IDF):** Measures how unique a word is across all documents.

IDF = log(Total number of documents/ Number of documents containing the word)

Rare words get higher IDF scores.
Common words (like "is", "the", "and") get lower IDF scores.

TF-IDF also have disadvantage of semantic words. (better and good are considered different). To overcome this here, lemmatization helps in creating single words for semantic words (through wordnet dictionary already available)

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||

In [2]:
import nltk

In [3]:
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # downloading packages - punktokenizer , wordnet

True

In [4]:
fh = open("chatbot.txt", "r", errors = 'ignore')
data = fh.read()
data = data.lower()

The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. Converting the entire text into uppercase or lowercase, so that the algorithm does not treat the same words in different cases as different.

Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”. It remove suffix.

Lemmatization: A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same. (based on verb, noun, adjective)

#### Tokenization

In [5]:
sentence_tokens = nltk.sent_tokenize(data)

In [6]:
word_tokens = nltk.word_tokenize(data)

#### Preprocessing

In [7]:
# a function is defined called LemTokens which will take as input the tokens and return normalized tokens.- lemmatization
lemmer = nltk.stem.WordNetLemmatizer()
#WordNetLemmatizer() is used to reduce words to their base or dictionary form (lemma). Lemmatization

def lemtokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
    #It takes a list of words (tokens) and applies lemmatization to each word.
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
#To creates a dictionary that maps punctuation ASCII values to None

def LemNormalize(text):
    return lemtokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
#To convert to lowercase using text.lower(), To Remove punctuation using text.translate(remove_punct_dict)

In [8]:
#Example
lemmer.lemmatize('running', pos='v')  


'run'

In [9]:
GREETING_INPUTS = ('hello','hi','greetings', 'hey','holla')
GREETING_RESPONSES = ["hi","hello","*nods*","hey"," I am glad! You are talking."]
def greetings(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)


In [10]:
#Example
greetings("Hey")

'hey'

In [11]:
def response(user_response):
    bot_response = ''
    # to store reply
    sentence_tokens.append(user_response)
    # The user’s input (user_response) is added to sent_tokens, which contains all known sentences.
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sentence_tokens)
    #TfidfVectorizer converts text into numerical vectors based on word importance.
    #tokenizer=LemNormalize: Uses the previously defined LemNormalize() function to preprocess text (lemmatization, tokenization, and punctuation removal).
    #stop_words='english': Removes common words like "is," "the," "and" to focus on meaningful words.
    #fit_transform(sent_tokens): Computes TF-IDF weights for all sentences.
    vals = cosine_similarity(tfidf[-1], tfidf)
    #tfidf[-1] represents the TF-IDF vector of the user's input.
    #cosine_similarity(tfidf[-1], tfidf) computes the similarity between the user’s input and all sentences in sent_tokens.
    #The result vals is an array where higher values indicate more similar sentences.
    idx = vals.argsort()[0][-2]
    #argsort() sorts similarity scores in ascending order. [-2] selects the second last value (most similar sentence).[-1] would be the user’s own input (which will always be the most similar to itself).
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    #Flattens the similarity array to a 1D list.Sorts the values in ascending order.Selects the second highest similarity score (req_tfidf).
    if req_tfidf == 0:
     bot_response = bot_response + "I am sorry! I don't understand you"
     return bot_response
    else:
     bot_response = bot_response + sentence_tokens[idx]
     return bot_response


In [12]:
flag = True #to control loop - chatbot keeps running bcz od true
print("BOT: My name is bot. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response = user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag = False
            print("BOT: You are welcome..")
        else:
             if(greetings(user_response)!=None):
                print("BOT: "+greetings(user_response))
             else:
                print("BOT: ",end="")
                print(response(user_response))
                sentence_tokens.remove(user_response)
    else:
        flag=False
        print("BOT: Bye! take care..")

BOT: My name is bot. I will answer your queries about Chatbots. If you want to exit, type Bye!


 Hello


BOT:  I am glad! You are talking.


 Benefits of using chatbots


BOT: benefits of using chatbots
improved customer service: provide 24/7 support, answer frequently asked questions, and assist customers quickly.


 Challenges of using chatbots


BOT: challenges of using chatbots
limited understanding: may struggle with complex or nuanced queries.


 Bye


BOT: Bye! take care..
