**Coding a simple chatbot from scratch in Python using NLTK**

---





At the most basic level, chatbots are computer programs that simulate and process human conversations (written or spoken) and interact with digital devices as if people were communicating with  real people. I will be able to do it. Chatbots can be as simple as a basic program that responds to a simple query with a single line of response, or  to provide an increasingly higher level of personalization when collecting and processing information. It can be as advanced as a learning and evolving digital assistant.





**How do chatbots work?** 

---



Equipped with AI, automated rules, natural language processing (NLP), and machine learning (ML), chatbots process data and respond to all types of requests. 
There are two main types of 
 chatbots. 
* **`A task-oriented (declarative) chatbot`** 
is a single-purpose program focused on performing one function. It uses rules, NLP, and very few MLs to generate automated  conversational responses to user requests. The interaction with these chatbots is very specific and structured and is ideal for support and service features. Think of a robust and interactive FAQ. Task-oriented chatbots can answer common questions such as: B. Business hours request or simple transaction that does not contain many variables. I'm using NLP so that end users can experience NLP interactively, but the skill is pretty basic. These are  the most  used chatbots today. 
* **`Data-driven predictive (conversational) chatbots`** , often referred to as virtual assistants or digital assistants,  are far more sophisticated, interactive, and personalized than task-oriented chatbots. These chatbots are context-aware and leverage natural language understanding (NLU), NLP, and ML to learn on the fly. Apply predictive intelligence and analytics to enable personalization based on user profiles and past user behavior. Digital assistants can learn user preferences over time, make recommendations, and even anticipate needs. In addition to monitoring data and intent, they can start a conversation. Apple's Siri and Amazon's Alexa are examples of consumer-centric, data-driven predictive chatbots.  Advanced digital assistants can connect multiple single-purpose chatbots under one roof, extract different information from each, and  combine that information to perform tasks while  maintaining context. Is never "confused" intention.

**Importing libraries**

In [None]:
import io
import random
import string # to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

**Installing NLTK Packages**

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
nltk.download('punkt') # first-time use only
nltk.download('wordnet') # first-time use only

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Reading in the corpus**

---
Corpus is the file that will be used by our chatbot to display the output of the questions.


*   Copy/paste any Wikipedia page in a notebook and name it as 'chatbot.txt'.
*   I am using my data science notes file to derive data science answers.




In [None]:
f=open('chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw = raw.lower()# converts to lowercase

The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for working. Basic text pre-processing includes:

* Converting the entire text into `uppercase` or `lowercase`, so that the algorithm does not treat the same words in different cases as different

* `Tokenization`: Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

* `Stemming`: Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.

* `Lemmatization`: A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.

**Tokenisation**

In [None]:
sent_tokens = nltk.sent_tokenize(raw) # converts to list of sentences 
word_tokens = nltk.word_tokenize(raw) # converts to list of words

**Checking the file for sentences**

In [None]:
sent_tokens[0:2]

['about jarvis\ni am an a.i system created by mr.tony stark but after ultron tried to destroy me, i was transfered to mr.asif khan for further upgrades.',
 'data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.']

**Checking the file for words**

In [None]:
word_tokens[0:2]

['about', 'jarvis']

**Preprocessing**

---


We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.

In [None]:
lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

**Keyword matching**

---


Next, we shall define a function for a greeting by the bot i.e if a user’s input is a greeting, the bot shall return a greeting response.

In [None]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
 
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

**Generating Response**

---


**Bag of Words**

After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of numbers. The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

* A vocabulary of known words.

* A measure of the presence of known words.

Why is it is called a “bag” of words? That is because any information about the order or structure of words in the document is discarded and the model is only concerned with whether the known words occur in the document, not where they occur in the document.

The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).

**`TF-IDF Approach`**

A problem with the Bag of Words approach is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content”. Also, it will give more weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short, where:

`Term Frequency: is a scoring of the frequency of the word in the current document.`

TF = (Number of times term t appears in a document)/(Number of terms in the document)
`Inverse Document Frequency: is a scoring of how rare the word is across documents.`

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

**`Cosine Similarity`**

Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
where d1,d2 are two non zero vectors.

To generate a response from our bot for input questions, the concept of document similarity will be used. We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry! I don’t understand you”

In [None]:
def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response



Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.

In [None]:
flag=True
print("Jarvis: My name is Jarvis. I will answer your queries about Data science. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("Jarvis: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("Jarvis: "+greeting(user_response))
            else:
                print("Jarvis: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("Jarvis: Bye! take care..")

Jarvis: My name is Jarvis. I will answer your queries about Data science. If you want to exit, type Bye!
Hi
Jarvis: I am glad! You are talking to me
what is naive bayse?
Jarvis: naive bayes
naive bayes classifiers are used to classify by applying the bayes' theorem.
what is machine learning?
Jarvis: machine learning
machine learning is a technique used to perform tasks by inferencing patterns from data.
jarvis
Jarvis: about jarvis
i am an a.i system created by mr.tony stark but after ultron tried to destroy me, i was transfered to mr.asif khan for further upgrades.
Thanks 
Jarvis: I am sorry! I don't understand you
thanks
Jarvis: You are welcome..
