# Simple Chatbot with NLTK and Scikit-learn

Let's create a chatbot with Python's NLTK library and Scikit-learn.

## NLP?
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

## Importing libraries

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import io
import random
import string # to process standard python strings
import warnings
warnings.filterwarnings('ignore')

## Downloading and installing NLTK
NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

[Natural Language Processing with Python](http://www.nltk.org/book/) provides an guide to do programming for language processing.
If this is your first time to use NLKT, you'll need to check... [Installing NLTK Data](https://www.nltk.org/data.html)

### Installing NLTK Packages




In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True)
#nltk.download('punkt') first-time use only
#nltk.download('wordnet') first-time use only

## Reading the corpus

We use the Wikipedia page for chatbots as our corpus. Copy the contents from the page and place it in a text file named ‘chatbot.txt’. However, you can use any corpus of your choice.

In [None]:
f = open('chatbot.txt',errors = 'ignore') 
raw = f.read()
raw = raw.lower() # convert the text to lowercase

## Cleaning the text


We need to convert the text format (string) data into umerical vector so that computer can perform Machine learning task. Therefore, before we start with any NLP project we need to pre-process it to make it ideal for working. Basic text pre-processing includes:

**1. Tokenization**
* Converting the entire text into **uppercase** or **lowercase**, so that the algorithm does not treat the same words in different cases as different

* **Tokenization**: Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.


NLTK data package includes a pre-trained tokenizer for English.

**2. Noise Removal**
* Removing **Noise** i.e everything that isn’t in a standard number or letter.
* Removing the **Stop words**. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.


**3. Normalization**
* **Stemming**: Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example, if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.

Example of Stemming
_<br>Form---->Suffix—-> Stem_
<br>studies—>   -es  —->studi
<br>studying—> -ing —->study

* **Lemmatization**: Lemmatization is a slight variant of stemming. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.

Example of Lemmatization
_<br>Form--------> Morphological info————————> Lemma_
<br>studies—>   Third person, present form of verb —->study
<br>studying—> Gerund of the verb ---------------------—->study



Which is better?
<br>We can say developing a stemmer is far simpler than building a lemmatizer. Lemmatization requires more computational resource and linguistic knowledge to create the dictionaries that allow the algorithm to look for the proper form of the word. 
We have seen the benefits of a lemmatizer for search engines, but there are more applications of lemmatization, like textual bases or e-commerce search.


## Tokenization

In [None]:
sentence_tokens = nltk.sent_tokenize(raw) # convert to list of sentences 
listword_tokens = nltk.word_tokenize(raw) # convert to list of words

In [None]:
lemmer = nltk.stem.WordNetLemmatizer()
# WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

## Greeting and Keyword matching

Next, we define a function for a greeting from the bot. If a user’s input matches with one of the greeting words, the bot shall return a greeting response. [ELIZA](https://en.wikipedia.org/wiki/ELIZA) uses a simple keyword matching for greetings. We use the same concept here.

In [None]:
GREETING_INPUTS = ("hello", "hi", "greetings","hey")
GREETING_RESPONSES = ["Hello, Please input your question.", "Hi! type your question please."]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

## Generating Response

### Bag of Words
After the preprocessing, we need to transform text into a meaningful vector (or array) of numbers in order to analyze the text and run algorithms.
<br>The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

* A vocabulary of known words.

* A measure of the presence of known words.

Why is it is called a “bag” of words? 
<br>That is because any information about the order or structure of words in the document is discarded and the model is only **concerned with whether the known words occur in the document, not where they occur in the document.**

The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).


<img src="BOW.jpg" style="width: 90%/">

source: [From text to vectors with BoW and TF-IDF](https://maelfabien.github.io/machinelearning/NLP_2)



The number in the matrix are simply the count of the tokens in each document. This is called the Term Frequency (TF) approach.
<br>However, this approach is not popular anymore due to the limitations and newly emerged approach. The logic of TF approach is the more frequent a word, the more importance we attach to it within each document. However, this can be problematic since common words, like cat or dog in our example, do not bring much information about the document it refers to. In other words, words that appear the most are not the most interesting to extract information from a document. Plus, we could leverage the fact that the words that appear rarely bring a lot of information within the document.

## TF-IDF Approach
A problem with the Bag of Words approach is that highly frequent words start to get larger score (i.e. just because a word is frequently appeared, it does not mean the word is important in the document), and it may not contain much informational content. Also, it will give more weight to longer documents than shorter documents.


One approach is to rescale the frequency of words by how often they appear through the documents so that the scores for frequent words like “the” which is also frequent across all documents are penalized. 
<br>Instead of filling the BOW matrix with the raw counting, how about a scoring with the term frequency multiplied by the inverse document frequency? It is intended to reflect how important a word is to a document in a collection or corpus. This approach to scoring is called Term Frequency-Inverse Document Frequency (TF-IDF) where:



**Term Frequency: is a scoring of the frequency of the word in the current document.**

```
TF = (Number of times term t appears in a document)/(Number of terms in the document)
```

**Inverse Document Frequency: is a scoring of how rare the word is across documents.**

```
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
```

<br>
For exaple, we have:

* Document1 = The sky is blue

* Document2 = The sky is not blue




<img src="TFIDF.png" style="width: 50%/">


The only word which differenciates the two document is 'not' and it is important from TF-IDF perspective (i.e. gets larger score).

## Cosine Similarity

TF-IDF weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

```
Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
```
where d1,d2 are two non zero vectors.


<img src="CosineSimilarity.png" style="width: 70%/">


<br>
To generate a response from our bot for input questions, the concept of document similarity will be used. We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry, I don’t understand you”

In [None]:
def response(user_response):
    robo_response = ''
    
    # take user response and add it to sentence_tokens
    sentence_tokens.append(user_response) 
    
    # calculate tf-idf scores within the user responce and corpus/doc
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    
    # generate word counts for the words in the corpus
    tfidf = TfidfVec.fit_transform(sentence_tokens)
   
    # calculate the similarity between user responce and the corpus
    # 'tfidf[-1]' is user responce, tfidf is the corpus
    vals_cosine = cosine_similarity(tfidf[-1], tfidf)
    
    # sort the cosine similarity to remove the user response appended in the corpus 
    # get the second largest element’s index from it’s 0 th row
    # most large cosine similarity can be user input so we get the second largest element
    idx = vals_cosine.argsort()[0][-2]
    
    # flatten() function gets a copy of an given array, and
    # converts the 2d array into a 1d array (e.g. [[]] -> []
    flat = vals_cosine.flatten()
    
    # then sort the cosine similarity 
    flat.sort() 
    
    # if the input does't find any match, say "I am sorry"
    # most large cosine similarity can be user input so we use the second largest element,
    # to chech the match between the given input and corpus
    req_tfidf = flat[-2] 
    if(req_tfidf == 0):
        robo_response = robo_response + "I am sorry, I don't understand you."
        return robo_response
    
    # if the input find a match, retuern a sentence from the corpus
    else:
        robo_response = robo_response + sentence_tokens[idx]
        return robo_response

Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.

In [None]:
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type 'Bye'")
while True:
    user_response = input()
    user_response = user_response.lower()
    if user_response == 'bye':
        print("ROBO: See you.")
        break
    elif user_response == 'thanks' or user_response == 'thank you' :
        print("ROBO: You're welcome.")
    elif greeting(user_response) != None:
        print("ROBO: " + greeting(user_response))
    else:
        print("ROBO: ", end = "")
        print(response(user_response))
        sentence_tokens.remove(user_response)

### End