# Building a Simple Chatbot from Scratch in Python (using NLTK)


History of chatbots dates back to 1966 when a computer program called ELIZA was invented by Weizenbaum. It imitated the language of a psychotherapist from only 200 lines of code. You can still converse with it here: [Eliza](http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm?utm_source=ubisend.com&utm_medium=blog-link&utm_campaign=ubisend).

Similarly we are creating a very basic Chatbot with Text Summarization feature utlising the Python's NLTK and SpaCy libraries.It's a very simple bot with hardly any cognitive skills,but this is done as a good effort to get more into NLP and get to know about chatbots.


## NLP
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

## Import necessary libraries

In [1]:
import io
import random
import string # to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

## NLTK
NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.





In [2]:
!pip install nltk



### Installing NLTK Packages




In [3]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
nltk.download('punkt') # first-time use only
nltk.download('wordnet') # first-time use only

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Reading in the corpus

For our example,we will be using the Wikipedia page for chatbots as our corpus. Copy the contents from the page and place it in a text file named ‘chatbot.txt’. However, you can use any corpus of your choice.

In [4]:
# 1. Fetch Text
import nltk
import os
from google.colab import files

uploaded = files.upload()
print("len(uploaded.keys()):", len(uploaded.keys()))

# Load text into an object called "text"
for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
    with open(fn, 'r', encoding='utf8', errors='ignore') as f:
        # set contents of file to your text object HERE
        text = f.read()
        f.seek(0)  # Move the file pointer to the beginning of the file
        partial_text = f.read(100)  # Read the first 100 characters
print(partial_text)


Saving burgess-busterbrown.txt to burgess-busterbrown.txt
len(uploaded.keys()): 1
User uploaded file "burgess-busterbrown.txt" with length 84663 bytes
[The Adventures of Buster Bear by Thornton W. Burgess 1920]

I

BUSTER BEAR GOES FISHING


Buster Be



The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for working. Basic text pre-processing includes:

* Converting the entire text into **uppercase** or **lowercase**, so that the algorithm does not treat the same words in different cases as different

* **Tokenization**: Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

_The NLTK data package includes a pre-trained Punkt tokenizer for English._

* Removing **Noise** i.e everything that isn’t in a standard number or letter.
* Removing the **Stop words**. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words
* **Stemming**: Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.
* **Lemmatization**: A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.



## Build the TF-IDF matrix.

In [5]:
sent_tokens = nltk.sent_tokenize(text)# converts to list of sentences
word_tokens = nltk.word_tokenize(text)# converts to list of words

## Preprocessing

We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.

In [6]:
# 1.(b) Tokenize text into sentences HERE:
nltk.download('punkt')

def tokenize_sentences(text):
    sentences = text

    return nltk.sent_tokenize(text)

sentences = tokenize_sentences(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# 1.(c) Clean and Preprocess documents (sentences) for our matrix
import string

translation = str.maketrans('', '', string.punctuation)

def preprocess_text(sentences):
    # Clean text HERE or in the parameters of BOW or TFIDF:
    #https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
    #https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

    cleaned = [w.lower() for w in sentences]
    cleaned = [w.translate(translation) for w in cleaned]

    return cleaned

cleaned = preprocess_text(sentences)

In [8]:
# create TFIDF (or BOW) matrix on SENTENCES
import pandas as pd

# build BOW matrix HERE (from your cleaned sentences)
# remember we need both a model ("CountVec" in the code examples)
# and a matrix ("X" in the code examples)
from sklearn.feature_extraction.text import CountVectorizer


# OR

# build TFIDF matrix HERE (from your cleaned sentences)
# remember we need both a model ("TfidfVec" in the code examples)
# and a matrix ("X" in the code examples)
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVec = TfidfVectorizer(stop_words='english') #see documentation for options!
tfidf = TfidfVec.fit_transform(cleaned)

X = pd.DataFrame(tfidf.toarray(), columns = TfidfVec.get_feature_names_out(), dtype='float32')
print(X.head())


   112      1920   26   45   71  ability  able  accept  according  acquainted  \
0  0.0  0.230881  0.0  0.0  0.0      0.0   0.0     0.0        0.0         0.0   
1  0.0  0.000000  0.0  0.0  0.0      0.0   0.0     0.0        0.0         0.0   
2  0.0  0.000000  0.0  0.0  0.0      0.0   0.0     0.0        0.0         0.0   
3  0.0  0.000000  0.0  0.0  0.0      0.0   0.0     0.0        0.0         0.0   
4  0.0  0.000000  0.0  0.0  0.0      0.0   0.0     0.0        0.0         0.0   

   ...  years  yell  yelled  yelling  yes  yesterday   yo  youll  young  youre  
0  ...    0.0   0.0     0.0      0.0  0.0        0.0  0.0    0.0    0.0    0.0  
1  ...    0.0   0.0     0.0      0.0  0.0        0.0  0.0    0.0    0.0    0.0  
2  ...    0.0   0.0     0.0      0.0  0.0        0.0  0.0    0.0    0.0    0.0  
3  ...    0.0   0.0     0.0      0.0  0.0        0.0  0.0    0.0    0.0    0.0  
4  ...    0.0   0.0     0.0      0.0  0.0        0.0  0.0    0.0    0.0    0.0  

[5 rows x 1338 columns]


## Keyword matching

Next, we shall define a function for a greeting by the bot i.e if a user’s input is a greeting, the bot shall return a greeting response.ELIZA uses a simple keyword matching for greetings. We will utilize the same concept here.

In [9]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):

    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

## Generating Response

### Bag of Words
After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of numbers. The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

* A vocabulary of known words.

* A measure of the presence of known words.

Why is it is called a “bag” of words? That is because any information about the order or structure of words in the document is discarded and the model is only **concerned with whether the known words occur in the document, not where they occur in the document.**

The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).


### TF-IDF Approach
A problem with the Bag of Words approach is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content”. Also, it will give more weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short, where:

**Term Frequency: is a scoring of the frequency of the word in the current document.**

```
TF = (Number of times term t appears in a document)/(Number of terms in the document)
```

**Inverse Document Frequency: is a scoring of how rare the word is across documents.**

```
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
```
### Cosine Similarity

Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

```
Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
```
where d1,d2 are two non zero vectors.



To generate a response from our bot for input questions, the concept of document similarity will be used. We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry! I don’t understand you”

In [10]:
# build the bot reponse:
def respond(user_input):
    bot_response = ''

    # transform user query with our model HERE
    query = TfidfVec.transform([user_input])

    # get dot-product of our query-vector and matrix HERE
    cosine_sim = query.dot(X.T)

    # get index of maximum similarity HERE
    max_sim_index = cosine_sim.argmax()

    #if there's nothing like the user query in our matrix, give a standard response
    if max_sim_index == 0:
        return 'BOT: I beg your pardon? I\'m not quite sure I got your meaning.'

    # fetch the sentce from our (original) sentence vector by max_sim_index and return it
    bot_response = sentences[max_sim_index]

    return bot_response

## Adding Text Summarizer to the Chatbot

In [11]:
from heapq import nlargest

In [12]:
# Function to perform text summarization
def summarize_text(user_input):
    if user_input.lower() == "~summarize":

        # SpaCy setup
        import spacy
        from spacy.lang.en.stop_words import STOP_WORDS
        from string import punctuation
        from heapq import nlargest

        stopwords = list(STOP_WORDS)
        nlp = spacy.load('en_core_web_sm')

        # Tokenize and process the text
        doc = nlp(text)

        # Calculate word frequencies
        word_frequencies = {}
        for word in doc:
            if word.text.lower() not in stopwords and word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1

        # Normalize word frequencies
        max_frequency = max(word_frequencies.values())
        for word in word_frequencies.keys():
            word_frequencies[word] = word_frequencies[word] / max_frequency

        # Calculate sentence scores
        sentence_tokens = [sent for sent in doc.sents]
        sentence_scores = {}
        for sent in sentence_tokens:
            for word in sent:
                if word.text.lower() in word_frequencies.keys():
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

        # Select top sentences for the summary
        select_length = int(len(sentence_tokens) * 0.3)
        summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)

        # Convert the summary sentences to text
        final_summary = [word.text for word in summary]
        summary_text = ' '.join(final_summary)

        # Print the initial number of characters and the number of characters in the summarized text
        print(f"\nInitial number of characters: {len(text)}")
        print(f"Number of characters in the summarized text: {len(summary_text)}\n")

        return summary_text

    # If it's not a summarization command, return None
    return None


Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.

In [13]:
# Main chatbot logic
flag = True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")

while flag:
    user_response = input("YOU: ")
    user_response = user_response.lower()

    if user_response != 'bye':
        if user_response == 'thanks' or user_response == 'thank you':
            flag = False
            print("ROBO: You are welcome.")
        else:
            # Check for summarization command
            summary_result = summarize_text(user_response)
            if summary_result is not None:
                print("ROBO: " + summary_result)
            else:
                if greeting(user_response) is not None:
                    print("ROBO: " + greeting(user_response))
                else:
                    print("ROBO: ", end="")
                    print(respond(user_response))
                    if user_response in sent_tokens:
                        sent_tokens.remove(user_response)
    else:
        flag = False
        print("ROBO: Bye! Take care.")

ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!
YOU: Hi
ROBO: hi there
YOU: Buster Bear yawned
ROBO: It was Buster Bear.
YOU:  He shuffled along over to the Laughing Brook
ROBO: Somehow, it seemed to Buster as if the Brook were
laughing at him.
YOU: ~summarize

Initial number of characters: 82992
Number of characters in the summarized text: 45549

ROBO: Sammy Jay looked at Blacky the Crow, and Blacky looked at Chatterer,
and Chatterer looked at Happy Jack, and Happy Jack looked at Peter
Rabbit, and Peter looked at Unc' Billy Possum, and Unc' Billy looked at
Bobby Coon, and Bobby looked at Johnny Chuck, and Johnny looked at Reddy
Fox, and Reddy looked at Jimmy Skunk, and Jimmy looked at Billy Mink,
and Billy looked at Little Joe Otter, and for a minute nobody could say
a word. He saw Farmer Brown's
boy filling a great tin pail with blueberries, and he knew that Farmer
Brown's boy didn't know that Buster Bear was anywhere about, and he kne