# Building a Simple Chatbot from Scratch in Python (using NLTK)

## Import necessary libraries

In [1]:
import io
import random
import string
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

## Downloading and installing NLTK
- **NLTK (Natural Language Toolkit)**:
  - Leading platform for building Python programs with human language data.
  - Easy-to-use interfaces to over 50 corpora and lexical resources like WordNet.
  - Provides text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
  - Includes wrappers for industrial-strength NLP libraries.





In [2]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


### Installing NLTK Packages




In [3]:
# for downloading packages
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True)

## For the first time you may need to use these two statements:
nltk.download('punkt') #Tokonizer package
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MOHAMMEDG\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\MOHAMMEDG\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\MOHAMMEDG\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Reading in the corpus

### We will use the Wikipedia page for chatbots as our corpus.

### Read file 'chatbot.txt'.

### You can use any corpus of your choice.

In [4]:
f = open('Storage/chatbot.txt', 'r', errors = 'ignore')
raw = f.read()
raw = raw.lower() #converts to lowercase

In [5]:
print(raw)

a chatbot (also known as a talkbot, chatterbot, bot, im bot, interactive agent, or artificial conversational entity) is a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods. such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the turing test. chatbots are typically used in dialog systems for various practical purposes including customer service or information acquisition. some chatterbots use sophisticated natural language processing systems, but many simpler systems scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.

the term "chatterbot" was originally coined by michael mauldin (creator of the first verbot, julia) in 1994 to describe these conversational programs.today, most chatbots are either accessed via virtual assistants such as google assistant and amazon al

Text data needs to be converted into numerical feature vectors for machine learning algorithms. Basic text pre-processing includes:

- **Case Conversion**: Convert text to uppercase or lowercase to ensure uniformity.
- **Tokenization**: Split text into tokens (words or sentences). NLTK's Punkt tokenizer can be used.
- **Noise Removal**: Remove non-standard characters (anything that isn't a letter or number).
- **Stop Words Removal**: Exclude common words that add little value (e.g., 'and', 'the').
- **Stemming**: Reduce words to their base or root form (e.g., "running", "ran" → "run").
- **Lemmatization**: Convert words to their base form using actual words (e.g., "better", "good" → "good").

## Tokenisation

In [6]:
sent_tokens = nltk.sent_tokenize(raw) #converts to list of sentences
word_tokens = nltk.word_tokenize(raw) #converts to list of words

In [7]:
for sentence in sent_tokens:
    print("-> ", sentence) 

->  a chatbot (also known as a talkbot, chatterbot, bot, im bot, interactive agent, or artificial conversational entity) is a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods.
->  such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the turing test.
->  chatbots are typically used in dialog systems for various practical purposes including customer service or information acquisition.
->  some chatterbots use sophisticated natural language processing systems, but many simpler systems scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.
->  the term "chatterbot" was originally coined by michael mauldin (creator of the first verbot, julia) in 1994 to describe these conversational programs.today, most chatbots are either accessed via virtual assistants such as google assi

In [8]:
for word in word_tokens:
    print("-> ", word) 

->  a
->  chatbot
->  (
->  also
->  known
->  as
->  a
->  talkbot
->  ,
->  chatterbot
->  ,
->  bot
->  ,
->  im
->  bot
->  ,
->  interactive
->  agent
->  ,
->  or
->  artificial
->  conversational
->  entity
->  )
->  is
->  a
->  computer
->  program
->  or
->  an
->  artificial
->  intelligence
->  which
->  conducts
->  a
->  conversation
->  via
->  auditory
->  or
->  textual
->  methods
->  .
->  such
->  programs
->  are
->  often
->  designed
->  to
->  convincingly
->  simulate
->  how
->  a
->  human
->  would
->  behave
->  as
->  a
->  conversational
->  partner
->  ,
->  thereby
->  passing
->  the
->  turing
->  test
->  .
->  chatbots
->  are
->  typically
->  used
->  in
->  dialog
->  systems
->  for
->  various
->  practical
->  purposes
->  including
->  customer
->  service
->  or
->  information
->  acquisition
->  .
->  some
->  chatterbots
->  use
->  sophisticated
->  natural
->  language
->  processing
->  systems
->  ,
->  but
->  many
->  simpler
->  sy

## Preprocessing

We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.

In [9]:
lemmer = nltk.stem.WordNetLemmatizer()

# WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
    
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

## Keyword matching

Next, we shall define a function for a greeting by the bot i.e if a user’s input is a greeting, the bot shall return a greeting response. ELIZA uses a simple keyword matching for greetings. We will utilize the same concept here.

In [10]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

## Generating Response

### Bag of Words
After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of numbers. The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

* A vocabulary of known words.

* A measure of the presence of known words.

Why is it called a “bag” of words? That is because any information about the order or structure of words in the document is discarded and the model is only **concerned with whether the known words occur in the document, not where they occur in the document.**

The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).


### TF-IDF Approach
A problem with the Bag of Words approach is that:
- highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content”.
- Also, it will give more weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short, where:

**Term Frequency: is a scoring of the frequency of the word in the current document.**

```
TF = (Number of times term t appears in a document)/(Number of terms in the document)
```

**Inverse Document Frequency: is a scoring of how rare the word is across documents.**

```
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
```
### Cosine Similarity

Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

```
Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
```
where d1,d2 are two non zero vectors.



To generate a response from our bot for input questions, the concept of document similarity will be used. We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry! I don’t understand you”

In [11]:
def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    #print(tfidf[-1], tfidf)
    
    vals = cosine_similarity(tfidf[-1], tfidf)
    #print(vals)
    
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    print(req_tfidf)
    
    if req_tfidf == 0:
        robo_response = robo_response+"I am sorry! I don't understand you"
        return robo_response
    
    robo_response = robo_response+sent_tokens[idx]
    return robo_response

Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.

In [12]:
flag = True
print("ROBO: My name is Robot. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while flag == True:
    user_response = input()
    user_response = user_response.lower()
    if user_response != 'bye':
        if user_response == 'thanks' or user_response == 'thank you':
            flag=False
            print("ROBO: You are welcome..")
        else:
            if greeting(user_response) != None:
                print("ROBO: ", greeting(user_response))
            else:
                print("ROBO: ", end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

ROBO: My name is Robot. I will answer your queries about Chatbots. If you want to exit, type Bye!


 Hei


ROBO: 0.0
I am sorry! I don't understand you


 Hi


ROBO:  *nods*


 How are you doing?


ROBO: 0.24423334064209848
in 1984, a book called the policeman's beard is half constructed was published, allegedly written by the chatbot racter (though the program as released would not have been capable of doing so).


 Ok, what does this chatbot do?


ROBO: 0.09836955908885903
design
the chatbot design is the process that defines the interaction between the user and the chatbot.the chatbot designer will define the chatbot personality, the questions that will be asked to the users, and the overall interaction.it can be viewed as a subset of the conversational design.


 thanks


ROBO: You are welcome..
