# Jayithi - Python Chatbot Project – Learn to build your first chatbot using NLTK & Keras

We are going to build a chatbot using deep learning techniques. The chatbot will be trained on a dataset containing categories (intents), patterns, and responses. We will use a special type of recurrent neural network called Long Short-Term Memory (LSTM) to classify which category the user’s message belongs to.  Subsequently, the chatbot will provide a random response from the list of responses associated with that category.

---

## **1. Introduction to Chatbots**

A chatbot is an intelligent piece of software that is capable of communicating and performing actions similar to a human.

It is of two types:
 - **Retrieval based models**
 - **Generative based models**

---

### Retrieval-Based Models

**What They Do:** Retrieval-based chatbots look through a predefined set of responses and pick the best one based on the user's input. 

**How They Work:**
1. **Predefined Responses:** They have a list of possible answers stored in advance.
2. **Matching:** When a user asks something, the chatbot matches the question to the most relevant response from its list.
3. **Selection:** It then selects and displays the best response it found.

**Example:** If you ask a retrieval-based chatbot, "What’s the weather today?" it might look up responses related to weather queries and choose a pre-written response about today’s weather.

**Pros:** 
- Simple to implement.
- Good at providing accurate responses if questions are predictable.

**Cons:** 
- Limited by the responses it has.
- Cannot handle unexpected questions well.



### Generative-Based Models

**What They Do:** Generative-based chatbots create responses from scratch using learned patterns from data. They generate replies based on the input they receive.

**How They Work:**
1. **Learning from Data:** They are trained on large amounts of conversation data.
2. **Generating Responses:** When you ask a question, the chatbot generates a new, unique response based on what it has learned.

**Example:** If you ask a generative-based chatbot, "Tell me a joke," it might create a new joke based on patterns it has learned from previous jokes.

**Pros:**
- Can handle a wide range of topics and questions.
- Produces more natural and varied responses.

**Cons:**
- More complex to build.
- Responses might be less predictable and require careful tuning.

## **2. Setting Up the Environment**

We begin by installing the necessary libraries:

In [4]:
!pip install nltk
!pip install json
!pip install tensorflow
!pip install numpy
!pip install keras
!pip install pickle
!pip install random
!pip install tkinter


[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[31mERROR: Could not find a version that satisfies the requirement pickle (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pickle[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement random (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for random[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement tkinter (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tkinter[0m[31m
[0m

## **3. Importing Libraries**

We import the required libraries. The usage of each library is given in the comments.

In [5]:
import nltk  # For Natural Language Toolkit functionalities, including tokenization and lemmatization
nltk.download('punkt')#I downloaded this after an error in tokenization
nltk.download('wordnet')#I downloaded this after an error in lemmatization
from nltk.stem import WordNetLemmatizer  # Import WordNetLemmatizer for lemmatizing words
lemmatizer = WordNetLemmatizer()  # Initialize the lemmatizer
import json  # For handling JSON data, such as loading intents
import pickle  # For saving and loading Python objects, like processed words and classes
import numpy as np  # For numerical operations and array handling
from keras.models import Sequential, load_model  # For building and loading Keras models
from keras.layers import Dense, Activation, Dropout  # For adding layers to Keras models
from keras.optimizers import SGD  # For using Stochastic Gradient Descent optimizer in Keras
import random  # For generating random numbers, such as selecting responses
import tkinter  # For creating GUI components with Tkinter
from tkinter import *  # Import Tkinter components for GUI design

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
2024-07-28 07:15:33.641106: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-28 07:15:33.744802: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-28 07:15:33.790049: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-28 07:15:34.055966: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-28 07:15:34.091562: E external/loca

## **4. Data Preparation**

We load and process the intent data:

In [6]:
words=[]  # Starting with an empty list to collect all the words we'll encounter
classes = []  # Here, I'll keep track of the unique tags or categories
documents = []  # This list will hold pairs of tokenized words and their associated tags
ignore_words = ['?', '!']  # I’m defining a few words to ignore during tokenization

# Let’s open and read the JSON file that contains all the intents
data_file = open('intents.json').read()
# Now I'll load the JSON data so we can work with it
intents = json.loads(data_file)

# I’ll go through each intent to process it
for intent in intents['intents']:
    # For every pattern in the current intent
    for pattern in intent['patterns']:

        # Tokenizing each pattern into words
        w = nltk.word_tokenize(pattern)
        # Adding these words to our list of all words
        words.extend(w)
        # Storing these tokenized words along with their tag in the documents list
        documents.append((w, intent['tag']))

        # Adding the tag to our list of classes if it’s not already there
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

# It didn’t work at first, but I found this helpful post: https://stackoverflow.com/questions/26693736/nltk-and-stopwords-fail-lookuperror and downloaded the necessary resources.

I found out that tokenization is like breaking down a sentence into individual pieces, which we call tokens. For example, the sentence "I love programming!" gets split into ["I", "love", "programming", "!"]. This process helps make sense of text by turning it into manageable chunks. I came across this explanation on this link https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/, and it really helped clarify how tokenization works. Then, I read here https://www.datacamp.com/blog/what-is-tokenization that explained why tokenization is crucial for tasks like text analysis and machine learning, making it easier for algorithms to process and understand the text.

## **5. Text Processing**

We lemmatize and clean the text data:

In [7]:
# First, I’m lemmatizing each word, making them lowercase, and removing any words in our ignore list
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
# Then, I’m removing duplicates by converting the list to a set and sorting it
words = sorted(list(set(words)))
# Sorting the classes to have them in a consistent order
classes = sorted(list(set(classes)))
# Printing the number of documents to see how many pairs we have
print(len(documents), "documents")
# Printing the number of unique classes and listing them out
print(len(classes), "classes", classes)
# Printing the number of unique, lemmatized words and displaying them
print(len(words), "unique lemmatized words", words)

# Saving the cleaned-up words list to a pickle file
pickle.dump(words, open('words.pkl', 'wb'))
# Saving the sorted classes list to another pickle file
pickle.dump(classes, open('classes.pkl', 'wb'))

#I have downloaded wordnet from the nltk package, for the lemmatization.

47 documents
9 classes ['adverse_drug', 'blood_pressure', 'blood_pressure_search', 'goodbye', 'greeting', 'hospital_search', 'options', 'pharmacy_search', 'thanks']
88 unique lemmatized words ["'s", ',', 'a', 'adverse', 'all', 'anyone', 'are', 'awesome', 'be', 'behavior', 'blood', 'by', 'bye', 'can', 'causing', 'chatting', 'check', 'could', 'data', 'day', 'detail', 'do', 'dont', 'drug', 'entry', 'find', 'for', 'give', 'good', 'goodbye', 'have', 'hello', 'help', 'helpful', 'helping', 'hey', 'hi', 'history', 'hola', 'hospital', 'how', 'i', 'id', 'is', 'later', 'list', 'load', 'locate', 'log', 'looking', 'lookup', 'management', 'me', 'module', 'nearby', 'next', 'nice', 'of', 'offered', 'open', 'patient', 'pharmacy', 'pressure', 'provide', 'reaction', 'related', 'result', 'search', 'searching', 'see', 'show', 'suitable', 'support', 'task', 'thank', 'thanks', 'that', 'there', 'till', 'time', 'to', 'transfer', 'up', 'want', 'what', 'which', 'with', 'you']


I discovered that lemmatization is a way to reduce words to their base or root form, which is really useful for processing text. For example, the words "running," "ran," and "runner" all get reduced to "run." I found out more about this on this link https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/, which explained how lemmatization helps by simplifying text data while retaining the meaning. It’s different from stemming because it uses actual word dictionaries to find the correct base form. https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming I also checked out another article that highlighted how lemmatization can improve the accuracy of text analysis and machine learning models by providing a more consistent representation of words. https://intellipaat.com/blog/what-is-lemmatization-in-nlp/

## **6. Training Data Creation**

We create the training data for the model:

In [8]:
# # create our training data
# training = []
# # create an empty array for our output
# output_empty = [0] * len(classes)
# # training set, bag of words for each sentence
# for doc in documents:
#     # initialize our bag of words
#     bag = []
#     # list of tokenized words for the pattern
#     pattern_words = doc[0]
#     # lemmatize each word - create base word, in attempt to represent related words
#     pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
#     # create our bag of words array with 1, if word match found in current pattern
#     for w in words:
#         bag.append(1) if w in pattern_words else bag.append(0)

#     # output is a '0' for each tag and '1' for current tag (for each pattern)
#     output_row = list(output_empty)
#     output_row[classes.index(doc[1])] = 1

#     training.append([bag, output_row])
# # shuffle our features and turn into np.array
# random.shuffle(training)
# training = np.array(training)
# # create train and test lists. X - patterns, Y - intents
# train_x = list(training[:,0])
# train_y = list(training[:,1])
# print("Training data created")
# #Value error is being shown because we are setting an array element with a sequence, and the requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (94, 2) + inhomogeneous part.

To fix the ValueError, I removed the conversion of the training list to a NumPy array because it was causing dimension errors. Instead, I kept the data in simple Python lists, which resolved the issue. This change worked because Python lists are more flexible with varying lengths, so I was able to avoid errors related to inconsistent shapes. Me and Atticus worked this out in class along with Professor JoJo. He posted his solution on the discussion board, but this is one more type of solution I am trying out.

In [9]:
# I’m setting up an empty list to hold our training data
training = []
# Creating an empty output row with zeros, the length of the number of classes
output_empty = [0] * len(classes)

# For each document, I'm preparing the bag-of-words and the output row
for doc in documents:
    # Creating a bag-of-words where we mark '1' if the word is in the document
    bag = [1 if w in doc[0] else 0 for w in words]
    # Starting with an empty output row and setting the right class to '1'
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    # Appending the bag-of-words and the output row to our training data
    training.append([bag, output_row])

# Shuffling the training data to mix things up and improve model performance
random.shuffle(training)

# Separating the features (bag-of-words) and labels (output rows) into their own lists
train_x = [x[0] for x in training]
train_y = [x[1] for x in training]

# Letting myself know that the training data is ready for use
print("Training data created")


Training data created


## **7. Training the Model**

We are training the model in this step:

In [10]:
# I’m starting by setting up the model with three layers
# First, I’ll create a Sequential model to stack the layers
model = Sequential()

# Adding the first dense layer with 128 neurons and ReLU activation
# The input shape is determined by the number of features in train_x[0]
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))

# Adding a Dropout layer to help prevent overfitting
# This randomly drops 50% of the neurons during training
model.add(Dropout(0.5))

# Adding the second dense layer with 64 neurons and ReLU activation
model.add(Dense(64, activation='relu'))

# Adding another Dropout layer to further combat overfitting
model.add(Dropout(0.5))

# Adding the final output layer with neurons equal to the number of unique classes
# Using softmax activation to get probabilities for each class
model.add(Dense(len(train_y[0]), activation='softmax'))

# Now I’m setting up the optimizer
# I chose Stochastic Gradient Descent (SGD) with Nesterov accelerated gradient
# This should give good results for our model
sgd = SGD(learning_rate=0.01, decay=1e-6, momentum=0.9, nesterov=True)

# Compiling the model
# Using categorical cross-entropy loss function for multi-class classification
# Setting the optimizer to our SGD instance and tracking accuracy as our metric
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

# Fitting the model with our training data
# Converting train_x and train_y to NumPy arrays for compatibility
# Training for 200 epochs with a batch size of 5
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)

# Saving the trained model to a file so I can use it later
model.save('chatbot_model.h5')

# Letting myself know that the model has been successfully created and saved
print("Model created")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.0621 - loss: 2.3310    
Epoch 2/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.1771 - loss: 2.2393
Epoch 3/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.2540 - loss: 2.1615
Epoch 4/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.2299 - loss: 2.1587
Epoch 5/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.4575 - loss: 1.9792  
Epoch 6/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.2753 - loss: 2.0126      
Epoch 7/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.3628 - loss: 1.8754
Epoch 8/200
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.3859 - loss: 1.7284
Epoch 9/200
[1m10/10[0m [32m



Model created


## **8.Loading the model**

Here we are loading the trained model and the necessary data:

model = load_model('chatbot_model.h5')
intents = json.loads(open('intents.json').read())
words = pickle.load(open('words.pkl','rb'))
classes = pickle.load(open('classes.pkl','rb'))

## **8. Helper Functions**

We define functions for cleaning sentences, predicting classes, and generating responses:

In [11]:
def clean_up_sentence(sentence):
    # I start by tokenizing the sentence to split it into words
    # This breaks the sentence into an array of individual words
    sentence_words = nltk.word_tokenize(sentence)
    # Next, I lemmatize each word to reduce it to its base form
    # This helps in normalizing variations of the same word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words

In [12]:
# This function returns a bag-of-words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=True):
    # First, I clean up the sentence to get tokenized and lemmatized words
    sentence_words = clean_up_sentence(sentence)
    # Creating a bag-of-words array with 0s initially
    # This array will have a length equal to the number of words in our vocabulary
    bag = [0]*len(words) 
    for s in sentence_words:
        # Checking each word in the sentence against our vocabulary
        for i, w in enumerate(words):
            if w == s: 
                # If the word from the sentence is found in our vocabulary, mark it with 1
                bag[i] = 1
                if show_details:
                    print("found in bag: %s" % w)
    return(np.array(bag))

In [13]:
def predict_class(sentence, model):
    # First, I convert the sentence into a bag-of-words representation
    p = bow(sentence, words, show_details=False)
    # Predicting the class probabilities using the trained model
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    # Filtering out predictions below the threshold
    results = [[i, r] for i, r in enumerate(res) if r > ERROR_THRESHOLD]
    # Sorting the results by the strength of the probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    # Preparing the final list with intent and probability
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list


In [14]:
def getResponse(ints, intents_json):
    # I get the intent from the first result in the predicted intents
    tag = ints[0]['intent']
    # Access the list of intents from the intents JSON
    list_of_intents = intents_json['intents']
    # Loop through each intent to find the matching tag
    for i in list_of_intents:
        if i['tag'] == tag:
            # Randomly select a response from the matched intent's responses
            result = random.choice(i['responses'])
            break
    return result

In [15]:
def chatbot_response(text):
    # Predict the class (intent) for the input text
    ints = predict_class(text, model)
    # Get a suitable response based on the predicted intent
    res = getResponse(ints, intents)
    return res

## **10. Building the GUI**

We use Tkinter to create a simple graphical user interface for the chatbot:

In [16]:
# Importing tkinter and necessary components for creating the GUI
import tkinter
from tkinter import *

def send():
    # I'm getting the message from EntryBox and removing any extra whitespace
    msg = EntryBox.get("1.0", 'end-1c').strip()
    # Clearing the EntryBox after retrieving the message
    EntryBox.delete("0.0", END)

    # If the message isn't empty, I proceed with updating the chat log
    if msg != '':
        # Making the ChatLog editable so I can add new messages
        ChatLog.config(state=NORMAL)
        # Inserting the user's message into the ChatLog
        ChatLog.insert(END, "You: " + msg + '\n\n')
        # Setting the text color and font for the ChatLog
        ChatLog.config(foreground="#442265", font=("Verdana", 12))

        # Getting the chatbot's response based on the user's message
        res = chatbot_response(msg)
        # Inserting the bot's response into the ChatLog
        ChatLog.insert(END, "Bot: " + res + '\n\n')

        # Making the ChatLog non-editable again after updating
        ChatLog.config(state=DISABLED)
        # Scrolling the ChatLog to the bottom to show the latest messages
        ChatLog.yview(END)

# Creating the main window for the GUI
base = Tk()
base.title("Hello")  # Setting the title of the window
base.geometry("400x500")  # Setting the dimensions of the window
base.resizable(width=FALSE, height=FALSE)  # Making the window non-resizable

# Creating the chat window where messages will be displayed
ChatLog = Text(base, bd=0, bg="white", height="8", width="50", font="Arial")
ChatLog.config(state=DISABLED)  # Making the ChatLog initially non-editable

# Adding a scrollbar to the ChatLog
scrollbar = Scrollbar(base, command=ChatLog.yview, cursor="heart")
ChatLog['yscrollcommand'] = scrollbar.set

# Creating the Send button that will send messages when clicked
SendButton = Button(base, font=("Verdana", 12, 'bold'), text="Send", width="12", height=5,
                    bd=0, bg="#32de97", activebackground="#3c9d9b", fg='#ffffff',
                    command=send)

# Creating the EntryBox where users can type their messages
EntryBox = Text(base, bd=0, bg="white", width="29", height="5", font="Arial")

# Placing all components on the screen
scrollbar.place(x=376, y=6, height=386)  # Positioning the scrollbar
ChatLog.place(x=6, y=6, height=386, width=370)  # Positioning the ChatLog
EntryBox.place(x=128, y=401, height=90, width=265)  # Positioning the EntryBox
SendButton.place(x=6, y=401, height=90)  # Positioning the Send button

# Running the GUI main loop to keep the window open and interactive
base.mainloop()


TclError: no display name and no $DISPLAY environment variable

In [None]:
#https://github.com/jupyterlab/jupyterlab/issues/9660
#As mentioned in the above link, it is not working. 
#I have done it in VS Code and it worked. I am trying to find what to do next