<a href="https://colab.research.google.com/github/Fcazarez/PersonalProjects_LLM_Modelling_ChatBot/blob/main/chatbot_tf1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM with a ChatBot

# `Problem statement`

Help people to get relevant information about basic and most common questions about taxation

# Setup environment

In [1]:
# import libraries
import nltk
from nltk.stem.lancaster import LancasterStemmer

import numpy as np
import json
import random
import pickle
import requests

import tensorflow as tf

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Load and Preprocess Data

# **`Training dataset`**
* the intents file contains the most comon 54 questions asked to CRA about taxes, the information was tuned to build a json file usable for training porpouses.
* The subject of discussion during the dialog with the bot should be closely related with taxes and similar to the questions used for the training.

* reference: https://www.canada.ca/en/revenue-agency/news/newsroom/tax-tips/tax-filing-season-media-kit/tax-questions-answers.html

In [2]:
# load data
# define the URL of the JSON file
url = "https://raw.githubusercontent.com/Fcazarez/PersonalProjects_LLM_Modelling_ChatBot/main/intents_cra.json"

# make a GET request to the URL
response = requests.get(url)

# check if the request was successful
if response.status_code == 200:
    # parse the response as JSON
    raw_data = response.json()
    # print the raw data
    print(raw_data)
else:
    # handle the error
    print(f"Request failed with status code {response.status_code}")

{'intents': [{'tag': 'greeting', 'patterns': ['Hi there', 'How are you', 'Is anyone there?', 'Hey', 'Hola', 'Hello', 'Good day'], 'responses': ['Hello, thanks for asking', 'Good to see you again', 'Hi there, how can I help?'], 'context': ['']}, {'tag': 'goodbye', 'patterns': ['Bye', 'See you later', 'Goodbye', 'Nice chatting to you, bye', 'Till next time'], 'responses': ['See you!', 'Have a nice day', 'Bye! Come back again soon.'], 'context': ['']}, {'tag': 'thanks', 'patterns': ['Thanks', 'Thank you', "That's helpful", 'Awesome, thanks', 'Thanks for helping me'], 'responses': ['Happy to help!', 'Any time!', 'My pleasure'], 'context': ['']}, {'tag': 'noanswer', 'patterns': [], 'responses': ["Sorry, can't understand you", 'Please give me more info', 'Not sure I understand'], 'context': ['']}, {'tag': 'covid19_tax_impact', 'patterns': ['How will COVID-19 benefit payments affect your 2022 taxes?'], 'responses': ['We have entered a different phase of the pandemic, and the emergency and rec

In [3]:
stemmer = LancasterStemmer()

In [4]:
'''
Explanation:

The code checks if a 'data.pickle' file exists and loads data from it. If not, it generates the data and saves it to the 'data.pickle' file.
Inside the except block, it iterates through the intents in the 'raw_data'.
For each pattern in the intents, it tokenizes the words, extends the 'words' list, and appends the tokenized words to 'docs_x' and the tag to 'docs_y'.
It checks if the tag is not in the 'labels' list, and if not, appends it.
It stems the words, removes '?' from the words, and creates a sorted list of unique words.
It creates training and output data using one-hot encoding.
Finally, it converts the training and output data to numpy arrays and saves them to the 'data.pickle' file.
'''


# reminder to delete the pickle file if you change the intents file
try:
    with open('data.pickle', 'rb') as data_file:
        words, labels, training, output = pickle.load(data_file)
except:
# get the words and labels
    words = []
    labels = []
    docs_x = []
    docs_y = []

    for intent in raw_data['intents']:
        for pattern in intent['patterns']:
            # Tokenize the words
            tokenized_words = nltk.word_tokenize(pattern)
            words.extend(tokenized_words)
            docs_x.append(tokenized_words)
            docs_y.append(intent['tag'])

        if intent['tag'] not in labels:
            labels.append(intent['tag'])

    # Stem the words
    words = [stemmer.stem(w.lower()) for w in words if w != '?']
    words = sorted(list(set(words)))
    labels = sorted(labels)

    # create training and output data
    training = []
    output = []

    # Create a list of zeros with the length of labels
    out_empty = [0 for _ in range(len(labels))]

    # One-hot encoding
    for x, doc in enumerate(docs_x):
        bag = []

        # Stem the words in the document
        stemmed_words = [stemmer.stem(w.lower()) for w in doc]

        # Create a bag of words
        for w in words:
            if w in stemmed_words:
                bag.append(1)
            else:
                bag.append(0)

        # Create the output row using one-hot encoding
        output_row = out_empty[:]
        output_row[labels.index(docs_y[x])] = 1

        # Append training and output data
        training.append(bag)
        output.append(output_row)

    # Convert to numpy arrays
    training = np.array(training)
    output = np.array(output)

    # Save data to a pickle file
    with open('data.pickle', 'wb') as data_file:
        pickle.dump((words, labels, training, output), data_file)


# Train the model

In [5]:
'''
Explanation:

Build the Model:

A sequential model is created using Keras.
Three layers are added to the model:
The first layer is a Dense layer with 8 neurons and input shape determined by the length of the training data.
The second layer is another Dense layer with 8 neurons.
The third layer is the output layer with neurons equal to the length of the output data and softmax activation function, suitable for multi-class classification.
The model is compiled using the Adam optimizer, categorical crossentropy loss (suitable for multi-class classification), and accuracy as the metric.
Load or Train the Model:

An attempt is made to load weights from a pre-existing 'model.keras' file.
If the loading fails (perhaps the file doesn't exist), the model is trained:
Training is done using the training data and output labels with 1000 epochs and a batch size of 8.
After training, the model is saved to the 'model.keras' file.
'''

# build the model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(8, input_shape=[len(training[0])]))
model.add(tf.keras.layers.Dense(8))
model.add(tf.keras.layers.Dense(len(output[0]), activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# reminder to delete the model file if you change the model
try:
    # Attempt to load the weights from a pre-existing model file
    model.load_weights('model.keras')
except:
    # If the loading fails, train the model
    model.fit(training, output, epochs=1000, batch_size=8)
    # Save the trained model to a file
    model.save('model.keras')


Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

In [6]:
'''
Explanation:

Function Purpose:

The function bag_of_words is designed to convert a sentence (input s) into a bag-of-words representation based on a given list of words.
Initialization:

The bag list is initialized with zeros, with its length equal to the number of unique words in the provided list.
Tokenization and Stemming:

The input sentence s is tokenized into words.
Each word is stemmed (reduced to its base form) and converted to lowercase.
Creating the Bag of Words:

The function iterates through the stemmed words and the provided list of words.
If a word from the stemmed words matches a word in the provided list, the corresponding index in the bag is set to 1.
Return:

The final bag-of-words representation is returned as a NumPy array.
'''


# create a bag of words function to be used in the chat
def bag_of_words(s, words):
    # Initialize a list of zeros with the length of the words
    bag = [0 for _ in range(len(words))]

    # Tokenize the input sentence
    tokenized_words = nltk.word_tokenize(s)

    # Stem the tokenized words and convert them to lowercase
    stemmed_words = [stemmer.stem(w.lower()) for w in tokenized_words]

    # Iterate through the stemmed words
    for w in stemmed_words:
        # Check if the word is in the provided list of words
        for i, word in enumerate(words):
            if word == w:
                # If the word is present, set the corresponding index in the bag to 1
                bag[i] = 1

    # Convert the bag list to a NumPy array and return it
    return np.array(bag)


In [9]:
'''

Certainly! Here's an explanation for the provided code with comments:

python
Copy code
# create a chat function
def chat():
    # Print a welcome message
    print('Start talking with the bot! (type quit to stop)')

    # Start an infinite loop for the conversation
    while True:
        # Get user input
        inp = input('You: ')

        # Check if the user wants to quit the chat
        if inp.lower() == 'quit':
            break

        # Use the model to predict the intent of the user input
        results = model.predict(np.array([bag_of_words(inp, words)]), verbose=0)
        results_index = np.argmax(results)
        tag = labels[results_index]

        # Find the responses associated with the predicted intent
        for intent in raw_data['intents']:
            if intent['tag'] == tag:
                responses = intent['responses']

        # Print the user input and a randomly chosen response
        print('Bot: ' + random.choice(responses))
Explanation:

Function Purpose:

The function chat allows the user to interact with the chatbot in a conversational manner.
User Interaction Loop:

The function enters an infinite loop to allow continuous interaction until the user decides to quit by typing 'quit'.
User Input:

The user is prompted to input a message ('You: ').
Check for Quit Command:

If the user types 'quit', the loop breaks, ending the conversation.
Intent Prediction:

The bag-of-words representation of the user input is fed into the model for intent prediction.
The index with the highest predicted probability is used to identify the predicted intent.
Retrieve Responses:

The responses associated with the predicted intent are retrieved from the raw data.
Print Bot's Response:

A randomly chosen response from the retrieved responses is printed as the bot's reply ('Bot: ...').
'''


# create a chat function
def chat():
    # Print a welcome message
    print('Start talking with the bot! (type quit to stop)')

    # Start an infinite loop for the conversation
    while True:
        # Get user input
        inp = input('You: ')

        # Check if the user wants to quit the chat
        if inp.lower() == 'quit':
            break

        # Use the model to predict the intent of the user input
        results = model.predict(np.array([bag_of_words(inp, words)]), verbose=0)
        results_index = np.argmax(results)
        tag = labels[results_index]

        # Find the responses associated with the predicted intent
        for intent in raw_data['intents']:
            if intent['tag'] == tag:
                responses = intent['responses']

        # Print the user input and a randomly chosen response
        print('Bot: ' + random.choice(responses))


In [10]:
# start the chat
chat()

Start talking with the bot! (type quit to stop)
You: hey
Bot: Good to see you again
You: hello
Bot: Hi there, how can I help?
You: What happens if you made a payment or payments toward your COVID-19 benefit overpayment? Will that appear on your T4A slip?"
Bot: If you made a repayment to the CRA between January 1 and December 31, 2022, for an excess payment received in 2020 or 2021, this amount will be shown in box 201 of your 2022 T4A slip.
You: What if you would like to make a request to deduct federal COVID-19 benefit repayments in a prior year (Form T1B)?
Bot: The CRA will reassess your return(s) to apply the deduction.
You: What should you do if there are issues with your RL-1 slip?
Bot: Quebec residents who notice issues with their RL-1 slip should contact the CRA.
You: You received a letter or a T4A slip stating that you received a COVID-19 benefit payment but you never applied. What do you do?
Bot: The CRA is working to ensure your tax information is protected from fraud.
You: S

KeyboardInterrupt: ignored