Skip to content
This repository was archived by the owner on Feb 20, 2023. It is now read-only.

ManindraDeMel/Chatbot

Repository files navigation

Kali the deep learning chatbot

https://kalichatbotblog.000webhostapp.com/

Python 3.9 (other versions not tested)

If you do not know how to use git LFS to clone large files please download the repo at the following google drive link: https://drive.google.com/drive/folders/15Nx9KjmZ2mlCjgvs44ONhgppfZ0GoT4f?usp=sharing

External Modules used

  • tensorflow 2.6.0
py -3.9 -m pip install tensorflow
  • tensorflow_addons 0.14.0
py -3.9 -m pip install tensorflow_addons
  • sklearn
py -3.9 -m pip install scikit_learn
  • pandas
py -3.9 -m pip install pandas
  • tensforflow GPU
py -3.9 -m pip install tensorflow_gpu (optional)
  • Cuda (Recommended but optional)

Website modules

Flask

py -3.9 -m pip install flask (optional)

flask socketio

py -3.9 -m pip install flask-socketio (optional)

Understanding the concepts behind the Chatbot

This deep learning chatbot utilizes Neural Machine translation (NMT), Long-short term memory units which make up the network and the addition of the attention layer to pick out important words in a given sentence. The model first encodes the input into comprehensible numerical values for the network, once passed through the network it is then passed to the attention layer and finally to the decoder which returns the chatbot's english reply.

Running the Chatbot via the website

The website associated with the chatbot uses flask-socketio to communicate with the backend. To begin a local instance of the server simply run

python3.9 webserver.py

website image 1 website image 2 The UI is fairly intuitive and simply pressing the chat buttons redirects you to a webpage where the server establishes a websocket with your machine, which once completed allows you to directly talk to the chatbot. website image 3

Running the Chatbot via Console App

As provided in the run_console.py file, running the chatbot in a console app is extremely easy.

python3.9 run.py

The conversation will now be displayed in the console similar to the desktop application. Additionally feel free to experiment with extensions or add more functionality to the console application however you like.

console example image

Ending the program will just require you to close the console, or if you want, some extra functionality can be added to the given code to exit upon a button press, user input (such as 'exit') or anything similar

Training the Chatbot

Training the chatbot, is provided with the train.py file found in the user directory. In the console debugging logs will appear similar to:

training example

The loss represents the accuracy of the network, whilst the epochs and batches represent the different portions of the data the chatbot is being trained on.

Chatbot parameters

Located in chatbot.py several adjustable parameters can be found. Notably:

CONST_TRAINING_CHECKPOINT_DIRECTORY = "training_checkpoints/"
CONST_TRAINING_FILES_DIRECTORIES = ("training_data/training_data.original", "training_data/training_data.reply")

Where the checkpoint directory is where you want the chatbot to save it's state in the training process and the training files directory is the directory that contain the .original and .reply files for training.

Futhermore parameters for the training process itself include:

CONST_BUFFER_SIZE = 32000 
# limits how much we read from the IO/Stream. We wouldn't want a buffer overflow...
CONST_BATCH_SIZE = 32 
# The batch sizes can vary depending on the computation power of your computer
dataset_limit = 30000  # Limit for dataset sizes

Depending on the capability of your computer these numbers can be increased and decreased accordingly. If you find your computer often crashing, reducing the batch size and the dataset limit may solve the issue.

Finding the right Dataset

Finding the right dataset is crucial to the overall success of the project. Optimistically, you want to have ~250,000 individual conversations at the very least to attain a somewhat realistic deep learning chatbot. I suggest using the dataset I used reddit data. Around 1.5 TB is needed for the data and the databases which filter through the data.

Filtering the data into a database

Given you have a dataset filled with original topics / starting messages and one to many replies to these topics, we have to first sort through this dataset and link pairs of these original topics and replies together in a SQlite database. In addition to this, we will also do some initial filtration, such as removing hyperlinks, certain words, sentence length limits etc...

This purpose is fulfilled in the gen_database.py file found at database/gen_database.py. The creates a database for in my instance, the reddit data i'm using to train my chatbot. The database sorts through all this data and pairs comments with other comments which can then be used for training the chatbot.

After completion you should have a database with a structure similar to:

Database structured example

Filled with data that should look like:

Database filled example

Futhermore, if you're using my dataset, each month of data will be separated out into separate databases as depicted: Database filled example

Pairing the data into different files

Once the data has been inserted into the database we need to look for all the pairs of conversations in the database and separate them into different files. Where .original depicts the start comment / message and .reply depicts the associated reply to that original message. This is done by the get_training_data.py file which reads all the provided databases and splits it into a small portion of test_data for after training and the rest of the filtered data is added towards the training_data files:

training data example

Training data parameters

Since filtration is completed in the database insertion, parameters to this filtration can be deducted or added.

In gen_database.py the filters for each sentence can be found in the following method:

    @staticmethod
    def filter_comment(comment):  # Could add filtration for sub-reddit's. 
        if (len(comment.split()) > 50) or (len(comment) < 1):
            return False
        elif (len(comment) > 1000):
            return False
        possible_url = re.search("(?P<url>https?://[^\s]+)",comment) # checking for URLS
        if (possible_url):
            return False
        elif (comment == "[deleted]") or (comment == "[removed]"):
            return False
        return True

Currently the database filters out URL's and makes sure the message length is appropriate and exists.

Similarly in the get_training_data.py file also has path parameters that need to be addressed

header = r'D:/Data/ChatBot/database/'
Extract_data(10000, [header + r'2015-01.db', header + r'2015-02.db']).sort_data()

The header is the directory of all the databases the generator file has created. Futhermore, the second parameter of the Extract data class will need to be modified depending on the amount of databases created. I will most likely optimize this in a future update so python simply reads the filenames in the header directory with the extension of .db.

If the usage of this program may seem confusing, try watching my quick usage video @ https://youtu.be/srREp4IqlHQ

About

A deep learning chatbot

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •