# Chatbot Documentation

This project was heavily inspired by the series of [YouTube videos](https://www.youtube.com/watch?v=dvOnYLDg8_Y&t=20s) by [sentdex](https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ). In this notebook, I present comprehensive documentation of my experience with this series, which involved creating a chatbot with deep learning, Python and TensorFlow.

## Contents

1. [Introduction & Collecting our Training Data](#intro)
2. [Data structure](#2)
3. [Buffering dataset](#3)
4. [Determining insert](#4)
5. [Building database](#5)
6. [Database to training data](#6)
7. [Training a model](#7)
8. NMT Concepts and Parameters
9. Interacting with our Chatbot
10. Further Work

# <a id='intro'></a> Introduction & Collecting our Training Data

The method used here in building this chatbot is designed to be generalizable for building chatbots that can be used for a diverse range of applications. A large differentiating factor between the different implementations of chatbots, will largely depend on the type of training data used to build the chatbot.  

As with all machine learning applications, one of the biggest obstacles is to collect the relevant data, and manipulate it to be useful for a given task. One of the most common data sets that people tend to use for building (relatively weak) chatbots is the open-source [Cornell movie database](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), which contains 220,000 conversational exchanges between 10,300 pairs of movie characters. In addition to it being open-source, it contains conversational data from different movies, different characters, and different genders, which allows for a dataset that is fairly balanced to train on. However, a major draw-back is that there is a fairly limited amount of conversational exchanges.

In the era of deep learning, big data is really the fuel for our machine learning applications. As we are building a chatbot that utilizes deep learning, our model will necessarily be very data hungry. Therefore, the Cornell movie database discussed will be fairly limited in its size. In search of a larger corpus of conversational exchanges to train our chatbot, we thus turn to [Reddit](https://www.reddit.com/), a popular collection of online forums where people share news, stories, and other types of content, and users are encouraged to comment and discuss about everyones posts. The popularity of Reddit is immense. It was measured that almost 1.69 billion users had accessed the site in just the month of March 2018, alone ([source](https://www.statista.com/statistics/443332/reddit-monthly-visitors/)). 

To extract this conversational data from Reddit, we could have used the [Python Reddit API](https://praw.readthedocs.io/en/latest/), but it has some pretty strict limitations. This API unfortunately will not allow you to parse millions of rows without a longer period of time, or violating the Terms of Service. Instead we can use a Reddit post that provides a [data dump of 1.7 Billion Reddit Comments](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/?st=j9udbxta&sh=69e4fee7). This datafile is about 250 GB of data, when compressed. 

I will compare the performance of the chatbot on different sizes, and receny of reddit comment data which will be as follows:
1. A __small__ dataset of __1.4 GB__ collected from a __distant__ time: __January 2012__
2. A __medium__ sized dataset of __9.1 GB__ collected from a more __recent__ time, __June 2018__
3. A __large__ dataset of ~ __250 GB__, from between __December 2005 to June 2018__

This data was collected from this <a id='downloadc'></a> [website](http://files.pushshift.io/reddit/comments/), where the datafiles are available in compressed form - allowing for relatively fast download speeds.

# <a id='2'></a> Data Structure

If you observe the following [sample Reddit post](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/?sort=top) and scroll down to the comments, you can observe that the comments are arranged in a __tree-like structure__, where you have parent comments at the top, followed by child comments which are in response to the parent above it. What we will need to do is to pull these strings of comments apart, and then  pair them together in a parent-child (comment and reply) manner. This allows us to capture natural conversational exchanges between humans on the internet, which we can then provide as training data to our chatbot to mimic. 

In this section we begin by __building our database__ that will store our parent comments that are paired to their best child (reply) comments. The reason do this is because a lot of these files are way too big for us to read into RAM and create training files from. For now, to keep things relatively simple we will use __SQLite__ for our database. 

The datafiles we are using are stored in the __JSON__ format. Because of this, there is a lot of unnecessary data within each of the files. When building our database, we will extract the following data from our JSON files:
* Comment score (karma)
* The comment body itself
* Subreddit
* The parent_id
* Time of creation (UTC)

Let's begin building the database using Python and SQLite!

In [None]:
import sqlite3  # For building our database
import json  # To parse our datafiles
from datetime import datetime

In [None]:
timeframe = '2012-01'  # Begin with our small dataset
sql_transaction = []  # Efficiently parse rows

In SQL, you ideally want to have a __big SQL transaction__ when possible. This is because you don't want to (for example) handle millions of rows by inserting them one by one if you don't have to, because that can be incredibly really inefficient. Instead you want to build up a big transaction and then perform it all at once - as this will be much faster to execute.

The code immediately below this writing will connect to, and create the database if it doesn't exist already.

In [None]:
connection = sqlite3.connect('../data/{}.db'.format(timeframe))  # Connects to database
c = connection.cursor()  # Define cursor

Here, let's create a function that can help us store the parent_id, comment_id, the parent comment, the reply (comment), subreddit, the time, and then finally the score (votes) for the comments from the raw JSON files.

In [None]:
# Query function to extract data from raw JSON files
def create_table():
    c.execute("""CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT 
    PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit
    TEXT, unix INT, score INT)""")

```python
# Creates the table, if it doesn't exist already:
if __name__ == '__main__':
    create_table()
```

Having discussed the structure of our data, and having created our database to store that date - we will beging working through the data.

# <a id='3'></a> Buffering Data

In this section we will begin iterating over our data files and store that information. 

Let us begin by first improving the above ```if``` statement to include some additional actions. We will record using ```row_counter``` how far we are in the file that we are iterating through. Additionally, we also record how many rows of our data are reply-comment pairs, which we will aim to use as our training data - using the ```paired_rows``` variable. Since the file is too large for us to deal with in memory, we will use the ```buffering``` parameter, so that we read the file in small chunks that we can easily work with, which is fine since all we care about is 1 row at a time.

We will now read through this data row by row, which is of the JSON format.

```python
if __name__ == '__main__':
    create_table()  # Creates the table, if it doesn't exist already
    row_counter = 0  # Record how many rows we iterate through in our data
    paired_rows = 0  # Record how many parent-child pairs we have found
    
    with open('../data/{}/RC_{}'.format(timeframe.split('-')[0], timeframe), 
              buffering=1000) as f:
        for row in f:  # Extract relevant feature data from rows of json file 
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']  
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            subreddit = row['subreddit']
```

Note the ```format_data``` function call above. This is used to normalize the comments and to convert the newline character to a word. Let's create that:

In [None]:
# Get rid of any new lines, and convert the newline character to a word
def format_data(data):
    data = data.replace('\n', ' newlinechar ').replace('\r', ' newlinechar ').replace('""', "'")
    return data

We can read the data into a Python object using ```json.loads()```, which simply takes a string formatted like a json object. 

All comments will initially not have a parent. This could be because it is either a top level comment (and the parent is the Reddit post itself), or because the parent isn't in the document. However, as we go through the document we will indeed find comments that do have parents within our database. When this occurs, we want to instead add this comment to the existing parent to store the conversation. Once we've gone through the file(s), we'll take the database and output our comment-reply pairs as training data, train the model and then finally have our completed chatbot. 

Before we input our data to the database, we should see if we can find the parent first. We achieve this with the ```find_parent``` function:

In [None]:
# Finds the parent comment by the parent id (pid)
def find_parent(pid):
    try:
        sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else:
            return False
    except Exception as e:
        #print('find_parent', e)
        return False

We can now improve our previous ```if``` statement by appending it with this newly created ```find_parent``` function:

```python
if __name__ == '__main__':
    create_table()  # Creates the table, if it doesn't exist already
    row_counter = 0  # Record how many rows we iterate through in our data
    paired_rows = 0  # Record how many parent-child pairs we have found
    
    with open('../data/{}/RC_{}'.format(timeframe.split('-')[0], timeframe), 
              buffering=1000) as f:
        for row in f:  # Extract relevant feature data from rows of json file 
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']  
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            subreddit = row['subreddit']
            parent_data = find_parent(parent_id)
```

# <a id='4'></a> Determining Insert

In this section we will begin building the logic required to determine whether or not to insert data, and how.

To begin, we aim to impose a restriction on _all_ comments, regardless if there are any others. This is because we want to filter unhelpful comments, and instead focus our training data on the best, most upvoted comments (measured by their comment score: karma). For this reason, we will only consider comments with a comment score of 2 or higher. 

Here, we __require the comment score to be 2 or higher__, and also check if there's already an existing reply to the parent, and check its score. The significance of a comment score value of 2, is that it means that some other unique user saw this comment on Reddit and decided to upvote it - signalling it as a useful comment to others. However, the value 2 is somewhat arbitrary, and is a threshold parameter which can be changed for different applications. For example if we are dealing with a much larger dataset, with many more comments - then it may be helpful to set this threshold to be higher - for example ```score >= 15```. Regardless, for this application we will set it to 2 for now, and perhaps in future implementations, experiment between different threshold values and compare the chatbots performance. 

In addition, if there is an existing comment, and if our score is higher than the existing comment's score, we would like to replace it:

```python
if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    with open('../data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']
            parent_data = find_parent(parent_id)

            if score >= 2:  # Impose threshold comment score
                existing_comment_score = find_existing_score(parent_id)
                if existing_comment_score:  
                    if score > existing_comment_score:
```

Note this requires the ```find_existing_score``` function:

In [None]:
def find_existing_score(pid):
    try:
        sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid)
        c.execute(sql)
        result = c.fetchone()
        if result != None:
            return result[0]
        else: return False
    except Exception as e:
        #print(str(e))
        return False

It should be noted that on Reddit, many comments are either deleted or removed, but also some comments can be very long / very short. For our application, we want to make sure comments are of an acceptable length for training, and that the comment wasn't removed or deleted. We achieve this using the following function:

In [None]:
# Checks if the comments are acceptable for training, i.e. they are of 
# sufficient length, and that they exist
def acceptable(data):
    if len(data.split(' ')) > 50 or len(data) < 1:  # Ignore long/short comments
        return False
    elif len(data) > 1000:  # Ignore long comments
        return False
    elif data == '[deleted]':  # Ignore deleted comments
        return False
    elif data == '[removed]':  # Ignore removed comments
        return False
    else:
        return True

# <a id='5'></a> Building Database

Up to this point we have been working with our data and preparing the logic for how we want to insert it into a database. In this section, we will begin actually building the database.

First, we will improve our previous ```if``` statement with our newly created ```acceptable``` function. In addition, we will add a code block that use SQL to insert and replace the comment to the database, if the score is above the threshold, the comment is acceptable, and if the score is above the parent comment's score. We achieve this with the ```sql_insert_replace``` function:

```python
if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    with open('../data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']
            parent_data = find_parent(parent_id)

            if score >= 2:  # Impose threshold comment score
                if acceptable(body):  # Check if comment body is suitable 
                    existing_comment_score = find_existing_score(parent_id)
                    if existing_comment_score:  
                        if score > existing_comment_score:
                            sql_insert_replace_comment(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
                    else:  # If there is no existing comment score
                        if parent_data:
                            sql_insert_has_parent(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
                        else:
                            sql_insert_no_parent(comment_id, parent_id, body, subreddit, created_utc, score)
```

Notice here we have 3 additional function calls that do not exist yet. Let's create them:

In [None]:
# Overwrites the parent_id comment with the current comment, if it has a better score
def sql_insert_replace_comment(comment_id, parent_id, parent, comment, subreddit, created_utc, score):
    try:
        sql = """UPDATE parent_reply SET parent_id = ?, comment_id = ?, parent = ?, comment = ?, subreddit = ?, unix = ?, score = ? WHERE parent_id =?;""".format(parent_id, comment_id, parent, comment, subreddit, int(created_utc), score, parent_id)
        transaction_bldr(sql)
    except Exception as e:
        print('s-UPDATE insertion',str(e))
        
        
# Inserts commment at the parent_id, if we had a comment body for that parent  
def sql_insert_has_parent(comment_id, parent_id, parent, comment, subreddit, created_utc, score):
    try:
        sql = """INSERT INTO parent_reply (parent_id, comment_id, parent, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}","{}",{},{});""".format(parent_id, comment_id, parent, comment, subreddit, int(created_utc), score)
        transaction_bldr(sql)
    except Exception as e:
        print('s-PARENT insertion',str(e))
        
        
# Inserts comment if there was no parent, but we still want the parent_id 
def sql_insert_no_parent(comment_id, parent_id, comment, subreddit, time, score):
    try:
        sql = """INSERT INTO parent_reply (parent_id, comment_id, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}",{},{});""".format(parent_id, comment_id, comment, subreddit, int(created_utc), score)
        transaction_bldr(sql)
    except Exception as e:
        print('s-NO_PARENT insertion',str(e))

Finally, the last function we will need to define to create this database will be the ```transaction_bldr```. This will build up insertion statements and commit them in groups, rather than one-by-one. This will help make our code to be much more efficient:

In [None]:
# Efficiently builds up inerstion statements, and commits them in groups
def transaction_bldr(sql):
    global sql_transaction
    sql_transaction.append(sql)
    if len(sql_transaction) > 1000:
        c.execute('BEGIN TRANSACTION')
        for s in sql_transaction: 
            try:
                c.execute(s)
            except:
                pass
        connection.commit()
        sql_transaction = []

Let us add some additional lines of code, that allow us to see where we are during our iteration - so we'll output every 100,000 rows of data some information. Having done this, let's finally run our big ```if``` statement, and build the database:

In [None]:
# Create the database using SQLite
if __name__ == '__main__':
    create_table()
    row_counter = 0
    paired_rows = 0

    with open('../data/{}/RC_{}'.format(timeframe.split('-')[0],timeframe), buffering=1000) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']
            parent_data = find_parent(parent_id)
            if score >= 2:
                existing_comment_score = find_existing_score(parent_id)
                if existing_comment_score:
                    if score > existing_comment_score:
                        if acceptable(body):
                            sql_insert_replace_comment(comment_id,parent_id,parent_data,body,subreddit,created_utc,score)
                else:
                    if acceptable(body):
                        if parent_data:
                            sql_insert_has_parent(comment_id,parent_id,parent_data,body,subreddit,created_utc,score)
                            paired_rows += 1
                        else:
                            sql_insert_no_parent(comment_id,parent_id,body,subreddit,created_utc,score)
                            
            if row_counter % 100000 == 0:  # Inform us our progress
                print('Total Rows Read: {}, Paired Rows: {}, Time: {}'.format(row_counter, paired_rows, str(datetime.now())))

Having completed constructing the database using SQLite, we can then view the created database within the ```data``` directory, which is located one directory above this ```Documentation.ipynb```. The created database has the file extension ```.db```. For those that are curious, we can open and view this created database using the open-source tool [Database Browser for SQLite](https://sqlitebrowser.org/). I have decided not to commit this database to GitHub, as it is extremely large in size. However, running this iPython notebook yourself (after downloading the compressed files [discussed earlier](#downloadc)) will allow you to create the database for yourself, if you desire.

# <a id='6'></a> Database to Training Data

Having created our database containing pairs of Reddit comments and replies, we will now  generate training data from it. This training data will later be used to train our models of chatbots.

The model we will be building with [TensorFlow](https://www.tensorflow.org/) is a type of [Recurrent Neural Network](https://www.tensorflow.org/tutorials/sequences/recurrent) (RNN) known as a [sequence-to-sequence](https://www.tensorflow.org/versions/r1.2/tutorials/seq2seq) model.

What we first want to do is create two files: a parent-comment file, and then a reply file - where each row corresponds to the comment in the opposite file. We create these files as follows:

In [None]:
import sqlite3
import pandas as pd

In [16]:
# Grabs comment pairs from database and append them to their respective training files 
timeframes = ['2012-01']  # Begin with just our small dataset

for timeframe in timeframes:
    connection = sqlite3.connect('../data/{}.db'.format(timeframe))
    c = connection.cursor()
    limit = 5000  # Determines how much we'll pull at a time into our pandas DF
    last_unix = 0  # Helps us buffer through the database
    cur_length = limit  
    counter = 0
    test_done = False
    
    # Continue pulling data into our dataframe, while cur_length == limit
    while cur_length == limit:
        df = pd.read_sql("SELECT * FROM parent_reply WHERE unix > {} AND parent NOT NULL AND score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix, limit), connection)
        last_unix = df.tail(1)['unix'].values[0]
        cur_length = len(df)
        
        if not test_done:
            with open('../data/test.from', 'a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')
            with open('../data/test.to', 'a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(content+'\n')
            test_done = True
        else:
            with open('../data/train.from','a', encoding='utf8') as f:
                for content in df['parent'].values:
                    f.write(content+'\n')

            with open('../data/train.to','a', encoding='utf8') as f:
                for content in df['comment'].values:
                    f.write(str(content)+'\n')
                    
        counter += 1  # Track progress
        if counter % 20 == 0:
            print(counter*limit,'rows completed so far')

100000 rows completed so far
200000 rows completed so far
300000 rows completed so far
400000 rows completed so far
500000 rows completed so far
600000 rows completed so far
700000 rows completed so far


Having created our training data from our database, we will now use it to train a model.

# <a id='7'></a>Training a Model

There are endless models that we could choose from, and adapt to our needs. In this case, we will be using [sequence-to-sequence models](https://www.tensorflow.org/versions/r1.2/tutorials/seq2seq), since they can be used for a wide variety of applications (not just chatbots). The versatility of these models stems from the perspective that (in general) everything in life can be reduced to sequences being mapped to other sequences. This is of interest, because the knowledge we have learned in this project can be applied to train many other models for different types of applications in the future. However, before we get ahead of ourselves, let us return to the scope of this project: building a chatbot.

TensorFlow offers an excellent [Neural Machine Translation (NMT) tutorial](https://github.com/tensorflow/nmt) that uses the latest version of [seq2seq](https://www.tensorflow.org/versions/r1.2/tutorials/seq2seq). Here, we will be building upon the set of utilities built on top of [TensorFlow's NMT code](https://github.com/tensorflow/nmt), by following another project by [Daniel Kukiela](https://github.com/daniel-kukiela): [NMT Chatbot](https://github.com/daniel-kukiela/nmt-chatbot).

Since my laptop is not powerful enough to train complex models in a  reasonable time, I will instead be using [Paperspace](https://www.paperspace.com) as a cloud computing service to train the models. I should note that other cloud computing services exist, such as [Amazon Web Services (AWS)](https://aws.amazon.com/),  [Google Cloud Platform (GCP)](https://cloud.google.com/), and [Microsoft Azure](https://azure.microsoft.com/en-us/) - however for this project I will be using [Paperspace](https://www.paperspace.com). 

Once you are in your Virtual Machine (VM) environment (or if you choose to train the chatbot locally), run the following lines of code in a Bash terminal to start training a sample chatbot:

1. ```$ git clone --recursive https://github.com/daniel-kukiela/nmt-chatbot```
(or)
```$ git clone --branch v0.1 --recursive https://github.com/daniel-kukiela/nmt-chatbot.git``` (for a version featured in Sentdex tutorial)
2. ```$ cd nmt-chatbot```
3. ```$ pip install -r requirements.txt``` TensorFlow-GPU is one of the requirements. You also need CUDA Toolkit 8.0 and cuDNN 6.1. (Windows tutorial: https://www.youtube.com/watch?v=r7-WPbx8VuY Linux tutorial: https://pythonprogramming.net/how-to-cuda-gpu-tensorflow-deep-learning-tutorial/)
4. ```$ cd setup```
5. (optional) edit settings.py to your liking. These are a decent starting point for ~4GB of VRAM, you should first start by trying to raise vocab if you can.
6. (optional) Edit text files containing rules in the setup directory.
7. Place training data inside "new_data" folder (train.(from|to), tst2013.(from|to), tst2013(from|to)). We have provided some sample data for those who just want to do a quick test drive.
8. ```$ python prepare_data.py``` ...Run setup/prepare_data.py - a new folder called "data" will be created with prepared training data
9. ```$ cd ../```
10. ```$ python train.py``` ```  Begin training

# NMT Concepts and Parameters

NMT stands for ___Neural Machine Translation___.

https://www.youtube.com/watch?v=gFxiQXnt9w4&list=PLQVvvaa0QuDdc2k5dwtDTyT9aCja0on8j&index=8