## Building a Chatbot: PyTorch

Note: This code has been adapted from [https://pytorch.org/tutorials/beginner/chatbot_tutorial.html](https://pytorch.org/tutorials/beginner/chatbot_tutorial.html). We will demonstrate more details and observe the output step by step to have a deeper understanding.

In [4]:
import csv
import random
import re
import os
import unicodedata
import codecs
import itertools

In [6]:
# Checking whether the cuda is available or not
CUDA = torch.cuda.is_available()
device = torch.device("cuda" if CUDA else "cpu")

### Part 1: Data Preprocessing 

In [12]:
lines_filepath = os.path.join("cornell movie-dialogs corpus", "movie_lines.txt")
conv_filepath = os.path.join("cornell movie-dialogs corpus", "movie_conversations.txt")

In [13]:
## Visualizing and inspecting some lines
with open(lines_filepath, 'r') as file:
        lines = file.readlines()
for line in lines[:8]:
    print(line.strip())

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.
L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No


In [17]:
## Total number of utterances
print("Total number of utterances: {}".format(len(lines)))

Total number of utterances: 304713


In [22]:
# Splits each line of the file into a dictionary of fields (lineID, characterID, movieID, character, text)
line_fields = ["lineID", "characterID", "movieID", "character", "text"]
lines = {}
with open(lines_filepath, 'r', encoding='iso-8859-1') as f:
    for line in f:
        values = line.split(" +++$+++ ")
        # Extract fields
        lineObj = {}
        for i, field in enumerate(line_fields):
            lineObj[field] = values[i]
            
        lines[lineObj['lineID']] = lineObj

Now, it is time to process movie conversations dataset. 

In [32]:
## Visualizing and inspecting *movie_conversations.txt*
with open(conv_filepath, 'r') as file:
        x = file.readlines()
for y in x[:8]:
    print(y.strip())

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L271', 'L272', 'L273', 'L274', 'L275']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L276', 'L277']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L280', 'L281']


In [40]:
# Groups fields of lines from `loadLines` into conversations based on *movie_conversations.txt*
conv_fields = ["character1ID", "character2ID", "movieID", "utteranceIDs"]
conversations = []
with open(conv_filepath, 'r', encoding='iso-8859-1') as f:
    for line in f:
        values = line.split(" +++$+++ ")
        # Extract fields
        convObj = {}
        for i, field in enumerate(conv_fields):
            convObj[field] = values[i]
        # Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")
        lineIds = eval(convObj["utteranceIDs"])
        # Reassemble lines
        convObj["lines"] = []
        for lineId in lineIds:
            convObj["lines"].append(lines[lineId])
        conversations.append(convObj)

For example, let's take a look at the one element of conversations list to become more familiar with it.

In [49]:
## conversations is a list of dictionaries
assert isinstance (conversations, list)
assert isinstance (conversations[0], dict)

print(conversations[0])

{'character1ID': 'u0', 'character2ID': 'u2', 'movieID': 'm0', 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n", 'lines': [{'lineID': 'L194', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'}, {'lineID': 'L195', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"}, {'lineID': 'L196', 'characterID': 'u0', 'movieID': 'm0', 'character': 'BIANCA', 'text': 'Not the hacking and gagging and spitting part.  Please.\n'}, {'lineID': 'L197', 'characterID': 'u2', 'movieID': 'm0', 'character': 'CAMERON', 'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}]}


In [50]:
# Extracts pairs of sentences from the conversation
qa_pairs = []
for conversation in conversations:
    # Iterate over all the lines of the conversation
    for i in range(len(conversation["lines"]) - 1):
        inputLine = conversation["lines"][i]["text"].strip()
        targetLine = conversation["lines"][i+1]["text"].strip()
        # Filter wrong samples (if one of the lists is empty)
        if inputLine and targetLine:
            qa_pairs.append([inputLine, targetLine])

In [66]:
print("Number of conversation pairs in the dataset: {}".format(len(qa_pairs)))

Number of conversation pairs in the dataset: 221282


Save the qa_pairs dataset onto the disk. So, we don't need to repeat the preprocessing steps each time we want to work on this dataset.

In [69]:
# Define path to new file
datafile = os.path.join("cornell movie-dialogs corpus", "formatted_movie_lines.txt")
delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in qa_pairs:
        writer.writerow(pair)

print("Done writing into the file")


Writing newly formatted file...
Done writing into the file


So, a text file with the name "formatted_movie_lines.txt" has been saved within the "cornell movie-dialogs corpus" folder. It contains sentences pair, and in the subsequent notebooks, we are going to load and work with it. 