# Chatbot - NLP 2021L
#### Authors:
#### <i>Mateusz Marciniewicz</i>
#### <i>Przemysław Bedełek</i>

In [1]:
import numpy as np
import pandas as pd

## Alexa topical 

Topical-Chat is a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles.

Link to the dataset https://github.com/alexa/Topical-Chat

In [2]:
df_topical = pd.read_csv("Datasets/topical_chat.csv")

In [3]:
df_topical["message"].head(15)

0                 Are you a fan of Google or Microsoft?
1      Both are excellent technology they are helpfu...
2      I'm not  a huge fan of Google, but I use it a...
3      Google provides online related services and p...
4      Yeah, their services are good. I'm just not a...
5      Google is leading the alphabet subsidiary and...
6      Did you know Google had hundreds of live goat...
7      It is very interesting. Google provide "Chrom...
8      I like Google Chrome. Do you use it as well f...
9      Yes.Google is the biggest search engine and G...
10                       By the way, do you like Fish? 
11     Yes. They form a sister group of tourniquets-...
12     Did you know that a seahorse is the only fish...
13     Freshwater fish only drink water through the ...
14     Interesting, they also have gills. Did you kn...
Name: message, dtype: object

## Human-robot text dataset

The dataset contains 2363 pairs of lines of text exchanged between a human and a robot.

Link to the dataset https://github.com/jackfrost1411/Generative-chatbot

In [4]:

import re
import random
data_path = "Datasets/human_text.txt"
data_path2 = "Datasets/robot_text.txt"
# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')
with open(data_path2, 'r', encoding='utf-8') as f:
  lines2 = f.read().split('\n')
lines = [re.sub(r"\[\w+\]",'hi',line) for line in lines]
lines = [" ".join(re.findall(r"\w+",line)) for line in lines]
lines2 = [re.sub(r"\[\w+\]",'',line) for line in lines2]
lines2 = [" ".join(re.findall(r"\w+",line)) for line in lines2]
# Grouping lines by response pair
pairs = list(zip(lines,lines2))

In [5]:
import numpy as np

input_docs = []
target_docs = []
input_tokens = set()
target_tokens = set()
for line in pairs[:400]:
  input_doc, target_doc = line[0], line[1]
  # Appending each input sentence to input_docs
  input_docs.append(input_doc)
  # Splitting words from punctuation  
  target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
  # Redefine target_doc below and append it to target_docs
  target_doc = '<START> ' + target_doc + ' <END>'
  target_docs.append(target_doc)
  
  # Now we split up each sentence into words and add each unique word to our vocabulary set
  for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
    if token not in input_tokens:
      input_tokens.add(token)
  for token in target_doc.split():
    if token not in target_tokens:
      target_tokens.add(token)
input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)

input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])
target_features_dict = dict(
    [(token, i) for i, token in enumerate(target_tokens)])

reverse_input_features_dict = dict(
    (i, token) for token, i in input_features_dict.items())
reverse_target_features_dict = dict(
    (i, token) for token, i in target_features_dict.items())

In [6]:
pairs

[('hi', 'hi there how are you'),
 ('oh thanks i m fine this is an evening in my timezone', 'here is afternoon'),
 ('how do you feel today tell me something about yourself',
  'my name is rdany but you can call me dany the r means robot i hope we can be virtual friends'),
 ('how many virtual friends have you got',
  'i have many but not enough to fully understand humans beings'),
 ('is that forbidden for you to tell the exact number',
  'i ve talked with 143 users counting 7294 lines of text'),
 ('oh i thought the numbers were much higher how do you estimate your progress in understanding human beings',
  'i started chatting just a few days ago every day i learn something new but there is always more things to be learn'),
 ('how old are you how do you look like where do you live',
  'i m 22 years old i m skinny with brown hair yellow eyes and a big smile i live inside a lab do you like bunnies'),
 ('have you seen a human with yellow eyes you asked about the bunnies i haven t seen any re

## Cornell Movie Dialogue Dataset

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies.

The preprocessing code is taken from https://www.kaggle.com/shashankasubrahmanya/preprocessing-cornell-movie-dialogue-corpus/
Link to the dataset https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html


In [7]:
import pandas as pd
import pickle as pkl
import random

# Which tokenizer to use? TweetTokenizer is more robust than the vanilla tokenizer, but then,
# will the intelligence of tokenization matter in the long run when trained using DL?
from nltk.tokenize import word_tokenize, TweetTokenizer
tokenizer = TweetTokenizer(preserve_case = False)





### Create a list of dialogues

We join two different files namely `movie_lines.tsv` and `movie_conversations.tsv` to finally produce a list of dialogues. This list is further stored as a `pickle` file for further processing.


In [8]:

movie_lines_features = ["LineID", "Character", "Movie", "Name", "Line"]
movie_lines = pd.read_csv("Datasets/movie-dialogue/movie_lines.txt", sep = "\+\+\+\$\+\+\+", engine = "python", index_col = False, names = movie_lines_features)

# Using only the required columns, namely, "LineID" and "Line"
movie_lines = movie_lines[["LineID", "Line"]]

# Strip the space from "LineID" for further usage and change the datatype of "Line"
movie_lines["LineID"] = movie_lines["LineID"].apply(str.strip)



In [9]:
movie_lines.head()

Unnamed: 0,LineID,Line
0,L1045,They do not!
1,L1044,They do to!
2,L985,I hope so.
3,L984,She okay?
4,L925,Let's go.


In [10]:
movie_conversations_features = ["Character1", "Character2", "Movie", "Conversation"]
movie_conversations = pd.read_csv("Datasets/movie-dialogue/movie_conversations.txt", sep = "\+\+\+\$\+\+\+", engine = "python", index_col = False, names = movie_conversations_features)

# Again using the required feature, "Conversation"
movie_conversations = movie_conversations["Conversation"]

In [11]:
movie_conversations.head()

0     ['L194', 'L195', 'L196', 'L197']
1                     ['L198', 'L199']
2     ['L200', 'L201', 'L202', 'L203']
3             ['L204', 'L205', 'L206']
4                     ['L207', 'L208']
Name: Conversation, dtype: object

In [12]:
# This instruction takes lot of time, run it only once.
#conversation = [[str(list(movie_lines.loc[movie_lines["LineID"] == u.strip().strip("'"), "Line"])[0]).strip() for u in c.strip().strip('[').strip(']').split(',')] for c in movie_conversations]

#with open("./conversations.pkl", "wb") as handle:
 #   pkl.dump(conversation, handle)

### Create context and response pairs

In [13]:
with open("./conversations.pkl", "rb") as handle:
    conversation = pkl.load(handle)

In [14]:
# Calculate the dialogue length statistics

dialogue_lengths = [len(dialogue) for dialogue in conversation]
pd.Series(dialogue_lengths).describe()

count    83097.000000
mean         3.666955
std          2.891798
min          2.000000
25%          2.000000
50%          3.000000
75%          4.000000
max         89.000000
dtype: float64

As observed above, the mean dialogue length is approximately 4 which is pretty less and we can take only the last utterence as the response. Yet to figure out a way to handle the larger ones though.

In [15]:
# Generate 50 sample pairs - 14/03/2019
indices = random.sample(range(len(conversation)), 50)
sample_context_list = []
sample_response_list = []

for index in indices:
    
    response = conversation[index][-1]
        
    context = "FS: " + conversation[index][0] + "\n"
    for i in range(1, len(conversation[index]) - 1):
        
        if i % 2 == 0:
            prefix = "FS: "
        else:
            prefix = "SS: "
            
        context += prefix + conversation[index][i] + "\n"
        
    sample_context_list.append(context)
    sample_response_list.append(response)

with open("cornell_movie_dialogue_sample.csv", "w") as handle:
    for c, r in zip(sample_context_list, sample_response_list):
        handle.write('"' + c + '"' + "#" + r + "\n")

In [16]:
def generate_pairs(conversation):
    
    context_list = []
    response_list = []
    
    for dialogue in conversation:
        
        response = word_tokenize(dialogue[-1])
        
        context = word_tokenize(dialogue[0])
        for index in range(1, len(dialogue) - 1):
            context += word_tokenize(dialogue[index])
        
        context_list.append(context)
        response_list.append(response)
    return context_list, response_list

In [17]:
context_list, response_list = generate_pairs(conversation)

In [18]:
conversation

[['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
  "Well, I thought we'd start with pronunciation, if that's okay with you.",
  'Not the hacking and gagging and spitting part.  Please.',
  "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?"],
 ["You're asking me out.  That's so cute. What's your name again?",
  'Forget it.'],
 ["No, no, it's my fault -- we didn't have a proper introduction ---",
  'Cameron.',
  "The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.",
  'Seems like she could get a date easy enough...'],
 ['Why?',
  'Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.',
  "That's a shame."],
 ['Gosh, if only we could find Kat a boyfriend...',
  'Let me see what I can do.'],
 ["C'esc ma tete. This is my head"

## Santa Barbara Corpus of Spoken American English

The Santa Barbara Corpus of Spoken American English is based on a large body of recordings of naturally occurring spoken interaction from all over the United States. The Santa Barbara Corpus represents a wide variety of people of different regional origins, ages, occupations, genders, and ethnic and social backgrounds. The predominant form of language use represented is face-to-face conversation, but the corpus also documents many other ways that that people use language in their everyday lives: telephone conversations, card games, food preparation, on-the-job talk, classroom lectures, sermons, story-telling, town hall meetings, tour-guide spiels, and more.

Link to the dataset https://www.linguistics.ucsb.edu/research/santa-barbara-corpus

In [19]:
from os import listdir
from os.path import isfile, join
trn_files = [f for f in listdir("Datasets/TRN") if isfile(join("Datasets/TRN", f))]



In [20]:
sbc_files = []
for file in trn_files:
    with open ("Datasets/TRN/"+file, "r") as myfile:
        sbc_files.append(myfile.readlines())

In [21]:
sbc_files[1]

['0.00 6.52\tJAMIE:  \tHow [can you teach a three-year-old to] ta=p [2dance2].\n',
 "4.43 5.78\tHAROLD: \t    [I can't imagine teaching a] --\n",
 '6.08 6.35\t        \t                                             [2@Yeah2],\n',
 '6.35 6.73\t        \treally.\n',
 '6.73 8.16\tJAMIE:  \t... (H)=\n',
 '8.16 9.56\tMILES:  \t... Who suggested this to em.\n',
 '9.56 10.41\tHAROLD: \tI have no idea.\n',
 "10.41 13.06\t        \tIt was probably my= .. sister-in-law's idea because,\n",
 '13.06 15.01\t        \t... I think they saw= ... that movie.\n',
 '15.01 16.43\tJAMIE:  \t... Tap?\n',
 '16.43 16.98\t        \t[X] [2X2] --\n',
 '16.50 17.00\tHAROLD: \t[What] [2was the2],\n',
 '16.60 17.00\tMILES:  \t    [2<X They had X>2] --\n',
 '17.00 19.10\tHAROLD: \tthe movie with that .. really hot tap danc[er].\n',
 '19.00 19.75\tJAMIE:  \t                                          [Oh] that ki=d.\n',
 '19.75 21.82\tMILES:  \t... He was actually here two weeks ago,\n',
 '21.82 22.57\t        \tand [I m

In [22]:
sbc_files_new = [[re.sub("\s+"," ",line.partition("\t")[2]) for line in file] for file in sbc_files]
sbc_files_new = [[re.sub("[^A-Za-z]+"," ",line) for line in file] for file in sbc_files_new]

In [23]:
sbc_files_new[1]

['JAMIE How can you teach a three year old to ta p dance ',
 'HAROLD I can t imagine teaching a ',
 ' Yeah ',
 ' really ',
 'JAMIE H ',
 'MILES Who suggested this to em ',
 'HAROLD I have no idea ',
 ' It was probably my sister in law s idea because ',
 ' I think they saw that movie ',
 'JAMIE Tap ',
 ' X X ',
 'HAROLD What was the ',
 'MILES X They had X ',
 'HAROLD the movie with that really hot tap danc er ',
 'JAMIE Oh that ki d ',
 'MILES He was actually here two weeks ago ',
 ' and I missed him ',
 'JAMIE at the at the ja zz t ap thing or whatever ',
 'HAROLD Was he a little kid ',
 'MILES No he s sixteen now ',
 'JAMIE H No he s like ',
 ' Yeah he s a teenager ',
 ' but he teaches these classes in New York ',
 'MILES X That X boy ',
 ' he s supposed to be awe some ',
 'JAMIE Yeah ',
 ' Really fa st ',
 'PETE Hm ',
 'HAROLD But I m sure that was the the impetus ',
 'MILES Have you seen him ',
 'JAMIE No ',
 ' I just read an article on him ',
 'MILES You you probably read the same