# Chatbot - NLP 2021L
#### Authors:
#### <i>Mateusz Marciniewicz</i>
#### <i>Przemysław Bedełek</i>

## Human-robot text dataset

The dataset contains 2363 pairs of lines of text exchanged between a human and a robot.

Link to the dataset https://github.com/jackfrost1411/Generative-chatbot

In [3]:
import re

data_path = "Datasets/human_text.txt"
data_path2 = "Datasets/robot_text.txt"

# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  contexts = f.read().split('\n')
  contexts = [re.sub(r"\[\w+\]",'hi',line) for line in contexts]
  contexts = [" ".join(re.findall(r"\w+",line)) for line in contexts]

with open(data_path2, 'r', encoding='utf-8') as f:
  responses = f.read().split('\n')
  responses = [re.sub(r"\[\w+\]",'',line) for line in responses]
  responses = [" ".join(re.findall(r"\w+",line)) for line in responses]
  
# sample context-response pairs
list(zip(contexts, responses))[:10]

[('hi', 'hi there how are you'),
 ('oh thanks i m fine this is an evening in my timezone', 'here is afternoon'),
 ('how do you feel today tell me something about yourself',
  'my name is rdany but you can call me dany the r means robot i hope we can be virtual friends'),
 ('how many virtual friends have you got',
  'i have many but not enough to fully understand humans beings'),
 ('is that forbidden for you to tell the exact number',
  'i ve talked with 143 users counting 7294 lines of text'),
 ('oh i thought the numbers were much higher how do you estimate your progress in understanding human beings',
  'i started chatting just a few days ago every day i learn something new but there is always more things to be learn'),
 ('how old are you how do you look like where do you live',
  'i m 22 years old i m skinny with brown hair yellow eyes and a big smile i live inside a lab do you like bunnies'),
 ('have you seen a human with yellow eyes you asked about the bunnies i haven t seen any re

## Alexa topical 

Topical-Chat is a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles.

Link to the dataset https://github.com/alexa/Topical-Chat

In [4]:
import pandas as pd

df_topical = pd\
    .read_csv("Datasets/topical_chat.csv")[['conversation_id', 'message']]\
    .rename(columns={
        'conversation_id': 'id',
        'message': 'response'
        })

context = df_topical\
    .groupby("id")\
    .first()\
    .rename(columns={'response': 'context'})\
    .reset_index()

df_topical = df_topical[~df_topical.isin(context)]

topical_preprocessed = df_topical\
    .set_index('id')\
    .join(context.set_index('id'))\
    .reset_index()[['context', 'response']]

topical_preprocessed.sample(n=10)

Unnamed: 0,context,response
96212,Hi! I've often wondered if there ever existed...,"Yeah, that is a large number! I need to check..."
166106,"How's it going, did you know there are over 5...","How's it going, did you know there are over 5..."
93654,Hello there do you have a favorite album?,They early games were sure primitive by today...
28991,Hello there and good day. Have you heard abou...,I like the way T.S. donated the proceeds of W...
36133,"Hello, do you know the details of box office?","Well, there are some stars that aren't hot li..."
51050,Are you a parent? having a 5 year old is diff...,Yeah specially if you have twins or more kids...
186558,good morning. How are you.,The huge cathode tube can be replaced by the ...
16131,"Hey there, how are you doing? Are you a hock...",Okay that is crazy! I heard that an average...
71167,Hi there! Do you play video games?,Same here. I go all the way back to the orig...
133616,Hi there! Are you a fan of rap music?,"That would be neat to see! In 2001, a Michig..."


In [5]:
contexts += list(topical_preprocessed.context)
responses += list(topical_preprocessed.response)

print(f"Total pairs count: {len(contexts)}")

Total pairs count: 190741


## Cornell Movie Dialogue Dataset

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies.

The preprocessing code is taken from https://www.kaggle.com/shashankasubrahmanya/preprocessing-cornell-movie-dialogue-corpus/
Link to the dataset https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

### Create a list of dialogues

We join two different files namely `movie_lines.tsv` and `movie_conversations.tsv` to finally produce a list of dialogues. This list is further stored as a `pickle` file for further processing.

In [39]:
movie_lines_features = ["LineID", "Character", "Movie", "Name", "Line"]
movie_lines = pd.read_csv(
    "Datasets/movie-dialogue/movie_lines.txt",
    sep = "\+\+\+\$\+\+\+", 
    engine = "python", 
    index_col = False, 
    names = movie_lines_features,
)

# Using only the required columns, namely, "LineID" and "Line"
movie_lines = movie_lines[["LineID", "Line"]]

# Strip the space from "LineID" for further usage and change the datatype of "Line"
movie_lines["LineID"] = movie_lines["LineID"].apply(str.strip)

movie_lines.head()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte

In [44]:
movie_conversations_features = ["Character1", "Character2", "Movie", "Conversation"]
movie_conversations = pd.read_csv(
    "Datasets/movie-dialogue/movie_conversations.txt",
    sep = "\+\+\+\$\+\+\+", 
    engine = "python", 
    index_col = False, 
    names = movie_conversations_features
)

# Again using the required feature, "Conversation"
movie_conversations = movie_conversations["Conversation"]

movie_conversations.head()

0     ['L194', 'L195', 'L196', 'L197']
1                     ['L198', 'L199']
2     ['L200', 'L201', 'L202', 'L203']
3             ['L204', 'L205', 'L206']
4                     ['L207', 'L208']
Name: Conversation, dtype: object

In [12]:
# This instruction takes lot of time, run it only once.
#conversation = [[str(list(movie_lines.loc[movie_lines["LineID"] == u.strip().strip("'"), "Line"])[0]).strip() for u in c.strip().strip('[').strip(']').split(',')] for c in movie_conversations]

#with open("./conversations.pkl", "wb") as handle:
 #   pkl.dump(conversation, handle)

### Create context and response pairs

In [6]:
import pickle as pkl

with open("./conversations.pkl", "rb") as handle:
    conversation = pkl.load(handle)
    conversation = list(filter(lambda dialogue: len(dialogue) == 2, conversation))

conversation[:10]    

[["You're asking me out.  That's so cute. What's your name again?",
  'Forget it.'],
 ['Gosh, if only we could find Kat a boyfriend...',
  'Let me see what I can do.'],
 ['How is our little Find the Wench A Date plan progressing?',
  "Well, there's someone I think might be --"],
 ['There.', 'Where?'],
 ['You got something on your mind?',
  "I counted on you to help my cause. You and that thug are obviously failing. Aren't we ever going on our date?"],
 ['You have my word.  As a gentleman', "You're sweet."],
 ['How do you get your hair to look like that?',
  "Eber's Deep Conditioner every two days. And I never, ever use a blowdryer without the diffuser attachment."],
 ['Hi.', 'Looks like things worked out tonight, huh?'],
 ['You know Chastity?', 'I believe we share an art instructor'],
 ['Have fun tonight?', 'Tons']]

In [7]:
def generate_pairs(dialogues):
    
    context_list = []
    response_list = []
    
    for dialogue in dialogues:        
        context_list.append(dialogue[0])
        response_list.append(dialogue[1])
        
    return context_list, response_list

context_list, response_list = generate_pairs(conversation)

list(zip(context_list, response_list))[:10]

[("You're asking me out.  That's so cute. What's your name again?",
  'Forget it.'),
 ('Gosh, if only we could find Kat a boyfriend...',
  'Let me see what I can do.'),
 ('How is our little Find the Wench A Date plan progressing?',
  "Well, there's someone I think might be --"),
 ('There.', 'Where?'),
 ('You got something on your mind?',
  "I counted on you to help my cause. You and that thug are obviously failing. Aren't we ever going on our date?"),
 ('You have my word.  As a gentleman', "You're sweet."),
 ('How do you get your hair to look like that?',
  "Eber's Deep Conditioner every two days. And I never, ever use a blowdryer without the diffuser attachment."),
 ('Hi.', 'Looks like things worked out tonight, huh?'),
 ('You know Chastity?', 'I believe we share an art instructor'),
 ('Have fun tonight?', 'Tons')]

In [8]:
#Merge datasets
contexts += context_list
responses += response_list

In [9]:
print(f"Total pairs count: {len(contexts)}")



Total pairs count: 228832
