# Building the dataset

In this notebook, I'm going to be working with three datasets to create the dataset that the chatbot will be trained on.

In [1]:
import pandas as pd

In [2]:
files_path = 'D:/Sarcastic Chatbot/Input/'

# First dataset
**The Wordball Joke Dataset**, [link](https://www.kaggle.com/bfinan/jokes-question-and-answer/).

This dataset consists of three files, namely:
1. <i>qajokes1.1.2.csv</i>: with <i>75,114</i> pairs.
2. <i>t_lightbulbs.csv</i>: with <i>2,640</i> pairs.
3. <i>t_nosubject.csv</i>: with <i>32,120</i> pairs.

However, I'm not going to incorporate <i>t_lightbulbs.csv</i> in my dataset because I don't want that many examples of one topic. Besides, all the examples are similar in structure (they all start with <i>how many</i>).

Read the data files into pandas dataframes:

In [3]:
wordball_qajokes = pd.read_csv(files_path + 'qajokes1.1.2.csv', usecols=['Question', 'Answer'])
wordball_nosubj = pd.read_csv(files_path + 't_nosubject.csv', usecols=['Question', 'Answer'])

In [4]:
print(len(wordball_qajokes))
print(len(wordball_nosubj))

75114
32120


In [5]:
wordball_qajokes.head()

Unnamed: 0,Question,Answer
0,What's the best anti diarrheal prescription?,Mycheexarphlexin
1,What do you call a person who is outside a doo...,Matt
2,Which Star Trek character is a member of the m...,Jean-Luc Pickacard
3,What's the difference between a bullet and a h...,A bullet doesn't miss Harambe
4,Why was the Ethiopian baby crying?,He was having a mid-life crisis


In [6]:
wordball_nosubj.head()

Unnamed: 0,Question,Answer
0,Did you hear about the Native American man tha...,He nearly drown in his own tea pee.
1,Did you hear about the oyster who went to the ...,He pulled a muscle
2,Shall I tell you the joke about the kidnappers?,I'd better not. You might get carried away.
3,Do you like fish sticks?,"Well then, you're a gay fish."
4,Want to hear a joke about UDP?,"Never mind. you won't get it, and I won't care"


Concatenate both dataframes into one:

In [7]:
wordball = pd.concat([wordball_qajokes, wordball_nosubj], ignore_index=True)
wordball.head()

Unnamed: 0,Question,Answer
0,What's the best anti diarrheal prescription?,Mycheexarphlexin
1,What do you call a person who is outside a doo...,Matt
2,Which Star Trek character is a member of the m...,Jean-Luc Pickacard
3,What's the difference between a bullet and a h...,A bullet doesn't miss Harambe
4,Why was the Ethiopian baby crying?,He was having a mid-life crisis


In [8]:
print(f"Number of question-answer pairs in the Wordball dataset: {len(wordball)}")

Number of question-answer pairs in the Wordball dataset: 107234


## Text Preprocessing

It turns out that not all cells are of type string. So, we can just apply the *str* function to make sure that all of them are of the same desired type.

In [9]:
wordball = wordball.applymap(str)

Let's look at the characters used in this dataset:

In [10]:
def distinct_chars(data, cols):
    """
    This method takes in a pandas dataframe and prints all distinct characters.
    data: a pandas dataframe.
    cols: a Python list, representing names of columns for questions and answers. First item of the list should be the name 
    of the questions column and the second item should be the name of the column corresponding to answers.
    """
    
    if cols is None:
        cols = list(data.columns)
    
    # join all questions into one string
    questions = ' '.join(data[cols[0]])
    # join all answers into one string
    answers = ' '.join(data[cols[1]])
    
    # get distinct characters used in the data (all questions and answers)
    dis_chars = set(questions+answers)
    
    # print the distinct characters that are used in the data
    print(f"Number of distinct characters used in the dataset: {len(dis_chars)}")
    # print(dis_chars)    
    dis_chars = list(dis_chars)
    
    # Now let's print those characters in an organized way
    digits = [char for char in dis_chars if char.isdigit()]
    alphabets = [char for char in dis_chars if char.isalpha()]
    special = [char for char in dis_chars if not (char.isdigit() | char.isalpha())]
    # sort them to make them easier to read
    digits = sorted(digits)
    alphabets = sorted(alphabets)
    special = sorted(special)
    
    print(f"Digits: {digits}")
    print(f"Alphabets: {alphabets}")
    print(f"Special characters: {special}")

In [11]:
distinct_chars(wordball, ['Question', 'Answer'])

Number of distinct characters used in the dataset: 120
Digits: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Alphabets: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ß', 'è', 'é', 'ñ', 'ó', 'ö', 'ü']
Special characters: [' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\xa0', '¡', '¤', '«', '°', '»', '¿', '\u200b', '–', '—', '‘', '’', '“', '”', '…', '™', '\ufeff', '🎺']


The following function replaces some characters with others, removes unwanted characters and gets rid of extra whitespaces from the data.

In [12]:
def clean_text(text):
    """
    This method takes a string, applies different text preprocessing (characters replacement, removal of unwanted characters, 
    removal of extra whitespaces) operations and returns a string.
    text: a string.
    """
    import re
    
    text = str(text)
    
    # REPLACEMENT
    # replace " with ' (because they basically mean the same thing)
    # text = text.replace('\"','\'')
    text = re.sub('\"', '\'', text)
    # replace “ and ” with '
    # text = text.replace("“",'\'').replace("”",'\'')
    text = re.sub("“", '\'', text)
    text = re.sub("”", '\'', text)
    # replace ’ with '
    # text = text.replace('’','\'')
    text = re.sub('’', '\'', text)
    # replace [] and {} with ()
    #text = text.replace('[','(').replace(']',')').replace('{','(').replace('}',')')
    text = re.sub('\[','(', text)
    text = re.sub('\]',')', text)
    text = re.sub('\{','(', text)
    text = re.sub('\}',')', text)
    # replace ? with itself and a whitespace preceding it
    # ex. what's your name? (we want the word name and question mark to be separate tokens)
    # text = re.sub('\?', ' ?', text)
    # creating a space between a word and the punctuation following it
    # punctuation we're using: . , : ; ' ? ! + - * / = % $ @ & ( )
    text = re.sub("([?.!,:;'?!+\-*/=%$@&()])", r" \1 ", text)
    
    
    # REMOVAL OF UNWANTED CHARACTERS
    # accept only alphanumeric and some special characters and remove all others
    # a-zA-Z0-9 : matches any alphanumeric character and the underscore.
    # \. : matches .
    # \, : matches ,
    # \: : matches :
    # \; : matches ;
    # \' : matches '
    # \? : matches ?
    # \! : matches !
    # \+ : matches +
    # \- : matches -
    # \* : matches *
    # \/ : matches /
    # \= : matches =
    # \% : matches %
    # \$ : matches $
    # \@ : matches @
    # \& : matches &
    # ^ is added to the beginning of the set to express that we want the regex to recognize all other characters except
    # these that are explicitly specified, so that we can omit them.
    # define the pattern
    pattern = re.compile('[^a-zA-Z0-9_\.\,\:\;\'\?\!\+\-\*\/\=\%\$\@\&\(\)]')
    # remove unwanted characters
    text = re.sub(pattern, ' ', text)
    
    # lower case the characters in the string
    text = text.lower()
    
    # REMOVAL OF EXTRA WHITESPACES
    # remove duplicated spaces
    text = re.sub(' +', ' ', text)
    # remove leading and trailing spaces
    text = text.strip()
    
    return text

Let's try it out:

In [13]:
clean_text("A nice quote I read    today: “Everything that you are going through is preparing you for what you asked for”. @hi % & =+-*/")

"a nice quote i read today : ' everything that you are going through is preparing you for what you asked for ' . @ hi % & = + - * /"

The following method prints a question-answer pair from the dataset, it will be helpful to give us a sense of what the *clean_text* function results in:

In [14]:
def print_question_answer(df, index, cols):
    print(f"Question: ({index})")
    print(df.loc[index][cols[0]])
    print(f"Answer: ({index})")
    print(df.loc[index][cols[1]])

In [15]:
print("Before applying text preprocessing:")
print_question_answer(wordball, 102, ['Question', 'Answer'])
print_question_answer(wordball, 200, ['Question', 'Answer'])
print_question_answer(wordball, 88376, ['Question', 'Answer'])
print_question_answer(wordball, 94351, ['Question', 'Answer'])

Before applying text preprocessing:
Question: (102)
What's 11 & 2?
Answer: (102)
The Cowboys
Question: (200)
What did the girlfriend say to her boyfriend that was bitten by a zombie?
Answer: (200)
You're dead to me"
Question: (88376)
I think my husband is psychic! "Honey, what do you think of this outfit
Answer: (88376)
" {from other room} "You look great!"
Question: (94351)
{Thomas Edison prank call} Is your refrigerator running
Answer: (94351)
 "Yes.." YOU'RE WELCOME! *click*


Apply text preprocessing (characters replacement, removal of unwanted characters, removal of extra whitespaces):

In [16]:
wordball = wordball.applymap(clean_text)

In [17]:
print("After applying text preprocessing:")
print_question_answer(wordball, 102, ['Question', 'Answer'])
print_question_answer(wordball, 200, ['Question', 'Answer'])
print_question_answer(wordball, 88376, ['Question', 'Answer'])
print_question_answer(wordball, 94351, ['Question', 'Answer'])

After applying text preprocessing:
Question: (102)
what ' s 11 & 2 ?
Answer: (102)
the cowboys
Question: (200)
what did the girlfriend say to her boyfriend that was bitten by a zombie ?
Answer: (200)
you ' re dead to me '
Question: (88376)
i think my husband is psychic ! ' honey , what do you think of this outfit
Answer: (88376)
' ( from other room ) ' you look great ! '
Question: (94351)
( thomas edison prank call ) is your refrigerator running
Answer: (94351)
' yes . . ' you ' re welcome ! * click *


The following function applies some preprocessing operations on the data, concretely:
1. Drops unecessary duplicate pairs (rows) but keep only one instance of all duplicates. *(For example, if the dataset contains three duplicates of the same question-answer pair, then two of them would be removed and one kept.)*
2. Drops rows with empty question/answer. *(These may appear because of the previous step or because they happen to be empty in the original dataset) *
3. Drops rows with more than 30 words in either the question or the answer or if the answer has less than two characters. *(Note: this is a hyperparameter and you can try other values.)*

In [18]:
def preprocess_data(data, cols):
    """
    This method preprocess data and does the following:
    1. drops unecessary duplicate pairs.
    2. drops rows with empty strings.
    3. drops rows with more than 30 words in either the question or the answer, 
    or if the an answer has less than two characters.
    Arguments:
        data: a pandas dataframe.
        cols: a Python list, representing names of columns for questions and answers. First item of the list should be the name 
        of the questions column and the second item should be the name of the column corresponding to answers.
    Returns:
        a pandas dataframe.
    """
    
    
    # (1) Remove unecessary duplicate pairs but keep only one instance of all duplicates.
    print('Removing unecessary duplicate pairs:')
    data_len_before = len(data) # len of data before removing duplicates
    print(f"# of examples before removing duplicates: {data_len_before}")
    # drop duplicates
    data = data.drop_duplicates(keep='first')
    data_len_after = len(data) # len of data after removing duplicates
    print(f"# of examples after removing duplicates: {data_len_after}")
    print(f"# of removed duplicates: {data_len_before-data_len_after}")
    
    
    # (2) Drop rows with empty strings.
    print('Removing empty string rows:')
    if cols is None:
        cols = list(data.columns)
        
    data_len_before = len(data) # len of data before removing empty strings
    print(f"# of examples before removing rows with empty question/answers: {data_len_before}")
    # I am going to use boolean masking to filter out rows with an empty question or answer
    data = data[(data[cols[0]] != '') & (data[cols[1]] != '')]
    # also, the following row results in the same as the above.
    # data = data.query('Answer != "" and Question != ""')
    data_len_after = len(data) # len of data after removing empty strings
    print(f"# of examples after removing with empty question/answers: {data_len_after}")
    print(f"# of removed empty string rows: {data_len_before-data_len_after}")
    
    
    # (3) Drop rows with more than 30 words in either the question or the answer
    # or if the an answer has less than two characters.
    def accepted_length(qa_pair):
        q_len = len(qa_pair[0].split(' '))
        a_len = len(qa_pair[1].split(' '))
        if (q_len <= 30) & ((a_len <= 30) & (len(qa_pair[1]) > 1)):
            return True
        return False
    
    print('Removing rows with more than 30 words in either the question or the answer:')
    data_len_before = len(data) # len of data before dropping those rows (30+ words)
    print(f"# of examples before removing rows with more than 30 words: {data_len_before}")
    # filter out rows with more than 30 words
    accepted_mask = data.apply(accepted_length, axis=1)
    data = data[accepted_mask]
    data_len_after = len(data) # len of data after dropping those rows (50+ words)
    print(f"# of examples after removing rows with more than 30 words: {data_len_after}")
    print(f"# of removed empty rows with more than 30 words: {data_len_before-data_len_after}")
    
    print("Data preprocessing is done.")
    
    return data

In [19]:
wordball = preprocess_data(wordball, ['Question', 'Answer'])

Removing unecessary duplicate pairs:
# of examples before removing duplicates: 107234
# of examples after removing duplicates: 107144
# of removed duplicates: 90
Removing empty string rows:
# of examples before removing rows with empty question/answers: 107144
# of examples after removing with empty question/answers: 107054
# of removed empty string rows: 90
Removing rows with more than 30 words in either the question or the answer:
# of examples before removing rows with more than 30 words: 107054
# of examples after removing rows with more than 30 words: 101712
# of removed empty rows with more than 30 words: 5342
Data preprocessing is done.


In [20]:
print(f"# of question-answer pairs we have left in the Wordball dataset: {len(wordball)}")

# of question-answer pairs we have left in the Wordball dataset: 101712


Let's look at the characters after cleaning the data:

In [21]:
distinct_chars(wordball, ['Question', 'Answer'])

Number of distinct characters used in the dataset: 56
Digits: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Alphabets: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Special characters: [' ', '!', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '=', '?', '@', '_']


# Second Dataset
**reddit /r/Jokes**, [here](https://www.kaggle.com/cuddlefish/reddit-rjokes#jokes_score_name_clean.csv).

This dataset consists of two files, namely:
1. <i>jokes_score_name_clean.csv</i>: with <i>133,992</i> pairs.
2. <i>all_jokes.csv</i>

However, I'm not going to incorporate <i>all_jokes.csv</i> in the dataset because it's so messy.

In [22]:
reddit_jokes = pd.read_csv(files_path + 'jokes_score_name_clean.csv', usecols=['q', 'a'])

Let's rename the columns to have them aligned with the previous dataset:

In [23]:
reddit_jokes.rename(columns={'q':'Question', 'a':'Answer'}, inplace=True)

In [24]:
reddit_jokes.head()

Unnamed: 0,Question,Answer
0,I enjoy working in a slaughterhouse..,Everything is so cut and dry.
1,What do you call a soldier who survives Mustar...,A seasoned veteran.
2,I really like white dwarf stars...,...My favorite is Peter Dinklage.
3,Knock knock. Whose their?,The grammar police.
4,What breaks when you give it to a twelve year ...,Her hips.


In [25]:
print(len(reddit_jokes))

133328


In [26]:
distinct_chars(reddit_jokes, ['Question', 'Answer'])

Number of distinct characters used in the dataset: 567
Digits: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '²', '³', '¹', '₂', '₄']
Alphabets: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'µ', 'º', 'Ä', 'Ñ', 'Ö', 'ß', 'à', 'á', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'þ', 'ā', 'ē', 'ě', 'ı', 'ń', 'ō', 'œ', 'ƃ', 'Ɔ', 'ǎ', 'ǐ', 'ǒ', 'ǚ', 'ǝ', 'ɐ', 'ɑ', 'ɔ', 'ə', 'ɟ', 'ɡ', 'ɥ', 'ɪ', 'ɯ', 'ɴ', 'ɹ', 'ɾ', 'ʇ', 'ʌ', 'ʍ', 'ʎ', 'ʏ', 'ʖ', 'ʘ', 'ʞ', 'ʟ', 'ʰ', 'ʲ', 'ʳ', 'ʷ', 'ʸ', 'ˈ', 'ˢ', 'Δ', 'Π', 'Σ', 'ί', 'α', 'κ', 'λ', 'μ', 'ν', 'π', 'ρ', 'ω', 'ϱ', 'А', 'Д', 'К', 'Т', 'а', 'е', 'л', 'м', 'о', 'т', 'ш', 'я', 'Ԁ', 'א', 'ב', 'ג', 'ה', 'ו', 'ז', 'ח', 'ט', 'י', 'ך', 'כ', 'ל', 'ם', 'ן', 'נ', 'ע', 'פ',

## Text Preprocessing

In [27]:
reddit_jokes = reddit_jokes.applymap(str)

Reddit data has some special tags like <i>[removed]</i> or <i>[deleted]</i> (these two mean that the comment has been removed/deleted). Also, they're written in an inconsistent way, i.e. you may find the tag <i>[removed]</i> capitalized or lowercased.<br>
The next function will address reddit tags as follows:
1. Drops rows with deleted, removed or censored tags.
2. Replaces other tags found in text with a whitespace. *(i.e. some comments have tags like <i>[censored], [gaming], [long], [request] and [dirty]</i> and we want to omit these tags from the text)*

In [28]:
def clean_reddit_tags(data, cols):
    """
    This function removes reddit-related tags from the data and does the following:
    1. drops rows with deleted, removed or censored tags.
    2. replaces other tags found in text with a whitespace. 
    Arguments:
        data: a pandas dataframe.
        cols: a Python list, representing names of columns for questions and answers. First item of the list should be the name 
        of the questions column and the second item should be the name of the column corresponding to answers.
    Returns:
        a pandas dataframe.
    """
    
    import re
    
    if cols is None:
        cols = list(data.columns)
    
    # First, I'm going to lowercase all the text to address these tags 
    # however, I'm not going to alter the original dataframe because I don't want text to be lowercased.
    data_copy = data.copy()
    data_copy[cols[0]] = data_copy[cols[0]].str.lower()
    data_copy[cols[1]] = data_copy[cols[1]].str.lower()
    
    # drop rows with deleted, removed or censored tags.
    # qa_pair[0] is the question, qa_pair[1] is the answer
    mask = data_copy.apply(lambda qa_pair: 
                           False if (qa_pair[0]=='[removed]') | (qa_pair[0]=='[deleted]') | (qa_pair[0]=='[censored]') |
                           (qa_pair[1]=='[removed]') | (qa_pair[1]=='[deleted]') | (qa_pair[1]=='[censored]')
                           else True, axis=1)
    # drop the rows, notice we're using the mask to filter out those rows
    # in the original dataframe 'data', because we don't need it anymore
    data = data[mask]
    print(f"# of rows dropped with [deleted], [removed] or [censored] tags: {mask.sum()}")
    
    # replaces other tags found in text with a whitespace. 
    def sub_tag(pair):
        """
        This method substitute tags (square brackets with words inside) with whitespace.
        Arguments:
            pair: a Pandas Series, where the first item is the question and the second is the answer.
        Returns:
            pair: a Pandas Series.
        """
        # \[(.*?)\] is a regex to recognize square brackets [] with anything in between
        p=re.compile("\[(.*?)\]")
        pair[0] = re.sub(p, ' ', pair[0])
        pair[1] = re.sub(p, ' ', pair[1])
        
        return pair
    
    # substitute tags with whitespaces.
    data = data.apply(sub_tag, axis=1)
    
    return data

In [29]:
print("Before addressing tags:")
print_question_answer(reddit_jokes, 1825, ['Question', 'Answer'])
print_question_answer(reddit_jokes, 52906, ['Question', 'Answer'])
print_question_answer(reddit_jokes, 59924, ['Question', 'Answer'])
print_question_answer(reddit_jokes, 1489, ['Question', 'Answer'])

Before addressing tags:
Question: (1825)
How do you piss off an entire community with one word?
Answer: (1825)
[Deleted]
Question: (52906)
[Corny] What does a highlighter say when it answers the phone?
Answer: (52906)
Yello?
Question: (59924)
How do you disappoint a redditor?
Answer: (59924)
[removed]
Question: (1489)
Everything men know about women
Answer: (1489)
[   ]


**Note:** the following cell may take multiple seconds to finish.

In [30]:
reddit_jokes = clean_reddit_tags(reddit_jokes, ['Question', 'Answer'])

# of rows dropped with [deleted], [removed] or [censored] tags: 133117


In [31]:
reddit_jokes

Unnamed: 0,Question,Answer
0,I enjoy working in a slaughterhouse..,Everything is so cut and dry.
1,What do you call a soldier who survives Mustar...,A seasoned veteran.
2,I really like white dwarf stars...,...My favorite is Peter Dinklage.
3,Knock knock. Whose their?,The grammar police.
4,What breaks when you give it to a twelve year ...,Her hips.
...,...,...
133323,Today a girl kissed me.,I wish I could post it in another subreddit.
133324,The millennium is now legal.,Who wants to be the first person to fuck time ...
133325,I haven't since last year.,(obligatory)
133326,A 17 year old male walks into a drug store...,A 17 year old male walks into a drug store. He...


In [32]:
print("After addressing tags:")
# because rows with [removed], [deleted] and [censored] tags have been dropped
# we're not going to print the rows (index=1825, index=59924) since they contain 
# those tags, or we're going to have a KeyError
print_question_answer(reddit_jokes, 52906, ['Question', 'Answer'])
print_question_answer(reddit_jokes, 1489, ['Question', 'Answer'])

After addressing tags:
Question: (52906)
  What does a highlighter say when it answers the phone?
Answer: (52906)
Yello?
Question: (1489)
Everything men know about women
Answer: (1489)
 


**Note:** notice the question whose index is 52906, has some leading whitespaces. That's because it had the <i>[Corny]</i> tag and the function replaced it with whitespaces. Also, the question whose index is 1489 has an empty answer and that's because of the fact that the original answer just square brackets with some whitespaces in between. We're going to address all of that next!

Now, let's apply the *clean_text* function on the reddit data.<br>
**Remember:** the *clean_text* function replaces some characters with others, removes unwanted characters and gets rid of extra whitespaces from the data.

In [33]:
reddit_jokes = reddit_jokes.applymap(clean_text)

In [34]:
print_question_answer(reddit_jokes, 52906, ['Question', 'Answer'])
print_question_answer(reddit_jokes, 1489, ['Question', 'Answer'])

Question: (52906)
what does a highlighter say when it answers the phone ?
Answer: (52906)
yello ?
Question: (1489)
everything men know about women
Answer: (1489)



Everything looks good!<br>
Now, let's apply the *preprocess_data* function on the data.<br>
**Remember:** the *preprocess_data* function applies the following preprocessing operations:
1. Drops unecessary duplicate pairs (rows) but keep only one instance of all duplicates. *(For example, if the dataset contains three duplicates of the same question-answer pair, then two of them would be removed and one kept.)*
2. Drops rows with empty question/answer. *(These may appear because of the previous step or because they happen to be empty in the original dataset) *
3. Drops rows with more than 30 words in either the question or the answer or if the an answer has less than two characters. *(Note: this is a hyperparameter and you can try other values.)*

In [35]:
reddit_jokes = preprocess_data(reddit_jokes, ['Question', 'Answer'])

Removing unecessary duplicate pairs:
# of examples before removing duplicates: 133117
# of examples after removing duplicates: 128036
# of removed duplicates: 5081
Removing empty string rows:
# of examples before removing rows with empty question/answers: 128036
# of examples after removing with empty question/answers: 127946
# of removed empty string rows: 90
Removing rows with more than 30 words in either the question or the answer:
# of examples before removing rows with more than 30 words: 127946
# of examples after removing rows with more than 30 words: 89001
# of removed empty rows with more than 30 words: 38945
Data preprocessing is done.


In [36]:
print(f"Number of question answer pairs in the reddit /r/Jokes dataset: {len(reddit_jokes)}")

Number of question answer pairs in the reddit /r/Jokes dataset: 89001


In [37]:
distinct_chars(reddit_jokes, ['Question', 'Answer'])

Number of distinct characters used in the dataset: 56
Digits: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Alphabets: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Special characters: [' ', '!', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '=', '?', '@', '_']


# Third Dataset
**Question-Answer Jokes**, [here](https://www.kaggle.com/jiriroz/qa-jokes).

This dataset consists of one file, namely:
* <i>jokes_score_name_clean.csv</i>: with <i>38,269</i> pairs.

In [38]:
qa_jokes = pd.read_csv(files_path + 'jokes.csv', usecols=['Question', 'Answer'])
qa_jokes

Unnamed: 0,Question,Answer
0,Did you hear about the Native American man tha...,He nearly drown in his own tea pee.
1,What's the best anti diarrheal prescription?,Mycheexarphlexin
2,What do you call a person who is outside a doo...,Matt
3,Which Star Trek character is a member of the m...,Jean-Luc Pickacard
4,What's the difference between a bullet and a h...,A bullet doesn't miss Harambe
...,...,...
38264,Q: Why did the pacifist /b/tard try to calm ev...,He did it for the
38265,Q: Why can't Obama poke fun at himself?,A: Because that would be racist.
38266,Why is gambling not allowed in Africa?,Because there are too many cheetahs.
38267,What do you call three witches in a hot tub?,A self-cleaning coven.


In [39]:
print(len(qa_jokes))

38269


In [40]:
distinct_chars(qa_jokes, ['Question', 'Answer'])

Number of distinct characters used in the dataset: 237
Digits: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '³', '౪', '₄']
Alphabets: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'È', 'É', 'Ñ', 'ß', 'á', 'ä', 'å', 'æ', 'è', 'é', 'ê', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'Đ', 'ı', 'ō', 'œ', 'ʃ', 'ʅ', 'ʖ', 'Α', 'Μ', 'Ω', 'ά', 'ε', 'ζ', 'η', 'θ', 'κ', 'μ', 'π', 'ρ', 'ς', 'С', 'б', 'е', 'и', 'н', 'р', 'т', 'ь', 'ॐ', 'ಠ', 'ứ', 'づ', 'ツ', '丁', '二', '喲', '媽', '崇', '常', '清', '胖', '董', '這', '麼', '빵']
Special characters: [' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\xa0', '¡', '£', '¤', '©', '«', '¬', '\xad', '®', '¯', '°', '

## Text Preprocessing

If you look at some examples in the dataset, you notice that some examples has 'Q:' at beginning of the question and 'A:' at the beginning of the answer, so we need to get rid of these prefixes because they don't convey useful information.<br>
You also notice some examples where both 'Q:' and 'A:' are found in either the question or the answer, although I'm not going to omit these because they probably convey information and are part of the answer. However, some of them have 'Q:' in the question and 'Q: question A: answer' where the question in the answer is the same question, so we need to fix that.

In [41]:
def clean_qa_prefixes(data, cols):
    """
    This function removes special prefixes ('Q:' and 'A:') found in the data.
    i.e. input="Q: how's your day?" --> output=" how's your day?"
    Arguments:
        data: a pandas dataframe.
        cols: a Python list, representing names of columns for questions and answers. First item of the list should be the name 
        of the questions column and the second item should be the name of the column corresponding to answers.
    Returns:
        a pandas dataframe.
    """
    def removes_prefixes(pair):
        """
        This function removes prefixes ('Q:' and 'A:') from the question and answer.
        Examples:
        Input: qusetion="Q: what is your favorite Space movie?", answer='A: Interstellar!'
        Output: qusetion=' what is your favorite Space movie?', answer=' Interstellar!'
        Input: question="Q: how\'s your day?", answer='Q: how\'s your day? A: good, thanks.'
        Output: qusetion=" how's your day?", answer='good, thanks.'
        Input: qusetion='How old are you?', answer='old enough'
        Output: qusetion='How old are you?', answer='old enough'
        Arguments:
            pair: a Pandas Series, where the first item is the question and the second is the answer.
        Returns:
            pair: a Pandas Series.
        """
        # pair[0] corresponds to the question
        # pair[1] corresponds to the answer
        # if the question contains 'Q:' and the answer contains 'A:' but doesn't contain 'Q:'
        if ('Q:' in pair[0]) and ('A:' in pair[1]) and ('Q:' not in pair[1]):
            pair[0] = pair[0].replace('Q:','')
            pair[1] = pair[1].replace('A:','')
        # if the answer contains both 'Q:' and 'A:'
        elif ('A:' in pair[1]) and ('Q:' in pair[1]):
            pair[0] = pair[0].replace('Q:','')
            # now we should check if the text between 'Q:' and 'A:' is the same text in the question (pair[0])
            # because if they are, this means that the question is repeated in the answer and we should address that.
            q_start = pair[1].find('Q:') + 2 # index of the start of the text that we want to extract
            q_end = pair[1].find('A:') # index of the end of the text that we want to extract
            q_txt = pair[1][q_start:q_end].strip()
            # if the question is repeated in the answer
            if q_txt == pair[0].strip():
                # in case the question is repeated in the answer, removes it from the answer
                pair[1] = pair[1][q_end+2:].strip()
            
        return pair
        
    return data.apply(removes_prefixes, axis=1)

In [42]:
print("Before removing unnecessary prefixes:")
print_question_answer(qa_jokes, 44, ['Question', 'Answer'])
print_question_answer(qa_jokes, 22, ['Question', 'Answer'])
print_question_answer(qa_jokes, 31867, ['Question', 'Answer'])

Before removing unnecessary prefixes:
Question: (44)
Q: What did the left leg say to the right leg?
Answer: (44)
A: That one in the middle thinks he's hard.
Question: (22)
Why does Santa have three gardens?
Answer: (22)
Q: Why does Santa have three gardens? A: So he can "hoe, hoe, hoe."
Question: (31867)
What is your favorite joke about women?
Answer: (31867)
Q: Why don't women wear watches? A: Because there is a clock on the stove.


In [43]:
qa_jokes = clean_qa_prefixes(qa_jokes, ['Question', 'Answer'])

In [44]:
print("After removing unnecessary prefixes:")
print_question_answer(qa_jokes, 44, ['Question', 'Answer'])
print_question_answer(qa_jokes, 22, ['Question', 'Answer'])
print_question_answer(qa_jokes, 31867, ['Question', 'Answer'])

After removing unnecessary prefixes:
Question: (44)
 What did the left leg say to the right leg?
Answer: (44)
 That one in the middle thinks he's hard.
Question: (22)
Why does Santa have three gardens?
Answer: (22)
So he can "hoe, hoe, hoe."
Question: (31867)
What is your favorite joke about women?
Answer: (31867)
Q: Why don't women wear watches? A: Because there is a clock on the stove.


Notice that the third example both 'Q:' and 'A:' are part of the answer and conveys information.

Now, let's apply the *clean_text* function on the Question-Answer Jokes data.<br>
**Remember:** the *clean_text* function replaces some characters with others, removes unwanted characters and gets rid of extra whitespaces from the data.

In [45]:
qa_jokes = qa_jokes.applymap(clean_text)

Now, let's apply the *preprocess_data* function on the data.<br>
**Remember:** the *preprocess_data* function applies the following preprocessing operations:
1. Drops unnecessary duplicate pairs (rows) but keep only one instance of all duplicates. *(For example, if the dataset contains three duplicates of the same question-answer pair, then two of them would be removed and one kept.)*
2. Drops rows with an empty question/answer. *(These may appear because of the previous step or because they happen to be empty in the original dataset) *
3. Drops rows with more than 30 words in either the question or the answer or if the an answer has less than two characters. *(Note: this is a hyperparameter and you can try other values.)*

In [46]:
qa_jokes = preprocess_data(qa_jokes, ['Question', 'Answer'])

Removing unecessary duplicate pairs:
# of examples before removing duplicates: 38269
# of examples after removing duplicates: 38187
# of removed duplicates: 82
Removing empty string rows:
# of examples before removing rows with empty question/answers: 38187
# of examples after removing with empty question/answers: 38166
# of removed empty string rows: 21
Removing rows with more than 30 words in either the question or the answer:
# of examples before removing rows with more than 30 words: 38166
# of examples after removing rows with more than 30 words: 37086
# of removed empty rows with more than 30 words: 1080
Data preprocessing is done.


In [47]:
print(f"Number of question-answer pairs in the Question-Answer Jokes dataset: {len(qa_jokes)}")

Number of question-answer pairs in the Question-Answer Jokes dataset: 37086


In [48]:
distinct_chars(qa_jokes, ['Question', 'Answer'])

Number of distinct characters used in the dataset: 56
Digits: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Alphabets: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Special characters: [' ', '!', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '=', '?', '@', '_']


# Putting it together

Let's concatenate all the data we have to create our final dataset.

In [49]:
dataset = pd.concat([wordball, reddit_jokes, qa_jokes], ignore_index=True)
dataset.head()

Unnamed: 0,Question,Answer
0,what ' s the best anti diarrheal prescription ?,mycheexarphlexin
1,what do you call a person who is outside a doo...,matt
2,which star trek character is a member of the m...,jean - luc pickacard
3,what ' s the difference between a bullet and a...,a bullet doesn ' t miss harambe
4,why was the ethiopian baby crying ?,he was having a mid - life crisis


In [50]:
print(f"Number of question-answer pairs in the dataset: {len(dataset)}")

Number of question-answer pairs in the dataset: 227799


There may be duplicate examples in the data so let's drop them:

In [51]:
data_len_before = len(dataset) # len of data before removing duplicates
print(f"# of examples before removing duplicates: {data_len_before}")
# drop duplicates
dataset = dataset.drop_duplicates(keep='first')
data_len_after = len(dataset) # len of data after removing duplicates
print(f"# of examples after removing duplicates: {data_len_after}")
print(f"# of removed duplicates: {data_len_before-data_len_after}")

# of examples before removing duplicates: 227799
# of examples after removing duplicates: 175671
# of removed duplicates: 52128


Let's drop rows with NaN values if there's any:

In [52]:
dataset.dropna(inplace=True)

In [53]:
dataset

Unnamed: 0,Question,Answer
0,what ' s the best anti diarrheal prescription ?,mycheexarphlexin
1,what do you call a person who is outside a doo...,matt
2,which star trek character is a member of the m...,jean - luc pickacard
3,what ' s the difference between a bullet and a...,a bullet doesn ' t miss harambe
4,why was the ethiopian baby crying ?,he was having a mid - life crisis
...,...,...
227779,how many surrealists does it take to change a ...,fish .
227784,here ' s a joke just for reddit : how many nar...,bacon
227786,what do you get when you combine a comedian an...,a brofl
227794,q : why did the pacifist / b / tard try to cal...,he did it for the


Let's make sure that all our cells are of the same type:

In [54]:
dataset = dataset.applymap(str)

In [55]:
print(f"Number of question-answer pairs in the dataset: {len(dataset)}")

Number of question-answer pairs in the dataset: 175671


In [56]:
distinct_chars(dataset, ['Question', 'Answer'])

Number of distinct characters used in the dataset: 56
Digits: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Alphabets: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Special characters: [' ', '!', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '=', '?', '@', '_']


Finally, let's save the dataset:

In [57]:
dataset.to_csv(files_path + '/dataset.csv')