In this notebook, we load datasets we gathered from diffrent sources and examine each dataset seperately. The goal is to inspect them to get a better idea of each one. 

# Setting up the environment

In [6]:
#@title Setting up project paths
import os

colab_setup = False #@param {type:"boolean"}
PROJECT_PATH = "/content/drive/MyDrive/TWM/" #@param {"type":"string"}

if colab_setup:
    from google.colab import drive
    print("Mounting Google Drive...", end="", flush=True)
    drive.mount('/content/drive')
    print("Done")

else:
    # set this to the parent directory of the whole project
    PROJECT_PATH = rf"C:\Users\{os.environ['USERNAME']}\Graduation-Project"

print("PROJECT_PATH:", PROJECT_PATH)
os.chdir(PROJECT_PATH)
os.listdir()

PROJECT_PATH: C:\Users\LAPTOP\Graduation-Project


['.git',
 '.gitignore',
 'chatbot-env',
 'DataEngineering',
 'Dependencies',
 'dummy.py',
 'FineTuneing',
 'hierarchy.txt',
 'README.md',
 'requirements.txt',
 'Testing Interface.ipynb',
 'Utils']

In [37]:
#@title Importing modules

import ast
import json
import functools
from pprint import pprint

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# importing the hugging face datasets library
# copy-pasted code from their 'getting started' notebook
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16 and int(pyarrow.__version__.split('.')[0]) == 0:
    import os
    os.kill(os.getpid(), 9)
from datasets import list_datasets, list_metrics, load_dataset, load_metric
from datasets.dataset_dict import DatasetDict

# I wrote a simple library to make it easy for me to download files and extract archives
import Utils.helperFunctions as helperFunctions
import Utils.dialogueUtils as dialogueUtils

In [10]:
#@title Environment watermark
%load_ext watermark
%watermark --author "Mohamed Hisham" --email "Mohamed00Hisham@gmail.com" --github_username "Mhmd-Hisham",
%watermark
%watermark --iversions

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: Mohamed Hisham

Github username: Mhmd-Hisham,

Email: Mohamed00Hisham@gmail.com

Last updated: 2022-09-26T22:45:02.095790+02:00

Python implementation: CPython
Python version       : 3.9.5
IPython version      : 8.5.0

Compiler    : MSC v.1928 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 12
Architecture: 64bit

sys       : 3.9.5 (tags/v3.9.5:0a7dcbd, May  3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
json      : 2.0.9
pandas    : 1.5.0
numpy     : 1.23.3
matplotlib: 3.6.0
pyarrow   : 9.0.0



# Dataset #1: DialogSum

~13K Dialogue

A Real-life Scenario Dialogue Summarization Dataset

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

Source: https://github.com/cylnlp/dialogsum


In [11]:
#@title Download
def get_dialogSum(dest):
    def download(dest):
        urls = [
            'https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.train.jsonl',
            'https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.test.jsonl',
            'https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.dev.jsonl',
        ]
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        files = helperFunctions.download_from_list(urls, dest, override=False)
        files = list(map(lambda f: os.path.join(dest, f), files))

        jsonl_objects = sum([helperFunctions.read_jsonl(f) for f in files], [])
        df = helperFunctions.jsonl_to_df(jsonl_objects)

        return df

    def preprocess(df):
        # reorder the columns
        return df[
            ['dialogue', 'topic', 'topic1', 'topic2', 'topic3',
             'summary', 'summary1', 'summary2','summary1', 'fname']
        ]

    def save(df, dest):
        # save for later use
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, "df1.csv"))

        # save general info about the dataset
        info_dict = {
            "name":"DialogSum", 
            "source":"https://github.com/cylnlp/dialogsum",
            "order":1
        }

        helperFunctions.save_as_json(info_dict, "info1.json", dest)

        return info_dict
    
    df = preprocess(download(dest))
    info_dict = save(df, dest)

    return df, info_dict

dest1 = "DataEngineering/Datasets/dataset1/"
df1, info_dict1 = get_dialogSum(dest1)
df1.head(n=5)

File 'dialogsum.train.jsonl' exists! Enable override to override it.
File 'dialogsum.test.jsonl' exists! Enable override to override it.
File 'dialogsum.dev.jsonl' exists! Enable override to override it.


Unnamed: 0,dialogue,topic,topic1,topic2,topic3,summary,summary1,summary2,summary1.1,fname
0,"#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. ...",get a check-up,,,,"Mr. Smith's getting a check-up, and Doctor Haw...",,,,train_0
1,"#Person1#: Hello Mrs. Parker, how have you bee...",vaccines,,,,Mrs Parker takes Ricky for his vaccines. Dr. P...,,,,train_1
2,"#Person1#: Excuse me, did you see a set of key...",find keys,,,,#Person1#'s looking for a set of keys and asks...,,,,train_2
3,#Person1#: Why didn't you tell me you had a gi...,have a girlfriend,,,,#Person1#'s angry because #Person2# didn't tel...,,,,train_3
4,"#Person1#: Watsup, ladies! Y'll looking'fine t...",dance,,,,Malik invites Nikki to dance. Nikki agrees if ...,,,,train_4


In [12]:
dialogueUtils.preview_dialogues(df1, n=3)

#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.
------------------------------
#Person1#: Hello Mrs. Parker, how hav

# Dataset #2: DailyDialog

~13K Dialogue

The developed DailyDialog dataset contains 13,118 multi-turn dialogues.


Source: http://yanran.li/dailydialog.html

In [13]:
def get_DailyDialog(dest:str) -> tuple[pd.DataFrame,  dict]:
    def parse_dialogue(dialogue: str) -> str:
        utterances = dialogue.split("__eou__")[:-1]

        output = ""
        for i, utterance in enumerate(utterances):
            speaker = "#Person1#: " if i&1 == 0 else "#Person2#: "
            sentences = utterance.split(" . ")
            
            sentences = [speaker+s.strip() for s in sentences if s.strip()]
            sentences = ' . \n'.join(sentences)

            if sentences.strip():
                output += sentences + " . \n"

        return output

    def download(dest: str) -> str:
        url = 'http://yanran.li/files/ijcnlp_dailydialog.zip'

        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        filename = helperFunctions.download_file(url, dest)
        filename = os.path.join(dest, filename)

        extracted_folder = helperFunctions.unzip(filename, dest)
        return extracted_folder

    def preprocess(dest: str, extracted_dir: str) -> pd.DataFrame:
        dest = os.path.join(dest, 'raw')

        custom_open = functools.partial(open, mode='r', encoding='utf-8')

        with custom_open(os.path.join(dest, extracted_dir, 'readme.txt')) as fh:
            print(fh.read())

        with custom_open(os.path.join(dest, extracted_dir, 'dialogues_text.txt')) as fh:
            dialogues = list(map(parse_dialogue, fh.readlines()))

        with custom_open(os.path.join(dest, extracted_dir, 'dialogues_topic.txt')) as fh:
            topics = fh.read().splitlines()
        topics_dict = {
            1:  "Ordinary Life",2:  "School Life", 3:  "Culture & Education",
            4:  "Attitude & Emotion", 5:  "Relationship", 6:  "Tourism ",
            7:  "Health",8:  "Work",9:  "Politics",10: "Finance"
        }
        topics = list(map(lambda t: topics_dict[int(t)], topics))

        with custom_open(os.path.join(dest, extracted_dir, 'dialogues_act.txt')) as fh:
            acts = fh.read().splitlines()

        with custom_open(os.path.join(dest, extracted_dir, 'dialogues_emotion.txt')) as fh:
            emotions = fh.read().splitlines()
        data = np.array([dialogues, topics, emotions, acts])

        columns = ['dialogue', 'topics', 'emotions', 'acts']
        df = pd.DataFrame(data.T, columns=columns)

        return df

    def save(df, dest):
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)
        acts_dict = { 1: "inform", 2: "question", 3: "directive", 4: "commissive" }

        emotions_dict = {
            0: "no emotion", 1: "anger", 2: "disgust", 3: "fear", 
            4: "happiness", 5: "sadness", 6: "surprise"
        }

        # save for later use
        df.to_csv(os.path.join(dest, "df2.csv"))

        # serialize the emotion and act mappers
        helperFunctions.save_as_json(acts_dict, "acts.json", dest)
        helperFunctions.save_as_json(emotions_dict, "emotions.json", dest)

        info_dict = {
            "name":"DailyDialog", 
            "source":"http://yanran.li/dailydialog.html",
            "order":2
        }

        helperFunctions.save_as_json(info_dict, "info2.json", dest)

        return info_dict

    extracted_dir = download(dest)
    df = preprocess(dest, extracted_dir)
    info_dict = save(df, dest)

    return df, info_dict

dest2 = "DataEngineering/Datasets/dataset2/"
df2, info_dict2 = get_DailyDialog(dest2)
df2.head(n=5)

File 'ijcnlp_dailydialog.zip' exists! Enable override to override it.
Unzipping 'DataEngineering/Datasets/dataset2/raw\ijcnlp_dailydialog.zip'.. Done!
Here are some explanations about the files:

1) dialogues_text.txt: The DailyDialog dataset which contains 11,318 transcribed dialogues.
2) dialogues_topic.txt: Each line in dialogues_topic.txt corresponds to the topic of that in dialogues_text.txt.
                        The topic number represents: {1: Ordinary Life, 2: School Life, 3: Culture & Education,
                        4: Attitude & Emotion, 5: Relationship, 6: Tourism , 7: Health, 8: Work, 9: Politics, 10: Finance}
3) dialogues_act.txt: Each line in dialogues_act.txt corresponds to the dialog act annotations in dialogues_text.txt.
                      The dialog act number represents: { 1: inform，2: question, 3: directive, 4: commissive }
4) dialogues_emotion.txt: Each line in dialogues_emotion.txt corresponds to the emotion annotations in dialogues_text.txt.
            

Unnamed: 0,dialogue,topics,emotions,acts
0,#Person1#: The kitchen stinks . \n#Person2#: I...,Ordinary Life,2 0,3 4
1,"#Person1#: So Dick , how about getting some co...",Ordinary Life,4 2 0 1 0,3 4 3 1 1
2,#Person1#: Are things still going badly with y...,Ordinary Life,0 1 0 0,2 1 3 4
3,#Person1#: Would you mind waiting a while ? . ...,Ordinary Life,0 0 0 4,3 2 1 1
4,#Person1#: Are you going to the annual party ?...,Ordinary Life,0 4 4,3 4 1


In [14]:
dialogueUtils.preview_dialogues(df2, n=3)

#Person1#: The kitchen stinks . 
#Person2#: I'll throw out the garbage . 

------------------------------
#Person1#: So Dick , how about getting some coffee for tonight ? . 
#Person2#: Coffee ? I don ’ t honestly like that kind of stuff . 
#Person1#: Come on , you can at least try a little , besides your cigarette . 
#Person2#: What ’ s wrong with that ? Cigarette is the thing I go crazy for . 
#Person1#: Not for me , Dick . 

------------------------------
#Person1#: Are things still going badly with your houseguest ? . 
#Person2#: Getting worse . 
#Person2#: Now he ’ s eating me out of house and home . 
#Person2#: I ’ Ve tried talking to him but it all goes in one ear and out the other . 
#Person2#: He makes himself at home , which is fine . 
#Person2#: But what really gets me is that yesterday he walked into the living room in the raw and I had company over ! That was the last straw . 
#Person1#: Leo , I really think you ’ re beating around the bush with this guy . 
#Person1#: I kno

# Dataset #3: Cornell Movie--Dialogs Corpus

~83K Dialogue

This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata included:
	- genres
	- release year
	- IMDB rating
	- number of IMDB votes
	- IMDB rating
- character metadata included:
	- gender (for 3,774 characters)
	- position on movie credits (3,321 characters)

Source: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

Note: I decided to ignore the metadata since it's irrelevant to our project. I reconstructed the dialogues only and saved it as a Dataframe.

In [15]:
def get_CornellMovieDialogsCorpus(dest: str) -> tuple[pd.DataFrame,  dict]:
    def download(dest: str) -> str:
        url = "http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip"

        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        filename = helperFunctions.download_file(url, dest)
        filename = os.path.join(dest, filename)

        extracted_dir = helperFunctions.unzip(filename, dest)
        extracted_dir = 'cornell_movie_dialogs_corpus/cornell movie-dialogs corpus'
        extracted_dir = os.path.join(dest, extracted_dir)

        return extracted_dir

    def preprocess(dest: str, extracted_dir: str) -> pd.DataFrame:
        line_id_to_utterance = dict()
        field_separator = ' +++$+++ '
        custom_open = functools.partial(open, mode='r', encoding='ISO-8859-1')
        with custom_open(os.path.join(extracted_dir, 'movie_lines.txt')) as fh:
            # ex movie line:
            # L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!

            # line fields:
                # - lineID
                # - characterID (who uttered this phrase)
                # - movieID
                # - character name
                # - text of the utterance

            for line in fh:
                line_id, _, _, _, utterance = line.split(field_separator)
                line_id_to_utterance[line_id] = utterance.strip()

        movie_id_to_genres = dict()
        with custom_open(os.path.join(extracted_dir, 'movie_titles_metadata.txt')) as fh:
            # - contains information about each movie title
            # - fields: 
            # 	- movieID, 
            # 	- movie title,
            # 	- movie year, 
            #    	- IMDB rating,
            # 	- no. IMDB votes,
            # 	- genres in the format ['genre1','genre2',É,'genreN']
            for line in fh:
                movie_id, *_, movie_genres = line.split(field_separator)
                movie_id_to_genres[movie_id] = movie_genres

        conversations = []
        genres = []
        with custom_open(os.path.join(extracted_dir, 'movie_conversations.txt')) as fh:
            # ex line: 
            # u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']

            # line fields:
                # - characterID of the first character involved in the conversation
                # - characterID of the second character involved in the conversation
                # - movieID of the movie in which the conversation occurred
                # - list of the utterances that make the conversation, in chronological 
                # 	order: ['lineID1','lineID2',É,'lineIDN']
                # 	has to be matched with movie_lines.txt to reconstruct the actual content
            for line in fh:
                *_, movie_id, conversation = line.split(field_separator)
                
                # reconstruct the conversation
                conversation = ast.literal_eval(conversation)
                conversation = map(line_id_to_utterance.get, conversation)
                conversation = '\n'.join(f"#Person{i%2+1}#: "+l for i, l in enumerate(conversation))

                # add it to our list
                conversations.append(conversation)

                # get the genres of the movie
                genres.append(movie_id_to_genres[movie_id])

        df = pd.DataFrame(np.array([conversations, genres]).T, 
                        columns=["dialogue", "genres"])

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, "df3.csv"))

        info_dict = {
            "name":"Cornell Movie--Dialogs Corpus", 
            "source":"https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html",
            "order":3
        }

        helperFunctions.save_as_json(info_dict, "info3.json", dest)
        return info_dict

    extracted_dir = download(dest)
    df = preprocess(dest, extracted_dir)
    info_dict = save(df, dest)

    return df, info_dict

dest3 = "DataEngineering/Datasets/dataset3/"
df3, info_dict3 = get_CornellMovieDialogsCorpus(dest3)
df3.head(n=5)

File 'cornell_movie_dialogs_corpus.zip' exists! Enable override to override it.
Unzipping 'DataEngineering/Datasets/dataset3/raw\cornell_movie_dialogs_corpus.zip'.. Done!


Unnamed: 0,dialogue,genres
0,#Person1#: Can we make this quick? Roxanne Ko...,"['comedy', 'romance']\n"
1,#Person1#: You're asking me out. That's so cu...,"['comedy', 'romance']\n"
2,"#Person1#: No, no, it's my fault -- we didn't ...","['comedy', 'romance']\n"
3,#Person1#: Why?\n#Person2#: Unsolved mystery. ...,"['comedy', 'romance']\n"
4,"#Person1#: Gosh, if only we could find Kat a b...","['comedy', 'romance']\n"


In [16]:
dialogueUtils.preview_dialogues(df3, n=3)

#Person1#: Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
#Person2#: Well, I thought we'd start with pronunciation, if that's okay with you.
#Person1#: Not the hacking and gagging and spitting part.  Please.
#Person2#: Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?
------------------------------
#Person1#: You're asking me out.  That's so cute. What's your name again?
#Person2#: Forget it.
------------------------------
#Person1#: No, no, it's my fault -- we didn't have a proper introduction ---
#Person2#: Cameron.
#Person1#: The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.
#Person2#: Seems like she could get a date easy enough...
------------------------------


# Dataset #4: Commonsense-Dialogues

~11K Dialogue

Commonsense-Dialogues is a crowdsourced dataset of ~11K dialogues grounded in social contexts involving utilization of commonsense. The social contexts used were sourced from the train split of the [SocialIQA](https://leaderboard.allenai.org/socialiqa/submissions/get-started) dataset, a multiple-choice question-answering based social commonsense reasoning benchmark.

Source: https://github.com/alexa/Commonsense-Dialogues



In [21]:
def get_CommonsenseDialogues(dest: str) -> tuple[pd.DataFrame,  dict]:
    def download(dest: str) -> str:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        urls = [
            'https://raw.githubusercontent.com/alexa/Commonsense-Dialogues/main/data/test.json',
            'https://raw.githubusercontent.com/alexa/Commonsense-Dialogues/main/data/train.json',
            'https://raw.githubusercontent.com/alexa/Commonsense-Dialogues/main/data/valid.json'
        ]

        # download the files
        files = helperFunctions.download_from_list(urls, dest, override=False)
        files = list(map(lambda f: os.path.join(dest, f), files))

        # load the 3 json files into a single dataframe
        df = functools.reduce(
            lambda x, y: x.append(y),
            (map(lambda f: pd.read_json(f).T, files))
        )
        return df

    def preprocess(df: pd.DataFrame) -> pd.DataFrame:
        # reorder the columns
        df = df[['turns', 'speaker', 'context']]
        df = df.rename(columns={"turns": "dialogue"})

        # reformat the dialogues and label speakers
        df['dialogue'] = df['dialogue'].apply(lambda x: '\n'.join(
            f"#Person{i%2+1}#: "+l for i, l in enumerate(x)))

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, "df4.csv"))

        info_dict = {
            "name": "Commonsense-Dialogues",
            "source": "https://github.com/alexa/Commonsense-Dialogues",
            "order": 4
        }

        helperFunctions.save_as_json(info_dict, "info4.json", dest)
        return info_dict

    df = preprocess(download(dest))
    info_dict = save(df, dest)

    return df, info_dict

dest4 = "DataEngineering/Datasets/dataset4/"
df4, info_dict4 = get_CommonsenseDialogues(dest4)

print("Dataset Shape:", df4.shape)
df4.head(n=5)

File 'test.json' exists! Enable override to override it.
File 'train.json' exists! Enable override to override it.
File 'valid.json' exists! Enable override to override it.


  lambda x, y: x.append(y),
  lambda x, y: x.append(y),


Dataset Shape: (11373, 3)


Unnamed: 0,dialogue,speaker,context
1,"#Person1#: I got so mad, I couldn't contain it...",Austin,Austin left in a huff of rage after they were ...
2,#Person1#: I was so devastated after the defea...,Austin,Austin left in a huff of rage after they were ...
3,#Person1#: I was so upset that I lost yesterda...,Austin,Austin left in a huff of rage after they were ...
4,#Person1#: Remember when I was playing with th...,Cameron,"Cameron had accidentally broken the door, so t..."
5,#Person1#: I think that Taylor is mad at me.\n...,Casey,Casey walked out of the room in a huff after t...


In [18]:
dialogueUtils.preview_dialogues(df4, n=3)

#Person1#: I got so mad, I couldn't contain it anymore
#Person2#: Did you huff off?
#Person1#: I did, I flared up into anger
#Person2#: You need to calm down, it's just a video game
#Person1#: I know, I should not let it get to me like this.
#Person2#: blow off some steam and come back
------------------------------
#Person1#: I was so devastated after the defeat.
#Person2#: You should have put more effort into it!
#Person1#: yes, I know! I left in a huff of rage after I was beaten in the video game competition.
#Person2#: You need to buckle up next time!
#Person1#: I know, I will.
------------------------------
#Person1#: I was so upset that I lost yesterday.
#Person2#: What did you play?
#Person1#: It was a video game competition.
#Person2#: Who did you play against?
#Person1#: They were players form the town next to us.
#Person2#: That happened sometimes. Better luck next time.
------------------------------


# Dataset #5: EmpatheticDialogues

~23K Dialogues

This dataset is grounded in emotional situations to facilitate training and evaluation of empathy in dialog agents. It was collected via crowdsourcing, where one participant (speaker) selects the emotion word, describes the corresponding situation and discusses it with another participant (listener). Each conversation is allowed to be 4-8 utterances long, and the average is 4.31 utterances per conversation. 

Source: 
- https://arxiv.org/abs/1811.00207
- https://github.com/facebookresearch/EmpatheticDialogues

In [23]:
def get_EmpatheticDialogues(dest: str, order: int=5) -> tuple[pd.DataFrame,  dict]:
    # commas are represented as _comma_

    def download(dest: str) -> DatasetDict:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        # the hugging face datasets library has the empathetic dialogues dataset
        # so no need to re-write a dataset downloader
        dataset = load_dataset('empathetic_dialogues')
        dataset.save_to_disk(dest)

        return dataset

    def preprocess(dataset: DatasetDict) -> pd.DataFrame:
        dataset_rows = []

        for key in dataset.keys():
            df = dataset[key].to_pandas()
            for conv_id in df['conv_id'].unique():
                conv_rows = df[df['conv_id']==conv_id]
                conv_rows.sort_values(by=['utterance_idx'])
                conv_rows = conv_rows[['utterance', 'speaker_idx', 'context']]

                turns = conv_rows['utterance'].to_list()
                speakers = list(set(conv_rows['speaker_idx'].to_list()))

                if len(speakers) != 2:
                    continue
                
                if speakers[0] != conv_rows['speaker_idx'].iloc[0]:
                    speakers = speakers[::-1]

                speaker1, speaker2 = speakers
                mapper = {speaker1:"#Person1#:", speaker2:"#Person2#:"}
                speakers = conv_rows['speaker_idx'].to_list()

                context = conv_rows['context'].iloc[0]
                conv = [mapper[speakers[i]]+turns[i] for i in range(len(turns))]
                conv = '\n'.join(conv)

                dataset_rows.append([conv, context])

        df = pd.DataFrame(
            dataset_rows, 
            columns=["dialogue", "emotion"]
        )
        return df
    
    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, f"df{order}.csv"))

        info_dict = {
            "name":"EmpatheticDialogues", 
            "source":"https://github.com/facebookresearch/EmpatheticDialogues",
            "order": order
        }

        helperFunctions.save_as_json(info_dict, f"info{order}.json", dest)
        return info_dict

    dataset = download(dest)
    df = preprocess(dataset)
    info_dict = save(df, dest)

    return df, info_dict

dest5 = "DataEngineering/Datasets/dataset5"
df5, info_dict5 = get_EmpatheticDialogues(dest5, order=5)

print("Dataset Shape:", df5.shape)
df5.head(n=5)

Using custom data configuration default
Found cached dataset empathetic_dialogues (C:/Users/LAPTOP/.cache/huggingface/datasets/empathetic_dialogues/default/0.1.0/09bbeed3882a67db98c73952fb3c1c9a85af83dc78f81454c2454382fd03f6cf)


  0%|          | 0/3 [00:00<?, ?it/s]

Dataset Shape: (23078, 2)


Unnamed: 0,dialogue,emotion
0,#Person1#:I remember going to see the firework...,sentimental
1,#Person1#: it feels like hitting to blank wall...,afraid
2,#Person1#:Hi how are you doing today\n#Person2...,proud
3,#Person1#:I have never cheated on my wife.\n#P...,faithful
4,#Person1#:Job interviews always make me sweat ...,terrified


In [24]:
dialogueUtils.preview_dialogues(df5, n=3)

#Person1#:I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people_comma_ we felt like the only people in the world.
#Person2#:Was this a friend you were in love with_comma_ or just a best friend?
#Person1#:This was a best friend. I miss her.
#Person2#:Where has she gone?
#Person1#:We no longer talk.
#Person2#:Oh was this something that happened because of an argument?
------------------------------
#Person1#: it feels like hitting to blank wall when i see the darkness
#Person2#:Oh ya? I don't really see how
#Person1#:dont you feel so.. its a wonder 
#Person2#:I do actually hit blank walls a lot of times but i get by
#Person1#: i virtually thought so.. and i used to get sweatings
#Person2#:Wait what are sweatings
------------------------------
#Person1#:Hi how are you doing today
#Person2#:doing good.. how about you
#Person1#:Im good_comma_ trying to understand how someone can feel like hittin

# Dataset #6: MultiWOZ 2.2

~ 10K human-to-human dialogue

This is a multi-domain booking dataset, covering 7 domains: Attraction, Hospital, Police, Hotel, Restaurant, Taxi, Train. The dataset was collected using a Wizard-of-Oz setup whereby conversations were conducted between two crowdworkers, a ‘wizard’ and a ‘user’. The user is provided with a goal (e.g. book a taxi to the hotel) and chats with the wizard to achieve this goal.

Source:
- https://arxiv.org/abs/2007.12720
- https://github.com/budzianowski/multiwoz/tree/master/data/MultiWOZ_2.2

In [27]:
def get_MultiWOZ_2_2(dest: str, order: int=6) -> tuple[pd.DataFrame,  dict]:
    def download(dest: str) -> list[str]:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        dev_urls = [
            'https://www.github.com/budzianowski/multiwoz/raw/master/data/MultiWOZ_2.2/dev/dialogues_001.json',
            'https://www.github.com/budzianowski/multiwoz/raw/master/data/MultiWOZ_2.2/dev/dialogues_002.json'
        ]

        test_urls = [
            'https://www.github.com/budzianowski/multiwoz/raw/master/data/MultiWOZ_2.2/test/dialogues_001.json',
            'https://www.github.com/budzianowski/multiwoz/raw/master/data/MultiWOZ_2.2/test/dialogues_002.json'
        ]

        train_urls = ['https://www.github.com/budzianowski/multiwoz/raw/master/data/MultiWOZ_2.2/train/dialogues_001.json']

        train_url = train_urls[0]
        for i in range(2, 18):
            train_urls.append(train_url.replace("001", f"{i:03}"))

        dest_dev = os.path.join(dest, "dev")
        dest_test = os.path.join(dest, "test")
        dest_train = os.path.join(dest, "train")

        # download the files
        dev_files = helperFunctions.download_from_list(dev_urls, dest_dev, override=False)
        files = list(map(lambda f: os.path.join(dest_dev, f), dev_files))

        test_files = helperFunctions.download_from_list(test_urls, dest_test, override=False)
        files += list(map(lambda f: os.path.join(dest_test, f), test_files))

        train_files = helperFunctions.download_from_list(train_urls, dest_train, override=False)
        files += list(map(lambda f: os.path.join(dest_train, f), train_files))

        return files

    def parse_dialogue(dialogue: str) -> str:
        output = []

        speakers = set()
        for entry in dialogue:
            speakers.add(entry['speaker'])

        if len(speakers) != 2:
            return ""
        
        speaker1, speaker2 = list(speakers)
        if speaker1 != dialogue[0]['speaker']:
            speaker1, speaker2 = speaker2, speaker1

        mapper = {speaker1: "#Person1#:", speaker2: "#Person2#:", }
        for entry in dialogue:
            output.append(mapper[entry['speaker']] + entry['utterance'])

        return '\n'.join(output)

    def preprocess(files: list[str]) -> pd.DataFrame:
        # load the json files into a single dataframe
        df = functools.reduce(
            lambda x, y: pd.concat([x, y]),
            (map(pd.read_json, files))
        )
        # reorder the columns
        df = df[['turns', 'services', 'dialogue_id']]
        df = df.rename(columns={"turns": "dialogue"})

        df['dialogue'] = df['dialogue'].apply(parse_dialogue)

        df = df[df['dialogue'].notna()].reset_index(drop=True)
        df.index = np.arange(0, df.shape[0])

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, f"df{order}.csv"), index=False)

        info_dict = {
            "name":"MultiWOZ 2.2", 
            "source":"https://github.com/budzianowski/multiwoz/tree/master/data/MultiWOZ_2.2",
            "order": order
        }

        helperFunctions.save_as_json(info_dict, f"info{order}.json", dest)

        return info_dict

    dataset_files = download(dest)
    df = preprocess(dataset_files)
    info_dict = save(df, dest)

    return df, info_dict

dest6 = "DataEngineering/Datasets/dataset6"
df6, info_dict6 = get_MultiWOZ_2_2(dest6)

print("Dataset Shape:", df6.shape)
df6.head(n=5)

File 'dialogues_001.json' exists! Enable override to override it.
File 'dialogues_002.json' exists! Enable override to override it.
File 'dialogues_001.json' exists! Enable override to override it.
File 'dialogues_002.json' exists! Enable override to override it.
File 'dialogues_001.json' exists! Enable override to override it.
File 'dialogues_002.json' exists! Enable override to override it.
File 'dialogues_003.json' exists! Enable override to override it.
File 'dialogues_004.json' exists! Enable override to override it.
File 'dialogues_005.json' exists! Enable override to override it.
File 'dialogues_006.json' exists! Enable override to override it.
File 'dialogues_007.json' exists! Enable override to override it.
File 'dialogues_008.json' exists! Enable override to override it.
File 'dialogues_009.json' exists! Enable override to override it.
File 'dialogues_010.json' exists! Enable override to override it.
File 'dialogues_011.json' exists! Enable override to override it.
File 'dial

Unnamed: 0,dialogue,services,dialogue_id
0,#Person1#:I'm looking for a local place to din...,"[restaurant, train]",PMUL0698.json
1,#Person1#:My husband and I are celebrating our...,"[taxi, attraction, hotel]",PMUL3233.json
2,#Person1#:I need a taxi to come to backstreet ...,[taxi],SNG01627.json
3,#Person1#:I'm looking for a place to go in the...,"[attraction, train]",MUL1719.json
4,#Person1#:I'm looking for an expensive restaur...,"[restaurant, train]",MUL0242.json


In [26]:
dialogueUtils.preview_dialogues(df6, n=3)

#Person1#:I'm looking for a local place to dine in the centre that serves chinese food.
#Person2#:I have restaurants matching your criteria in all price ranges. Do you have a preference on price?
#Person1#:I need the address, postcode and the price range.
#Person2#:Ok how about Charlie Chan, located at Regent Street City Centre. Postcode is cb21db with a cheap price. Can I help you further today?
#Person1#:I also need a train. The train should leave after 16:15 and should leave on sunday.
#Person2#:Can I have more information for the train you're needing? Where are you departing from and arriving to?
#Person1#:I am leaving from Cambridge and going to Norwich.
#Person2#:I have train TR1840 leaving at 16:36 is that okay?
#Person1#:book for 5 people and get me the reference number
#Person2#:You're all set. Reference number is NJB87PAP . Is there anything else I can help you with today?
#Person1#:No, this is all I will need. Thank you.
#Person2#:Thank for calling us today. I hope you have 

# Dataset #7: Taskmaster-2

~17K Dialogue

Unlike Taskmaster-1, which includes both written "self-dialogs" and spoken two-person dialogs, Taskmaster-2 consists entirely of spoken two-person dialogs. In addition, while Taskmaster-1 is almost exclusively task-based, Taskmaster-2 contains a good number of search- and recommendation-oriented dialogs, as seen for example in the restaurants, flights, hotels, and movies verticals. The music browsing and sports conversations are almost exclusively search- and recommendation-based. All dialogs in this release were created using a Wizard of Oz (WOz) methodology in which crowdsourced workers played the role of a 'user' and trained call center operators played the role of the 'assistant'. In this way, users were led to believe they were interacting with an automated system that “spoke” using text-to-speech (TTS) even though it was in fact a human behind the scenes. As a result, users could express themselves however they chose in the context of an automated interface.

The Taskmaster-2 dataset consists of 17,289 dialogs in the seven domains below. 

- restaurants (3276)
- food ordering (1050)
- movies (3047)
- hotels (2355)
- flights (2481)
- music (1602)
- sports (3478)

Source: 
- https://arxiv.org/abs/1909.05358
- https://github.com/google-research-datasets/Taskmaster/tree/master/TM-2-2020

In [29]:
def get_Taskmaster_2(dest: str, order: int=7) -> tuple[pd.DataFrame,  dict]:

    def download(dest: str) -> list[str]:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        urls = [
            'https://www.github.com/google-research-datasets/Taskmaster/raw/master/TM-2-2020/data/flights.json',
            'https://www.github.com/google-research-datasets/Taskmaster/raw/master/TM-2-2020/data/food-ordering.json',
            'https://www.github.com/google-research-datasets/Taskmaster/raw/master/TM-2-2020/data/hotels.json',
            'https://www.github.com/google-research-datasets/Taskmaster/raw/master/TM-2-2020/data/movies.json',
            'https://www.github.com/google-research-datasets/Taskmaster/raw/master/TM-2-2020/data/music.json',
            'https://www.github.com/google-research-datasets/Taskmaster/raw/master/TM-2-2020/data/restaurant-search.json',
            'https://www.github.com/google-research-datasets/Taskmaster/raw/master/TM-2-2020/data/sports.json',
        ]

        # download the files
        files = helperFunctions.download_from_list(urls, dest, override=False)
        files = list(map(lambda f: os.path.join(dest, f), files))

        return files

    def parse_dialogue(dialogue: str) -> str:
        speakers = set()
        for entry in dialogue:
            speaker = entry['speaker']
            speakers.add(speaker)
    
        if len(speakers) != 2:
            return ""
        
        speaker1, speaker2 = list(speakers)
        if (speaker1 != dialogue[0]['speaker']):
            speaker1, speaker2 = speaker2, speaker1

        mapper = {speaker1:"#Person1#:", speaker2:"#Person2#:"}

        output = []
        for entry in dialogue:
            speaker = entry['speaker']
            turn = mapper[speaker] + entry['text']
            output.append(turn)

        return '\n'.join(output)

    def preprocess(files: list[str]) -> pd.DataFrame:
        # load the json files into a single dataframe
        df = functools.reduce(lambda x, y: pd.concat([x, y]), (map(pd.read_json, files)))

        # reorder the columns
        df = df[['utterances', 'instruction_id', 'conversation_id']]
        df = df.rename(columns={'utterances':'dialogue'})

        # reformat the dialogues and label speakers
        df['dialogue'] = df['dialogue'].apply(parse_dialogue)
        df.to_csv(os.path.join(dest, "df.csv"), index=False)

        df = df[df['dialogue'].notna()].reset_index(drop=True)
        df.index = np.arange(0, df.shape[0])

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, f"df{order}.csv"), index=False)

        info_dict = {
            "name":"Taskmaster-2", 
            "source":"https://github.com/google-research-datasets/Taskmaster/tree/master/TM-2-2020",
            "order": order
        }

        helperFunctions.save_as_json(info_dict, f"info{order}.json", dest)

        return info_dict

    dataset_files = download(dest)
    df = preprocess(dataset_files)
    info_dict = save(df, dest)

    return df, info_dict

dest7 = "DataEngineering/Datasets/dataset7"
df7, info_dict7 = get_Taskmaster_2(dest7)

print("Dataset Shape:", df7.shape)
df7.head(n=5)

File 'flights.json' exists! Enable override to override it.
File 'food-ordering.json' exists! Enable override to override it.
File 'hotels.json' exists! Enable override to override it.
File 'movies.json' exists! Enable override to override it.
File 'music.json' exists! Enable override to override it.
File 'restaurant-search.json' exists! Enable override to override it.
File 'sports.json' exists! Enable override to override it.
Dataset Shape: (17304, 3)


Unnamed: 0,dialogue,instruction_id,conversation_id
0,#Person1#:Hello. I'd like to find a round trip...,flight-12,dlg-00100680-00e0-40fe-8321-6d81b21bfc4f
1,"#Person1#:Hi, how can I help you?\n#Person2#:H...",flight-7,dlg-0027f924-7723-48bc-bf18-e41c03b0e709
2,"#Person1#:Hi, I'm looking for a flight. I need...",flight-6,dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013
3,"#Person1#:Hi assistant, need help finding a fl...",flight-11,dlg-005d7a68-35ec-4ed0-a0ab-715a499b48b7
4,#Person1#:Hi. How can I help you?\n#Person2#:H...,flight-6,dlg-006d8337-fc53-4aac-8895-b2f0caa14baa


In [30]:
dialogueUtils.preview_dialogues(df7, n=3)

#Person1#:Hello. I'd like to find a round trip commercial airline flight from San Francisco to Denver.
#Person2#:Hello, how can I help you?
#Person2#:San Francisco to Denver, got it.
#Person1#:You're really on top of things. I like that.
#Person2#:So what days are you looking to fly?
#Person2#:Hey, what else can you say?
#Person1#:I'm looking to fly out sometime today, the earliest time today, and I'll be returning in 4 days.
#Person1#:So, I would like to fly out sometime tonight and fly back in the evening in 4 days. From I'm looking to go to Denver. I'm flying out of San Francisco.
#Person2#:That sounds good, where you looking to go?
#Person2#:That's right okay we have prices starting at $337.
#Person1#:That sounds very good. I just have two preferences. I want a nonstop flight.
#Person1#:And I'd like to get an aisle seat.
#Person2#:Okay, Non-Stop and if I heard you correctly did you say you wanted to leave as early as possible and also Nile C.
#Person1#:Yes.
#Person2#:Okay, you got 

# Dataset #8: MetaLWOZ

Method: Crowdsourcing

Source: 
- https://www.microsoft.com/en-us/research/project/metalwoz/
- https://www.microsoft.com/en-us/download/58389
- https://www.microsoft.com/en-us/download/100639


Description: Meta-Learning Wizard-of-Oz (MetaLWOz) is a dataset designed to help develop models capable of predicting user responses in unseen domains. It can improve dialog systems, such as those used in voice assistants, to help users accomplish tasks such as booking a flight. This dataset is particularly suited for meta-learning dialog models or fine-tuning models with transfer-learning approaches. This dataset aims to reduce the amount of data required to train domain-specific dialog systems and it is one of the first datasets designed with meta-learning dialog models in mind.

The MetaLWOz dataset is being used as the baseline for the DSTC8 dialog competition.

In [31]:
def get_MetaLWOZ(dest: str, order: int=8) -> tuple[pd.DataFrame,  dict]:

    def download(dest: str) -> list[str]:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        urls = [
            'https://download.microsoft.com/download/E/B/8/EB84CB1A-D57D-455F-B905-3ABDE80404E5/metalwoz-v1.zip',
            'https://download.microsoft.com/download/0/c/4/0c4a8893-cbf9-4a43-a44a-09bab9539234/metalwoz-test-v1.zip'
        ]

        filenames = helperFunctions.download_from_list(urls, dest)

        # parse the first zip file
        filename = os.path.join(dest, filenames[0])
        helperFunctions.unzip(filename, dest)

        path_to_dialogues = os.path.join(dest, "metalwoz-v1", "dialogues")
        txt_files = [os.path.join(path_to_dialogues, f) for f in os.listdir(path_to_dialogues)]

        # parse the second zip file
        filename = os.path.join(dest, filenames[1])
        helperFunctions.unzip(filename, dest)

        path_to_dialogues = os.path.join(dest, "metalwoz-test-v1")

        helperFunctions.unzip(os.path.join(path_to_dialogues, "dstc8_multiwoz2.0.zip"), path_to_dialogues)
        path_to_dialogues1 = os.path.join(path_to_dialogues, "dstc8_multiwoz2.0", "dialogues")
        txt_files += [os.path.join(path_to_dialogues1, f) for f in os.listdir(path_to_dialogues1)]

        helperFunctions.unzip(os.path.join(path_to_dialogues, "dstc8_metalwoz_heldout.zip"), path_to_dialogues)
        path_to_dialogues2 = os.path.join(path_to_dialogues, "dstc8_metalwoz_heldout", "dialogues")
        txt_files += [os.path.join(path_to_dialogues2, f) for f in os.listdir(path_to_dialogues2)]

        return txt_files

    def parse_dialogue(utterances: list[str]) -> str:
        output = []
        for i, utterance in enumerate(utterances):
            speaker = "#Person1#:" if i&1 == 0 else "#Person2#:"
            output.append(speaker + utterance.strip())

        return "\n".join(output)

    def preprocess(files: list[str]) -> pd.DataFrame:
        jsonl_objects = sum([helperFunctions.read_jsonl(f) for f in files], [])

        df = helperFunctions.jsonl_to_df(jsonl_objects)
        df = df[["turns", "task_id", "domain", "bot_id", "user_id", "id", "annotations"]]
        df = df.rename(columns={"turns":"dialogue"})

        # reformat the dialogues and label speakers
        df['dialogue'] = df['dialogue'].apply(parse_dialogue)

        df = df[df['dialogue'].notna()].reset_index(drop=True)
        df.index = np.arange(0, df.shape[0])

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, f"df{order}.csv"), index=False)

        info_dict = {
            "name":"MetaLWOZ", 
            "source":"https://www.microsoft.com/en-us/research/project/metalwoz/",
            "order": order
        }

        helperFunctions.save_as_json(info_dict, f"info{order}.json", dest)

        return info_dict

    dataset_files = download(dest)
    df = preprocess(dataset_files)
    info_dict = save(df, dest)

    return df, info_dict

dest8 = "DataEngineering/Datasets/dataset8"
df8, info_dict8 = get_MetaLWOZ(dest8)

print("Dataset Shape:", df8.shape)
df8.head(n=5)

Downloading 'metalwoz-v1.zip'.. Done!
Downloading 'metalwoz-test-v1.zip'.. Done!
Unzipping 'DataEngineering/Datasets/dataset8\raw\metalwoz-v1.zip'.. Done!
Unzipping 'DataEngineering/Datasets/dataset8\raw\metalwoz-test-v1.zip'.. Done!
Unzipping 'DataEngineering/Datasets/dataset8\raw\metalwoz-test-v1\dstc8_multiwoz2.0.zip'.. Done!
Unzipping 'DataEngineering/Datasets/dataset8\raw\metalwoz-test-v1\dstc8_metalwoz_heldout.zip'.. Done!
Dataset Shape: (43127, 7)


Unnamed: 0,dialogue,task_id,domain,bot_id,user_id,id,annotations
0,#Person1#:Hello how may I help you?\n#Person2#...,a9203a2c,AGREEMENT_BOT,c96edf42,c05f0462,c399a493,
1,#Person1#:Hello how may I help you?\n#Person2#...,d47b54df,AGREEMENT_BOT,bcc50983,46fe62d7,2888aa3e,
2,#Person1#:Hello how may I help you?\n#Person2#...,83ad6a66,AGREEMENT_BOT,97fcd3ba,f840ce6a,17a8685a,
3,#Person1#:Hello how may I help you?\n#Person2#...,83ad6a66,AGREEMENT_BOT,843e209d,ae15d73b,b9ae2ba5,
4,#Person1#:Hello how may I help you?\n#Person2#...,a9203a2c,AGREEMENT_BOT,6e6f928c,ae15d73b,f153593e,


In [32]:
dialogueUtils.preview_dialogues(df8, n=3)

#Person1#:Hello how may I help you?
#Person2#:i am awesome
#Person1#:of course you are
#Person2#:and i own rental properties on the moon
#Person1#:i doubt you own a property in the moon
#Person2#:just kidding. i own them on Earth
#Person1#:that's a nice joke
#Person2#:because i am a billionaire!
#Person1#:i don't seem to know you
#Person2#:and i programmed you
#Person1#:i am the programmer
------------------------------
#Person1#:Hello how may I help you?
#Person2#:I am the king of the world
#Person1#:I agree that you are the king of the world
#Person2#:I can have any woman I want!
#Person1#:I agree that you can have any woman you desire.
#Person2#:Even you bot, if I were in to AIs
#Person1#:Agreed.
#Person2#:Really? you're awfully agreeable aren't you
#Person1#:I agree that I am awfully agreeable, yes.
#Person2#:Having an agreement bot seems like a useless thing to have. I need some spice in my life!
#Person1#:I really agree with that. I am rather useles.
-----------------------------

# Dataset #9: Taskmaster-1

~13K dialogues

The dataset consists of 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.

Source:
- https://github.com/google-research-datasets/Taskmaster/tree/master/TM-1-2019

In [33]:
def get_Taskmaster_1(dest: str, order: int=9) -> tuple[pd.DataFrame,  dict]:
    def download(dest: str) -> list[str]:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        urls = [
            'https://github.com/google-research-datasets/Taskmaster/raw/master/TM-1-2019/self-dialogs.json',
            'https://github.com/google-research-datasets/Taskmaster/raw/master/TM-1-2019/woz-dialogs.json'
        ]

        # download the files
        files = helperFunctions.download_from_list(urls, dest, override=False)
        files = list(map(lambda f: os.path.join(dest, f), files))

        return files

    def preprocess(files: list[str]) -> pd.DataFrame:
        # load the json files into a single dataframe
        df = functools.reduce(lambda x, y: pd.concat([x, y]), (map(pd.read_json, files)))

        # reorder the columns
        df = df[['utterances', 'instruction_id', 'conversation_id']]
        df = df.rename(columns={'utterances':'dialogue'})

        # reformat the dialogues and label speakers
        df['dialogue'] = df['dialogue'].apply(dialogueUtils.parse_dialogue)

        df = df[df['dialogue'].notna()]#.reset_index(drop=True)

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, f"df{order}.csv"), index=False)

        info_dict = {
            "name":"Taskmaster-1", 
            "source":"https://github.com/google-research-datasets/Taskmaster/tree/master/TM-1-2019",
            "order": order
        }

        helperFunctions.save_as_json(info_dict, f"info{order}.json", dest)

        return info_dict

    dataset_files = download(dest)
    df = preprocess(dataset_files)
    info_dict = save(df, dest)

    return df, info_dict

dest9 = "DataEngineering/Datasets/dataset9"
df9, info_dict9 = get_Taskmaster_1(dest9)

print("Dataset Shape:", df9.shape)
df9.head(n=5)

Downloading 'self-dialogs.json'.. Done!
Downloading 'woz-dialogs.json'.. Done!
Dataset Shape: (13215, 3)


Unnamed: 0,dialogue,instruction_id,conversation_id
0,"#Person1#:Hi, I'm looking to book a table for ...",restaurant-table-2,dlg-00055f4e-4a46-48bf-8d99-4e477663eb23
1,#Person1#:Hi I would like to see if the Movie ...,movie-tickets-1,dlg-0009352b-de51-474b-9f13-a2b0b2481546
2,#Person1#:I want to watch avengers endgame\n#P...,movie-tickets-3,dlg-00123c7b-15a0-4f21-9002-a2509149ee2d
3,#Person1#:I want to order a pizza from Bertucc...,pizza-ordering-2,dlg-0013673c-31c6-4565-8fac-810e173a5c53
4,#Person1#:Hi I'd like to order two large pizza...,pizza-ordering-2,dlg-001d8bb1-6f25-4ecd-986a-b7eeb5fa4e19


In [34]:
dialogueUtils.preview_dialogues(df9, n=3)

#Person1#:Hi, I'm looking to book a table for Korean fod.
#Person2#:Ok, what area are you thinking about?
#Person1#:Somewhere in Southern NYC, maybe the East Village?
#Person2#:Ok, great.  There's Thursday Kitchen, it has great reviews.
#Person1#:That's great. So I need a table for tonight at 7 pm for 8 people. We don't want to sit at the bar, but anywhere else is fine.
#Person2#:They don't have any availability for 7 pm.
#Person1#:What times are available?
#Person2#:5 or 8.
#Person1#:Yikes, we can't do those times.
#Person2#:Ok, do you have a second choice?
#Person1#:Let me check.
#Person2#:Ok.
#Person1#:Lets try Boka, are they free for 8 people at 7?
#Person2#:Yes.
#Person1#:Great, let's book that.
#Person2#:Ok great, are there any other requests?
#Person1#:No, that's it, just book.
#Person2#:Great, should I use your account you have open with them?
#Person1#:Yes please.
#Person2#:Great. You will get a confirmation to your phone soon.
------------------------------
#Person1#:Hi I wou

# Dataset #10: Taskmaster-3

The Taskmaster-3 (aka TicketTalk) dataset consists of 23,789 movie ticketing dialogs (located in Taskmaster/TM-3-2020/data/). By "movie ticketing" we mean conversations where the customer's goal is to purchase tickets after deciding on theater, time, movie name, number of tickets, and date, or opt out of the transaction.

Source:
- https://github.com/google-research-datasets/Taskmaster/tree/master/TM-3-2020

## Load & Parse

In [41]:
def get_Taskmaster_3(dest: str, order: int=10) -> tuple[pd.DataFrame,  dict]:

    def download(dest: str) -> list[str]:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        urls = ['https://github.com/google-research-datasets/Taskmaster/raw/master/TM-3-2020/data/data_00.json']

        for i in range(1, 20):
            urls.append(urls[0].replace("00", f"{i:02}"))


        # download the files
        files = helperFunctions.download_from_list(urls, dest, override=False)
        files = list(map(lambda f: os.path.join(dest, f), files))

        return files

    def preprocess(files: list[str]) -> pd.DataFrame:
        # load the json files into a single dataframe
        df = functools.reduce(lambda x, y: pd.concat([x, y]), (map(pd.read_json, files)))
        df = df.drop(columns=['vertical'])

        # reorder the columns
        df = df[['utterances', 'scenario', 'instructions', 'conversation_id']]
        df = df.rename(columns={'utterances':'dialogue'})

        # reformat the dialogues and label speakers
        df['dialogue'] = df['dialogue'].apply(dialogueUtils.parse_dialogue)

        df = df[df['dialogue'].notna()].reset_index(drop=True)
        df.index = np.arange(0, df.shape[0])

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, f"df{order}.csv"))

        info_dict = {
            "name":"Taskmaster-3", 
            "source":"https://github.com/google-research-datasets/Taskmaster/tree/master/TM-3-2020",
            "order": order
        }

        helperFunctions.save_as_json(info_dict, f"info{order}.json", dest)

        return info_dict

    dataset_files = download(dest)
    df = preprocess(dataset_files)
    info_dict = save(df, dest)

    return df, info_dict

dest10 = "DataEngineering/Datasets/dataset10"
df10, info_dict10 = get_Taskmaster_3(dest10)

print("Dataset Shape:", df10.shape)
df10.head(n=5)

File 'data_00.json' exists! Enable override to override it.
File 'data_01.json' exists! Enable override to override it.
File 'data_02.json' exists! Enable override to override it.
File 'data_03.json' exists! Enable override to override it.
File 'data_04.json' exists! Enable override to override it.
File 'data_05.json' exists! Enable override to override it.
File 'data_06.json' exists! Enable override to override it.
File 'data_07.json' exists! Enable override to override it.
File 'data_08.json' exists! Enable override to override it.
File 'data_09.json' exists! Enable override to override it.
File 'data_10.json' exists! Enable override to override it.
File 'data_11.json' exists! Enable override to override it.
File 'data_12.json' exists! Enable override to override it.
File 'data_13.json' exists! Enable override to override it.
File 'data_14.json' exists! Enable override to override it.
File 'data_15.json' exists! Enable override to override it.
File 'data_16.json' exists! Enable overr

Unnamed: 0,dialogue,scenario,instructions,conversation_id
0,#Person1#:hi....am buying a ticket tonight so ...,Auto template 1 with theater name error,"SCENARIO: In the conversation below, a custome...",dlg-bca5ce0a-056f-446e-be94-3ba77b32a84f
1,#Person1#:I am looking for tickets tonight at ...,Auto template 1 with theater name error,"SCENARIO: In the conversation below, a custome...",dlg-bd494e2c-36f6-4529-8e4d-d5c4d64388ae
2,#Person1#:I need to get some tickets for a mov...,Auto template 1 with theater name error,"SCENARIO: In the conversation below, a custome...",dlg-c9064676-75fe-4d0a-83c2-497e1f2115a6
3,#Person1#:I need help finding showtimes for to...,Auto template 1 with theater name error,"SCENARIO: In the conversation below, a custome...",dlg-f7500bcf-472c-48c3-adfd-e4ec9f63bcf1
4,"#Person1#:Hello, I am interested in buying tic...",Auto template 1 with theater name error,"SCENARIO: In the conversation below, a custome...",dlg-df1f0d45-27f2-4fb0-8aaa-b6b5f5a843bb


In [42]:
dialogueUtils.preview_dialogues(df10, n=3)

#Person1#:hi....am buying a ticket tonight so we go and see a movie at AMC mountain 16
#Person2#:No problem. Is there a particular type of movie you’re looking for?
#Person1#:hhhmmmmm not at all. i dont have any in mind for now
#Person2#:Sure. I can help with that. Let me listings at AMC Mercado 24.
#Person1#:sure you can but i want to see the movie at AMC mountain 16
#Person2#:Oh, sorry about that. So you’re interested in action films at AMC Mountain 16, right?
#Person1#:yeah
#Person2#:OK. I show one action movie playing at AMC Mountain 16: No Time To Die. Remaining showtimes are 4:30pm, 6:40pm and 9:10pm. Does any of those work?
#Person1#:yeah but 9.10pm will be perfect for me
#Person2#:Great. And how many tickets?
#Person1#:myself and two other persons are going to see a movie
#Person2#:All right. Let me confirm that you’d like three tickets for No Time To Die at AMC Mountain 16 tonight at 9:10pm. Is that all correct?
#Person1#:yeah
#Person2#:Is it OK to go ahead and purchase these 

# Dataset #11: The Schema-Guided Dialogue Dataset

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, such as banks, events, media, calendar, travel, and weather.

Source:
- https://github.com/google-research-datasets/dstc8-schema-guided-dialogue


Data collection approach:
- https://arxiv.org/pdf/1801.04871.pdf

In [46]:
def get_SchemaGuidedDialogue(dest: str, order: int=11) -> tuple[pd.DataFrame,  dict]:

    def download(dest: str) -> list[str]:
        dest = os.path.join(dest, 'raw')
        os.makedirs(dest, exist_ok=True)

        dev_urls = ['https://github.com/google-research-datasets/dstc8-schema-guided-dialogue/raw/master/dev/dialogues_001.json']
        for i in range(2, 21):
            dev_urls.append(dev_urls[0].replace("001", f"{i:03}"))

        test_urls = ['https://github.com/google-research-datasets/dstc8-schema-guided-dialogue/raw/master/test/dialogues_001.json']
        for i in range(2, 35):
            test_urls.append(test_urls[0].replace("001", f"{i:03}"))

        train_urls =['https://github.com/google-research-datasets/dstc8-schema-guided-dialogue/raw/master/train/dialogues_001.json']
        for i in range(2, 128):
            train_urls.append(train_urls[0].replace("001", f"{i:03}"))

        dest_dev = os.path.join(dest, "dev")
        dest_test = os.path.join(dest, "test")
        dest_train = os.path.join(dest, "train")

        # download the files
        dev_files = helperFunctions.download_from_list(dev_urls, dest_dev, override=False)
        files = list(map(lambda f: os.path.join(dest_dev, f), dev_files))

        test_files = helperFunctions.download_from_list(test_urls, dest_test, override=False)
        files += list(map(lambda f: os.path.join(dest_test, f), test_files))

        train_files = helperFunctions.download_from_list(train_urls, dest_train, override=False)
        files += list(map(lambda f: os.path.join(dest_train, f), train_files))

        return files

    def preprocess(files: list[str]) -> pd.DataFrame:
        # load the json files into a single dataframe
        df = functools.reduce(lambda x, y: pd.concat([x, y]), (map(pd.read_json, files)))

        # reorder the columns
        df = df[['turns', 'services', 'dialogue_id']]
        df = df.rename(columns={"turns": "dialogue"})

        df['dialogue'] = df['dialogue'].apply(dialogueUtils.parse_dialogue, args=('utterance',))

        df = df[df['dialogue'].notna()].reset_index(drop=True)
        df.index = np.arange(0, df.shape[0])

        return df

    def save(df: pd.DataFrame, dest: str) -> dict:
        dest = os.path.join(dest, 'preprocessed')
        os.makedirs(dest, exist_ok=True)

        df.to_csv(os.path.join(dest, f"df{order}.csv"), index=False)

        info_dict = {
            "name":"Schema-Guided Dialogue", 
            "source":"https://github.com/google-research-datasets/dstc8-schema-guided-dialogue",
            "order": order
        }

        helperFunctions.save_as_json(info_dict, f"info{order}.json", dest)

        return info_dict

    dataset_files = download(dest)
    df = preprocess(dataset_files)
    info_dict = save(df, dest)

    return df, info_dict

dest11 = "DataEngineering/Datasets/dataset10"
df11, info_dict11 = get_SchemaGuidedDialogue(dest11)

print("Dataset Shape:", df11.shape)
df11.head(n=5)

File 'dialogues_001.json' exists! Enable override to override it.
File 'dialogues_002.json' exists! Enable override to override it.
File 'dialogues_003.json' exists! Enable override to override it.
File 'dialogues_004.json' exists! Enable override to override it.
File 'dialogues_005.json' exists! Enable override to override it.
File 'dialogues_006.json' exists! Enable override to override it.
File 'dialogues_007.json' exists! Enable override to override it.
File 'dialogues_008.json' exists! Enable override to override it.
File 'dialogues_009.json' exists! Enable override to override it.
File 'dialogues_010.json' exists! Enable override to override it.
File 'dialogues_011.json' exists! Enable override to override it.
File 'dialogues_012.json' exists! Enable override to override it.
File 'dialogues_013.json' exists! Enable override to override it.
File 'dialogues_014.json' exists! Enable override to override it.
File 'dialogues_015.json' exists! Enable override to override it.
File 'dial

Unnamed: 0,dialogue,services,dialogue_id
0,#Person1#:I want to make a restaurant reservat...,[Restaurants_2],100000
1,#Person1#:I am not in the mood to cook today. ...,[Restaurants_2],100001
2,#Person1#:I want to reserve a table at a resta...,[Restaurants_2],100002
3,#Person1#:I would like to make a restaurant re...,[Restaurants_2],100003
4,#Person1#:I want to make a dinner reservation ...,[Restaurants_2],100004


In [47]:
dialogueUtils.preview_dialogues(df11, n=3)

#Person1#:I want to make a restaurant reservation for 2 people at half past 11 in the morning.
#Person2#:What city do you want to dine in? Do you have a preferred restaurant?
#Person1#:Please find restaurants in San Jose. Can you try Sino?
#Person2#:Confirming: I will reserve a table for 2 people at Sino in San Jose. The reservation time is 11:30 am today.
#Person1#:Yes, thanks. What's their phone number?
#Person2#:Your reservation has been made. Their phone number is 408-247-8880.
#Person1#:What's their address? Do they have vegetarian options on their menu?
#Person2#:The street address is 377 Santana Row #1000. They have good vegetarian options.
#Person1#:Thanks very much.
#Person2#:Is there anything else I can help you with?
#Person1#:No, that's all. Thanks.
#Person2#:Have a great day.
------------------------------
#Person1#:I am not in the mood to cook today. I want to eat out at a restaurant instead.
#Person2#:Which area would you like me to look in? Which restaurant would you li

# Dataset #12: MSR-E2E

~10K Dialogue

Microsoft end-to-
end dialogue challenge has 10,087 dialogues in
three domains, movie-ticket booking, restaurant
reservation, and taxi booking. It also includes an
experiment platform with built-in simulators in
each domain.


Source:
- https://github.com/xiul-msr/e2e_dialog_challenge

In [None]:
urls = ['https://github.com/xiul-msr/e2e_dialog_challenge/raw/master/data/movie_all.tsv',
        'https://github.com/xiul-msr/e2e_dialog_challenge/raw/master/data/restaurant_all.tsv',
        'https://github.com/xiul-msr/e2e_dialog_challenge/raw/master/data/taxi_all.tsv']


dialogues = []
acts = []
topics = []

known_topics = ['movie-ticket booking', 'restaurant reservation', 'taxi booking']

for i, url in enumerate(urls):
    df = pd.read_csv(url, sep='\t', on_bad_lines='skip')
    for session in df['session.ID'].unique():
        session_mask = df['session.ID'] == session

        sorted_session = df[session_mask].sort_values(by=["Message.ID"])
        try:
            dialogue = sorted_session['Message.Text'].to_list()
            dialogue = [("#Person1#:" if i&1 == 0 else "#Person2#:")+l for i, l in enumerate(dialogue)]
            
            dialogues.append('\n'.join(dialogue))

        except:
            continue

        try:
            acts.append(sorted_session['Dialog-Acts'].to_list())
        
        except:
            acts.append(sorted_session['Annotation_Result'].to_list())
        
        topics.append(known_topics[i])

df12 = pd.DataFrame()
df12['dialogue'] = dialogues
df12['acts'] = acts
df12['topic'] = topics

dest12 = "Datasets/dataset12"
os.makedirs(dest12, exist_ok=True)

df12 = df12[df12['dialogue'].notna()].reset_index(drop=True)
df12.index = np.arange(0, df12.shape[0])

df12.to_csv(os.path.join(dest12, "df12.csv"), index=False)

info_dict12 = {"name":"MSR-E2E", 
               "source":"https://github.com/xiul-msr/e2e_dialog_challenge",
               "order":12}

helperFunctions.save_as_json(info_dict12, "info12.json", dest12)

# print the first 3 dialogues
for i in range(3):
    print(df12.iloc[i]['dialogue'])
    print("-"*60)

df12

#Person1#:I'd like 2 tickets to see Zoolander 2 tomorrow at Regal Meridian 16 theater in Seattle at 9:25 PM
#Person2#:Okay, your purchase of 2 tickets for Zoolander 2 is confirmed.
------------------------------------------------------------
#Person1#:Hi! are there any good foreign movies showing around Houma, Louisiana this week?
#Person2#:What date would you like me to look for a reservation?
#Person1#:How about the 9th?
#Person2#:Unfortunately, there are no foreign movies playing at this time. Do you have another genre that you're interested in?
#Person1#:Is there something that's maybe a good intelligent comedy?
#Person2#:Whiskey Tango Foxtrot is the only Adult comedy I see playing in your area. Would you like to try that?
#Person1#:I guess I'll have to. Any night showing will be fine.
#Person2#:Whisky Tango Foxtrot is playing at the AMC HOUMA PALACE 10 5737 W Park Ave., Houma, LA 70364 at 11:40am 2:15pm 5:00pm 7:40pm.  Does one of those times work?  How many tickets would you need

Unnamed: 0,dialogue,acts,topic
0,#Person1#:I'd like 2 tickets to see Zoolander ...,[request(ticket;moviename=Zoolander 2;date=tom...,movie-ticket booking
1,#Person1#:Hi! are there any good foreign movie...,"[greeting(greeting=hi), request(date), inform(...",movie-ticket booking
2,#Person1#:Show me restaurants near the space n...,[request(other;distanceconstraints=near the sp...,movie-ticket booking
3,#Person1#:Hi! I'm looking for good thriller. A...,"[greeting(greeting=Hi) , inform(moviename={The...",movie-ticket booking
4,"#Person1#:Hello\n#Person2#:Hello there, are yo...","[greeting(greeting=Hello), greeting(greeting=H...",movie-ticket booking
...,...,...,...
10076,#Person1#:I would like to have a car pick me u...,[request(taxi;pickup_location=Bellevue Hospita...,taxi booking
10077,#Person1#:Hi I would like to have a Taxi take ...,[request(taxi;greeting=Hi;date=today;pickup_ti...,taxi booking
10078,#Person1#:Hi I would like to have a Taxi take ...,[request(taxi;greeting=Hi;dropoff_location=the...,taxi booking
10079,#Person1#:I need a taxi to grab me from my hou...,[request(taxi;numberofpeople=1;date=tomorrow n...,taxi booking


# DSTC Datasets

The Dialog State Tracking Challenge (DSTC) --- formerly the Dialog State Tracking Challenge DST --- is an on-going series of research community challenge tasks. Each task released dialog data labeled with dialog state information, such as the user’s desired restaurant search query given all of the dialog history up to the current turn. The challenge is to create a “tracker” that can predict the dialog state for new dialogs. In each challenge, trackers are evaluated using held-out dialog data.

# Discarded Datasets

## The Gutenberg Dialogue Dataset

Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). The Gutenberg Dialogue dataset tries to narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. The dialogues were extracted and processed  from public-domain books made available by Project Gutenberg.


Source:
- https://arxiv.org/abs/2004.12752
- https://github.com/ricsinaruto/gutenberg-dialog


In [None]:
# # they created a mini script to download their dataset
# # so I download their github repo and then execute the script 

# url = "https://www.github.com/ricsinaruto/gutenberg-dialog/archive/refs/heads/master.zip"

# dest5 = "Datasets/dataset5"

# filename = helperFunctions.download_file(url, dest5)
# filename = os.path.join(dest5, filename)

# extracted_folder = helperFunctions.unzip(filename, dest5)

# %cd Datasets/dataset5/gutenberg-dialog-master/
# !python setup.py
# !python code/main.py -d -l=en -f1 -f2
# %cd ../../..