# References

- https://github.com/facebookresearch/EmpatheticDialogues
- https://paperswithcode.com/dataset/empatheticdialogues
- https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/final_reports/report028.pdf
- https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb
- https://towardsdatascience.com/dialogpt-large-scale-generative-pre-training-for-conversational-response-generation-5ceb783428dc
- https://jalammar.github.io/illustrated-gpt2/
- https://huggingface.co/blog/how-to-generate

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch, os, re, pandas as pd, json
from sklearn.model_selection import train_test_split
from transformers import (
    DataCollatorForLanguageModeling,
    DataCollatorWithPadding,
    GPT2Tokenizer,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM
)
from datasets import Dataset, list_metrics, load_metric


MODEL_NAME = "microsoft/DialoGPT-small"
model_cls = AutoModelForCausalLM
tokenizer_cls = AutoTokenizer

In [3]:
!wget https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz
!tar xzvf empatheticdialogues.tar.gz
!mv empatheticdialogues.tar.gz ../data/
!mv empatheticdialogues ../data/

--2022-05-09 23:07:27--  https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28022709 (27M) [application/gzip]
Saving to: ‘empatheticdialogues.tar.gz’


2022-05-09 23:07:29 (26.1 MB/s) - ‘empatheticdialogues.tar.gz’ saved [28022709/28022709]

empatheticdialogues/
empatheticdialogues/test.csv
empatheticdialogues/train.csv
empatheticdialogues/valid.csv


In [2]:
!ls ../data

empatheticdialogues	    hopper-dialoGPT-1
empatheticdialogues.tar.gz  processed.csv


In [3]:
!ls ../data/empatheticdialogues/

test.csv  train.csv  valid.csv


In [3]:
filename = "../data/empatheticdialogues/train.csv"
df = pd.read_csv(filename, encoding="utf-8", on_bad_lines="skip")

In [4]:
df.head(5)

Unnamed: 0,conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
0,hit:0_conv:1,1,sentimental,I remember going to the fireworks with my best...,1,I remember going to see the fireworks with my ...,5|5|5_2|2|5,
1,hit:0_conv:1,2,sentimental,I remember going to the fireworks with my best...,0,Was this a friend you were in love with_comma_...,5|5|5_2|2|5,
2,hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best...,1,This was a best friend. I miss her.,5|5|5_2|2|5,
3,hit:0_conv:1,4,sentimental,I remember going to the fireworks with my best...,0,Where has she gone?,5|5|5_2|2|5,
4,hit:0_conv:1,5,sentimental,I remember going to the fireworks with my best...,1,We no longer talk.,5|5|5_2|2|5,


In [5]:
df

Unnamed: 0,conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
0,hit:0_conv:1,1,sentimental,I remember going to the fireworks with my best...,1,I remember going to see the fireworks with my ...,5|5|5_2|2|5,
1,hit:0_conv:1,2,sentimental,I remember going to the fireworks with my best...,0,Was this a friend you were in love with_comma_...,5|5|5_2|2|5,
2,hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best...,1,This was a best friend. I miss her.,5|5|5_2|2|5,
3,hit:0_conv:1,4,sentimental,I remember going to the fireworks with my best...,0,Where has she gone?,5|5|5_2|2|5,
4,hit:0_conv:1,5,sentimental,I remember going to the fireworks with my best...,1,We no longer talk.,5|5|5_2|2|5,
...,...,...,...,...,...,...,...,...
76663,hit:12424_conv:24848,5,sentimental,I found some pictures of my grandma in the att...,389,Yeah reminds me of the good old days. I miss ...,5|5|5_5|5|5,
76664,hit:12424_conv:24849,1,surprised,I woke up this morning to my wife telling me s...,294,I woke up this morning to my wife telling me s...,5|5|5_5|5|5,
76665,hit:12424_conv:24849,2,surprised,I woke up this morning to my wife telling me s...,389,Oh hey that's awesome! That is awesome right?,5|5|5_5|5|5,
76666,hit:12424_conv:24849,3,surprised,I woke up this morning to my wife telling me s...,294,It is soooo awesome. We have been wanting a b...,5|5|5_5|5|5,


In [6]:
base_tokenizer = tokenizer_cls.from_pretrained(MODEL_NAME)

In [7]:
df["text"] = df[["conv_id", "prompt", "utterance"]].groupby("conv_id")["utterance"].transform(lambda x: base_tokenizer.eos_token.join(x))

In [14]:
df.iloc[0]["text"]

'I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people_comma_ we felt like the only people in the world.<|endoftext|>Was this a friend you were in love with_comma_ or just a best friend?<|endoftext|>This was a best friend. I miss her.<|endoftext|>Where has she gone?<|endoftext|>We no longer talk.<|endoftext|>Oh was this something that happened because of an argument?'

In [11]:
train_df = pd.DataFrame({})
train_df["text"] = df[["conv_id", "prompt", "utterance"]].groupby("conv_id")["utterance"].transform(lambda x: base_tokenizer.eos_token.join(x))

In [15]:
train_df.iloc[0]["text"]

'I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people_comma_ we felt like the only people in the world.<|endoftext|>Was this a friend you were in love with_comma_ or just a best friend?<|endoftext|>This was a best friend. I miss her.<|endoftext|>Where has she gone?<|endoftext|>We no longer talk.<|endoftext|>Oh was this something that happened because of an argument?'

In [64]:
def prepare_data(df):
    up_df = pd.DataFrame({})
    up_df["text"] = df[["conv_id", "prompt", "utterance"]].groupby("conv_id")["utterance"].transform(lambda x: base_tokenizer.eos_token.join(x))
    up_df.reset_index()
    return pd.DataFrame({"text": up_df.text.unique()})

In [65]:
trn_df = prepare_data(df)

In [66]:
for row in trn_df[:10].iterrows():
    print(row)

(0, text    I remember going to see the fireworks with my ...
Name: 0, dtype: object)
(1, text     it feels like hitting to blank wall when i se...
Name: 1, dtype: object)
(2, text    Hi how are you doing today<|endoftext|>doing g...
Name: 2, dtype: object)
(3, text    I have never cheated on my wife.<|endoftext|>A...
Name: 3, dtype: object)
(4, text    Job interviews always make me sweat bullets_co...
Name: 4, dtype: object)
(5, text    Hi_comma_ this year_comma_ I was the first ove...
Name: 5, dtype: object)
(6, text    I lost my job last year and got really angry.<...
Name: 6, dtype: object)
(7, text    During christmas a few years ago_comma_ I did ...
Name: 7, dtype: object)
(8, text    My coworker is allowed to work remotely_comma_...
Name: 8, dtype: object)
(9, text    The other night I was alone and heard a nose c...
Name: 9, dtype: object)


In [67]:
pd.DataFrame({"text": trn_df.text.unique()})

Unnamed: 0,text
0,I remember going to see the fireworks with my ...
1,it feels like hitting to blank wall when i se...
2,Hi how are you doing today<|endoftext|>doing g...
3,I have never cheated on my wife.<|endoftext|>A...
4,Job interviews always make me sweat bullets_co...
...,...
17834,I was watching professional rodeo last night. ...
17835,I am waiting to see if I pass the GRE.<|endoft...
17836,What a scary night that was.<|endoftext|>What ...
17837,I was going through the stuff in my attic last...
