<a href="https://colab.research.google.com/github/Mozzer2310/text-mining-cwk/blob/sam-experiments/DL_Relation_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Get the Dataset
We need to install the `datasets` module to download the [DialogRE](https://huggingface.co/datasets/dataset-org/dialog_re) dataset.

In [1]:
! pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

Then we can download the dataset.

In [2]:
from datasets import load_dataset

dataset = load_dataset("dataset-org/dialog_re", download_mode="force_redownload", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/726k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1073 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/357 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/358 [00:00<?, ? examples/s]

Then view the Dataset and its contents.

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['dialog', 'relation_data'],
        num_rows: 1073
    })
    test: Dataset({
        features: ['dialog', 'relation_data'],
        num_rows: 357
    })
    validation: Dataset({
        features: ['dialog', 'relation_data'],
        num_rows: 358
    })
})

In [4]:
dataset['train'][0]

{'dialog': ["Speaker 1: It's been an hour and not one of my classmates has shown up! I tell you, when I actually die some people are gonna get seriously haunted!",
  'Speaker 2: There you go! Someone came!',
  "Speaker 1: Ok, ok! I'm gonna go hide! Oh, this is so exciting, my first mourner!",
  'Speaker 3: Hi, glad you could come.',
  'Speaker 2: Please, come in.',
  "Speaker 4: Hi, you're Chandler Bing, right? I'm Tom Gordon, I was in your class.",
  'Speaker 2: Oh yes, yes... let me... take your coat.',
  "Speaker 4: Thanks... uh... I'm so sorry about Ross, it's...",
  'Speaker 2: At least he died doing what he loved... watching blimps.',
  'Speaker 1: Who is he?',
  'Speaker 2: Some guy, Tom Gordon.',
  "Speaker 1: I don't remember him, but then again I touched so many lives.",
  'Speaker 3: So, did you know Ross well?',
  "Speaker 4: Oh, actually I barely knew him. Yeah, I came because I heard Chandler's news. D'you know if he's seeing anyone?",
  'Speaker 3: Yes, he is. Me.',
  'S

## Preprocess the Data
1. Reformat the dataset so each sample (relation) is extracted from each item in the dataset
2. Proprocess each sample getting the tokens and the positional indices of the entities
3. Create a PyTorch dataset for the data

### Reformat the Dataset
Convert the dataset so that each item contains a singular relation.

In [5]:
def reformat_dataset(dataset, add_triggers=True):
    reformatted_dataset = []

    for item in dataset:
        dialog = item['dialog']
        relation_data = item['relation_data']

        # Join the dialog into a single string
        all_dialog = ' '.join(dialog)

        samples = []
        for x, y, r, t in zip(relation_data['x'], relation_data['y'], relation_data['r'], relation_data['t']):
            sample = {'dialog': all_dialog, 'x': x, 'y': y, 'relation': r}
            if add_triggers:
                sample['trigger'] = t
            samples.append(sample)

        reformatted_dataset.extend(samples)

    return reformatted_dataset

In [16]:
reformatted_dataset = {}
for split in dataset.keys():
    reformatted_dataset[split] = reformat_dataset(dataset[split], add_triggers=False)

print(reformatted_dataset['train'][0])

{'dialog': "Speaker 1: It's been an hour and not one of my classmates has shown up! I tell you, when I actually die some people are gonna get seriously haunted! Speaker 2: There you go! Someone came! Speaker 1: Ok, ok! I'm gonna go hide! Oh, this is so exciting, my first mourner! Speaker 3: Hi, glad you could come. Speaker 2: Please, come in. Speaker 4: Hi, you're Chandler Bing, right? I'm Tom Gordon, I was in your class. Speaker 2: Oh yes, yes... let me... take your coat. Speaker 4: Thanks... uh... I'm so sorry about Ross, it's... Speaker 2: At least he died doing what he loved... watching blimps. Speaker 1: Who is he? Speaker 2: Some guy, Tom Gordon. Speaker 1: I don't remember him, but then again I touched so many lives. Speaker 3: So, did you know Ross well? Speaker 4: Oh, actually I barely knew him. Yeah, I came because I heard Chandler's news. D'you know if he's seeing anyone? Speaker 3: Yes, he is. Me. Speaker 4: What? You... You... Oh! Can I ask you a personal question? Ho-how 

### Preprocess each Sample

In [19]:
import re
from transformers import AutoTokenizer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [24]:
def preprerocess_sample(sample, model_name="bert-base-uncased"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    SUBJ_TOKEN = '[SUBJ]'
    OBJ_TOKEN = '[OBJ]'
    SEP_TOKEN = '[SEP]'
    dialog = sample['dialog']
    x = sample['x']
    y = sample['y']

    dialog1 = dialog.replace(x, '[SUBJ]')
    dialog2 = dialog1.replace(y, '[OBJ]')

    text = f"{dialog2} {SEP_TOKEN} {x} {SEP_TOKEN} {y}"
    if 'trigger' in sample:
        trigger = ', '.join(sample['trigger'])
        text += f" {SEP_TOKEN} {trigger}"

    tokens = tokenizer(text, padding="max_length", truncation=True, return_tensors="pt")

    # Find entity positions
    words = tokenizer.tokenize(text)
    print(words)
    e1_pos = [i for i, x in enumerate(words) if x == SUBJ_TOKEN]
    e2_pos = [i for i, x in enumerate(words) if x == OBJ_TOKEN]

    return tokens, (e1_pos, e2_pos)

    # TODO: replace instances of x and y with [SUBJ] and [OBJ] tokens
    # TODO: append [SEP] x [SEP] y {[SEP] trigger}
    # TODO: tokenize with relevant tokenizer add special tokens first
    # TODO: get indices of [SUBJ] and [OBJ] in tokenized dialog

In [25]:
preprerocess_sample(reformatted_dataset['train'][0])

['speaker', '1', ':', 'it', "'", 's', 'been', 'an', 'hour', 'and', 'not', 'one', 'of', 'my', 'classmates', 'has', 'shown', 'up', '!', 'i', 'tell', 'you', ',', 'when', 'i', 'actually', 'die', 'some', 'people', 'are', 'gonna', 'get', 'seriously', 'haunted', '!', '[', 'sub', '##j', ']', ':', 'there', 'you', 'go', '!', 'someone', 'came', '!', 'speaker', '1', ':', 'ok', ',', 'ok', '!', 'i', "'", 'm', 'gonna', 'go', 'hide', '!', 'oh', ',', 'this', 'is', 'so', 'exciting', ',', 'my', 'first', 'mo', '##urne', '##r', '!', 'speaker', '3', ':', 'hi', ',', 'glad', 'you', 'could', 'come', '.', '[', 'sub', '##j', ']', ':', 'please', ',', 'come', 'in', '.', 'speaker', '4', ':', 'hi', ',', 'you', "'", 're', '[', 'ob', '##j', ']', ',', 'right', '?', 'i', "'", 'm', 'tom', 'gordon', ',', 'i', 'was', 'in', 'your', 'class', '.', '[', 'sub', '##j', ']', ':', 'oh', 'yes', ',', 'yes', '.', '.', '.', 'let', 'me', '.', '.', '.', 'take', 'your', 'coat', '.', 'speaker', '4', ':', 'thanks', '.', '.', '.', 'uh', '.'

({'input_ids': tensor([[  101,  5882,  1015,  1024,  2009,  1005,  1055,  2042,  2019,  3178,
           1998,  2025,  2028,  1997,  2026, 19846,  2038,  3491,  2039,   999,
           1045,  2425,  2017,  1010,  2043,  1045,  2941,  3280,  2070,  2111,
           2024,  6069,  2131,  5667, 11171,   999,  1031,  4942,  3501,  1033,
           1024,  2045,  2017,  2175,   999,  2619,  2234,   999,  5882,  1015,
           1024,  7929,  1010,  7929,   999,  1045,  1005,  1049,  6069,  2175,
           5342,   999,  2821,  1010,  2023,  2003,  2061, 10990,  1010,  2026,
           2034,  9587, 21737,  2099,   999,  5882,  1017,  1024,  7632,  1010,
           5580,  2017,  2071,  2272,  1012,  1031,  4942,  3501,  1033,  1024,
           3531,  1010,  2272,  1999,  1012,  5882,  1018,  1024,  7632,  1010,
           2017,  1005,  2128,  1031, 27885,  3501,  1033,  1010,  2157,  1029,
           1045,  1005,  1049,  3419,  5146,  1010,  1045,  2001,  1999,  2115,
           2465,  1012,  1

In [15]:
from transformers import BertTokenizer

MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

text = "This is some example text with Speaker 1 reapeated a few times, Speaker 1, and Speaker 1 again."
tokens = tokenizer(text, padding="max_length", truncation=True, return_tensors="pt")

# Find entity positions
words = tokenizer.tokenize(text)
e1_pos = words.index("speaker 1") if "speaker 1" in text else -1

print(tokens)
print(words)
print(e1_pos)
# # e2_pos = words.index(entity2.split()[0]) if entity2 in text else -1

{'input_ids': tensor([[  101,  2023,  2003,  2070,  2742,  3793,  2007,  5882,  1015,  2128,
         24065,  4383,  1037,  2261,  2335,  1010,  5882,  1015,  1010,  1998,
          5882,  1015,  2153,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

In [None]:
#TODO
#   - Replace entities in dialog with [SUBJ] [OBJ]
#       > [CLS]d*[SEP]e1[SEP]e2[SEP] where d* is as above
#   - Get positional indices of entity1 (SUBJ token) and entity2 (OBJ token) in dialog