<a href="https://colab.research.google.com/github/Mozzer2310/text-mining-cwk/blob/sam-experiments/bert-experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install datasets
from datasets import load_dataset

dataset = load_dataset("dataset-org/dialog_re", download_mode="force_redownload", trust_remote_code=True)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/726k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1073 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/357 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/358 [00:00<?, ? examples/s]

In [2]:
dataset['test'][0]

{'dialog': ['Speaker 1: Hey, you guys! Look what I found! Look at this!  That’s my Mom’s writing! Look.',
  'Speaker 2: Me and Frank and Phoebe, Graduation 1965.',
  "Speaker 1: Y'know what that means?",
  'Speaker 3: That you’re actually 50?',
  'Speaker 1: No-no, that’s not, that’s not me Phoebe, that’s her pal Phoebe. According to her high school yearbook, they were like B.F.F. Best Friends Forever.',
  'Speaker 4: Oh!',
  'Speaker 5: That is so cool.',
  'Speaker 1: I know! So this woman probably could like have all kinds of stories about my parents, and she might even know like where my Dad is. So I looked her up, and she lives out by the beach. So maybe this weekend we could go to the beach?',
  'Speaker 4: Yeah! Yeah, we can!',
  'Speaker 6: Shoot! I can’t go, I have to work!',
  'Speaker 7: That’s too bad.',
  'Speaker 5: Ohh, big, fat bummerrr.',
  'Speaker 1: So great! Okay! Tomorrow we’re gonna drive out to Montauk.'],
 'relation_data': {'x': ['Speaker 1',
   'Speaker 1',
  

**Data Fields**
- `dialog`
    - List of dialog spoken between the speakers
- List of annotations per dialog per argument
    - `x` : First entity
    - `y` : Second entity
    - `x_type` : Type of the first entity
    - `y_type`: Type of the second entity
    - `r` : List of relations
    - `rid`: List of relation IDs
    - `t`: List of relation Trigger words

In [3]:
import re
from transformers import AutoTokenizer

In [4]:
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

In [5]:
def add_tokens(example, add_triggers=True):
    """Converts datapoint to a list of tuples containing dialog with separated entities and their relation"""

    dialog = example['dialog']
    relation_data = example['relation_data']

    data = []
    relation_labels = []

    # Join the dialog into a single string
    all_dialog = ' '.join(dialog)

    for x, y, r, t in zip(relation_data['x'], relation_data['y'], relation_data['r'], relation_data['t']):
        # Create a dictionary to store the formatted dialog
        formatted_text = f"{all_dialog} [SEP] {x} [SEP] {y}"

        # optionally include trigger words
        if add_triggers:
            triggers = ', '.join(t)
            formatted_text = f"{formatted_text} [SEP] {triggers}"

        data.append(formatted_text)
        relation_labels.append(r)

    return data, relation_labels

In [6]:
add_tokens(dataset['test'][0])

(["Speaker 1: Hey, you guys! Look what I found! Look at this!  That’s my Mom’s writing! Look. Speaker 2: Me and Frank and Phoebe, Graduation 1965. Speaker 1: Y'know what that means? Speaker 3: That you’re actually 50? Speaker 1: No-no, that’s not, that’s not me Phoebe, that’s her pal Phoebe. According to her high school yearbook, they were like B.F.F. Best Friends Forever. Speaker 4: Oh! Speaker 5: That is so cool. Speaker 1: I know! So this woman probably could like have all kinds of stories about my parents, and she might even know like where my Dad is. So I looked her up, and she lives out by the beach. So maybe this weekend we could go to the beach? Speaker 4: Yeah! Yeah, we can! Speaker 6: Shoot! I can’t go, I have to work! Speaker 7: That’s too bad. Speaker 5: Ohh, big, fat bummerrr. Speaker 1: So great! Okay! Tomorrow we’re gonna drive out to Montauk. [SEP] Speaker 1 [SEP] 50 [SEP] ",
  "Speaker 1: Hey, you guys! Look what I found! Look at this!  That’s my Mom’s writing! Look. S

In [7]:
test_data = []
test_relation_labels = []
for datapoint in dataset['test']:
    data, relation_labels = add_tokens(datapoint)
    test_data.extend(data)
    test_relation_labels.extend(relation_labels)

print(test_data[0])
print(test_relation_labels[0])

Speaker 1: Hey, you guys! Look what I found! Look at this!  That’s my Mom’s writing! Look. Speaker 2: Me and Frank and Phoebe, Graduation 1965. Speaker 1: Y'know what that means? Speaker 3: That you’re actually 50? Speaker 1: No-no, that’s not, that’s not me Phoebe, that’s her pal Phoebe. According to her high school yearbook, they were like B.F.F. Best Friends Forever. Speaker 4: Oh! Speaker 5: That is so cool. Speaker 1: I know! So this woman probably could like have all kinds of stories about my parents, and she might even know like where my Dad is. So I looked her up, and she lives out by the beach. So maybe this weekend we could go to the beach? Speaker 4: Yeah! Yeah, we can! Speaker 6: Shoot! I can’t go, I have to work! Speaker 7: That’s too bad. Speaker 5: Ohh, big, fat bummerrr. Speaker 1: So great! Okay! Tomorrow we’re gonna drive out to Montauk. [SEP] Speaker 1 [SEP] 50 [SEP] 
['per:age']


In [8]:
test_encodings = tokenizer(test_data, padding="max_length", truncation=True)
print(test_encodings[0])

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [9]:
print(tokenizer.decode(test_encodings[0].ids))

[CLS] Speaker 1 : Hey, you guys! Look what I found! Look at this! That ’ s my Mom ’ s writing! Look. Speaker 2 : Me and Frank and Phoebe, Graduation 1965. Speaker 1 : Y ' know what that means? Speaker 3 : That you ’ re actually 50? Speaker 1 : No - no, that ’ s not, that ’ s not me Phoebe, that ’ s her pal Phoebe. According to her high school yearbook, they were like B. F. F. Best Friends Forever. Speaker 4 : Oh! Speaker 5 : That is so cool. Speaker 1 : I know! So this woman probably could like have all kinds of stories about my parents, and she might even know like where my Dad is. So I looked her up, and she lives out by the beach. So maybe this weekend we could go to the beach? Speaker 4 : Yeah! Yeah, we can! Speaker 6 : Shoot! I can ’ t go, I have to work! Speaker 7 : That ’ s too bad. Speaker 5 : Ohh, big, fat bummerrr. Speaker 1 : So great! Okay! Tomorrow we ’ re gonna drive out to Montauk. [SEP] Speaker 1 [SEP] 50 [SEP] [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

**Need to encode label data**  
Use a binary multi-hot vector if multiple labels

In [10]:
def get_labels(dataset):
    all_dataset_labels = set()
    for datapoint in dataset['train']:
        for relation in  [item for sublist in datapoint['relation_data']['r'] for item in sublist]:
            all_dataset_labels.add(relation)
    return list(all_dataset_labels)

In [11]:
from sklearn.preprocessing import MultiLabelBinarizer

# Get all possible labels
all_labels = get_labels(dataset)
print(all_labels)

# Initialize the binarizer
mlb = MultiLabelBinarizer(classes=all_labels)

# Fit and transform the labels
encoded_labels = mlb.fit_transform(test_relation_labels)

['per:positive_impression', 'per:parents', 'per:title', 'per:schools_attended', 'per:children', 'per:friends', 'per:age', 'gpe:visitors_of_place', 'per:spouse', 'per:visited_place', 'per:subordinate', 'per:dates', 'per:client', 'per:girl/boyfriend', 'per:alternate_names', 'gpe:residents_of_place', 'per:roommate', 'per:siblings', 'per:date_of_birth', 'per:other_family', 'per:origin', 'org:students', 'per:place_of_work', 'per:works', 'per:alumni', 'per:employee_or_member_of', 'per:major', 'org:employees_or_members', 'unanswerable', 'per:neighbor', 'per:boss', 'per:negative_impression', 'per:acquaintance', 'per:pet', 'per:place_of_residence']


In [12]:
print(encoded_labels[0])
print(mlb.classes_)

[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
['per:positive_impression' 'per:parents' 'per:title'
 'per:schools_attended' 'per:children' 'per:friends' 'per:age'
 'gpe:visitors_of_place' 'per:spouse' 'per:visited_place'
 'per:subordinate' 'per:dates' 'per:client' 'per:girl/boyfriend'
 'per:alternate_names' 'gpe:residents_of_place' 'per:roommate'
 'per:siblings' 'per:date_of_birth' 'per:other_family' 'per:origin'
 'org:students' 'per:place_of_work' 'per:works' 'per:alumni'
 'per:employee_or_member_of' 'per:major' 'org:employees_or_members'
 'unanswerable' 'per:neighbor' 'per:boss' 'per:negative_impression'
 'per:acquaintance' 'per:pet' 'per:place_of_residence']


In [13]:
import torch
import numpy as np
from torch import nn
from transformers import BertTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

In [14]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [15]:
# Custom Dataset class
class MultiLabelDataset(Dataset):
    def __init__(self, data, labels, tokenizer, max_len=128):
        self.data = data
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Tokenize text with context keywords as second sentence
        encoding = self.tokenizer(
            self.data[idx],
            padding="max_length",
            truncation=True,
            max_length=self.max_len,
            return_tensors="pt"
        )

        # Return inputs and labels
        return {
            "input_ids": encoding["input_ids"].squeeze(),
            "attention_mask": encoding["attention_mask"].squeeze(),
            "labels": torch.tensor(self.labels[idx], dtype=torch.float)
        }

# Prepare dataset
dataset = MultiLabelDataset(test_data, encoded_labels, tokenizer)

# Initialize BERT model for multi-label classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=len(all_labels))

# Custom metrics function for evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = (torch.sigmoid(torch.tensor(logits)) > 0.5).int().numpy()

    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="micro")
    acc = accuracy_score(labels, predictions)

    return {
        "accuracy": acc,
        "f1": f1,
        "precision": precision,
        "recall": recall
    }

# Training arguments
training_args = TrainingArguments(
    output_dir="./bert_multilabel",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.108807,0.0,0.0,0.0,0.0
2,0.172100,0.107374,0.0,0.0,0.0,0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


KeyboardInterrupt: 