<a href="https://colab.research.google.com/github/Mozzer2310/text-mining-cwk/blob/wills-kitchen/experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Will's Experiments

## Set-Up

### Use GPU for PyTorch

We then set up PyTorch to use the GPU:

In [87]:
import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_device(device)

### Downloading the DialogRE Dataset

We download the DialogRE dataset from HuggingFace:

In [1]:
!pip install datasets -q

import datasets

dialogre = datasets.load_dataset(
    "dataset-org/dialog_re",
    download_mode="force_redownload",
    trust_remote_code=True,
)

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m368.6/485.4 kB[0m [31m11.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m12.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/726k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1073 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/357 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/358 [00:00<?, ? examples/s]

### Downloading the GloVE Word Vectors

As in the original DialogRE paper, we download the GloVE word vectors:

In [3]:
!wget -O glove.6B.zip https://nlp.stanford.edu/data/glove.6B.zip
!python -m spacy init vectors en glove.6B.zip glove_vectors

import spacy

nlp = spacy.load("glove_vectors")

--2025-02-26 13:22:17--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-02-26 13:22:17--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2025-02-26 13:25:09 (4.79 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

[38;5;4mℹ Creating blank nlp object for language 'en'[0m
400000it [00:07, 50716.99it/s]
[38;5;2m✔ Successfully converted 400000 vectors[0m
[38;5;2m✔ Saved nlp object with vectors to ou

## Hyper-Parameter Selection

### Training

In [7]:
dialogre["train"][0]

{'dialog': ["Speaker 1: It's been an hour and not one of my classmates has shown up! I tell you, when I actually die some people are gonna get seriously haunted!",
  'Speaker 2: There you go! Someone came!',
  "Speaker 1: Ok, ok! I'm gonna go hide! Oh, this is so exciting, my first mourner!",
  'Speaker 3: Hi, glad you could come.',
  'Speaker 2: Please, come in.',
  "Speaker 4: Hi, you're Chandler Bing, right? I'm Tom Gordon, I was in your class.",
  'Speaker 2: Oh yes, yes... let me... take your coat.',
  "Speaker 4: Thanks... uh... I'm so sorry about Ross, it's...",
  'Speaker 2: At least he died doing what he loved... watching blimps.',
  'Speaker 1: Who is he?',
  'Speaker 2: Some guy, Tom Gordon.',
  "Speaker 1: I don't remember him, but then again I touched so many lives.",
  'Speaker 3: So, did you know Ross well?',
  "Speaker 4: Oh, actually I barely knew him. Yeah, I came because I heard Chandler's news. D'you know if he's seeing anyone?",
  'Speaker 3: Yes, he is. Me.',
  'S

We create a `PhraseMatcher` that matches on the all of the named entities.

In [None]:
matcher = spacy.matcher.PhraseMatcher(nlp.vocab)

# Add entities to match on
for split in dialogre.values():
    for example in split:
        for entity in example['relation_data']['x'] + example['relation_data']['y']:
            matcher.add(entity, [nlp.make_doc(entity)])

In [89]:
encoder = nn.LSTM(50, 50, bidirectional=True)
predictor = nn.Linear(200, 37)

for example in dialogre["train"]:
    dialog = nlp("\n".join(example["dialog"]))
    dialog_vectors = torch.stack([
        torch.from_numpy(token.vector)
        for token in dialog
    ])
    embeddings, _ = encoder(dialog_vectors)

    entity_embeddings = {
        entity: []
        for entity in example['relation_data']['x'] + example['relation_data']['y']
    }

    for match_id, start, end in matcher(dialog):
        entity = nlp.vocab.strings[match_id]
        if entity in entity_embeddings:
            entity_embedding = embeddings[start:end].mean(dim=0)
            entity_embeddings[entity].append(entity_embedding)

    for entity in entity_embeddings:
        entity_embeddings[entity] = torch.stack(entity_embeddings[entity]).mean(dim=0)

    loss = 0
    for x, y, rid in zip(
        example['relation_data']['x'],
        example['relation_data']['y'],
        example['relation_data']['rid']
    ):
        logits = predictor(torch.cat((entity_embeddings[x], entity_embeddings[y])))
        truth = torch.zeros_like(logits)
        truth[torch.tensor(rid) - 1] = 1
        loss += F.binary_cross_entropy_with_logits(logits, truth)
    loss.backward()
    break