<a href="https://colab.research.google.com/github/Mozzer2310/text-mining-cwk/blob/wills-kitchen/text_mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining

## Dataset

We first need to download the DialogRE dataset from HuggingFace:

In [1]:
!pip install datasets -q

import datasets

dialog_re = datasets.load_dataset(
    'dataset-org/dialog_re',
    download_mode='force_redownload',
    trust_remote_code=True,
)


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m16.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/726k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1073 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/357 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/358 [00:00<?, ? examples/s]

We then extract the relevant information and transform each split into a PyTorch dataset:

In [22]:
import collections
import torch

from torch.utils import data

Example = collections.namedtuple(
    'Example', ['dialog', 'subject', 'object', 'relations']
)

class DialogREDataset(data.Dataset):
    def __init__(self, dialog_re_dataset):
        super().__init__()

        self.data = []
        for example in dialog_re_dataset:
            dialog, relation_data = example['dialog'], example['relation_data']

            # Join the lines of the dialog together
            dialog = '\n'.join(dialog)

            # Extract relation data
            for sbj, obj, rids in zip(
                relation_data['x'],
                relation_data['y'],
                relation_data['rid']
            ):
                # Construct each row of data
                row = Example(
                    dialog,
                    sbj,
                    obj,
                    torch.tensor(
                        [1.0 if i in rids else 0.0 for i in range(1, 38)]
                    ),
                )

                # Add row of data
                self.data.append(row)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Separate splits
train_split = dialog_re['train']
validation_split = dialog_re['validation']
test_split = dialog_re['test']

# Transform into PyTorch datasets
train_dataset = DialogREDataset(train_split)
validation_dataset = DialogREDataset(validation_split)
test_dataset = DialogREDataset(test_split)

# Apply DataLoader
train_data_loader = data.DataLoader(
    train_dataset,
    batch_size=None,
    shuffle=True
)
validation_data_loader = data.DataLoader(validation_dataset, batch_size=None)
test_data_loader = data.DataLoader(test_dataset, batch_size=None)


## Model

### GloVE Vectors

As in the original DialogRE paper, to define our model, we use the GloVE word vectors:

In [3]:
!wget -nv -O glove.42B.300d.zip https://nlp.stanford.edu/data/glove.42B.300d.zip
!python -m spacy init vectors en glove.42B.300d.zip glove_vectors

import spacy

nlp = spacy.load("glove_vectors")

2025-03-03 17:11:02 URL:https://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip [1877800501/1877800501] -> "glove.42B.300d.zip" [1]
[38;5;4mℹ Creating blank nlp object for language 'en'[0m
1917494it [03:27, 9229.90it/s] 
[38;5;2m✔ Successfully converted 1917494 vectors[0m
[38;5;2m✔ Saved nlp object with vectors to output directory. You can now use
the path to it in your config as the 'vectors' setting in [initialize].[0m
/content/glove_vectors


In [28]:
import torch
import numpy as np

from spacy import matcher
from torch import nn

class RelationExtractor(nn.Module):
    def __init__(self, n_relation_types, nlp):
        super().__init__()

        self.nlp = nlp

        self.token_vector_dims = nlp.vocab.vectors.shape[1]

        self.encoder = nn.LSTM(
            self.token_vector_dims,
            self.token_vector_dims,
            bidirectional=True,
        )

        self.classifier = nn.Linear(
            4 * self.token_vector_dims,
            n_relation_types,
        )

    def forward(self, doc, sbj, obj):
        doc = self.nlp(doc)

        # Calculate the contextual word vectors
        token_vectors = [token.vector for token in doc]
        token_vectors = np.vstack(token_vectors)
        token_vectors = torch.from_numpy(token_vectors)
        embedded_vectors, _ = self.encoder(token_vectors)

        # Calculate the subject and object entity vectors
        entity_matcher = matcher.PhraseMatcher(nlp.vocab, "LOWER")
        entity_matcher.add("SUBJECT", [self.nlp.make_doc(sbj)])
        entity_matcher.add("OBJECT", [self.nlp.make_doc(obj)])

        subject_vectors = []
        object_vectors = []
        for match_id, start, end in entity_matcher(doc):
            entity_vector = embedded_vectors[start:end].mean(dim=0)
            if nlp.vocab.strings[match_id] == "SUBJECT":
                subject_vectors.append(entity_vector)
            else:
                object_vectors.append(entity_vector)

        # Calculate logits for each relation
        subject_vector = torch.stack(subject_vectors).mean(dim=0)
        object_vector = torch.stack(object_vectors).mean(dim=0)

        return self.classifier(torch.cat([subject_vector, object_vector]))

## Training

In [29]:
from torch import optim

import torch.nn.functional as F

EPOCHS = 1

model = RelationExtractor(37, nlp)
optimizer = optim.Adam(model.parameters())

for _ in range(EPOCHS):
    for dialog, sbj, obj, relations in train_data_loader:
        print(dialog)
        print(obj)

        optimizer.zero_grad()
        predictions = model(dialog, sbj, obj)
        loss = F.binary_cross_entropy_with_logits(predictions, relations)
        loss.backward()
        print(loss)

Speaker 1: So I’m thinking about asking Rachel out tonight. Y'know maybe play her that song we wrote last week.
Speaker 2: Emotional Knapsack?
Speaker 1: Yeah.
Speaker 2: Right on! Oh! Uh, but, don’t take to long okay? 'Cause uh, we're gonna test out our fake ID's tonight, right Clifford Alverez.
Speaker 1: Listen, Roland Chang, if things go well, I’m gonna be out with her all night.
Speaker 2: Dude, don't do that too me!
Emotional Knapsack
tensor(0.6908, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Speaker 1: Hey, big...
Speaker 2: Shhhh!
Speaker 1: ...spender.
Speaker 2: She's still asleep.
Speaker 1: So how'd it go?
Speaker 2: Oh, it was amazing. You know how you always think you're great in bed?
Speaker 1: The fact that you'd even ask that question shows how little you know me.
big spender


RuntimeError: stack expects a non-empty TensorList