<a href="https://colab.research.google.com/github/Mozzer2310/text-mining-cwk/blob/wills-kitchen/text_mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining

## Dataset

We first need to download the DialogRE dataset from HuggingFace:

In [6]:
!pip install datasets -q

import datasets

dialog_re = datasets.load_dataset(
    'dataset-org/dialog_re',
    download_mode='force_redownload',
    trust_remote_code=True,
)


dialog_re.py:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/726k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1073 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/357 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/358 [00:00<?, ? examples/s]

We then extract the relevant information and transform each split into a PyTorch dataset:

In [13]:
import collections
import torch

from torch.utils import data

Example = collections.namedtuple(
    'Example', ['dialog', 'subject', 'object', 'relations']
)

class DialogREDataset(data.Dataset):
    def __init__(self, dialog_re_dataset):
        super().__init__()

        self.data = []
        for example in dialog_re_dataset:
            dialog, relation_data = example['dialog'], example['relation_data']

            # Join the lines of the dialog together
            dialog = '\n'.join(dialog)

            # Extract relation data
            for sbj, obj, rids in zip(
                relation_data['x'],
                relation_data['y'],
                relation_data['rid']
            ):
                # Construct each row of data
                row = Example(
                    dialog,
                    sbj,
                    obj,
                    torch.tensor(
                        [1 if i in rids else 0 for i in range(1, 38)]
                    ),
                )

                # Add row of data
                self.data.append(row)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Separate splits
train_split = dialog_re['train']
validation_split = dialog_re['validation']
test_split = dialog_re['test']

# Transform into PyTorch datasets
train_dataset = DialogREDataset(train_split)
validation_dataset = DialogREDataset(validation_split)
test_dataset = DialogREDataset(test_split)

# Apply DataLoader
train_data_loader = data.DataLoader(
    train_dataset,
    batch_size=None,
    shuffle=True,
)
validation_data_loader = data.DataLoader(validation_dataset, batch_size=None)
test_data_loader = data.DataLoader(test_dataset, batch_size=None)


## GloVE Vectors

As in the original paper, we download the GloVE word embeddings:

In [8]:
!wget -nv -O glove.42B.300d.zip https://nlp.stanford.edu/data/glove.42B.300d.zip
!python -m spacy init vectors en glove.42B.300d.zip glove_vectors

import spacy

nlp = spacy.load("glove_vectors")

2025-03-05 03:05:25 URL:https://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip [1877800501/1877800501] -> "glove.42B.300d.zip" [1]
[38;5;4mℹ Creating blank nlp object for language 'en'[0m
1917494it [03:33, 8992.26it/s] 
[38;5;2m✔ Successfully converted 1917494 vectors[0m
[38;5;2m✔ Saved nlp object with vectors to output directory. You can now use
the path to it in your config as the 'vectors' setting in [initialize].[0m
/content/glove_vectors


## Model

We can then define our model, which consists of an LSTM and a single layer FNN.

In [11]:
import torch
import numpy as np

from spacy import matcher
from torch import nn

class RelationExtractor(nn.Module):
    def __init__(self, hidden_dim, n_relation_types, nlp):
        super().__init__()

        token_vector_dim = nlp.vocab.vectors.shape[1]

        self.nlp = nlp

        # NOTE: the resulting contextual vector dim is 2 * hidden_dim!
        self.encoder = nn.LSTM(
            token_vector_dim,
            hidden_dim,
            bidirectional=True,
        )

        self.classifier = nn.Linear(
            4 * hidden_dim,
            n_relation_types,
        )

    def forward(self, doc, sbj, obj):
        # Tokenize the document, subject and object
        doc = self.nlp.make_doc(doc)
        sbj = self.nlp.make_doc(sbj)
        obj = self.nlp.make_doc(obj)

        # Calculate the contextual word vectors
        doc_vectors = [token.vector for token in doc]
        doc_vectors = np.vstack(doc_vectors)
        doc_vectors = torch.from_numpy(doc_vectors)
        embedded_vectors, _ = self.encoder(doc_vectors)

        # Calculate the subject and object vectors
        entity_matcher = matcher.PhraseMatcher(nlp.vocab)
        entity_matcher.add('SBJ', [sbj])
        entity_matcher.add('OBJ', [obj])

        sbj_vectors = []
        obj_vectors = []
        for match_id, start, end in entity_matcher(doc):
            entity_vector = embedded_vectors[start:end].mean(dim=0)
            if nlp.vocab.strings[match_id] == 'SBJ':
                sbj_vectors.append(entity_vector)
            else:
                obj_vectors.append(entity_vector)

        # If the entity is not mentioned in the text, set the vector to zero
        if sbj_vectors:
            sbj_vector = torch.stack(sbj_vectors).mean(dim=0)
        else:
            sbj_vector = torch.zeros(embedded_vectors.shape[1])

        if obj_vectors:
            obj_vector = torch.stack(obj_vectors).mean(dim=0)
        else:
            obj_vector = torch.zeros_like(sbj_vector)

        return self.classifier(torch.cat([sbj_vector, obj_vector]))

## Training

In [None]:
from torch import optim
from sklearn import metrics

import torch.nn.functional as F

EPOCHS = 1

model = RelationExtractor(300, 37, nlp)
optimizer = optim.AdamW(model.parameters())

with torch.no_grad():
    all_predictions = []
    all_relations = []
    for dialog, sbj, obj, relations in train_data_loader:
        predictions = torch.sigmoid(model(dialog, sbj, obj))

        # Apply threshold
        predictions = (predictions > 0.5).long().numpy()
        all_predictions.append(predictions)

        all_relations.append(relations)

    print(metrics.f1_score(all_relations, all_predictions, average='micro'))

for _ in range(EPOCHS):
    for dialog, sbj, obj, relations in train_data_loader:
        optimizer.zero_grad()
        predictions = model(dialog, sbj, obj)
        relations = relations.float()
        loss = F.binary_cross_entropy_with_logits(predictions, relations)
        loss.backward()
        optimizer.step()

    with torch.no_grad():
        all_predictions = []
        all_relations = []
        for dialog, sbj, obj, relations in train_data_loader:
            predictions = torch.sigmoid(model(dialog, sbj, obj))

            # Apply threshold
            predictions = (predictions > 0.5).long().numpy()
            all_predictions.append(predictions)

            all_relations.append(relations)

        print(metrics.f1_score(all_relations, all_predictions, average='micro'))


0.06535318701490808
