<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LUKE/Supervised_relation_extraction_with_LukeForEntityPairClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we are going to fine-tune [`LukeForEntityPairClassification`](https://huggingface.co/transformers/model_doc/luke.html#lukeforentitypairclassification) on a supervised **relation extraction** dataset.

The goal for the model is to predict, given a sentence and the character spans of two entities within the sentence, the relationship between the entities.

The author of LUKE has fine-tuned this model on the [TACRED](https://nlp.stanford.edu/projects/tacred/) dataset, an important supervised relation extraction dataset by Stanford University, and obtains state-of-the-art results with it. 

* Paper: https://arxiv.org/abs/2010.01057
* Original repository: https://github.com/studio-ousia/luke

## Read in data

Let's download the data from the web, hosted on Dropbox.

In [None]:
import requests, zipfile, io
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
def download_data():
    url = "https://www.dropbox.com/s/izi2x4sjohpzoot/relation_extraction_dataset.zip?dl=1"
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

download_data()

In [3]:
import shutil
import zipfile

zip_file_path = "relation_extraction_dataset.zip"

# 解压 ZIP 文件
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall("extracted_data")

print("解压完成！")

解压完成！


Each row in the dataframe consists of a news article, and a sentence in which a certain relationship was found (just as "invested_in", or "founded_by"). There were some patterns used to gather the data, so it might contain some noise. 

In [2]:
import pandas as pd

df = pd.read_pickle("extracted_data/relation_extraction_dataset.pkl")
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,end_idx,entities,entity_spans,match,original_article,sentence,start_idx,string_id
0,1024,"[Lilium, Baillie Gifford]","[[3, 9], [151, 166]]",raising $35,Happy Friday!\n\nWe sincerely hope you and you...,"3) Lilium, a German startup that’s making an a...",1013,invested_in
1,1762,"[Facebook ’s, Giphy]","[[92, 102], [148, 153]]",acquisition,Happy Friday!\n\nWe sincerely hope you and you...,"Meanwhile, the UK’s watchdog on Friday announc...",1751,acquired_by
2,2784,"[Global-e, Vitruvian Partners]","[[27, 35], [94, 112]]",raised $60,Happy Friday!\n\nWe sincerely hope you and you...,Israeli e-commerce startup Global-e has raised...,2774,invested_in
3,680,"[Joris Van Der Gucht, Silverfin]","[[0, 19], [35, 44]]",founder,Hg is a leading investor in tax and accounting...,"Joris Van Der Gucht, co-founder at Silverfin c...",673,founded_by
4,2070,"[Tim Vandecasteele, Silverfin]","[[0, 17], [71, 80]]",founder,Hg is a leading investor in tax and accounting...,"Tim Vandecasteele, co-founder added: ""We want ...",2063,founded_by


Let's create 2 dictionaries, one that maps each label to a unique integer, and one that does it the other way around.

In [3]:
id2label = dict()
for idx, label in enumerate(df.string_id.value_counts().index):
    id2label[idx] = label

As we can see, there are 7 labels (7 unique relationships):

In [4]:
id2label

{0: 'founded_by',
 1: 'acquired_by',
 2: 'invested_in',
 3: 'CEO_of',
 4: 'subsidiary_of',
 5: 'partners_with',
 6: 'owned_by'}

In [5]:
label2id = {v:k for k,v in id2label.items()}
label2id

{'founded_by': 0,
 'acquired_by': 1,
 'invested_in': 2,
 'CEO_of': 3,
 'subsidiary_of': 4,
 'partners_with': 5,
 'owned_by': 6}

In [6]:
df.shape

(12031, 8)

## Define the class RelationExtractionDataset
## 定义关系提取任务的数据集类

使用了 Hugging Face 的 LukeTokenizer，从mindnlp.transformers库中导入。导入mindspore进行张量操作。

For more information regarding these inputs, refer to the [docs](https://huggingface.co/transformers/model_doc/luke.html#lukeforentitypairclassification) of `LukeForEntityPairClassification`.


In [7]:
from mindnlp.transformers import LukeTokenizer
import mindspore

tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base", task="entity_pair_classification")

class RelationExtractionDataset():
    """Relation extraction dataset."""
    def __init__(self, data):
        """
        Args:
            data : Pandas dataframe.
        """
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data.iloc[idx]

        sentence = item.sentence
        entity_spans = [tuple(x) for x in item.entity_spans]
        encoding = tokenizer(sentence, entity_spans=entity_spans, padding="max_length", truncation=True, return_tensors="ms")
        
        label_value = label2id[item.string_id]
        encoding["label"] = mindspore.Tensor([label_value], dtype=mindspore.int32)
        
        return encoding



Here we instantiate the class defined above with 3 objects: a training dataset, a validation dataset and a test set.

In [8]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42, shuffle=False)

# define the dataset
train_dataset = RelationExtractionDataset(data=train_df)
valid_dataset = RelationExtractionDataset(data=val_df)
test_dataset = RelationExtractionDataset(data=test_df)

In [9]:
train_dataset[0].keys()

dict_keys(['input_ids', 'entity_ids', 'entity_position_ids', 'attention_mask', 'entity_attention_mask', 'label'])

Let's verify an example of a batch:

In [84]:
batch = next(iter(train_dataset))
print(batch)  # 查看 batch 的结构
print(batch["input_ids"].shape)

{'input_ids': Tensor(shape=[1, 512], dtype=Int64, value=
[[  0, 113, 894 ...   1,   1,   1]]), 'entity_ids': Tensor(shape=[1, 2], dtype=Int64, value=
[[2, 3]]), 'entity_position_ids': Tensor(shape=[1, 2, 30], dtype=Int64, value=
[[[34, 35, 36 ... -1, -1, -1],
  [46, 47, 48 ... -1, -1, -1]]]), 'attention_mask': Tensor(shape=[1, 512], dtype=Int64, value=
[[1, 1, 1 ... 0, 0, 0]]), 'entity_attention_mask': Tensor(shape=[1, 2], dtype=Int64, value=
[[1, 1]]), 'label': Tensor(shape=[1], dtype=Int32, value= [2])}
(1, 512)
Label distribution: Counter({2: 2299, 1: 1612, 0: 1558, 3: 1096, 4: 507, 5: 375, 6: 252})


In [10]:
label_value = batch["label"].item()
print(id2label[label_value])

founded_by


## Define a PyTorch LightningModule

Let's define the model as a PyTorch LightningModule. A `LightningModule` is actually an `nn.Module`, but with some extra functionality.

For more information regarding how to define this, see the [docs](https://pytorch-lightning.readthedocs.io/en/latest/?_ga=2.56317931.1395871250.1622709933-1738348008.1615553774) of PyTorch Lightning.

In [92]:
import mindspore.nn as nn
from mindnlp.transformers import LukeForEntityPairClassification
from mindnlp.core.optim import AdamW
from mindnlp.core import no_grad

class LUKE:
    def __init__(self, label2id):
        self.model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-base", num_labels=len(label2id))
        #self.optimizer = AdamW(self.model.trainable_params(), lr=1e-3)  # 使用 trainable_params()
        self.optimizer = AdamW(params=tuple(param for param in self.model.trainable_params() if param.requires_grad), lr=1e-3)
        self.criterion = nn.CrossEntropyLoss() 
    def forward(self, input_ids, entity_ids, entity_position_ids, attention_mask, entity_attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, entity_ids=entity_ids, 
                             entity_attention_mask=entity_attention_mask, entity_position_ids=entity_position_ids)
        return outputs

    def common_step(self, batch):
        labels = batch['label']  # 获取 labels
        del batch['label']  # 删除 labels
        outputs = self.forward(**batch)
        logits = outputs.logits
        criterion = nn.CrossEntropyLoss()
        loss = criterion(logits, labels)
        predictions = logits.argmax(-1)
        correct = (predictions == labels).sum().item()
        accuracy = correct / batch['input_ids'].shape[0]
        return loss, accuracy
    def clear_grad(self):
        for param in self.model.trainable_params():
            if param.grad is not None:
                param.grad.set_value(mindspore.Tensor(0))  # 将梯度设置为0
    def train(self, train_dataloader, num_epochs):
        self.model.set_train()  # 设置模型为训练模式
        total_batches = len(train_dataloader)  # 计算总批次
        for epoch in range(num_epochs):
            total_loss, total_accuracy = 0, 0
            batch_idx = 0 
            for batch in train_dataloader:
                # 直接清除上一个批次的梯度
                self.clear_grad()  
                # 计算损失和准确率
                loss, accuracy = self.common_step(batch)
                loss.backward()  # 反向传播
                grads = [param.grad for param in self.model.trainable_params() if param.grad is not None]
                self.optimizer.step(grads)  # 更新参数
                total_loss += loss.asnumpy()
                total_accuracy += accuracy
                batch_idx += 1
                # 每50个批次输出一次损失和准确率
                if (batch_idx + 1) % 50 == 0:
                    avg_loss = total_loss / 50
                    avg_accuracy = total_accuracy / 50
                    print(f"Epoch [{epoch + 1}/{num_epochs}], Batch [{batch_idx + 1}/{total_batches}], "
                          f"Loss: {avg_loss:.4f}, Accuracy: {avg_accuracy:.4f}")
                    total_loss = 0  # 重置总损失
                    total_accuracy = 0  # 重置总准确率
            avg_loss = total_loss / len(train_dataloader)
            avg_accuracy = total_accuracy / len(train_dataloader)
        print("avg_loss:", avg_loss,"avh_accuracy:",avg_accuracy)
    def validate(self, valid_dataloader):
        self.model.eval()
        total_loss, total_accuracy = 0, 0
        with no_grad():  # 禁用梯度计算
            for batch in valid_dataloader:
                loss, accuracy = self.common_step(batch)
                total_loss += loss.item()
                total_accuracy += accuracy

        print(f"Validation Loss: {total_loss / len(valid_dataloader)}, Validation Accuracy: {total_accuracy / len(valid_dataloader)}")

    def test(self, test_dataloader):
        self.model.eval()
        total_loss = 0
        with no_grad():
            for batch in test_dataloader:
                loss, _ = self.common_step(batch)
                total_loss += loss.item()

        print(f"Test Loss: {total_loss / len(test_dataloader)}")

    def __call__(self, **kwargs):
        return self.forward(**kwargs)

Let's verify a forward pass on a batch:

In [93]:
batch = next(iter(valid_dataset))
labels = batch["label"]
batch.keys()


dict_keys(['input_ids', 'entity_ids', 'entity_position_ids', 'attention_mask', 'entity_attention_mask', 'label'])

In [34]:
batch["input_ids"].shape

(1, 512)

In [35]:
# 查看所有唯一标签
unique_labels = df['string_id'].unique()
print("Unique labels:", unique_labels)

# 创建 label2id 字典
label2id = {label: idx for idx, label in enumerate(unique_labels)}
print("Label to ID mapping:", label2id)

Unique labels: ['invested_in' 'acquired_by' 'founded_by' 'CEO_of' 'subsidiary_of'
 'partners_with' 'owned_by']
Label to ID mapping: {'invested_in': 0, 'acquired_by': 1, 'founded_by': 2, 'CEO_of': 3, 'subsidiary_of': 4, 'partners_with': 5, 'owned_by': 6}


In [94]:
model = LUKE(label2id, weights)
del batch["label"]
outputs = model(**batch)


Some weights of LukeForEntityPairClassification were not initialized from the model checkpoint at studio-ousia/luke-base and are newly initialized: ['classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The initial loss should be around -ln(1/number of classes) = -ln(1/7) = 1.95:

In [95]:
criterion = nn.CrossEntropyLoss()
initial_loss = criterion(outputs.logits, labels)
print("Initial loss:", initial_loss)

Initial loss: 2.2671444


## Train the model

Let's train the model. We also use early stopping, to avoid overfitting the training dataset. We also log everything to Weights and Biases, which will give us beautiful charts of the loss and accuracy plotted over time.

If you haven't already, you can create an account on the [website](https://wandb.ai/site), then log in in a web browser, and run the cell below: 

In [96]:
num_epochs = 7
model.train(train_dataset, num_epochs)

Epoch [1/7], Batch [50/7699], Loss: 2.0207, Accuracy: 0.1000
Epoch [1/7], Batch [100/7699], Loss: 2.1692, Accuracy: 0.0800
Epoch [1/7], Batch [150/7699], Loss: 2.1246, Accuracy: 0.1400
Epoch [1/7], Batch [200/7699], Loss: 2.1026, Accuracy: 0.1600
Epoch [1/7], Batch [250/7699], Loss: 2.1277, Accuracy: 0.1400
Epoch [1/7], Batch [300/7699], Loss: 2.0651, Accuracy: 0.1400
Epoch [1/7], Batch [350/7699], Loss: 2.0984, Accuracy: 0.1000
Epoch [1/7], Batch [400/7699], Loss: 2.1529, Accuracy: 0.0200
Epoch [1/7], Batch [450/7699], Loss: 2.0739, Accuracy: 0.0600


Process ForkServerPoolWorker-74:
Process ForkServerPoolWorker-70:
Process ForkServerPoolWorker-68:
Process ForkServerPoolWorker-69:
Process ForkServerPoolWorker-72:
Process ForkServerPoolWorker-71:
Process ForkServerPoolWorker-67:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/mult

KeyboardInterrupt: 

In [98]:
model.validate(valid_dataset)

Validation Loss: 2.082239992773378, Validation Accuracy: 0.07532467532467532


In [None]:
model.test(test_dataset)

# 结束 WandB 实验
wandb.finish()

Test Loss: 2.0273792930891466


## Evaluation

Instead of calling `trainer.test()`, we can also manually evaluate the model on the entire test set:

In [None]:
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score

loaded_model.model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
loaded_model.to(device)

predictions_total = []
labels_total = []
for batch in tqdm(test_dataloader):
    # get the inputs;
    labels = batch["label"]
    del batch["label"]

    # move everything to the GPU
    for k,v in batch.items():
      batch[k] = batch[k].to(device)

    # forward pass
    outputs = loaded_model.model(**batch)
    logits = outputs.logits
    predictions = logits.argmax(-1)
    predictions_total.extend(predictions.tolist())
    labels_total.extend(labels.tolist())

In [None]:
print("Accuracy on test set:", accuracy_score(labels_total, predictions_total))

## Inference

Here we test the trained model on a new, unseen sentence.

In [None]:
loaded_model = LUKE.load_from_checkpoint(checkpoint_path="/content/drive/Shareddrives/Datascouts/epoch=3-step=7699.ckpt")

In [None]:
test_df.iloc[0].sentence

In [None]:
import torch.nn.functional as F

idx = 2
text = test_df.iloc[idx].sentence
entity_spans = test_df.iloc[idx].entity_spans  # character-based entity spans 
entity_spans = [tuple(x) for x in entity_spans]

inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")

outputs = loaded_model.model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Sentence:", text)
print("Ground truth label:", test_df.iloc[idx].string_id)
print("Predicted class idx:", id2label[predicted_class_idx])
print("Confidence:", F.softmax(logits, -1).max().item())