<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/Reproducing_LUKE_experimental_results_Open_Entity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reproducing experimental results of LUKE on Open Entity Using Hugging Face Transformers

This notebook shows how to reproduce the state-of-the-art results on the [Open Entity entity typing dataset](https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html) reported in [this paper](https://arxiv.org/abs/2010.01057) using the Trasnsformers library and the [fine-tuned model checkpoint](https://huggingface.co/studio-ousia/luke-large-finetuned-open-entity) available on the Model Hub.
The source code used in the experiments is also available [here](https://github.com/studio-ousia/luke/tree/master/examples/entity_typing).

There are two other related notebooks:

* [Reproducing experimental results of LUKE on TACRED Using Hugging Face Transformers](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb)
* [Reproducing experimental results of LUKE on CoNLL-2003 Using Hugging Face Transformers](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb)

In [1]:
# Currently, LUKE is only available on the master branch
!pip install git+https://github.com/huggingface/transformers.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-1trw6a9b
  Running command git clone -q https://github.com/huggingface/transformers.git /tmp/pip-req-build-1trw6a9b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 4.1 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 71.0 MB/s 
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25h

In [2]:
import json
import torch
from tqdm import trange
from transformers import LukeTokenizer, LukeForEntityClassification

## Loading the dataset

The dataset is downloaded from the link mentioned in [this GitHub repository](https://github.com/thunlp/ERNIE). The test.json file is placed in the current directory and loaded using the `load_examples` function.

In [3]:
!gdown --id 1HlWw7Q6-dFSm9jNSCh4VaBf1PlGqt9im
!tar xzf /content/data.tar.gz

# Place test.json on the working directory
!cp data/OpenEntity/test.json .

Access denied with the following error:

 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1HlWw7Q6-dFSm9jNSCh4VaBf1PlGqt9im 

tar (child): /content/data.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
cp: cannot stat 'data/OpenEntity/test.json': No such file or directory


In [4]:
def load_examples(dataset_file):
    """
    args:

    dataset_file: str
        path to file
        
    """
    with open(dataset_file, "r") as f:
        data = json.load(f)

    examples = []
    for item in data:
        examples.append(dict(
            text=item["sent"],
            entity_spans=[(item["start"], item["end"])],
            label=item["labels"]
        ))

    return examples

In [5]:
test_examples = load_examples("test.json")

In [9]:
for i in range(5):
    print(test_examples[i])
    print("*"*100)

{'text': 'On late Monday night , 30th Nov 2009 , Bangladesh Police arrested Rajkhowa somewhere near Dhaka .', 'entity_spans': [(3, 20)], 'label': ['time']}
****************************************************************************************************
{'text': 'Leo W. Gerard , president of the steelworkers union , said he and several leaders of the AFL-CIO had organized joint events this week with the Sierra Club and the Alliance for Climate Protection .', 'entity_spans': [(111, 123)], 'label': ['event']}
****************************************************************************************************
{'text': 'Peace agreements will only bring further losses and push back our cause , " he added , pointing out that Abbas \'s Fatah party also maintains its own armed wing , the loosely affiliated Al - Aqsa Martyrs Brigades .', 'entity_spans': [(76, 78)], 'label': ['person']}
****************************************************************************************************
{'text

## Loading the fine-tuned model and tokenizer

We construct the model and tokenizer using the [fine-tuned model checkpoint](https://huggingface.co/studio-ousia/luke-large-finetuned-open-entity).

In [10]:
# Load the model checkpoint
model = LukeForEntityClassification.from_pretrained("studio-ousia/luke-large-finetuned-open-entity")
model.eval()
model.to("cuda")

# Load the tokenizer
tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-open-entity")

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at studio-ousia/luke-large-finetuned-open-entity were not used when initializing LukeForEntityClassification: ['luke.embeddings.position_ids']
- This IS expected if you are initializing LukeForEntityClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeForEntityClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/33.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.69k [00:00<?, ?B/s]

## Measuring performance

We classify entity mentions in the test set and measure the performance of the model.
The performance reported in the [original paper](https://arxiv.org/abs/2010.01057) is successfully reproduced.

In [14]:
batch_size = 128

num_predicted = 0
num_gold = 0
num_correct = 0

all_predictions = []
all_labels = []

for batch_start_idx in trange(0, len(test_examples), batch_size):
    batch_examples = test_examples[batch_start_idx:batch_start_idx + batch_size]
    texts = [example["text"] for example in batch_examples]
    entity_spans = [example["entity_spans"] for example in batch_examples]
    gold_labels = [example["label"] for example in batch_examples]

    inputs = tokenizer(texts, entity_spans=entity_spans, return_tensors="pt", padding=True)
    inputs = inputs.to("cuda")
    with torch.no_grad():
        outputs = model(**inputs)

    num_gold += sum(len(l) for l in gold_labels)
    for logits, labels in zip(outputs.logits, gold_labels):
        all_labels.append(labels)
        tmp_sample_pred = []
        for index, logit in enumerate(logits):
            if logit > 0:
                num_predicted += 1
                predicted_label = model.config.id2label[index]
                tmp_sample_pred.append(predicted_label)
                if predicted_label in labels:
                    num_correct += 1

        all_predictions.append(tmp_sample_pred)
            
precision = num_correct / num_predicted
recall = num_correct / num_gold
f1 = 2 * precision * recall / (precision + recall)

print(f"\n\nprecision: {precision} recall: {recall} f1: {f1}")

100%|██████████| 16/16 [00:34<00:00,  2.19s/it]



precision: 0.7980295566502463 recall: 0.7657563025210085 f1: 0.781559903511123





In [23]:
test_examples[0]

{'text': 'On late Monday night , 30th Nov 2009 , Bangladesh Police arrested Rajkhowa somewhere near Dhaka .',
 'entity_spans': [(3, 20)],
 'label': ['time']}

In [25]:
for idx, (_gold, _pred) in enumerate(zip(all_labels, all_predictions)):

    sample = test_examples[idx]
    start_char, end_char = sample["entity_spans"][0]

    print(f"{sample['text']=}") 
    print(f"{sample['entity_spans']=} \t\t {sample['text'][start_char: end_char]}") 
    print(f"{sample['label']=}") 

    print(f"Gold: {_gold} \t\t\t Pred: {_pred}")
    print("*" * 100)

    if idx>=10:
        break

sample['text']='On late Monday night , 30th Nov 2009 , Bangladesh Police arrested Rajkhowa somewhere near Dhaka .'
sample['entity_spans']=[(3, 20)] 		 late Monday night
sample['label']=['time']
Gold: ['time'] 			 Pred: ['time']
****************************************************************************************************
sample['text']='Leo W. Gerard , president of the steelworkers union , said he and several leaders of the AFL-CIO had organized joint events this week with the Sierra Club and the Alliance for Climate Protection .'
sample['entity_spans']=[(111, 123)] 		 joint events
sample['label']=['event']
Gold: ['event'] 			 Pred: ['event']
****************************************************************************************************
sample['text']='Peace agreements will only bring further losses and push back our cause , " he added , pointing out that Abbas \'s Fatah party also maintains its own armed wing , the loosely affiliated Al - Aqsa Martyrs Brigades .'
sample['en

## Detecting types of entities in a text

Finally, we detect types of entities in a text using the [fine-tuned model](https://huggingface.co/studio-ousia/luke-large-finetuned-open-entity).

In [26]:
text = "Beyoncé lives in Los Angeles."
entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"

inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
inputs.to("cuda")
outputs = model(**inputs)

predicted_indices = [index for index, logit in enumerate(outputs.logits[0]) if logit > 0]
print("Predicted entity type for Beyoncé:", [model.config.id2label[index] for index in predicted_indices])

entity_spans = [(17, 28)]  # character-based entity span corresponding to "Beyoncé"
inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
inputs.to("cuda")
outputs = model(**inputs)

predicted_indices = [index for index, logit in enumerate(outputs.logits[0]) if logit > 0]
print("Predicted entity type for Los Angeles:", [model.config.id2label[index] for index in predicted_indices])

Predicted entity type for Beyoncé: ['person']
Predicted entity type for Los Angeles: ['location', 'place']


In [27]:
text = "Beyoncé lives in Los Angeles."
entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"

inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
inputs.to("cuda")
outputs = model(**inputs)


In [32]:
inputs.to("cpu")

{'input_ids': tensor([[    0, 50265, 12674, 12695, 50265,  1074,    11,  1287,  1422,     4,
             2]]), 'entity_ids': tensor([[2]]), 'entity_position_ids': tensor([[[ 1,  2,  3,  4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'entity_attention_mask': tensor([[1]])}

In [33]:
for k, v in inputs.items():
    print(f"{k}: {v}")

input_ids: tensor([[    0, 50265, 12674, 12695, 50265,  1074,    11,  1287,  1422,     4,
             2]])
entity_ids: tensor([[2]])
entity_position_ids: tensor([[[ 1,  2,  3,  4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]]])
attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
entity_attention_mask: tensor([[1]])


In [31]:
model.config.id2label

{0: 'entity',
 1: 'event',
 2: 'group',
 3: 'location',
 4: 'object',
 5: 'organization',
 6: 'person',
 7: 'place',
 8: 'time'}

In [36]:
model.config

LukeConfig {
  "_name_or_path": "studio-ousia/luke-large-finetuned-open-entity",
  "architectures": [
    "LukeForEntityClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "entity_emb_size": 256,
  "entity_vocab_size": 500000,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "entity",
    "1": "event",
    "2": "group",
    "3": "location",
    "4": "object",
    "5": "organization",
    "6": "person",
    "7": "place",
    "8": "time"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "entity": 0,
    "event": 1,
    "group": 2,
    "location": 3,
    "object": 4,
    "organization": 5,
    "person": 6,
    "place": 7,
    "time": 8
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "luke",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_

In [None]:
# check out this link and download the file
# https://huggingface.co/studio-ousia/luke-large-finetuned-open-entity/blob/main/entity_vocab.json
# in the json file, the key-2 point to [MASK]

In [37]:
text = "Mahathir lives in Los Angeles."
entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"

inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
inputs.to("cuda")
outputs = model(**inputs)

predicted_indices = [index for index, logit in enumerate(outputs.logits[0]) if logit > 0]
print("Predicted entity type for Mahathir:", [model.config.id2label[index] for index in predicted_indices])


Predicted entity type for Beyoncé: ['person']
