In this notebook, we fine-tuned [`LukeForEntityPairClassification`](https://huggingface.co/transformers/model_doc/luke.html#lukeforentitypairclassification) on a supervised entity relation extraction dataset.

The goal for the model is to predict the relationship between the entities, given a sentence and the character spans of two entities within the sentence, 

The author of LUKE has fine-tuned this model on the [TACRED](https://nlp.stanford.edu/projects/tacred/) dataset, an important supervised relation extraction dataset by Stanford University, and obtains state-of-the-art results with it. 

* Paper: https://arxiv.org/abs/2010.01057
* Original repository: https://github.com/studio-ousia/luke

In [1]:
!pip install -q transformers 

[K     |████████████████████████████████| 5.8 MB 20.1 MB/s 
[K     |████████████████████████████████| 7.6 MB 57.8 MB/s 
[K     |████████████████████████████████| 182 kB 63.8 MB/s 
[?25h

In [2]:
!pip install -q pytorch-lightning wandb

[K     |████████████████████████████████| 799 kB 36.3 MB/s 
[K     |████████████████████████████████| 1.9 MB 31.4 MB/s 
[K     |████████████████████████████████| 125 kB 84.3 MB/s 
[K     |████████████████████████████████| 512 kB 90.9 MB/s 
[K     |████████████████████████████████| 168 kB 95.1 MB/s 
[K     |████████████████████████████████| 182 kB 79.8 MB/s 
[K     |████████████████████████████████| 62 kB 1.4 MB/s 
[K     |████████████████████████████████| 168 kB 90.8 MB/s 
[K     |████████████████████████████████| 166 kB 102.6 MB/s 
[K     |████████████████████████████████| 166 kB 103.3 MB/s 
[K     |████████████████████████████████| 162 kB 103.5 MB/s 
[K     |████████████████████████████████| 162 kB 111.2 MB/s 
[K     |████████████████████████████████| 158 kB 89.2 MB/s 
[K     |████████████████████████████████| 157 kB 105.3 MB/s 
[K     |████████████████████████████████| 157 kB 90.3 MB/s 
[K     |████████████████████████████████| 157 kB 103.3 MB/s 
[K     |███████████

In [3]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 25.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 74.6 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 65.5 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 101.1 MB/s 
Installing collected packages: urllib3, xxhash, responses, multiprocess, datasets
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib

In [4]:
from transformers import LukeTokenizer, AdamW, LukeForEntityPairClassification
from torch.utils.data import Dataset, DataLoader 

import torch
from torch import nn
from tqdm.notebook import tqdm
from tqdm import tqdm, trange

import collections

import pandas as pd
import numpy as np
import re
import os
import random

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

## Read in data

Let's download the data from the web, hosted on Dropbox.

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Each row in the dataframe consists of a news article, and a sentence in which a certain relationship was found (just as "invested_in", or "founded_by"). There were some patterns used to gather the data, so it might contain some noise. 

In [6]:
from tqdm import tqdm, trange
import collections
from sklearn.preprocessing import OneHotEncoder

In [7]:
import pandas as pd
import numpy as np
import re

In [8]:
df= pd.read_csv("/content/drive/MyDrive/capstone/Cleaned_full_data.csv", index_col = 0)
df.head(1)

Unnamed: 0,entity_a,entity_b,entity_spans,sentence,Financial,Partner,People,Technical
0,Fortino Capital,Newion,"[(138, 152), (158, 163)]",After its rapid expansion from Luxembourg into...,1.0,0.0,0.0,0.0


In [9]:
# Note : entity span has to be calculated in the same notebook as Luke model or the information would be read as string instead of list of tuples
drop_index = []
for ind in df.index:
    sentence = df.iloc[ind,3]
    wA = df.iloc[ind,0].strip()
    wB = df.iloc[ind,1].strip()
    if re.search(wA, sentence) is not None and re.search(wB, sentence) is not None:
      #re.search returns None if word not found
      startA = re.search(wA,sentence).start()
      startB = re.search(wB,sentence).start()
      endA = re.search(wA,sentence).end() - 1
      endB = re.search(wB,sentence).end() - 1

      df["entity_spans"].iloc[ind]=list([(startA, endA), (startB, endB)])
    else:
        drop_index.append(ind)

df = df.drop(index = drop_index)

    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [10]:
len(df)

1276

In [11]:
validation_index = pd.read_csv('/content/drive/MyDrive/capstone/valid_ids.csv', header = None)

In [12]:
valid_ids = validation_index[0].tolist()

In [13]:
len(valid_ids)

166

## Train validation split

In [14]:
df

Unnamed: 0,entity_a,entity_b,entity_spans,sentence,Financial,Partner,People,Technical
0,Fortino Capital,Newion,"[(138, 152), (158, 163)]",After its rapid expansion from Luxembourg into...,1.0,0.0,0.0,0.0
1,Fortino Capital,Charles Souillard,"[(128, 142), (46, 62)]","As part of the transaction, Miguel Valdes and ...",0.0,0.0,1.0,0.0
2,Fortino Capital,Miguel Valdes,"[(128, 142), (28, 40)]","As part of the transaction, Miguel Valdes and ...",0.0,0.0,1.0,0.0
3,Fortino Capital,Autodesk,"[(288, 302), (166, 173)]",Belgium's Oqton scores $40 million to 'disrupt...,0.0,0.0,1.0,0.0
4,Fortino Capital,SimplyDelivery,"[(230, 244), (0, 13)]","SimplyDelivery, the Berlin-based startup which...",1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
1271,Notion,Sentry,"[(233, 238), (304, 309)]","Akto, a Palo Alto, California-based startup bu...",0.0,0.0,0.0,0.0
1272,Accel India,Tenable,"[(143, 153), (271, 277)]","Akto, a Palo Alto, California-based startup bu...",0.0,0.0,0.0,0.0
1273,Accel India,Sentry,"[(143, 153), (304, 309)]","Akto, a Palo Alto, California-based startup bu...",0.0,0.0,0.0,0.0
1274,Tenable,Sentry,"[(271, 277), (304, 309)]","Akto, a Palo Alto, California-based startup bu...",0.0,0.0,0.0,0.0


In [15]:
val_df = df.iloc[valid_ids]

In [None]:
val_df

Unnamed: 0,entity_a,entity_b,entity_spans,sentence,Financial,Partner,People,Technical
3,Fortino Capital,Autodesk,"[(288, 302), (166, 173)]",Belgium's Oqton scores $40 million to 'disrupt...,0.0,0.0,1.0,0.0
8,Fortino Capital,Efficy CRM,"[(0, 14), (172, 181)]","Fortino Capital Growth PE I, the firm as secon...",1.0,0.0,0.0,0.0
11,Fortino Capital,Kaizo,"[(281, 295), (118, 122)]","I am excited to have joined Fortino, to streng...",1.0,0.0,0.0,0.0
15,Fortino Capital,Bonitasoft,"[(32, 46), (67, 76)]","Out on European tour, Belgium's Fortino Capita...",1.0,0.0,0.0,0.0
22,Fortino Capital,Pires,"[(133, 147), (209, 213)]",The Series A round was co-led by Luxembourg-ba...,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
1196,SoftBank Opportunity Fund,Lenny Rachitsky,"[(173, 197), (214, 228)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0
1197,SoftBank Opportunity Fund,Y Combinator,"[(173, 197), (200, 211)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0
1198,Michael Seibel,Lenny Rachitsky,"[(265, 278), (214, 228)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0
1199,Michael Seibel,Y Combinator,"[(265, 278), (200, 211)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0


In [None]:
train_ids = [i for i in df.index if i not in valid_ids]

In [None]:
len(train_ids)

1110

In [None]:
train_df =  df.iloc[train_ids]

In [None]:
train_df = train_df.dropna()

## Define the PyTorch dataset and dataloaders


In our case, each item of the dataset consists of a sentence, the spans of 2 entities in the sentence, and a label of the relationship. 
We use `LukeTokenizer` to turn these into the inputs expected by the model, which are `input_ids`, `entity_ids`, `attention_mask`, `entity_attention_mask` and `entity_position_ids`.

For more information regarding these inputs, refer to the [docs](https://huggingface.co/transformers/model_doc/luke.html#lukeforentitypairclassification) of `LukeForEntityPairClassification`.


In [43]:
from transformers import LukeTokenizer
from torch.utils.data import Dataset, DataLoader 
import torch


#tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
#model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")

In [44]:
tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base", task="entity_pair_classification")

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/33.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/836 [00:00<?, ?B/s]

In [45]:
class RelationExtractionDataset(Dataset):
    """Relation extraction dataset."""

    def __init__(self, data):
        """
        Args:
            data : Pandas dataframe.
        """
        self.data = data
        

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data.iloc[idx]

        sentence = item.sentence
        entity_spans = [tuple(x) for x in item.entity_spans]

        encoding = tokenizer(sentence, entity_spans=entity_spans, padding="max_length", truncation=True, return_tensors="pt")

        for k,v in encoding.items():
          encoding[k] = encoding[k].squeeze()
        labels = item[['Financial', 'Partner', 'People', 'Technical']]
        encoding["label"] = torch.tensor(labels)

        return encoding

Here we instantiate the class defined above with  a training dataset, a validation dataset.

In [None]:

# define the dataset
train_dataset = RelationExtractionDataset(train_df)
valid_dataset = RelationExtractionDataset(data=val_df)


In [None]:
from transformers import TrainingArguments,LukeForEntityPairClassification, Trainer

Let's define the corresponding dataloaders (which allow us to iterate over the elements of the dataset):

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=1)


## Train and validate model using Transformer Trainer

In [None]:
model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-base", num_labels = 4, problem_type="multi_label_classification")

Some weights of the model checkpoint at studio-ousia/luke-base were not used when initializing LukeForEntityPairClassification: ['lm_head.layer_norm.weight', 'entity_predictions.transform.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'entity_predictions.transform.LayerNorm.weight', 'entity_predictions.transform.dense.weight', 'entity_predictions.transform.LayerNorm.bias', 'entity_predictions.bias', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing LukeForEntityPairClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeForEntityPairClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LukeForEntityPairClassificati

In [17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [None]:
from datasets import Dataset, load_metric

In [None]:
def compute_metrics(p):
    precision = load_metric("precision")
    recall = load_metric("recall")
    f1 = load_metric("f1")
    accuracy = load_metric("accuracy")
    
    predictions, labels = p
    # print(predictions)
    predictions[predictions >= 0] = 1
    predictions[predictions < 0] = 0

    true_predictions, true_labels = [], []
    for prediction, label in zip(predictions, labels):
        for p, l in zip(prediction, label):

            true_predictions.append(p)
            true_labels.append(l)
    # print(true_predictions, true_labels)
    precision_score = precision.compute(predictions=true_predictions, references=true_labels, average="macro")["precision"]
    recall_score = recall.compute(predictions=true_predictions, references=true_labels, average="macro")["recall"]
    f1_score = f1.compute(predictions=true_predictions, references=true_labels, average="macro")["f1"]
    accuracy_score = accuracy.compute(predictions=true_predictions, references=true_labels)["accuracy"]
    return {"precision": precision_score, "recall": recall_score, "f1": f1_score, "accuracy": accuracy_score}

In [None]:
EPOCHS = 10
LR = 1e-5
WD = 0.01
BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 8

training_args = TrainingArguments(
    # change folder name here, to avoid replacing the previous model's outputs
    output_dir="/content/drive/MyDrive/capstone/relationship_origin", 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=WD,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    load_best_model_at_end=True
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
torch.cuda.empty_cache()

In [None]:
CKPT = None
train_result = trainer.train(resume_from_checkpoint=CKPT)
trainer.save_model()
trainer.save_state()

***** Running training *****
  Num examples = 1108
  Num Epochs = 10
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 8
  Total optimization steps = 1380
  Number of trainable parameters = 274508288
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
0,0.1487,0.214158,0.852488,0.840776,0.84645,0.911145
1,0.1105,0.219434,0.859278,0.835125,0.846443,0.912651
2,0.0783,0.246933,0.872022,0.849564,0.860167,0.920181
3,0.0498,0.237089,0.862959,0.843528,0.85276,0.915663
4,0.0403,0.19526,0.885927,0.875692,0.880686,0.930723
5,0.0292,0.283839,0.849117,0.839858,0.844373,0.909639
6,0.0224,0.230777,0.879185,0.858885,0.868536,0.924699
7,0.0164,0.251195,0.868482,0.863619,0.866021,0.921687
8,0.0144,0.258808,0.868373,0.856133,0.862065,0.920181
9,0.0111,0.252111,0.863078,0.851014,0.85686,0.917169


***** Running Evaluation *****
  Num examples = 166
  Batch size = 1
Saving model checkpoint to /content/drive/MyDrive/capstone/relationship_origin/checkpoint-138
Configuration saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-138/config.json
Model weights saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-138/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-138/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-138/special_tokens_map.json
added tokens file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-138/added_tokens.json
***** Running Evaluation *****
  Num examples = 166
  Batch size = 1
Saving model checkpoint to /content/drive/MyDrive/capstone/relationship_origin/checkpoint-276
Configuration saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-276/config.json
Model weig

In [None]:
train_df.isnull().sum()

entity_a        0
entity_b        0
entity_spans    0
sentence        0
Financial       0
Partner         0
People          0
Technical       0
dtype: int64

In [None]:
# inference

labels = []
for b_id, batch in tqdm(enumerate(valid_dataset), total = len(valid_dataset)):
    
    
    del batch['label']
    for k,v in batch.items(): # luke need 2 arguments as shape (batch_size(1), num_token)
        batch[k] = torch.unsqueeze(batch[k],0) 
    # print(inputs['input_ids'].size())
    inputs = batch.to(device)
    with torch.no_grad():
        #  all items in the current batch dict
        outputs = model(**inputs) 
    
    logits = outputs.logits
    logits[logits >= 0] = 1
    logits[logits < 0] = 0
    preds = logits.cpu().detach().numpy()[0]
    
    labels.append(preds)


100%|██████████| 166/166 [00:07<00:00, 22.20it/s]


In [None]:
pred_val = pd.DataFrame()
pred_val['company_a'] = val_df['entity_a']
pred_val['company_b'] = val_df['entity_b']
pred_val['sentence'] = val_df['sentence']
pred_val['Financial'] = np.array(labels)[:,0]
pred_val['Partner'] = np.array(labels)[:,1]
pred_val['People'] = np.array(labels)[:,2]
pred_val['Technical'] = np.array(labels)[:,3]

In [None]:
pred_val.head()

Unnamed: 0,company_a,company_b,sentence,Financial,Partner,People,Technical
3,Fortino Capital,Autodesk,Belgium's Oqton scores $40 million to 'disrupt...,1.0,0.0,0.0,0.0
8,Fortino Capital,Efficy CRM,"Fortino Capital Growth PE I, the firm as secon...",1.0,0.0,0.0,0.0
11,Fortino Capital,Kaizo,"I am excited to have joined Fortino, to streng...",1.0,0.0,0.0,0.0
15,Fortino Capital,Bonitasoft,"Out on European tour, Belgium's Fortino Capita...",1.0,0.0,0.0,0.0
22,Fortino Capital,Pires,The Series A round was co-led by Luxembourg-ba...,1.0,0.0,0.0,0.0


In [None]:
pred_val.to_csv('/content/drive/MyDrive/capstone/relationship_prediction.csv')

In [18]:
## For later downstream inference use.
model1 = LukeForEntityPairClassification.from_pretrained('/content/drive/MyDrive/capstone/relationship_origin/checkpoint-690', problem_type="multi_label_classification")
model1.to(device)

LukeForEntityPairClassification(
  (luke): LukeModel(
    (embeddings): LukeEmbeddings(
      (word_embeddings): Embedding(50267, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (entity_embeddings): LukeEntityEmbeddings(
      (entity_embeddings): Embedding(500000, 256, padding_idx=0)
      (entity_embedding_dense): Linear(in_features=256, out_features=768, bias=False)
      (position_embeddings): Embedding(514, 768)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): LukeEncoder(
      (layer): ModuleList(
        (0): LukeLayer(
          (attention): LukeAttention(
            (self): LukeSelfAttention(
              (query): Linear(in_

In [19]:
NER_result = pd.read_csv('/content/drive/MyDrive/capstone/ner-full-data-val-pred - ner-full-data-val-pred.csv',header = None)

In [20]:
NER_result = NER_result.rename(columns = {0:'sentence', 1:'entities'})

In [21]:
NER_result.head()

Unnamed: 0,sentence,entities
0,"YES (Yield Engineering Systems, Inc.), a leadi...",OSAT Powertech Technology
1,"-- Menta S.A.S, a premier supplier of embedded...","Andes Technology,Menta"
2,A new supply chain report today reveals that F...,"Pegatron,Kinsus Interconnect Technology,Fuyang..."
3,A wholly-owned subsidiary of the JSE-listed Mu...,"Rectron,Mustek Limited"
4,"Abdul Hadi Jameel, business manager at NZXT Mi...",Rectron


## process the NER result
### generate cambination of entity pairs

In [22]:
from collections import defaultdict

In [23]:
sentence_dict = defaultdict(list)

In [24]:
for i in NER_result.index:
    
    entities = NER_result['entities'].iloc[i].split(',')
    sentence = NER_result['sentence'].iloc[i]
    if len(entities) > 1: # only record entities with at least one pair
        sentence_dict[sentence] = entities
    

In [25]:
len(sentence_dict)

83

In [26]:
import itertools
pair_dict = {}
for key in sentence_dict:
    pairs = []
    for subset in itertools.combinations(sentence_dict[key], 2):
        pair_dict[subset] = key

In [37]:
len(pair_dict)

354

In [60]:
valid_pairs = {}

In [33]:
val_df = val_df.reset_index(drop = True)

In [36]:
len(val_df)

166

In [61]:
for i in val_df.index:
    
    pair_1 = (val_df['entity_a'].iloc[i], val_df['entity_b'].iloc[i])
    pair_2 = (val_df['entity_b'].iloc[i], val_df['entity_a'].iloc[i]) # inversed sequence also valid
    if pair_1 in pair_dict:
       valid_pairs[pair_1] = pair_dict[pair_1]
    if pair_2 in pair_dict:
       new_key = (pair_2[1], pair_2[0]) # switch to the order that match ground truth
       valid_pairs[new_key] = pair_dict[pair_2]
    

In [62]:
len(valid_pairs)

126

In [64]:
generated_entity_a = []
generated_entity_b = []
generated_sentences = []
for pair in valid_pairs:
    generated_entity_a.append(pair[0])
    generated_entity_b.append(pair[1])
    generated_sentences.append(valid_pairs[pair])
generated_df = pd.DataFrame(columns = ["entity_a","entity_b","entity_spans","sentence",'Financial', 'Partner', 'People', 'Technical'])
generated_df["entity_a"] = generated_entity_a
generated_df["entity_b"] = generated_entity_b
generated_df["sentence"] = generated_sentences 
generated_df['Financial'] = [0.0 for i in range(len(valid_pairs))]
generated_df['Partner'] =  [0.0 for i in range(len(valid_pairs))]
generated_df['People'] =  [0.0 for i in range(len(valid_pairs))]
generated_df['Technical'] =  [0.0 for i in range(len(valid_pairs))]

In [65]:
drop_index = []
for ind in generated_df.index:
    sentence = generated_df.iloc[ind,3]
    wA = generated_df.iloc[ind,0].strip()
    wB = generated_df.iloc[ind,1].strip()
    if re.search(wA, sentence) is not None and re.search(wB, sentence) is not None:
      #re.search returns None if word not found
      startA = re.search(wA,sentence).start()
      startB = re.search(wB,sentence).start()
      endA = re.search(wA,sentence).end() - 1
      endB = re.search(wB,sentence).end() - 1

      generated_df["entity_spans"].iloc[ind]=list([(startA, endA), (startB, endB)])
    else:
        drop_index.append(ind)

generated_df = generated_df.drop(index = drop_index)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [66]:
generated_df.head()

Unnamed: 0,entity_a,entity_b,entity_spans,sentence,Financial,Partner,People,Technical
0,Fortino Capital,Kaizo,"[(281, 295), (118, 122)]","I am excited to have joined Fortino, to streng...",0.0,0.0,0.0,0.0
1,Fortino Capital,Bonitasoft,"[(32, 46), (67, 76)]","Out on European tour, Belgium's Fortino Capita...",0.0,0.0,0.0,0.0
2,Fortino Capital,Sandvik,"[(288, 302), (345, 351)]",Belgium's Oqton scores $40 million to 'disrupt...,0.0,0.0,0.0,0.0
3,Fortino Capital,Slick Software Solutions,"[(0, 14), (67, 90)]","Fortino Capital's portfolio includes Bloomon, ...",0.0,0.0,0.0,0.0
4,Fortino Capital,Keen Venture Partners,"[(108, 122), (34, 54)]",The round was led by London-based Keen Venture...,0.0,0.0,0.0,0.0


In [67]:
generated_valid_dataset = RelationExtractionDataset(data=generated_df)

In [48]:
generated_valid_dataset

<__main__.RelationExtractionDataset at 0x7f7eac4aa070>

In [68]:
labels = []
for b_id, batch in tqdm(enumerate(generated_valid_dataset), total = len(generated_valid_dataset)):
    
    
    del batch['label']
    for k,v in batch.items(): # luke need 2 arguments as shape (batch_size(1), num_token)
        batch[k] = torch.unsqueeze(batch[k],0) 
    # print(inputs['input_ids'].size())
    inputs = batch.to(device)
    with torch.no_grad():
        #  all items in the current batch dict
        outputs = model1(**inputs) 
    
    logits = outputs.logits
    logits[logits >= 0] = 1
    logits[logits < 0] = 0
    preds = logits.cpu().detach().numpy()[0]
    
    labels.append(preds)

100%|██████████| 126/126 [00:05<00:00, 23.08it/s]


In [69]:
generated_df['Financial'] = np.array(labels)[:,0]
generated_df['Partner'] = np.array(labels)[:,1]
generated_df['People'] = np.array(labels)[:,2]
generated_df['Technical'] = np.array(labels)[:,3]

In [70]:
generated_df

Unnamed: 0,entity_a,entity_b,entity_spans,sentence,Financial,Partner,People,Technical
0,Fortino Capital,Kaizo,"[(281, 295), (118, 122)]","I am excited to have joined Fortino, to streng...",1.0,0.0,0.0,0.0
1,Fortino Capital,Bonitasoft,"[(32, 46), (67, 76)]","Out on European tour, Belgium's Fortino Capita...",1.0,0.0,0.0,0.0
2,Fortino Capital,Sandvik,"[(288, 302), (345, 351)]",Belgium's Oqton scores $40 million to 'disrupt...,1.0,0.0,0.0,0.0
3,Fortino Capital,Slick Software Solutions,"[(0, 14), (67, 90)]","Fortino Capital's portfolio includes Bloomon, ...",1.0,0.0,0.0,0.0
4,Fortino Capital,Keen Venture Partners,"[(108, 122), (34, 54)]",The round was led by London-based Keen Venture...,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
121,Box Group,Scribble Ventures,"[(205, 213), (216, 232)]","WellTheory, a San Francisco, CA-based provider...",0.0,0.0,0.0,0.0
122,Lux Capital,Scribble Ventures,"[(192, 202), (216, 232)]","WellTheory, a San Francisco, CA-based provider...",0.0,0.0,0.0,0.0
123,SoftBank Opportunity Fund,Lenny Rachitsky,"[(173, 197), (214, 228)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0
124,SoftBank Opportunity Fund,Y Combinator,"[(173, 197), (200, 211)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0


In [71]:
con_val_df = generated_df.merge(val_df, how='inner', on=['entity_a','entity_b'],
          suffixes=('_pred', '_truth'))

In [72]:
con_val_df

Unnamed: 0,entity_a,entity_b,entity_spans_pred,sentence_pred,Financial_pred,Partner_pred,People_pred,Technical_pred,entity_spans_truth,sentence_truth,Financial_truth,Partner_truth,People_truth,Technical_truth
0,Fortino Capital,Kaizo,"[(281, 295), (118, 122)]","I am excited to have joined Fortino, to streng...",1.0,0.0,0.0,0.0,"[(281, 295), (118, 122)]","I am excited to have joined Fortino, to streng...",1.0,0.0,0.0,0.0
1,Fortino Capital,Bonitasoft,"[(32, 46), (67, 76)]","Out on European tour, Belgium's Fortino Capita...",1.0,0.0,0.0,0.0,"[(32, 46), (67, 76)]","Out on European tour, Belgium's Fortino Capita...",1.0,0.0,0.0,0.0
2,Fortino Capital,Sandvik,"[(288, 302), (345, 351)]",Belgium's Oqton scores $40 million to 'disrupt...,1.0,0.0,0.0,0.0,"[(288, 302), (345, 351)]",Belgium's Oqton scores $40 million to 'disrupt...,1.0,0.0,0.0,0.0
3,Fortino Capital,Slick Software Solutions,"[(0, 14), (67, 90)]","Fortino Capital's portfolio includes Bloomon, ...",1.0,0.0,0.0,0.0,"[(0, 14), (67, 90)]","Fortino Capital's portfolio includes Bloomon, ...",1.0,0.0,0.0,0.0
4,Fortino Capital,Keen Venture Partners,"[(108, 122), (34, 54)]",The round was led by London-based Keen Venture...,1.0,0.0,0.0,0.0,"[(108, 122), (34, 54)]",The round was led by London-based Keen Venture...,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122,Box Group,Scribble Ventures,"[(205, 213), (216, 232)]","WellTheory, a San Francisco, CA-based provider...",0.0,0.0,0.0,0.0,"[(205, 213), (216, 232)]","WellTheory, a San Francisco, CA-based provider...",0.0,0.0,0.0,0.0
123,Lux Capital,Scribble Ventures,"[(192, 202), (216, 232)]","WellTheory, a San Francisco, CA-based provider...",0.0,0.0,0.0,0.0,"[(192, 202), (216, 232)]","WellTheory, a San Francisco, CA-based provider...",0.0,0.0,0.0,0.0
124,SoftBank Opportunity Fund,Lenny Rachitsky,"[(173, 197), (214, 228)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0,"[(173, 197), (214, 228)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0
125,SoftBank Opportunity Fund,Y Combinator,"[(173, 197), (200, 211)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0,"[(173, 197), (200, 211)]","Emerge Career, a US provider of a platform for...",0.0,0.0,0.0,0.0


In [74]:
y_pred = con_val_df[['Financial_pred','Partner_pred','People_pred','Technical_pred']]
y_true = con_val_df[['Financial_truth','Partner_truth','People_truth','Technical_truth']]

## Performance evaluation on Entity Recognition Generated results

In [78]:
print('Macro average f1 on validation dataset:', f1_score(y_true, y_pred, average='macro'))
print('Macro average precision on validation dataset:', precision_score(y_true, y_pred, average='macro'))
print('Macro average recall on validation dataset:', recall_score(y_true, y_pred, average='macro'))

Macro average f1 on validation dataset: 0.7361295681063122
Macro average precision on validation dataset: 0.7968726675623227
Macro average recall on validation dataset: 0.6945091945091946
