In this notebook, we fine-tuned [`LukeForEntityPairClassification`](https://huggingface.co/transformers/model_doc/luke.html#lukeforentitypairclassification) on a supervised entity relation extraction dataset.

The goal for the model is to predict the relationship between the entities, given a sentence and the character spans of two entities within the sentence, 

The author of LUKE has fine-tuned this model on the [TACRED](https://nlp.stanford.edu/projects/tacred/) dataset, an important supervised relation extraction dataset by Stanford University, and obtains state-of-the-art results with it. 

* Paper: https://arxiv.org/abs/2010.01057
* Original repository: https://github.com/studio-ousia/luke

In [None]:
!pip install -q transformers 

[K     |████████████████████████████████| 5.5 MB 4.3 MB/s 
[K     |████████████████████████████████| 182 kB 54.9 MB/s 
[K     |████████████████████████████████| 7.6 MB 74.2 MB/s 
[?25h

In [None]:
!pip install -q pytorch-lightning wandb

[K     |████████████████████████████████| 798 kB 5.2 MB/s 
[K     |████████████████████████████████| 1.9 MB 66.6 MB/s 
[K     |████████████████████████████████| 529 kB 68.3 MB/s 
[K     |████████████████████████████████| 125 kB 88.8 MB/s 
[K     |████████████████████████████████| 87 kB 7.9 MB/s 
[K     |████████████████████████████████| 168 kB 43.7 MB/s 
[K     |████████████████████████████████| 182 kB 86.8 MB/s 
[K     |████████████████████████████████| 62 kB 1.1 MB/s 
[K     |████████████████████████████████| 168 kB 79.7 MB/s 
[K     |████████████████████████████████| 166 kB 67.7 MB/s 
[K     |████████████████████████████████| 166 kB 60.4 MB/s 
[K     |████████████████████████████████| 162 kB 56.5 MB/s 
[K     |████████████████████████████████| 162 kB 72.2 MB/s 
[K     |████████████████████████████████| 158 kB 62.0 MB/s 
[K     |████████████████████████████████| 157 kB 71.9 MB/s 
[K     |████████████████████████████████| 157 kB 70.5 MB/s 
[K     |████████████████████

In [None]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 5.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 82.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 76.2 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 77.1 MB/s 
Installing collected packages: urllib3, xxhash, responses, multiprocess, datasets
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.

In [None]:
from transformers import LukeTokenizer, AdamW, LukeForEntityPairClassification
from torch.utils.data import Dataset, DataLoader 

import torch
from torch import nn
from tqdm.notebook import tqdm
from tqdm import tqdm, trange

import collections

import pandas as pd
import numpy as np
import re
import os
import random

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

In [None]:
# def seed_everything(seed=42):
#     random.seed(seed)
#     os.environ['PYTHONHASHSEED'] = str(seed)
#     np.random.seed(seed)
#     torch.manual_seed(seed)
#     torch.cuda.manual_seed(seed)
#     torch.cuda.manual_seed_all(seed)
#     # Some cudnn methods can be random even after fixing the seed 
#     # unless you tell it to be deterministic
#     torch.backends.cudnn.deterministic = True

# seed_everything(1234)

## Read in data

Let's download the data from the web, hosted on Dropbox.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Each row in the dataframe consists of a news article, and a sentence in which a certain relationship was found (just as "invested_in", or "founded_by"). There were some patterns used to gather the data, so it might contain some noise. 

In [None]:
from tqdm import tqdm, trange
import collections
from sklearn.preprocessing import OneHotEncoder

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
df = pd.read_csv("/content/drive/MyDrive/capstone/data_relation_cleaned.csv")
df.head(1)

Unnamed: 0.1,Unnamed: 0,Company A,Company B,Sentence,Type,Degree,Url,a_start,a_end,b_start,b_end,words_start,words_end,type_cleaned
0,0,Fortino Capital,Newion,After its rapid expansion from Luxembourg into...,Investment,indirect,https://www.eu-startups.com/2021/07/luxembourg...,138,152,158,163,"[0, 5, 9, 15, 25, 30, 41, 46, 54, 58, 71, 80, ...","[4, 8, 14, 24, 29, 40, 45, 53, 57, 70, 79, 84,...",Financial


## Data Cleaning and feature engineering

In [None]:
len(df)

327

In [None]:
df.isnull().sum()

Unnamed: 0      0
Company A       0
Company B       0
Sentence        0
Type            2
Degree          0
Url             9
a_start         0
a_end           0
b_start         0
b_end           0
words_start     0
words_end       0
type_cleaned    1
dtype: int64

In [None]:
df = df.dropna(subset = ['Type'])

In [None]:
ohe = OneHotEncoder()
transformed = ohe.fit_transform(df[['type_cleaned']])
df[ohe.categories_[0]] = transformed.toarray()

In [None]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,Company A,Company B,Sentence,Type,Degree,Url,a_start,a_end,b_start,b_end,words_start,words_end,type_cleaned,Financial,Partner,People,Technical
0,0,Fortino Capital,Newion,After its rapid expansion from Luxembourg into...,Investment,indirect,https://www.eu-startups.com/2021/07/luxembourg...,138,152,158,163,"[0, 5, 9, 15, 25, 30, 41, 46, 54, 58, 71, 80, ...","[4, 8, 14, 24, 29, 40, 45, 53, 57, 70, 79, 84,...",Financial,1.0,0.0,0.0,0.0


In [None]:
# fill the multi label data points
df.loc[df.Type == 'Investment/People', 'Financial'] = 1.0
df.loc[df.Type == 'Investment/People', 'Partner'] = 1.0

In [None]:
df.type_cleaned.value_counts().index

Index(['Financial', 'Technical', 'People', 'Partner'], dtype='object')

## Prepare for Luke tokenizer format data

Clean up entities' beginning & end index

In [None]:
df = df.reset_index(drop = True)

In [None]:
dropInd = []
for ind in df.index:
    sentence = df.iloc[ind,3]
    wA = df.iloc[ind,1].strip()
    wB = df.iloc[ind,2].strip()
    if re.search(wA, sentence) is not None and re.search(wB, sentence) is not None:
      #re.search returns None if word not found
      startA = re.search(wA,sentence).start()
      startB = re.search(wB,sentence).start()
      endA = re.search(wA,sentence).end() - 1
      endB = re.search(wB,sentence).end() - 1
      if startA != df.iloc[ind,7] or endA !=df.iloc[ind,8]:
        df.iloc[ind,7] = startA
        df.iloc[ind,8] = endA
      if startB != df.iloc[ind,9] or endB !=df.iloc[ind,10]:
        df.iloc[ind,9] = startB
        df.iloc[ind,10] = endB
    else:
      dropInd.append(ind) 
df = df.drop(index = dropInd)

In [None]:
len(df)

282

In [None]:
data = pd.DataFrame(columns = ["entity_a","entity_b","entity_spans","sentence",'Financial', 'Partner', 'People', 'Technical'])
data['entity_a'] = df['Company A']
data['entity_b'] = df['Company B']
data['sentence'] = df['Sentence']
data['Financial'] = df['Financial']
data['Partner'] = df['Partner']
data['People'] = df['People']
data['Technical'] = df['Technical']

span = df[['a_start', 'a_end','b_start','b_end']].copy()
span['combined']= span.values.tolist()
spans = span['combined']
for index, span in enumerate(spans): 
    data.iloc[index,2] = [[(span[0], span[1]),(span[2], span[3])]]
    
data.head()

Unnamed: 0,entity_a,entity_b,entity_spans,sentence,Financial,Partner,People,Technical
0,Fortino Capital,Newion,"[(138, 152), (158, 163)]",After its rapid expansion from Luxembourg into...,1.0,0.0,0.0,0.0
1,Fortino Capital,Charles Souillard,"[(128, 142), (46, 62)]","As part of the transaction, Miguel Valdes and ...",0.0,0.0,1.0,0.0
2,Fortino Capital,Miguel Valdes,"[(128, 142), (28, 40)]","As part of the transaction, Miguel Valdes and ...",0.0,0.0,1.0,0.0
3,Fortino Capital,Autodesk,"[(288, 302), (166, 173)]",Belgium's Oqton scores $40 million to 'disrupt...,0.0,0.0,1.0,0.0
4,Fortino Capital,SimplyDelivery,"[(230, 244), (0, 13)]","SimplyDelivery, the Berlin-based startup which...",1.0,0.0,0.0,0.0


## Define the PyTorch dataset and dataloaders


In our case, each item of the dataset consists of a sentence, the spans of 2 entities in the sentence, and a label of the relationship. 
We use `LukeTokenizer` to turn these into the inputs expected by the model, which are `input_ids`, `entity_ids`, `attention_mask`, `entity_attention_mask` and `entity_position_ids`.

For more information regarding these inputs, refer to the [docs](https://huggingface.co/transformers/model_doc/luke.html#lukeforentitypairclassification) of `LukeForEntityPairClassification`.


In [None]:
from transformers import LukeTokenizer
from torch.utils.data import Dataset, DataLoader 
import torch


#tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
#model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")

In [None]:
tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base", task="entity_pair_classification")

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/33.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/836 [00:00<?, ?B/s]

In [None]:
class RelationExtractionDataset(Dataset):
    """Relation extraction dataset."""

    def __init__(self, data):
        """
        Args:
            data : Pandas dataframe.
        """
        self.data = data
        

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data.iloc[idx]

        sentence = item.sentence
        entity_spans = [tuple(x) for x in item.entity_spans]

        encoding = tokenizer(sentence, entity_spans=entity_spans, padding="max_length", truncation=True, return_tensors="pt")

        for k,v in encoding.items():
          encoding[k] = encoding[k].squeeze()
        labels = item[['Financial', 'Partner', 'People', 'Technical']]
        encoding["label"] = torch.tensor(labels)

        return encoding

Here we instantiate the class defined above with  a training dataset, a validation dataset.

## Balance class ratio over train, validation datasets

In [None]:
from sklearn.model_selection import train_test_split

dt_fin = data.loc[(df['Financial'] == 1) & (df['Partner'] != 1.0)]
dt_part = data.loc[df['Partner'] == 1]
dt_ppl = data.loc[df['People'] == 1]
dt_tech = data.loc[df['Technical'] == 1]

train_df_fin, test_df_fin = train_test_split(dt_fin, test_size=0.2, random_state=42, shuffle=False)
train_df_fin, val_df_fin = train_test_split(train_df_fin, test_size=0.2, random_state=42, shuffle=False)

train_df_part, test_df_part = train_test_split(dt_part, test_size=0.2, random_state=42, shuffle=False)
train_df_part, val_df_part = train_test_split(train_df_fin, test_size=0.2, random_state=42, shuffle=False)

train_df_ppl, test_df_ppl = train_test_split(dt_ppl, test_size=0.2, random_state=42, shuffle=False)
train_df_ppl, val_df_ppl = train_test_split(train_df_ppl, test_size=0.2, random_state=42, shuffle=False)

train_df_tech, test_df_tech = train_test_split(dt_tech, test_size=0.2, random_state=42, shuffle=False)
train_df_tech, val_df_tech = train_test_split(train_df_tech, test_size=0.2, random_state=42, shuffle=False)

train_frames = [train_df_fin, train_df_part,train_df_ppl, train_df_tech]
train_df = pd.concat(train_frames)

val_frames = [val_df_fin, val_df_part,val_df_ppl, val_df_tech]
val_df = pd.concat(val_frames)

test_frames = [test_df_fin, test_df_part,test_df_ppl, test_df_tech]
test_df = pd.concat(test_frames)

# shuffle
test_df = test_df.iloc[np.random.permutation(len(test_df))]
val_df = val_df.iloc[np.random.permutation(len(val_df))]
train_df = train_df.iloc[np.random.permutation(len(train_df))]

In [None]:
train_df

Unnamed: 0,entity_a,entity_b,entity_spans,sentence,Financial,Partner,People,Technical
103,Fortino Capital,Manifold Investments,"[(133, 147), (241, 260)]",The Series A round was co-led by Luxembourg-ba...,1.0,0.0,0.0,0.0
13,Fortino Capital,Kaizo,"[(281, 295), (118, 122)]","I am excited to have joined Fortino, to streng...",1.0,0.0,0.0,0.0
158,King Yuan Electronics,Huawei,"[(135, 155), (28, 33)]","""We previously thought that Huawei would be on...",0.0,0.0,0.0,1.0
204,Andes Technology,Menta,"[(61, 76), (21, 25)]",“It is an honour for Menta to work in close pa...,0.0,0.0,0.0,1.0
228,AP Memory Technology,NXP Connect Partner Program,"[(38, 57), (159, 185)]","TAIPEI, July 26, 2020 /PRNewswire/ -- AP Memor...",0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
203,Andes Technology,Mediatek,"[(30, 45), (0, 7)]",Mediatek has already absorbed Andes Technology...,0.0,0.0,0.0,1.0
0,Fortino Capital,Newion,"[(138, 152), (158, 163)]",After its rapid expansion from Luxembourg into...,1.0,0.0,0.0,0.0
143,Macronix,Huawei Technologies Co,"[(0, 7), (137, 158)]",Macronix International Co (旺宏電子) yesterday sai...,0.0,0.0,0.0,1.0
190,Andes Technology,Amazon,"[(108, 123), (53, 58)]",By combining the RISC-V platform with solution...,0.0,0.0,0.0,1.0


In [None]:

# define the dataset
train_dataset = RelationExtractionDataset(train_df)
valid_dataset = RelationExtractionDataset(data=val_df)
test_dataset = RelationExtractionDataset(data=test_df)

In [None]:
from transformers import TrainingArguments,LukeForEntityPairClassification, Trainer

Let's define the corresponding dataloaders (which allow us to iterate over the elements of the dataset):

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=1)
test_dataloader = DataLoader(test_dataset, batch_size=1)

## Train and validate model using Transformer Trainer

In [None]:
model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-base", num_labels = 4, problem_type="multi_label_classification")

Downloading:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

Some weights of the model checkpoint at studio-ousia/luke-base were not used when initializing LukeForEntityPairClassification: ['lm_head.dense.weight', 'entity_predictions.transform.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'entity_predictions.transform.dense.bias', 'entity_predictions.transform.LayerNorm.weight', 'entity_predictions.transform.LayerNorm.bias', 'lm_head.layer_norm.weight', 'entity_predictions.bias']
- This IS expected if you are initializing LukeForEntityPairClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeForEntityPairClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LukeForEntityPairClassificati

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [None]:
from datasets import Dataset, load_metric

In [None]:
def compute_metrics(p):
    precision = load_metric("precision")
    recall = load_metric("recall")
    f1 = load_metric("f1")
    accuracy = load_metric("accuracy")

    predictions, labels = p
    # print(predictions, labels)
    predictions[predictions >= 0] = 1
    predictions[predictions < 0] = 0

    true_predictions, true_labels = [], []
    for prediction, label in zip(predictions, labels):
        for p, l in zip(prediction, label):

            true_predictions.append(p)
            true_labels.append(l)

    precision_score = precision.compute(predictions=true_predictions, references=true_labels, average="macro")["precision"]
    recall_score = recall.compute(predictions=true_predictions, references=true_labels, average="macro")["recall"]
    f1_score = f1.compute(predictions=true_predictions, references=true_labels, average="macro")["f1"]
    accuracy_score = accuracy.compute(predictions=true_predictions, references=true_labels)["accuracy"]
    return {"precision": precision_score, "recall": recall_score, "f1": f1_score, "accuracy": accuracy_score}

In [None]:
EPOCHS = 20
LR = 1e-5
WD = 0.01
BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 8

training_args = TrainingArguments(
    # change folder name here, to avoid replacing the previous model's outputs
    output_dir="/content/drive/MyDrive/capstone/relationship_origin", 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=WD,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    load_best_model_at_end=True
)

In [None]:
train_dataset

<__main__.RelationExtractionDataset at 0x7faba8abe8d0>

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
torch.cuda.empty_cache()

In [None]:
CKPT = None
train_result = trainer.train(resume_from_checkpoint=CKPT)
trainer.save_model()
trainer.save_state()

***** Running training *****
  Num examples = 219
  Num Epochs = 20
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 8
  Total optimization steps = 540
  Number of trainable parameters = 274508288
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
0,0.4569,0.522645,0.636436,0.605263,0.613704,0.741228
1,0.1867,0.598757,0.67756,0.654971,0.66368,0.763158
2,0.1114,0.774439,0.652719,0.637427,0.643558,0.745614
3,0.0482,0.653414,0.761534,0.71345,0.731011,0.815789


***** Running Evaluation *****
  Num examples = 57
  Batch size = 1
  


Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.52k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Saving model checkpoint to /content/drive/MyDrive/capstone/relationship_origin/checkpoint-27
Configuration saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-27/config.json
Model weights saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-27/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-27/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-27/special_tokens_map.json
added tokens file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-27/added_tokens.json
***** Running Evaluation *****
  Num examples = 57
  Batch size = 1
Saving model checkpoint to /content/drive/MyDrive/capstone/relationship_origin/checkpoint-54
Configuration saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-54/config.json
Model weights saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-54

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
0,0.4569,0.522645,0.636436,0.605263,0.613704,0.741228
1,0.1867,0.598757,0.67756,0.654971,0.66368,0.763158
2,0.1114,0.774439,0.652719,0.637427,0.643558,0.745614
3,0.0482,0.653414,0.761534,0.71345,0.731011,0.815789
4,0.0224,0.743485,0.749775,0.72807,0.737461,0.811404
5,0.0111,0.70744,0.774038,0.75731,0.764905,0.828947
6,0.006,0.715416,0.798951,0.780702,0.789017,0.846491
7,0.0043,0.731208,0.801011,0.774854,0.786305,0.846491
8,0.0034,0.792764,0.785121,0.774854,0.77971,0.837719
9,0.0028,0.790599,0.786495,0.769006,0.776961,0.837719


***** Running Evaluation *****
  Num examples = 57
  Batch size = 1
Saving model checkpoint to /content/drive/MyDrive/capstone/relationship_origin/checkpoint-135
Configuration saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-135/config.json
Model weights saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-135/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-135/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-135/special_tokens_map.json
added tokens file saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-135/added_tokens.json
***** Running Evaluation *****
  Num examples = 57
  Batch size = 1
Saving model checkpoint to /content/drive/MyDrive/capstone/relationship_origin/checkpoint-162
Configuration saved in /content/drive/MyDrive/capstone/relationship_origin/checkpoint-162/config.json
Model weight

In [None]:


labels = []
for b_id, batch in tqdm(enumerate(valid_dataset), total = len(valid_dataset)):
    
    
    del batch['label']
    for k,v in batch.items(): # luke need 2 arguments as shape (batch_size(1), num_token)
        batch[k] = torch.unsqueeze(batch[k],0) 
    # print(inputs['input_ids'].size())
    inputs = batch.to(device)
    with torch.no_grad():
        #  all items in the current batch dict
        outputs = model(**inputs) 
    
    logits = outputs.logits
    logits[logits >= 0] = 1
    logits[logits < 0] = 0
    preds = logits.cpu().detach().numpy()[0]
    
    labels.append(preds)


  3%|▎         | 2/58 [00:00<00:03, 14.68it/s]

[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 1.]


 12%|█▏        | 7/58 [00:00<00:02, 17.86it/s]

[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]


 22%|██▏       | 13/58 [00:00<00:02, 20.34it/s]

[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]


 28%|██▊       | 16/58 [00:00<00:02, 20.98it/s]

[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 1.]


 38%|███▊      | 22/58 [00:01<00:01, 21.38it/s]

[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]


 48%|████▊     | 28/58 [00:01<00:01, 21.61it/s]

[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 0.]


 53%|█████▎    | 31/58 [00:01<00:01, 21.63it/s]

[0. 0. 0. 1.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]


 64%|██████▍   | 37/58 [00:01<00:00, 21.54it/s]

[0. 0. 0. 1.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]


 74%|███████▍  | 43/58 [00:02<00:00, 21.61it/s]

[1. 0. 0. 0.]
[0. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 0.]
[1. 0. 0. 0.]


 79%|███████▉  | 46/58 [00:02<00:00, 21.64it/s]

[0. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]


 90%|████████▉ | 52/58 [00:02<00:00, 21.67it/s]

[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 0.]
[0. 0. 0. 1.]


100%|██████████| 58/58 [00:02<00:00, 20.98it/s]

[0. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 0.]
[0. 0. 0. 1.]





In [None]:
pred_val = pd.DataFrame()
pred_val['company_a'] = val_df['entity_a']
pred_val['company_b'] = val_df['entity_b']
pred_val['sentence'] = val_df['sentence']
pred_val['Financial'] = np.array(labels)[:,0]
pred_val['Partner'] = np.array(labels)[:,1]
pred_val['People'] = np.array(labels)[:,2]
pred_val['Technical'] = np.array(labels)[:,3]

In [None]:
pred_val.head()

Unnamed: 0,company_a,company_b,sentence,Financial,Partner,People,Technical
129,Powertech Technology,PTI,PTI currently holds 11.6% in Tera Probe and fo...,0.0,0.0,0.0,0.0
257,Rectron,Vivotek,The aim of the partnership is to further stren...,0.0,0.0,0.0,0.0
279,VIA Technologies,Linux,VIA Technologies has launched a Linux-driven c...,0.0,0.0,0.0,1.0
124,Powertech Technology,Intel,With Intel agreeing to sell its NAND flash and...,0.0,0.0,0.0,1.0
142,Macronix,Foxconn,Macronix declined to comment on Foxconn's lat...,1.0,0.0,0.0,0.0


In [None]:
pred_val.to_csv('/content/drive/MyDrive/capstone/relationship_prediction.csv')