<a href="https://colab.research.google.com/github/Heity94/TWSM_Lab/blob/main/Project/Notebooks/PH_NER_FineTune_ST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Declare whether you are on Colab or local
colab = True

In [2]:
if colab==True:
  
  #Mount drive
  from google.colab import drive
  drive.mount('/content/drive')

  #install required packages
  !pip install -U sentence-transformers -q

Mounted at /content/drive
[K     |████████████████████████████████| 85 kB 4.5 MB/s 
[K     |████████████████████████████████| 4.4 MB 59.3 MB/s 
[K     |████████████████████████████████| 1.2 MB 54.6 MB/s 
[K     |████████████████████████████████| 101 kB 10.0 MB/s 
[K     |████████████████████████████████| 596 kB 72.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 72.1 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


In [31]:
#set path to data in Google Drive
data_path = "/content/drive/MyDrive/2022_Analytics Lab Student Projects/Data/All Topics"
data_path_group = data_path[:-10]+"Topic 1/Data_Team1/" # create new data path to access files created by Team1

In [14]:
# Import sentence_transformers 
from sentence_transformers import SentenceTransformer, util, InputExample, losses, evaluation
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, BinaryClassificationEvaluator, TripletEvaluator, LabelAccuracyEvaluator

In [30]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from datetime import datetime

### Load pretrained sentence transformer model

In [52]:
model_name = "all-MiniLM-L6-v2"

In [53]:
output_path = data_path_group+"NER/Training_results/"+model_name+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

In [41]:
# Load a pre-trained model
model = SentenceTransformer(model_name)

# Fine tuning of pretrained model


## Synonyms dataset

### Load data

In [54]:
syn_ont_train = pd.read_csv(data_path_group+"NER/Synonym_Dataset/NER_ontology_train_df_syn.csv", index_col=0)
syn_ont_val = pd.read_csv(data_path_group+"NER/Synonym_Dataset/NER_ontology_val_df_syn.csv", index_col=0)
syn_ont_test= pd.read_csv(data_path_group+"NER/Synonym_Dataset/NER_ontology_test_df_syn.csv", index_col=0)

In [8]:
syn_ont_train.head(3)

Unnamed: 0,entity_id,category,label,synonym,is_synonym
533794,smart contract,domain specific entity,TECHNOLOGY,intelligent process automation,0
280854,energy information system,domain specific entity,TECHNOLOGY,energy software,1
361291,archival research,research method,COLLECTION_METHOD,governmental and governmental record,1


In [40]:
# Only to test if everything is working
#syn_ont_train = syn_ont_train.iloc[:10_000].copy()

### Create sample sets for training/evaluation

In [55]:
syn_train_samples = []
for index, row in syn_ont_train.iterrows():
  inp_example = InputExample(texts=[row['entity_id'], row['synonym']], label=float(row['is_synonym']))
  syn_train_samples.append(inp_example)

In [25]:
syn_evaluation_samples = []
for index, row in syn_ont_val.iterrows():
  inp_example = InputExample(texts=[row['entity_id'], row['synonym']], label=float(row['is_synonym']))
  syn_evaluation_samples.append(inp_example)

In [26]:
syn_test_samples = []
for index, row in syn_ont_test.iterrows():
  inp_example = InputExample(texts=[row['entity_id'], row['synonym']], label=float(row['is_synonym']))
  syn_test_samples.append(inp_example)

### Baseline score of pretrained model without finetuning

In [42]:
# Instantiate Binary Classification evaluator for test set
test_evaluator = BinaryClassificationEvaluator.from_input_examples(syn_test_samples, name='NER-syn-test', show_progress_bar=True,)

In [56]:
# For the evaluation of the baseline model I had to create this folder manually in Drive first, otherwise it throws an error
model_name_baseline = "all-MiniLM-L6-v2_baseline"
output_path_baseline = data_path_group+"NER/Training_results/"+model_name_baseline

In [57]:
# Evluate performance on test set
test_result_baseline = test_evaluator(model, output_path=output_path_baseline)

Batches:   0%|          | 0/163 [00:00<?, ?it/s]

In [58]:
print(test_result_baseline)

0.9078516208024627


Without any fine tuning on the ontology the model classifies 90% of the examples correctly. After finetuning this score should thus be higher than 90%. 

### Fine tuning the pre trained model

In [59]:
syn_train_dataloader = DataLoader(syn_train_samples, shuffle=True, batch_size=64)
syn_train_loss = losses.CosineSimilarityLoss(model=model)

In [60]:
num_epochs = 1
warmup_steps = float(np.ceil(len(syn_train_dataloader) * num_epochs * 0.01)) #1% of train data for warm-up

In [61]:
evaluator = BinaryClassificationEvaluator.from_input_examples(syn_evaluation_samples)

In [62]:
model.fit(train_objectives=[(syn_train_dataloader, syn_train_loss)], 
          epochs=num_epochs, 
          warmup_steps=warmup_steps, 
          evaluator=evaluator, 
          evaluation_steps=5000, 
          output_path=output_path)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/8825 [00:00<?, ?it/s]

In [63]:
# Evluate performance on test set
test_result = test_evaluator(model, output_path=output_path)

Batches:   0%|          | 0/163 [00:00<?, ?it/s]

In [64]:
print(test_result)

0.9986310212959735


Training the model for one epoch on the complete training dataset improved the accuracy by far. Now the accuracy is very close to 100%. We might need to check on how to prevent overfitting, using a smaller learning rate or early stopping (even though with only one epoch of training we would not be able to stop earlier anyhow ...) 

## Triplets dataset

### Load data

In [12]:
trip_ont_train = pd.read_csv(data_path_group+"NER/Triplets_Dataset/NER_ontology_train_triplets.csv", index_col=0)
trip_ont_val = pd.read_csv(data_path_group+"NER/Triplets_Dataset/NER_ontology_val_triplets.csv", index_col=0)
trip_ont_test = pd.read_csv(data_path_group+"NER/Triplets_Dataset/NER_ontology_test_triplets.csv", index_col=0)

In [13]:
trip_ont_train.head(3)

Unnamed: 0,entity_id,positive_example,negative_example
0,survey,consumer based standardized questionnaire appr...,biometric measurement
1,computer supported cooperative work,applications enabled collaborations,Personal Information Protection and Electronic...
2,IT skill,competencies for digital employee,floppy disk
