<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width=400px style="opacity:0.7">
</center>

# BERT based Encoder finetuning using HF LoraConfig

**Model: `distilbert-base-uncased`**

**Purpose: Text classification**

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## Load the Model and Dataset from HuggingFace

In [2]:
from svlearn_lora_ft.bert_full_ft_trainer import evaluate, prepare_dataset, get_model_tokenizer_collator, get_subject_dataset
from svlearn_lora_ft import config
from pathlib import Path
from dotenv import load_dotenv
import os

load_dotenv()

project_root = Path(os.getenv("BOOTCAMP_ROOT_DIR"))

# Dataset paths
history_file = config["dataset-paths"]["history_file"]
physics_file = config["dataset-paths"]["physics_file"]
biology_file = config["dataset-paths"]["biology_file"]

eval_output_dir = config["final-ft-model-paths"]["eval_output_dir"]
finetuned_model_dir = config["final-ft-model-paths"]["subject_model_lora_peft"]

finetuned_model_dir = f'{finetuned_model_dir}/best_model'               

# Load dataset with absolute paths
train_dataset, test_dataset, label2id = get_subject_dataset(history_file, physics_file, biology_file)

# Model + tokenizer
model, tokenizer, data_collator = get_model_tokenizer_collator(num_labels=3, label2id=label2id)
# Tokenize
tokenized_train, tokenized_test = prepare_dataset(tokenizer, train_dataset, test_dataset)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/43497 [00:00<?, ? examples/s]

Map:   0%|          | 0/10875 [00:00<?, ? examples/s]

## Evaluate the base model before fine-tuning

In [3]:
# Initial evaluation on test-data (before fine-tuning)
initial_results = evaluate(model, tokenizer, data_collator, tokenized_test, eval_output_dir)
initial_results

{'eval_loss': 1.0941574573516846,
 'eval_model_preparation_time': 0.0005,
 'eval_accuracy': 0.4926896551724138,
 'eval_precision': 0.24274309631391203,
 'eval_recall': 0.4926896551724138,
 'eval_f1': 0.32524255188982854,
 'eval_runtime': 16.7905,
 'eval_samples_per_second': 647.689,
 'eval_steps_per_second': 20.25}

## Load the peft based fine-tuned model and evaluate

In [4]:
from peft import PeftModel

best_finetuned_model = PeftModel.from_pretrained(model, finetuned_model_dir)

final_results = evaluate(best_finetuned_model, tokenizer, data_collator, tokenized_test, eval_output_dir)
final_results


{'eval_loss': 0.06915368139743805,
 'eval_model_preparation_time': 0.0008,
 'eval_accuracy': 0.9808735632183908,
 'eval_precision': 0.9809136817536726,
 'eval_recall': 0.9808735632183908,
 'eval_f1': 0.9808465676272272,
 'eval_runtime': 18.0293,
 'eval_samples_per_second': 603.183,
 'eval_steps_per_second': 18.858}