<a href="https://www.kaggle.com/code/ritvik1909/finetuning-with-lora?scriptVersionId=160425038" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
%%capture
!pip install accelerate==0.26.1 transformers==4.35.2 datasets==2.16.1 peft==0.7.1 evaluate==0.4.0
!pip uninstall wandb -y

In [2]:
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer, DefaultDataCollator
from evaluate import load as load_metric
import os



# Data preparation

In [3]:
data = load_dataset('dair-ai/emotion', 'split')
idx2label = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
data['train'][0]

Downloading data:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/127k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'text': 'i didnt feel humiliated', 'label': 0}

# Setting Up Tokenizer, Model and PEFT Config

In [4]:
model_id = "microsoft/deberta-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
peft_config = LoraConfig(task_type=TaskType.SEQ_CLS, r=1, lora_alpha=1, lora_dropout=0.1)

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=len(idx2label))
model = get_peft_model(model, peft_config)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/559M [00:00<?, ?B/s]

Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_data = data.map(tokenize_function, batched=True)
tokenized_data

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
})

# Setting Up Trainer¶

In [6]:
training_args = TrainingArguments(
    output_dir="./logs", 
    evaluation_strategy="epoch",          
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=2,
)
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['validation'],
)

In [7]:
finetuned_model_id = "./deberta-finetuned"

trainer.train()

trainer.model.save_pretrained(finetuned_model_id)

You're using a DebertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,1.5802,1.581024
2,1.5782,1.579686
3,1.5661,1.568622
4,1.5591,1.557094
5,1.5563,1.551797




# Merging the LoRA with the base model

In [8]:
pretrained_model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=len(idx2label))
peft_model = PeftModel.from_pretrained(
    pretrained_model,
    finetuned_model_id,
    from_transformers=True,
)

model = peft_model.merge_and_unload()

Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Evaluation

In [9]:
from scipy.special import softmax

pred_output = Trainer(model=model).predict(tokenized_data['test'])

f1_metric = load_metric('f1')
scores = f1_metric.compute(
    references=pred_output.label_ids,
    predictions=softmax(pred_output.predictions, axis=1).argmax(axis=1),
    average=None
)
for label, score in zip(idx2label, scores['f1']):
    print(label, score)

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

sadness 0.0
joy 0.5163447251114412
love 0.0
anger 0.0
fear 0.0
surprise 0.0
