<a href="https://colab.research.google.com/github/Hamid-Rezaei/ParsBert-Fine-tuning-for-Text-Classification-Using-LoRA/blob/master/ParsBert_Fine_tuning_for_Text_Classification_Using_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
!pip install -q transformers datasets
!pip install evaluate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0mCollecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate

In [2]:
MAX_LEN = 512
parsbert_checkpoint = "HooshvareLab/bert-base-parsbert-uncased"

# Data preparation

## Data loading

In [3]:
from datasets import load_dataset, concatenate_datasets

# Load datasetZ
dataset = load_dataset("csv", data_files={
    "train": "/content/Pars-OFF_levela_train.csv",
    "test": "/content/Pars-OFF_levela_test.csv"
})

# Convert labels
def convert_label_to_int(example):
    example['label'] = 1 if example['label'] == 'OFF' else 0
    return example

dataset['train'] = dataset['train'].map(convert_label_to_int)
dataset['test'] = dataset['test'].map(convert_label_to_int)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/9506 [00:00<?, ? examples/s]

Map:   0%|          | 0/1057 [00:00<?, ? examples/s]

Now, let's split the dataset into training and validation datasets. Then add the test set:

In [4]:
# Split the dataset into training and validation datasets
data = dataset['train'].train_test_split(train_size=0.9, seed=42)
data['val'] = dataset['test']
data['test'] = data.pop("test")

Here's an overview of the dataset:

In [5]:
data

DatasetDict({
    train: Dataset({
        features: ['tweet', 'label'],
        num_rows: 8555
    })
    val: Dataset({
        features: ['tweet', 'label'],
        num_rows: 1057
    })
    test: Dataset({
        features: ['tweet', 'label'],
        num_rows: 951
    })
})

Let's check the data distribution:

In [6]:
import pandas as pd

data['train'].to_pandas().info()
data['test'].to_pandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8555 entries, 0 to 8554
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tweet   8555 non-null   object
 1   label   8555 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 133.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 951 entries, 0 to 950
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tweet   951 non-null    object
 1   label   951 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 15.0+ KB


Label distribution in the train dataset

In [7]:
label_distribution = data['train'].to_pandas()['label'].value_counts()
print(label_distribution)

label
0    5970
1    2585
Name: count, dtype: int64


As the classes are not balanced, we will compute the positive and negative weights and use them for loss calculation later:

In [8]:
pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().label.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().label.value_counts()[0])

The final weights are:

In [9]:
POS_WEIGHT, NEG_WEIGHT = (pos_weights, neg_weights)
print(pos_weights, neg_weights)

1.6547388781431334 0.716499162479062


Then, we compute the maximum length of the column text:

In [10]:
# Number of Characters
max_char = data['train'].to_pandas()['tweet'].str.len().max()
# Number of Words
max_words = data['train'].to_pandas()['tweet'].str.split().str.len().max()

print("The maximum number of characters is", max_char)
print("The maximum number of words is", max_words)

The maximum number of characters is 371
The maximum number of words is 73


## Data Processing

In [11]:
data['train'][0]

{'tweet': '@USER هاهاها یه کوچولو', 'label': 0}

## ParsBert:

#### Load the tokenizer:

In [12]:
from transformers import AutoTokenizer
parsbert_tokenizer = AutoTokenizer.from_pretrained(parsbert_checkpoint, add_prefix_space=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.22M [00:00<?, ?B/s]



#### Define the preprocessing function for converting one row of the dataframe:

In [13]:
def parsbert_preprocessing_function(examples):
    return parsbert_tokenizer(examples['tweet'], truncation=True, max_length=MAX_LEN)

By applying the preprocessing function to the first example of our training dataset, we have the tokenized inputs (input_ids) and the attention mask:

In [14]:
parsbert_preprocessing_function(data['train'][0])

{'input_ids': [2, 23, 57188, 78278, 3634, 5719, 20366, 4], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

#### Now, let's apply the preprocessing function to the entire dataset:

In [15]:
col_to_delete = ['tweet']
# Apply the preprocessing function and remove the undesired columns
parsbert_tokenized_datasets = data.map(parsbert_preprocessing_function, batched=True, remove_columns=col_to_delete)
# Set to torch format
parsbert_tokenized_datasets.set_format("torch")

Map:   0%|          | 0/8555 [00:00<?, ? examples/s]

Map:   0%|          | 0/1057 [00:00<?, ? examples/s]

Map:   0%|          | 0/951 [00:00<?, ? examples/s]

We can have a look into our tokenized training dataset:

In [16]:
parsbert_tokenized_datasets['train'][0]

{'label': tensor(0),
 'input_ids': tensor([    2,    23, 57188, 78278,  3634,  5719, 20366,     4]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1])}

#### For generating the training batches, we also need to pad the rows of a given batch to the maximum length found in the batch. For that, we will use the DataCollatorWithPadding class:

In [17]:
# Data collator for padding a batch of examples to the maximum length seen in the batch
from transformers import DataCollatorWithPadding
parsbert_data_collator = DataCollatorWithPadding(tokenizer=parsbert_tokenizer)

# Model

## Load ParsBert Checkpoints for the Classification Task
We load the pre-trained ParsBert model with a sequence classification head using the Hugging Face AutoModelForSequenceClassification class:

In [18]:
from transformers import AutoModelForSequenceClassification
parsbert_model = AutoModelForSequenceClassification.from_pretrained(parsbert_checkpoint, num_labels=2)

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## LoRA setup for ParsBert classifier
We import LoRa configuration and set some parameters for ParsBert classifier:


*   TaskType: Sequence classification
*   r(rank): Rank for our decomposition matrices
*   lora_alpha: Alpha parameter to scale the learned weights. LoRA paper advises fixing alpha at 16
*   lora_dropout: Dropout probability of the LoRA layers
*   bias: Whether to add bias term to LoRa layers












In [29]:
from peft import get_peft_model, LoraConfig, TaskType
parsbert_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=32, lora_alpha=16, lora_dropout=0.1, bias="none",
)
parsbert_model = get_peft_model(parsbert_model, parsbert_peft_config)
parsbert_model.print_trainable_parameters()

trainable params: 1,181,186 || all params: 164,024,068 || trainable%: 0.7201


At this point, we defined the tokenized dataset for training as well as the LLMs setup with LoRa layers. The following section will introduce how to launch training using the HuggingFace Trainer class.

# Setup the trainer

## Evaluation Metrics
First, we define the performance metrics we will use to compare the three models: F1 score, recall, precision and accuracy:

In [20]:
import evaluate
import numpy as np

def compute_metrics(eval_pred):
    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores.
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

## Custom Trainer for Weighted Loss

As we know we have an **imbalanced distribution** between positive and negative classes. We need to train our models with a **weighted cross-entropy loss** to account for that. The Trainer class doesn't support providing a custom loss as it expects to get the loss directly from the model's outputs.

So, we need to define our **custom WeightedCELossTrainer** that **overrides the compute_loss** method to calculate the weighted cross-entropy loss based on the model's predictions and the input labels:

In [21]:
from transformers import Trainer
import torch

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

### Trainer Setup

#### ParsBert
First important step is to move the models to the GPU device for training.

In [22]:
parsbert_model = parsbert_model.cuda()
print(parsbert_model.device)

cuda:0


Then, we set the training arguments:

In [30]:
from transformers import TrainingArguments

lr = 1e-4
batch_size = 32
num_epochs = 5

training_args = TrainingArguments(
    output_dir="parsbert-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=False,
    gradient_checkpointing=True,
)



Finally, we define the ParsBert trainer by providing the model, the training arguments and the tokenized datasets:

In [31]:
parsbert_trainer = WeightedCELossTrainer(
    model=parsbert_model,
    args=training_args,
    train_dataset=parsbert_tokenized_datasets['train'],
    eval_dataset=parsbert_tokenized_datasets["val"],
    data_collator=parsbert_data_collator,
    compute_metrics=compute_metrics
)

## Train

In [32]:
parsbert_trainer.train()

  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Precision,Recall,F1-score,Accuracy
1,No log,0.362003,0.765625,0.77044,0.833877,0.859981
2,0.455900,0.304721,0.73913,0.855346,0.846782,0.865658
3,0.455900,0.291622,0.73107,0.880503,0.849536,0.866604
4,0.321000,0.278727,0.776204,0.861635,0.865726,0.883633
5,0.321000,0.284059,0.762295,0.877358,0.863839,0.880795


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=1340, training_loss=0.36241759115190647, metrics={'train_runtime': 612.8496, 'train_samples_per_second': 69.797, 'train_steps_per_second': 2.187, 'total_flos': 1734744496549008.0, 'train_loss': 0.36241759115190647, 'epoch': 5.0})

## Evaluate
After training, we evaluate our model on the validation set.

In [26]:
parsbert_trainer.evaluate()

{'eval_loss': 0.48304060101509094,
 'eval_precision': 0.5789473684210527,
 'eval_recall': 0.7610062893081762,
 'eval_f1-score': 0.7373674828043163,
 'eval_accuracy': 0.7615894039735099,
 'eval_runtime': 5.5184,
 'eval_samples_per_second': 191.54,
 'eval_steps_per_second': 6.161,
 'epoch': 5.0}

In [27]:
! huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in yo

In [28]:
parsbert_model.push_to_hub("Persian-Offensive-Language-Detection-Lora")
parsbert_tokenizer.push_to_hub("Persian-Offensive-Language-Detection-Lora")

README.md:   0%|          | 0.00/5.25k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/4.73M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.25k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/HamidRezaei/Persian-Offensive-Language-Detection-Lora/commit/22cdb1e8742e9c6202b9bce9f66b259dc31b35c8', commit_message='Upload tokenizer', commit_description='', oid='22cdb1e8742e9c6202b9bce9f66b259dc31b35c8', pr_url=None, pr_revision=None, pr_num=None)