<a href="https://www.kaggle.com/code/aisuko/token-classification-using-lora?scriptVersionId=165126213" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Low-Rank Adaption(LoRA) is a reparametrization method that aims to reduce the number of trainable parameters with low-rank representations. The weight matrix is broken down into low-rank matrices that are trained and updated. All the pretrained model parameters remain frozen. After training, the low-rank matrices are added back to the original weights. This makes it more efficient to store and train LoRA model because there are significantly fewer trainable parameters. Let's fine-tune a [roberta-large](https://huggingface.co/roberta-large) model with LoRA on the [BioNLP2004](https://huggingface.co/datasets/tner/bionlp2004) dataset for token classification.

In [1]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install evaluate==0.4.1
!pip install datasets==2.15.0
!pip install peft==0.7.1

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning roberta-large-with-LoRA"
os.environ["WANDB_NOTES"] = "Fine tune model with low rank approximation"
os.environ["WANDB_NAME"] = "ft-roberta-large-on-bionlp2004-lora"
os.environ["MODEL_NAME"] = "roberta-large"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `roberta-large` from `transformers`...
config.json: 100%|█████████████████████████████| 482/482 [00:00<00:00, 3.15MB/s]
┌────────────────────────────────────────────────────┐
│      Memory Usage for loading `roberta-large`      │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│  198.38 MB  │ 1.32 GB  │       5.3 GB      │
│float16│   99.19 MB  │ 677.8 MB │      2.65 GB      │
│  int8 │   49.59 MB  │ 338.9 MB │      1.32 GB      │
│  int4 │   24.8 MB   │169.45 MB │      677.8 MB     │
└───────┴─────────────┴──────────┴───────────────────┘


# Loading the Datasets

In [4]:
from datasets import load_dataset

dataset=load_dataset("tner/bionlp2004", split="train[:5000]")
dataset=dataset.train_test_split(test_size=0.2)
dataset

Downloading builder script:   0%|          | 0.00/2.83k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/2.27k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.78M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/660k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 1000
    })
})

In [5]:
dataset["train"][0]

{'tokens': ['Here',
  'we',
  'report',
  'the',
  'characterization',
  'of',
  'a',
  'nuclear',
  'complex',
  'from',
  'human',
  'monocytic',
  'cells',
  'that',
  'bound',
  'to',
  'a',
  'kappa',
  'B-like',
  'site',
  ',',
  "5'-CGGAGTTTCC-3",
  "'",
  ',',
  'in',
  'the',
  "5'-flanking",
  'region',
  'of',
  'the',
  'human',
  'TF',
  'gene',
  '.'],
 'tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  3,
  4,
  0,
  0,
  5,
  6,
  0,
  0,
  0,
  0,
  1,
  2,
  2,
  0,
  1,
  2,
  0,
  0,
  0,
  1,
  2,
  0,
  0,
  1,
  2,
  2,
  0]}

# Preprocess dataset

* Tokeniztion
* Applying to the entire datasets

In [6]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"), add_prefix_space=True)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [7]:
def tokenize_and_align_labels(examples):
    tokenized_inputs=tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels=[]
    for i, label in enumerate(examples["tags"]):
        word_ids=tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx=None
        label_ids=[]
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx!=previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx=word_idx
        labels.append(label_ids)
    
    tokenized_inputs["labels"]=labels
    return tokenized_inputs

tokenized_dataset=dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
from transformers import DataCollatorForTokenClassification

data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer)

# Training

In [9]:
from transformers import AutoModelForTokenClassification


id2label = {
    0: "O",
    1: "B-DNA",
    2: "I-DNA",
    3: "B-protein",
    4: "I-protein",
    5: "B-cell_type",
    6: "I-cell_type",
    7: "B-cell_line",
    8: "I-cell_line",
    9: "B-RNA",
    10: "I-RNA",
}

label2id = {
    "O": 0,
    "B-DNA": 1,
    "I-DNA": 2,
    "B-protein": 3,
    "I-protein": 4,
    "B-cell_type": 5,
    "I-cell_type": 6,
    "B-cell_line": 7,
    "I-cell_line": 8,
    "B-RNA": 9,
    "I-RNA": 10,
}


def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params/all_params:.2f}")


model = AutoModelForTokenClassification.from_pretrained(
    os.getenv("MODEL_NAME"), num_labels=11, id2label=id2label, label2id=label2id
)

print_trainable_parameters(model)
print(model)

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 354321419 || all params: 354321419 || trainable%: 100.00
RobertaForTokenClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024, padding_idx=1)
      (token_type_embeddings): Embedding(1, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-23): 24 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
       

The weight matrix is scaled by **lora_alpha/r**, and a higher **lora_alpha** value assigns more weight to the LoRA activations. For performance, let's setting **bias** to **None** first, and then **lora_only**, before trying **all**.

In [10]:
from peft import LoraConfig, TaskType, get_peft_model

peft_config=LoraConfig(
    # https://github.com/huggingface/peft/blob/v0.7.1/src/peft/utils/peft_types.py#L38
    task_type=TaskType.TOKEN_CLS,
    inference_mode=False,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="lora_only",
    target_modules=["query","key", "value", "dense"]
)


peft_model=get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 7,089,163 || all params: 361,410,582 || trainable%: 1.9615261293040944


In [11]:
from transformers import TrainingArguments, Trainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    learning_rate=1e-3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2, # Only for getting the minimal time for testing.
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME")
)

trainer=Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.2
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240302_121119-wyt3og33[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-roberta-large-on-bionlp2004-lora[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tuning%20roberta-large-with-LoRA[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tuning%20roberta-large-with-LoRA/runs/wyt3og33[0m
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than 

Epoch,Training Loss,Validation Loss
1,No log,0.21645
2,No log,0.183505




TrainOutput(global_step=250, training_loss=0.28370635986328124, metrics={'train_runtime': 329.4909, 'train_samples_per_second': 24.28, 'train_steps_per_second': 0.759, 'total_flos': 1325134270804224.0, 'train_loss': 0.28370635986328124, 'epoch': 2.0})

In [12]:
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': os.getenv('MODEL_NAME'),
    'tasks': 'Token Classification',
#     'dataset_tags':'',
    'dataset':"tner/bionlp2004"
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/28.4M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.35k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-roberta-large-on-bionlp2004-lora/commit/72a0571471ddcd427e0d04ab2447a57dc6bd34d3', commit_message='End of training', commit_description='', oid='72a0571471ddcd427e0d04ab2447a57dc6bd34d3', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [13]:
import torch

del trainer, tokenizer, model
torch.cuda.empty_cache()

In [14]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
from peft import PeftConfig, PeftModel

peft_model_name="aisuko/"+os.getenv("WANDB_NAME")

peft_config=PeftConfig.from_pretrained(peft_model_name)
base_model=AutoModelForTokenClassification.from_pretrained(
    peft_config.base_model_name_or_path,
    num_labels=11,
    id2label=id2label,
    label2id=label2id
)

tokenizer=AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)
peft_model=PeftModel.from_pretrained(base_model, peft_model_name, device_map="auto")

adapter_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


adapter_model.safetensors:   0%|          | 0.00/28.4M [00:00<?, ?B/s]

In [15]:
text="The activation of IL-2 gene expression and NF-kappa B through CD28 requires reactive oxygen production by 5-lipoxygenase."
inputs=tokenizer(text, return_tensors="pt")
# inputs=inputs.to("cuda")
inputs

{'input_ids': tensor([[    0,   133, 29997,     9, 11935,    12,   176, 10596,  8151,     8,
         33861,    12,   330, 22181,   163,   149,  7522,  2517,  3441, 34729,
         11747,   931,    30,   195,    12, 33330, 25456,  4138,  3175,     4,
             2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1]])}

In [16]:
import torch

with torch.no_grad():
    logits=peft_model(**inputs,).logits

tokens=inputs.tokens()

# argmax() returns the indices of the maximum value of all elements in the input tensor.
predictions=torch.argmax(logits, dim=2)

for token, prediction in zip(tokens, predictions[0].detach().cpu().numpy()):
    print((token, peft_model.config.id2label[prediction]))

('<s>', 'O')
('The', 'O')
('Ġactivation', 'O')
('Ġof', 'O')
('ĠIL', 'B-DNA')
('-', 'O')
('2', 'I-DNA')
('Ġgene', 'I-DNA')
('Ġexpression', 'O')
('Ġand', 'O')
('ĠNF', 'B-protein')
('-', 'O')
('k', 'I-protein')
('appa', 'I-protein')
('ĠB', 'I-protein')
('Ġthrough', 'O')
('ĠCD', 'B-protein')
('28', 'I-protein')
('Ġrequires', 'O')
('Ġreactive', 'O')
('Ġoxygen', 'O')
('Ġproduction', 'O')
('Ġby', 'O')
('Ġ5', 'B-protein')
('-', 'O')
('lip', 'I-protein')
('oxy', 'I-protein')
('gen', 'I-protein')
('ase', 'I-protein')
('.', 'O')
('</s>', 'O')
