Let's do **task specific** fine-tuning by adapting the Llama 3.2 1B model for sentiment classification.
* Dataset: IMDB
* Model : Llama 3.2 1B
* PEFT: LoRA (Low Rank Adaptation)
* Quantization: 8bit (cast to float-16 during training)

We only need to modify a few lines of code

In [1]:
import warnings
from pprint import pprint
import math
import wandb

#hf
import datasets
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
from transformers import DataCollatorWithPadding
from transformers import LlamaConfig, LlamaForCausalLM,LlamaForSequenceClassification
from transformers import TrainingArguments, Trainer

Upgrade the transformers, PEFT, and accelerate packages to the specified versions.

In [2]:
import transformers, peft, accelerate
print(transformers.__version__)
print(peft.__version__)
print(accelerate.__version__)

4.45.2
0.13.2
1.0.1


## Dataset

In [3]:
ds = load_dataset('stanfordnlp/imdb')
_ = ds.pop('unsupervised')

In [3]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


In [4]:
ds["train"].features['label']

ClassLabel(names=['neg', 'pos'], id=None)

## Tokenizer

In [4]:
model_id = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id,model_max_length=1024)
# set pad token id
tokenizer.pad_token=tokenizer.eos_token

In [5]:
def tokenize(example):
    example = tokenizer(example['text'],padding=False,truncation=True)
    return example

In [6]:
tokenized_ds = ds.map(tokenize,batched=True,num_proc=12, remove_columns=['text'])
print(tokenized_ds)

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
})


In [7]:
ds_split = tokenized_ds['train'].train_test_split(test_size=0.1,seed=42)
print(ds_split)

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 22500
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 2500
    })
})


## Data Collator

In [5]:
# dataloader
data_collator = DataCollatorWithPadding(tokenizer,padding=True)

## Inference on Model with Random Intialization of weights in the classification head

We expect a poor performance by the model (irrespective of the representation from the underlying model migh be good enough!)

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(model_id,num_labels=2,
                                                           pad_token_id=tokenizer.eos_token_id,)

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Llama-3.2-1B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
print(model)

LlamaForSequenceClassification(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048, padding_idx=128001)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e

In [16]:
model.config.id2label = {0:"NEGATIVE",1:"POSITIVE"}

In [17]:
from transformers import TextClassificationPipeline
classifier = TextClassificationPipeline(model=model,
                                       tokenizer=tokenizer,
                                       framework='pt',
                                       task="sentiment-analysis",
                                       device = "cuda"
                                       )

In [18]:
text = "The movie is good."
prediction = classifier(text)
print(prediction)

[{'label': 'POSITIVE', 'score': 0.982572078704834}]


In [19]:
text = "The movie is really bad..nothing new to hook us"
prediction = classifier(text)
print(prediction)

[{'label': 'POSITIVE', 'score': 0.997260332107544}]


In [20]:
text = "Very bad movie with no good story"
prediction = classifier(text)
print(prediction)

[{'label': 'POSITIVE', 'score': 0.971861720085144}]


Let us fine tune the classification head (treating the model as feature extractor)

## Fine-tune the classification head

Let's freeze the parameters of all layers except the last layer!

In [30]:
 for name,param in model.named_parameters():    
     if name != "score.weight":
        param.requires_grad = False
     print(name,param.requires_grad)
    

model.embed_tokens.weight False
model.layers.0.self_attn.q_proj.weight False
model.layers.0.self_attn.k_proj.weight False
model.layers.0.self_attn.v_proj.weight False
model.layers.0.self_attn.o_proj.weight False
model.layers.0.mlp.gate_proj.weight False
model.layers.0.mlp.up_proj.weight False
model.layers.0.mlp.down_proj.weight False
model.layers.0.input_layernorm.weight False
model.layers.0.post_attention_layernorm.weight False
model.layers.1.self_attn.q_proj.weight False
model.layers.1.self_attn.k_proj.weight False
model.layers.1.self_attn.v_proj.weight False
model.layers.1.self_attn.o_proj.weight False
model.layers.1.mlp.gate_proj.weight False
model.layers.1.mlp.up_proj.weight False
model.layers.1.mlp.down_proj.weight False
model.layers.1.input_layernorm.weight False
model.layers.1.post_attention_layernorm.weight False
model.layers.2.self_attn.q_proj.weight False
model.layers.2.self_attn.k_proj.weight False
model.layers.2.self_attn.v_proj.weight False
model.layers.2.self_attn.o_proj

Count the number of trainable parameters

In [33]:
num_parameters = 0
for param in model.parameters():   
    if param.requires_grad:
        num_parameters += param.numel()
print(f'Number of Parameters:{num_parameters}')

Number of Parameters:4096


Let us load evaluation metrics (accuracy in this case)

In [34]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Batch of size 8 should work in Colab as we fine tune only the last layer

In [35]:
training_args = TrainingArguments( output_dir='llma32_imdb_ft',
                                  eval_strategy="steps",
                                  eval_steps=100,
                                  num_train_epochs=1,
                                  per_device_train_batch_size=12,
                                  per_device_eval_batch_size=12,
                                  bf16=False,
                                  fp16=True,
                                  tf32=False,
                                  gradient_accumulation_steps=1,
                                  adam_beta1=0.9,
                                  adam_beta2=0.999,
                                  learning_rate=2e-5,
                                  weight_decay=0.01,
                                  logging_dir='logs',
                                  logging_strategy="steps",
                                  logging_steps = 100,
                                  save_steps=100,
                                  save_total_limit=20,
                                  report_to='none',
                                )


In [37]:
trainer = Trainer(model=model,
                  args = training_args,
                 train_dataset=ds_split["train"],
                 eval_dataset=ds_split["test"],
                 compute_metrics=compute_metrics,
                 data_collator = data_collator)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


```python
results = trainer.train() # make this a code cell to execute
```

<img src="https://raw.githubusercontent.com/Arunprakash-A/Modern-NLP-with-Hugging-Face/refs/heads/main/Notebooks/images/ft_last_layer_llama321b.png">

* A simple linear network is able to acheive the training accuracy of 85.8%. 
* It Implies that the model is able to produce a good representation of the input sequence.

In [39]:
from transformers import TextClassificationPipeline
classifier = TextClassificationPipeline(model=model,
                                       tokenizer=tokenizer,
                                       framework='pt',
                                       task="sentiment-analysis",
                                       device = "cuda"
                                       )

## Inference 

In [40]:
text = "The movie is good."
prediction = classifier(text)
print(prediction)

[{'label': 'POSITIVE', 'score': 0.8752306699752808}]


In [41]:
text = "The movie is really bad..nothing new to hook us"
prediction = classifier(text)
print(prediction)

[{'label': 'POSITIVE', 'score': 0.6252825260162354}]


In [42]:
text = "Very bad movie with no good story"
prediction = classifier(text)
print(prediction)

[{'label': 'NEGATIVE', 'score': 0.9068325161933899}]


For the text `"Very bad movie with no good story"`, the fine tuned model predicts it as negative sentiment with high confidence.

## What is next?
* Train the model with the LoRA adapters and see how it performs
