# AIM
Aim of this Notebook is to fine-tune the [LLaMA 3.2 3B Instruct model](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) with the [TL;DR Dataset](https://huggingface.co/datasets/trl-lib/tldr) and Custom Dataset and export them for evalution later.

* Model_1 - Fine-Tune [LLaMA 3.2 3B Instruct model](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) with TL;DR first and then with the Custom Dataset
* Model_2 - Fine-Tune [LLaMA 3.2 3B Instruct model](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) with Custom Dataset only

## Installing Packages

we would be including the packages required for the fine-tuning (as this notebook runs on colab)

In [None]:
!pip install pandas datasets
!pip install transformers torch
!pip install xformers trl peft accelerate bitsandbytes

In [None]:
import sys

sys.version

'3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]'

In [None]:
from pathlib import Path

project_root = Path.cwd().parent  # or Path().resolve().parent
sys.path.insert(0, str(project_root))
# we are doing this so we can import src folder

from transformers import Trainer
from src.load_dataset import load_jsonl, split_90_and_10, print_sample, TL_DR_JSON, CS_JSON
from src.utils.torch import ensure_device
from src.train_model import print_train_progress, configure_trainer, print_args, export_model, EXPORT_TLDR_FINE_TUNED, EXPORT_TLDR_CS_FINE_TUNED, EXPORT_CS_FINE_TUNED
from src.load_model import load_tokenizer, load_model, lora_config_for, apply_formatter, TOKEN_LIMIT_FOR_CS, \
    format_dataset, prep_data_collector

In [None]:
ensure_device()

We would be using this device: cuda


# Model_1
first we will start fine-tuning the [LLaMA 3.2 3B Instruct model](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) on the [TL;DR Dataset](https://huggingface.co/datasets/trl-lib/tldr)

## Loading the TL;DR Dataset

we have saved the TL;DR Dataset in the JSONL format. we would load the `proc_tldr.jsonl` file. you can refer to this [notebook](https://github.com/au-nlp/project-milestone-p2-group-6/blob/main/lab/export_dataset.ipynb) that generated this.

In [None]:
dataset = load_jsonl(TL_DR_JSON)


[1/8] Loading dataset...
✓ Loaded 6944 examples


## Preparing the Train and Test Dataset

we have decided to split 90% for training and 10% for testing

In [None]:
# Split dataset
split_dataset = split_90_and_10(dataset)
train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]

✓ Train: 6249 | Val: 695


## Note

we have to log in inside hugging face since the LLaMA 3.2 3B Model is a gated repository. and it requires approval from their repo. admins in order to access it.

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
The token `YTA-DEV` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `YTA-DEV`


## Loading the Model and Tokenizer

this is where we load the [LLaMA 3.2 3B Instruct model](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and its Tokenizer

In [None]:
# Load model and tokenizer
tokenizer = load_tokenizer()
model = load_model()


[3/8] Loading LLaMA 3.2 3B model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

✓ Model loaded in 8-bit
✓ Model size: ~3B parameters


## Configuring LoRA

In [None]:
model = lora_config_for(model) #getting model with quantized LoRA config


[4/8] Preparing model for QLoRA...


In [None]:
model.print_trainable_parameters()

# Tokenization function with proper chat formatting
print("\n[5/8] Preparing tokenization...")

✓ LoRA adapters added
trainable params: 24,313,856 || all params: 3,237,063,680 || trainable%: 0.7511

[5/8] Preparing tokenization...


In [None]:
formatter = apply_formatter(tokenizer)

In [None]:
# Apply tokenization for train dataset
train_dataset = format_dataset(formatter, train_dataset, "Tokenizing train")


✓ Tokenizing train dataset...


Tokenizing train:   0%|          | 0/6249 [00:00<?, ? examples/s]

In [None]:
# Apply tokenization for validation dataset
val_dataset = format_dataset(formatter, val_dataset, "Tokenizing validation")

✓ Tokenizing validation dataset...


Tokenizing validation:   0%|          | 0/695 [00:00<?, ? examples/s]

In [None]:
print_sample(train_dataset)


✓ Sample stats:
  - Input length: 1504 tokens
  - Attention tokens: 1504 tokens
  - Truncated: No


## Training Configuration

In [None]:
training_args = configure_trainer(EXPORT_TLDR_FINE_TUNED)


[6/8] Configuring training...


In [None]:
print_args(train_dataset, training_args)

✓ Training configuration:
  - Effective batch size: 16
  - Total training steps: ~1171
  - Learning rate: 0.0002


In [None]:
data_collator = prep_data_collector(tokenizer)


[7/8] Creating data collator...


In [None]:
# Setup trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)


## Starting the Training _(With TL;DR DataSet - Model_1)_

we have observed the run took around ~1.5 hrs to complete the run with T4 GPU.

In [None]:
print_train_progress(trainer)


[8/8] Starting training...
TRAINING IN PROGRESS




Step,Training Loss,Validation Loss
100,2.1649,2.28672
200,2.1452,2.253277
300,2.1589,2.239438




TrainOutput(global_step=391, training_loss=2.296681910219705, metrics={'train_runtime': 4714.9168, 'train_samples_per_second': 1.325, 'train_steps_per_second': 0.083, 'total_flos': 1.2578198791033651e+17, 'train_loss': 2.296681910219705, 'epoch': 1.0})

In [None]:
# Save final model
export_model(trainer, tokenizer, EXPORT_TLDR_FINE_TUNED)


TRAINING COMPLETE

Saving model...


('./llama3.2-3b-qlora-summary/tokenizer_config.json',
 './llama3.2-3b-qlora-summary/special_tokens_map.json',
 './llama3.2-3b-qlora-summary/chat_template.jinja',
 './llama3.2-3b-qlora-summary/tokenizer.json')

In [None]:
!zip -r ./llama_3b_3_2.zip ./llama3.2-3b-qlora-summary # zip the model for saving

  adding: llama3.2-3b-qlora-summary/ (stored 0%)
  adding: llama3.2-3b-qlora-summary/tokenizer.json (deflated 85%)
  adding: llama3.2-3b-qlora-summary/adapter_config.json (deflated 58%)
  adding: llama3.2-3b-qlora-summary/tokenizer_config.json (deflated 96%)
  adding: llama3.2-3b-qlora-summary/README.md (deflated 65%)
  adding: llama3.2-3b-qlora-summary/chat_template.jinja (deflated 71%)
  adding: llama3.2-3b-qlora-summary/special_tokens_map.json (deflated 63%)
  adding: llama3.2-3b-qlora-summary/checkpoint-300/ (stored 0%)
  adding: llama3.2-3b-qlora-summary/checkpoint-300/tokenizer.json (deflated 85%)
  adding: llama3.2-3b-qlora-summary/checkpoint-300/adapter_config.json (deflated 58%)
  adding: llama3.2-3b-qlora-summary/checkpoint-300/scheduler.pt (deflated 61%)
  adding: llama3.2-3b-qlora-summary/checkpoint-300/tokenizer_config.json (deflated 96%)
  adding: llama3.2-3b-qlora-summary/checkpoint-300/README.md (deflated 65%)
  adding: llama3.2-3b-qlora-summary/checkpoint-300/chat_temp

## Fine tuning Complete _(With TL;DR DataSet - Model_1)_

we have successfully fine-tuned the LLaMA 3.2 3B model on the TL;DR Dataset and exported it to `./llama_3b_3_2.zip` we would have use this model and then further fine-tune with the custom dataset.

## FineTuning with the Custom Dataset

we have are fine-tuning the LLaMA 3.2 3B model on the Custom Dataset.

### Note

we have observed the fine-tuning with the Custom Dataset requires more GPU More so we have changed our run type to use A100 GPU, so we would need to load first load the Model Exported from the previous run.

In [None]:
!unzip llama_3b_3_2.zip # unzip model again for import

Archive:  llama_3b_3_2.zip
   creating: llama3.2-3b-qlora-summary/
  inflating: llama3.2-3b-qlora-summary/tokenizer.json  
  inflating: llama3.2-3b-qlora-summary/adapter_config.json  
  inflating: llama3.2-3b-qlora-summary/tokenizer_config.json  
  inflating: llama3.2-3b-qlora-summary/README.md  
  inflating: llama3.2-3b-qlora-summary/chat_template.jinja  
  inflating: llama3.2-3b-qlora-summary/special_tokens_map.json  
   creating: llama3.2-3b-qlora-summary/checkpoint-300/
  inflating: llama3.2-3b-qlora-summary/checkpoint-300/tokenizer.json  
  inflating: llama3.2-3b-qlora-summary/checkpoint-300/adapter_config.json  
  inflating: llama3.2-3b-qlora-summary/checkpoint-300/scheduler.pt  
  inflating: llama3.2-3b-qlora-summary/checkpoint-300/tokenizer_config.json  
  inflating: llama3.2-3b-qlora-summary/checkpoint-300/README.md  
  inflating: llama3.2-3b-qlora-summary/checkpoint-300/chat_template.jinja  
  inflating: llama3.2-3b-qlora-summary/checkpoint-300/special_tokens_map.json  
  inf

## Loading the Custom Dataset

we have saved the Custom Dataset in the JSONL format. we would load the custom_dataset.jsonl file. you can refer to this [notebook](https://github.com/au-nlp/project-milestone-p2-group-6/blob/main/lab/export_dataset.ipynb) that generated this.

In [5]:
# Load JSONL data (Custom Dataset)
custom_dataset = load_jsonl(CS_JSON)


2-[1/8] Loading dataset...
✓ Loaded 1004 examples


## Preparing the Train and the Test set

Split 0.1 (90% - Train and 10% - Test)

In [7]:
# Split dataset
split_dataset = split_90_and_10(custom_dataset)
train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]

✓ Train: 903 | Val: 101


## Loading the FineTuned Model

Loading the Fine-tuned (with TL;DR Dataset) LLaMA 3.2 3B Model

In [11]:
# we would be loading the tokenizer from the previously fine-tuned model
tokenizer = load_tokenizer(EXPORT_TLDR_FINE_TUNED)

In [12]:
# first we load the base model
model = load_model()
model = lora_config_for(model, EXPORT_TLDR_FINE_TUNED)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

✓ Fine-tuned model loaded


In [13]:
model.print_trainable_parameters()

# Tokenization function with proper chat formatting
print("\n[5/8] Preparing tokenization...")

trainable params: 24,313,856 || all params: 3,237,063,680 || trainable%: 0.7511

[5/8] Preparing tokenization...


In [14]:
model.peft_config  # for verifying lora config

{'default': LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, peft_version='0.18.0', base_model_name_or_path='meta-llama/Llama-3.2-3B-Instruct', revision=None, inference_mode=False, r=16, target_modules={'q_proj', 'k_proj', 'o_proj', 'down_proj', 'gate_proj', 'up_proj', 'v_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', trainable_token_indices=None, loftq_config={}, eva_config=None, corda_config=None, use_dora=False, alora_invocation_tokens=None, use_qalora=False, qalora_group_size=16, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False, target_parameters=None, arrow_config=None, ensure_weight_tying=False)}

## Note

we need to set the max_length for the tokenization function to 10_000 for the custom dataset (Youtube Transcripts), since the number of tokens in the custom dataset can be significantly larger than the TL;DR Dataset.

In [None]:
formatter = apply_formatter(tokenizer, token_limit=TOKEN_LIMIT_FOR_CS) # get formatter

In [17]:
# Apply tokenization for train dataset
train_dataset = format_dataset(formatter, train_dataset, "Tokenizing train")

✓ Tokenizing train dataset...


Tokenizing train:   0%|          | 0/903 [00:00<?, ? examples/s]

In [18]:
# Apply tokenization for validation dataset
val_dataset = format_dataset(formatter, val_dataset, "Tokenizing validation")

✓ Tokenizing validation dataset...


Tokenizing validation:   0%|          | 0/101 [00:00<?, ? examples/s]

In [19]:
# Show sample stats (pre-training)
print_sample(train_dataset)



✓ Sample stats:
  - Input length: 1382 tokens
  - Attention tokens: 1382 tokens
  - Truncated: No


## Training Configuration

In [20]:
# number of epochs is 3 with the custom dataset
training_args = configure_trainer(EXPORT_TLDR_CS_FINE_TUNED, num_train_epochs=3)


[6/8] Configuring training...


In [21]:
print_args(train_dataset, training_args)

✓ Training configuration:
  - Effective batch size: 16
  - Total training steps: ~169
  - Learning rate: 0.0002


In [None]:
data_collator = prep_data_collector(tokenizer) #init data collatter


[7/8] Creating data collator...


In [23]:
# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)


## Training (With Custom DataSet - Model_1)

In [24]:
print_train_progress(trainer)


[8/8] Starting training...
TRAINING IN PROGRESS




Step,Training Loss,Validation Loss
100,1.9983,2.143472




TrainOutput(global_step=171, training_loss=2.1060634607460065, metrics={'train_runtime': 6778.3477, 'train_samples_per_second': 0.4, 'train_steps_per_second': 0.025, 'total_flos': 2.0566782399297946e+17, 'train_loss': 2.1060634607460065, 'epoch': 3.0})

In [25]:
# Save final model
export_model(trainer, tokenizer, EXPORT_TLDR_CS_FINE_TUNED)


TRAINING COMPLETE

Saving model...


('./final-summary/tokenizer_config.json',
 './final-summary/special_tokens_map.json',
 './final-summary/chat_template.jinja',
 './final-summary/tokenizer.json')

In [None]:
!zip -r ./final-summary.zip ./final-summary # zip model 1

  adding: final-summary/ (stored 0%)
  adding: final-summary/README.md (deflated 65%)
  adding: final-summary/checkpoint-100/ (stored 0%)
  adding: final-summary/checkpoint-100/README.md (deflated 65%)
  adding: final-summary/checkpoint-100/adapter_config.json (deflated 58%)
  adding: final-summary/checkpoint-100/training_args.bin (deflated 53%)
  adding: final-summary/checkpoint-100/special_tokens_map.json (deflated 63%)
  adding: final-summary/checkpoint-100/tokenizer_config.json (deflated 96%)
  adding: final-summary/checkpoint-100/rng_state.pth (deflated 26%)
  adding: final-summary/checkpoint-100/trainer_state.json (deflated 70%)
  adding: final-summary/checkpoint-100/chat_template.jinja (deflated 71%)
  adding: final-summary/checkpoint-100/optimizer.pt (deflated 11%)
  adding: final-summary/checkpoint-100/scheduler.pt (deflated 62%)
  adding: final-summary/checkpoint-100/adapter_model.safetensors (deflated 7%)
  adding: final-summary/checkpoint-100/tokenizer.json (deflated 85%)
 

## Training Completed

we have observed the fine tune with the custom dataset took ~3 hrs to complete the run with A100 GPU. and we have exported it to `./final-summary.zip`

**we now have Model_1 which is fine-tuned on the TL;DR Dataset and then on the Custom Dataset**

# Preparing Model_2

Here we would fine-tuning the LLAMA 3.2 3B model on the Custom Dataset only.

Since we saw the Custom Dataset requires more GPU, we would be using A100 GPU for this run.

## Preparing the Train and Test Dataset

we have decided to split 90% for training and 10% for testing


(Doing this again since python's gc would have collected old ones)

In [None]:
# Split dataset
split_dataset = split_90_and_10(custom_dataset)
train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]

## Loading Tokenizer & Model

we would be loading base model (LLMA 3.2 3B Instruct)

In [None]:
# Load model and tokenizer
tokenizer = load_tokenizer()
model = load_model()

## Configuring LoRA for Fine Tuning

In [None]:
model = lora_config_for(model) # get lora config again for model 2

We would be using this device: cuda


In [None]:
model.print_trainable_parameters()

# Tokenization function with proper chat formatting
print("\n[5/8] Preparing tokenization...")

## Note

we need to set the max_length for the tokenization function to 10_000 for the custom dataset (YouTube Transcripts), since the number of tokens in the custom dataset can be significantly larger than the TL;DR Dataset.

In [None]:
formatter = apply_formatter(tokenizer, token_limit=TOKEN_LIMIT_FOR_CS) # get formatter

✓ Train: 903 | Val: 101


In [None]:
# Apply tokenization for train dataset
train_dataset = format_dataset(formatter, train_dataset, "Tokenizing train")

In [None]:
# Apply tokenization for validation dataset
val_dataset = format_dataset(formatter, val_dataset, "Tokenizing validation")

In [None]:
# Show sample stats (pre-training)
print_sample(train_dataset)

## Training Configuration

In [None]:
# number of epochs is 3 with the custom dataset
training_args = configure_trainer(EXPORT_CS_FINE_TUNED, num_train_epochs=3)

In [None]:
print_args(train_dataset, training_args)

In [12]:
data_collator = prep_data_collector(tokenizer)


[2/8] Configuring 8-bit quantization...


In [15]:
# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)



[4/8] Preparing model for QLoRA...


## Starting the Training

This is the Final Fine tune we have observed that this took again ~3hrs to complete

In [17]:
print_train_progress(trainer)

In [None]:
# Save final model
export_model(trainer, tokenizer, EXPORT_CS_FINE_TUNED)

In [18]:
!zip -r ./llama_3b_3_2.zip ./llama3.2-3b-qlora-summary

# Completed
---
* Fine-tuned the **Llama 3.2 3B** model in two stages:
  * First, on the `TL;DR` dataset.
  * Second, on a custom dataset.
* Fine-tuned the **Llama 3.2 3B** model with only custom dataset