# 🚀 Model Fine-tuning with Unsloth

## Purpose
Fine-tunes the Qwen-3.06b model for table of contents (TOC) extraction using LoRA (Low-Rank Adaptation) with the Unsloth framework. Trains the model to convert noisy, imperfect TOC text into clean, structured JSON format.

<br>

---

## What This Notebook Does

### Model Setup
- Loads the `unsloth/Qwen3-0.6B-unsloth-bnb-4bit` base model with 4-bit quantization for memory efficiency
- Applies LoRA adapters to specific attention and feed-forward layers for parameter-efficient fine-tuning

### Training Process
- Loads synthetic training dataset (15,000 examples) from pickle file
- Formats data using chat templates with system prompts for TOC parsing instructions
- Trains for 1 epochs with optimized hyperparameters (batch size=4, gradient accumulation=2, learning rate=5e-5)
- Uses cosine learning rate scheduling and mixed precision training

### Model Export
- Saves LoRA adapters and tokenizer for inference
- Converts to multiple formats:
 - **16-bit merged model**: Full precision for maximum quality
 - **4-bit GGUF**: Quantized format optimized for deployment and inference

## Output Models
- **16-bit merged**: `/finetuned_model/merged/16_bit_merge_temp/`
- **GGUF quantized**: `/finetuned_model/gguf/4bit_3version_gguf/`


In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

from unsloth import FastLanguageModel

## Model Loading and LoRA Configuration

Loads the Qwen3-0.6B base model in 4-bit quantization and applies LoRA adapters for parameter-efficient fine-tuning on attention and feed-forward layers.


In [None]:
# Model name
model ="unsloth/Qwen3-0.6B-unsloth-bnb-4bit"

# Load model with Unsloth (handles 4-bit automatically)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model,
    max_seq_length = 2048,
    dtype=None,
    load_in_4bit = True,
    load_in_8bit = False,
    full_finetuning = False,
)


# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

==((====))==  Unsloth 2025.6.2: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Dataset Loading

Mounts Google Drive and loads the pre-generated synthetic training dataset from pickle file. Imports necessary libraries for supervised fine-tuning with the TRL framework and dataset handling.

In [None]:
import pickle
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
from google.colab import drive
drive.mount('/content/drive')

# Y
data_path = "/path/to/your/dataset/data/synthetic_toc.pkl"
with open(data_path, 'rb') as f:
    training_data = pickle.load(f)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Training Data Preparation

Formats synthetic training data into chat templates with detailed system prompts for TOC parsing.

In [None]:
def prepare_data_for_training(training_data, tokenizer):
    formatted_texts = []

    system_prompt = """You are a table of contents parser. Read the input carefully and extract ONLY the numbered chapters shown.

    CRITICAL INSTRUCTIONS:
    1. Read the provided table of contents line by line
    2. Find lines that match: NUMBER SPACE TITLE PAGE_NUMBER
    3. Extract the exact chapter titles from the input text
    4. Use the exact page numbers from the input text
    5. Calculate end_page = next chapter start_page - 1
    6. Ignore lines with "Exercises", "###", "•", or standalone numbers

    FORMAT: Return JSON array with: chapter_number, chapter_title, start_page, end_page
    """
    for toc_text, json_output in training_data:
        messages = [
              {"role": "system", "content": system_prompt},
              {
                  "role": "user",
                  "content": f"""Parse this specific table of contents and extract the numbered chapters:\n\n

                  {toc_text.strip()}

                  Extract the chapters from the text above (not from any other source):"""
              },
              {"role": "assistant", "content": json_output}
        ]

        # Apply chat template directly
        formatted_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
            enable_thinking=False
        )
        formatted_texts.append(formatted_text)

    return Dataset.from_dict({"text": formatted_texts})

train_dataset = prepare_data_for_training(training_data, tokenizer)

## Model Training

Configures and executes supervised fine-tuning with optimized hyperparameters including cosine learning rate scheduling, gradient clipping, and mixed precision training. 

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=4,  
        gradient_accumulation_steps=2, 
        warmup_steps=100,               
        num_train_epochs=2,             
        learning_rate=5e-5,             
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.001,            
        lr_scheduler_type="cosine",    
        max_grad_norm=1.0,             
        fp16=True,                     
        seed=3407,
        report_to="none",
    ),
)

trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 2 | Total steps = 2,500
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 10,092,544/600,000,000 (1.68% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.8437
20,1.686
30,1.5257
40,1.3181
50,1.1105
60,0.8675
70,0.6304
80,0.5659
90,0.544
100,0.5298


KeyboardInterrupt: 

# Model Saving and Export

Saves and exports the fine-tuned model in multiple formats for different use cases:

- **LoRA adapters**: Saves parameter-efficient adapter weights and tokenizer for minimal storage
- **16-bit merged model**: Creates full precision standalone model by merging adapters with base weights
- **GGUF quantized**: Converts to optimized 4-bit format (q4_k_m) for efficient deployment and inference

In [None]:
# Save only adapters
adapter_path = "/output/path/to/save/adapters"
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

In [None]:
# merge adapters with the base model and save in 16-bit precision
merged_16bit = "/output/path/to/merged_model_16bit"
model.save_pretrained_merged(
    merged_16bit,
    tokenizer,
    save_method="merged_16bit"
)

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [00:29<00:00, 29.58s/it]


In [None]:
gguf_path = "/output/path/to/gguf_model"
model.save_pretrained_gguf(
    gguf_path,
    tokenizer,
    quantization_method="q4_k_m"
)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1.81 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 104.82it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /content/drive/MyDrive/Projects/Finetuning_TOC_Extractor/finetuned_model/gguf/4bit_3version_gguf/pytorch_model.bin...
Done.


Unsloth: Converting qwen3 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at /content/drive/MyDrive/Projects/Finetuning_TOC_Extractor/finetuned_model/gguf/4bit_3version_gguf into f16 GGUF format.
The output location will be /content/drive/MyDrive/Projects/Finetuning_TOC_Extractor/finetuned_model/gguf/4bit_3version_gguf/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: 4bit_3version_gguf
INFO:hf-to-gguf:Model architecture: Qwen3ForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...