# 🔬 Model Benchmarking and Comparison

## Purpose
Compares the performance of the original Qwen3-0.6B base model against the fine-tuned version on TOC extraction tasks. Demonstrates how fine-tuning improves the model's ability to correctly extract page information and handle noisy inputs.

<br>

--- 

## What This Notebook Does

### Model Loading
- **Base model**: Loads the original `unsloth/Qwen3-0.6B-unsloth-bnb-4bit` for baseline comparison
- **Fine-tuned model**: Loads the merged 16-bit fine-tuned model from previous training in 4bit

### Benchmarking Process
- Tests both models on the same noisy TOC samples from the test dataset
- Uses identical prompting strategies for fair comparison

### Key Improvements 
The fine-tuning results show the model has learned to:
- **Extract accurate page ranges**: Calculates proper end_page values (next chapter start_page - 1)
- **Handle noisy inputs**: More resistant to formatting inconsistencies, random characters, and OCR errors
- **Maintain JSON structure**: Consistently outputs properly formatted JSON with required fields
- **Ignore irrelevant content**: Filters out "Exercises", decorative elements, and standalone numbers as instructed


In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
import pickle
import torch
from unsloth import FastLanguageModel
from google.colab import drive
drive.mount('/content/drive')

data_path = "/content/drive/MyDrive/Projects/Finetuning_TOC_Extractor/data/synthetic_toc_test.pkl"
with open(data_path, 'rb') as f:
    training_data = pickle.load(f)


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth.chat_templates import standardize_sharegpt


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Mounted at /content/drive


## Model Loading for Comparison

Loads both the original Qwen3-0.6B base model and the fine-tuned merged model for side-by-side performance comparison. Both models are loaded in 4-bit quantization with identical parameters to ensure fair benchmarking conditions.


In [None]:
model ="unsloth/Qwen3-0.6B-unsloth-bnb-4bit"

# Load model with Unsloth (handles 4-bit automatically)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model,
    max_seq_length = 2048,
    dtype=None,
    load_in_4bit = True,
    load_in_8bit = False,
    full_finetuning = False,
)

merged_16bit = "/path/to/your/merged_16bit_model"  # Replace with your actual path

model_finetuned, tokenizer_finetuned = FastLanguageModel.from_pretrained(
    model_name = merged_16bit,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    )

==((====))==  Unsloth 2025.6.2: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/576M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

## Inference Functions Setup

Defines standardized inference functions for consistent model comparison across both base and fine-tuned models. Includes prompt preparation with detailed TOC parsing instructions and a unified generation function with optimized parameters for reliable JSON output.

In [None]:
def prepare_inference_prompt(toc_text, tokenizer):
    system_prompt = """You are a table of contents parser. Read the input carefully and extract ONLY the numbered chapters shown.

CRITICAL INSTRUCTIONS:
1. Read the provided table of contents line by line
2. Find lines that match: NUMBER SPACE TITLE PAGE_NUMBER
3. Extract the exact chapter titles from the input text
4. Use the exact page numbers from the input text
5. Calculate end_page = next chapter start_page - 1
6. Ignore lines with "Exercises", "###", "•", or standalone numbers

FORMAT: Return JSON array with: chapter_number, chapter_title, start_page, end_page
"""

    messages = [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": f"""Parse this specific table of contents and extract the numbered chapters:

{toc_text.strip()}

Extract the chapters from the text above (not from any other source):"""
        }
    ]

    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )


# Usage remains the same
def extract_chapters(model, tokenizer, toc_text):
    # Enable inference mode
    FastLanguageModel.for_inference(model)

    # Prepare the prompt
    prompt = prepare_inference_prompt(toc_text, tokenizer)

    # Tokenize and move to same device as model
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1024,
            temperature=0.1,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Extract only the generated part
    new_tokens = outputs[0][inputs['input_ids'].shape[1]:]
    json_output = tokenizer.decode(new_tokens, skip_special_tokens=True)

    return json_output.strip()

# Benchmarking Loop and Results

Runs side-by-side comparison tests on multiple noisy TOC samples, displaying both models' outputs for direct performance evaluation. Demonstrates the fine-tuned model's superior ability to extract proper page ranges and maintain JSON formatting compared to the base model.

In [6]:
for idx in range(11,17):
  train_sample = training_data[idx][0]
  print("\n\n\nnew sample:\n")
  print(train_sample)

  output_original = extract_chapters(model, tokenizer, train_sample)
  #output_original = extract_chapters(model, tokenizer, train_sample)
  output_finetuned = extract_chapters(model_finetuned, tokenizer_finetuned, train_sample)

  print("original\n")
  print(output_original)
  print("\nfinetuned")
  print(output_finetuned)




new sample:

1 Dragon Subspecies Classification Systems  1
Exercises  
2 Mating Rituals and Reproductive Cycles   25
Exercises    
3 Wing Aerodynamics and Flight Patterns 68
Exercises 
4 Magical Influence on Surrounding Ecosystems 117
Exercises 
5 Symbiotic Relationships with Other Creatures  134
Exercises   
6 Aging Processes and Lifespan Variations      154
Exercises 
7 Territorial Behavior and Nesting Habits   191
 Break-even analysis and sensitivity testing procedures
Exercises  
8 Hoard Psychology and Treasure Selection 217
Exercises  
9 Magical Energy Metabolism and Storage  241
Exercises 
10 Interaction Protocols with Human Settlements 258
Exercises 
1
11 Dietary Requirements and Hunting Strategies 284
Exercises  
12 Fire Production Mechanisms and Organ Structure 320
Exercises 
13 Environmental Adaptations Across Climates 355
Exercises    

original

```json
[
  {
    "chapter_number": 1,
    "chapter_title": "Dragon Subspecies Classification Systems",
    "start_page": 1,
  