# AToM-FM: Adaptive Transformer of Multimodal Foundation Model
## Qwen2.5 Fine-Tuning with QLoRA on RTX 4060 Ti

This notebook provides a **complete interactive environment** for:
1. Environment setup and GPU verification
2. Model loading with 4-bit quantization (QLoRA)
3. Dataset preparation and formatting
4. Training with SFTTrainer
5. Evaluation and inference
6. Model export and merging

**Hardware Target:** NVIDIA RTX 4060 Ti (8GB/16GB VRAM)  
**Model:** Qwen/Qwen2.5-3B (Optimized for Accuracy) with QLoRA adapters  
**Dataset:** tatsu-lab/alpaca (52K instruction samples)  \n**Optimization:** Sequence Packing enabled (~2x faster)  \n**Optimization:** Sequence Packing enabled (~2x faster)  \n**Optimization:** Sequence Packing enabled (~2x faster)

---
## 1. Environment Setup

In [13]:
# Install dependencies (run once)
# Uncomment and run if needed:

!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers datasets accelerate peft trl bitsandbytes
!pip install wandb tensorboard evaluate sentencepiece
!pip install pyyaml omegaconf matplotlib seaborn
!pip install wandb

Looking in indexes: https://download.pytorch.org/whl/cu121



[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: C:\Users\Harsha\AppData\Local\Programs\Python\Python311\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: C:\Users\Harsha\AppData\Local\Programs\Python\Python311\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: C:\Users\Harsha\AppData\Local\Programs\Python\Python311\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: C:\Users\Harsha\AppData\Local\Programs\Python\Python311\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: C:\Users\Harsha\AppData\Local\Programs\Python\Python311\python.exe -m pip install --upgrade pip


In [14]:
# --- AToM-FM Environment Setup ---
# Run this cell to ensure all dependencies are installed for the notebook kernel
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
%pip install transformers datasets accelerate peft trl bitsandbytes wandb pyyaml

# Login to Weights & Biases
import wandb
wandb.login(key="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")

Looking in indexes: https://download.pytorch.org/whl/cu121
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement torch (from versions: none)

[notice] A new release of pip is available: 25.3 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: No matching distribution found for torch


^C


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\Harsha\_netrc


Note: you may need to restart the kernel to use updated packages.


True

In [None]:
# Login to Weights & Biases (Optimized for monitoring)
%pip install wandb
import wandb
wandb.login(key="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")

In [None]:
# Login to Weights & Biases (Optimized for monitoring)
!pip install wandb
import wandb
wandb.login(key="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")




[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: C:\Users\Harsha\AppData\Local\Programs\Python\Python311\python.exe -m pip install --upgrade pip


ModuleNotFoundError: No module named 'wandb'

In [None]:
import os
import sys
import logging
from pathlib import Path

# Add project root to path
PROJECT_ROOT = Path(os.getcwd()).parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

os.chdir(PROJECT_ROOT)
print(f"Working directory: {os.getcwd()}")

In [None]:
import torch
import transformers
import peft
import datasets
import trl
import bitsandbytes

print(f"Python:        {sys.version}")
print(f"PyTorch:       {torch.__version__}")
print(f"Transformers:  {transformers.__version__}")
print(f"PEFT:          {peft.__version__}")
print(f"Datasets:      {datasets.__version__}")
print(f"TRL:           {trl.__version__}")
print(f"BitsAndBytes:  {bitsandbytes.__version__}")
print(f"CUDA:          {torch.version.cuda}")
print(f"GPU:           {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")

---
## 2. GPU Verification & VRAM Check

In [None]:
from src.utils import print_gpu_info, print_vram_usage, set_seed, setup_logging

setup_logging()
set_seed(42)

print("=" * 50)
print("  GPU Information")
print("=" * 50)
gpu_info = print_gpu_info()
print()
print_vram_usage()

# Determine recommended settings based on VRAM
if gpu_info.get("vram_total_gb", 0) >= 16:
    print("\n>> 16GB VRAM: Can use Qwen2.5-3B or even 7B with QLoRA")
elif gpu_info.get("vram_total_gb", 0) >= 8:
    print("\n>> 8GB VRAM: Recommended Qwen2.5-1.5B with QLoRA (default config)")
else:
    print("\n>> <8GB VRAM: Use Qwen2.5-0.5B with QLoRA")

---
## 3. Configuration

You can either load from YAML files or define inline. We'll do both for flexibility.

In [None]:
from src.utils import load_config

# Option A: Load from YAML config files
config = load_config("config")

# Option B: Override specific settings inline
# Uncomment any of these to override:

# config["model"]["name"] = "Qwen/Qwen2.5-0.5B"     # smaller model for testing
# config["model"]["name"] = "Qwen/Qwen2.5-3B"        # larger model (needs 8GB+)
# config["training"]["num_train_epochs"] = 1           # quick test
# config["training"]["max_steps"] = 100                # very quick test
# config["training"]["per_device_train_batch_size"] = 1 # reduce if OOM
# config["model"]["lora"]["r"] = 32                    # reduce LoRA rank if OOM
# config["model"]["tokenizer"]["max_length"] = 1024    # reduce seq length if OOM

print("Model:", config["model"]["name"])
print("LoRA rank:", config["model"]["lora"]["r"])
print("Epochs:", config["training"]["num_train_epochs"])
print("Batch size:", config["training"]["per_device_train_batch_size"])
print("Grad accum:", config["training"]["gradient_accumulation_steps"])
print("Effective batch:", config["training"]["per_device_train_batch_size"] * config["training"]["gradient_accumulation_steps"])
print("Learning rate:", config["training"]["learning_rate"])
print("Max seq length:", config["sft"]["max_seq_length"])

---
## 4. Load Model with QLoRA

This loads the Qwen model in 4-bit precision and applies LoRA adapters.
Only the LoRA parameters (~1-3% of total) are trainable.

In [None]:
from src.model import build_model_and_tokenizer, print_model_summary

print("Loading model and tokenizer...")
print(f"Model: {config['model']['name']}")
print(f"Quantization: 4-bit NF4 with double quantization")
print(f"LoRA rank: {config['model']['lora']['r']}")
print()

model, tokenizer = build_model_and_tokenizer(config)

print()
print_model_summary(model)
print()
print_vram_usage()

In [None]:
# Inspect LoRA adapter layers
print("LoRA adapter layers:")
print("=" * 60)
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  {name:60s} | shape: {str(list(param.shape)):20s} | params: {param.numel():,}")

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"\nTrainable: {trainable:,} / {total:,} = {100*trainable/total:.2f}%")

---
## 5. Dataset Preparation

Default dataset: **tatsu-lab/alpaca** (52K instruction-following samples)  
Format: `{instruction, input, output}` â†’ formatted prompt text

In [None]:
from src.dataset import prepare_datasets, format_instruction

print(f"Loading dataset: {config['dataset']['name']}")
train_dataset, eval_dataset = prepare_datasets(config)

print(f"\nTrain samples: {len(train_dataset):,}")
print(f"Eval samples:  {len(eval_dataset):,}")
print(f"Columns: {train_dataset.column_names}")

In [None]:
# Inspect a few formatted samples
print("=" * 60)
print("Sample 1:")
print("=" * 60)
print(train_dataset[0]["text"][:800])

print("\n" + "=" * 60)
print("Sample 2:")
print("=" * 60)
print(train_dataset[1]["text"][:800])

In [None]:
# Analyze token lengths to understand memory requirements
import matplotlib.pyplot as plt
import numpy as np

# Sample 1000 examples for length analysis
sample_size = min(1000, len(train_dataset))
sample_texts = [train_dataset[i]["text"] for i in range(sample_size)]
token_lengths = [len(tokenizer.encode(t)) for t in sample_texts]

print(f"Token length statistics (sample of {sample_size}):")
print(f"  Min:    {min(token_lengths)}")
print(f"  Max:    {max(token_lengths)}")
print(f"  Mean:   {np.mean(token_lengths):.0f}")
print(f"  Median: {np.median(token_lengths):.0f}")
print(f"  95th %%: {np.percentile(token_lengths, 95):.0f}")

plt.figure(figsize=(10, 4))
plt.hist(token_lengths, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(x=config['sft']['max_seq_length'], color='red', linestyle='--', label=f"max_seq_length={config['sft']['max_seq_length']}")
plt.xlabel('Token Length')
plt.ylabel('Count')
plt.title('Distribution of Token Lengths in Training Data')
plt.legend()
plt.tight_layout()
plt.show()

---
## 6. Training

Using `SFTTrainer` from TRL with:
- QLoRA (4-bit base + LoRA adapters)
- Gradient checkpointing (saves ~40% VRAM)
- Paged AdamW 8-bit optimizer
- NEFTune noise (improves instruction following)
- Cosine LR schedule with warmup

In [None]:
from src.trainer import create_trainer

trainer = create_trainer(model, tokenizer, train_dataset, eval_dataset, config)

print("Trainer created!")
print(f"  Total training steps: {trainer.state.max_steps if trainer.state.max_steps > 0 else 'auto'}")
print(f"  Effective batch size: {config['training']['per_device_train_batch_size'] * config['training']['gradient_accumulation_steps']}")
print()
print_vram_usage()

In [None]:
# ==========================================
# START TRAINING
# ==========================================
# This is the main training cell. On RTX 4060 Ti with Qwen2.5-1.5B:
#   - ~3 epochs on Alpaca (52K samples) takes roughly 2-4 hours
#   - VRAM usage: ~6-7 GB with default settings
#   - If you get OOM, reduce batch_size to 1 or max_seq_length to 1024

print("Starting training...")
print("=" * 60)
print_vram_usage()
print("=" * 60)

train_result = trainer.train()

print("\n" + "=" * 60)
print("Training Complete!")
print("=" * 60)
print(f"Metrics: {train_result.metrics}")
print_vram_usage()

In [None]:
# Plot training loss
import matplotlib.pyplot as plt

log_history = trainer.state.log_history

train_losses = [(h["step"], h["loss"]) for h in log_history if "loss" in h]
eval_losses = [(h["step"], h["eval_loss"]) for h in log_history if "eval_loss" in h]

fig, ax = plt.subplots(1, 1, figsize=(10, 5))

if train_losses:
    steps, losses = zip(*train_losses)
    ax.plot(steps, losses, label="Train Loss", alpha=0.7)

if eval_losses:
    steps, losses = zip(*eval_losses)
    ax.plot(steps, losses, label="Eval Loss", marker='o', markersize=4)

ax.set_xlabel("Step")
ax.set_ylabel("Loss")
ax.set_title("AToM-FM Training Progress")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## 7. Save Model

In [None]:
# Save the fine-tuned LoRA adapter
SAVE_DIR = "./models/final"
os.makedirs(SAVE_DIR, exist_ok=True)

trainer.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

print(f"Model saved to: {SAVE_DIR}")
print(f"Contents:")
for f in sorted(Path(SAVE_DIR).glob("*")):
    size_mb = f.stat().st_size / 1e6
    print(f"  {f.name:40s} {size_mb:8.2f} MB")

---
## 8. Evaluation

In [None]:
# Run evaluation on the eval set
eval_metrics = trainer.evaluate()

print("Evaluation Results:")
print("=" * 40)
for key, value in eval_metrics.items():
    print(f"  {key:30s}: {value}")

---
## 9. Inference & Testing

Test the fine-tuned model with various prompts.

In [None]:
from inference import generate_response

def test_prompt(instruction, input_text="", max_new_tokens=256):
    """Helper to test a prompt and display results."""
    print(f"Instruction: {instruction}")
    if input_text:
        print(f"Input: {input_text}")
    print("-" * 40)
    response = generate_response(
        model, tokenizer, instruction, input_text,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
    )
    print(f"Response: {response}")
    print("=" * 60)
    return response

In [None]:
# Test Suite
test_cases = [
    {
        "instruction": "Explain what a neural network is in simple terms.",
        "input": "",
    },
    {
        "instruction": "Write a Python function to check if a string is a palindrome.",
        "input": "",
    },
    {
        "instruction": "Summarize the following text.",
        "input": "Machine learning is a subset of artificial intelligence that focuses on building systems that learn from data. Instead of being explicitly programmed, these systems use algorithms to identify patterns in data and make decisions with minimal human intervention. The field has seen tremendous growth in recent years, driven by the availability of large datasets and powerful computing resources.",
    },
    {
        "instruction": "Translate the following English text to French.",
        "input": "The weather is beautiful today and I want to go for a walk in the park.",
    },
    {
        "instruction": "What are the three laws of thermodynamics? Explain each briefly.",
        "input": "",
    },
]

print("=" * 60)
print("  AToM-FM Test Suite")
print("=" * 60)

responses = []
for i, tc in enumerate(test_cases):
    print(f"\n--- Test {i+1}/{len(test_cases)} ---")
    resp = test_prompt(tc["instruction"], tc.get("input", ""))
    responses.append(resp)

---
## 10. Compare Base vs Fine-Tuned (Optional)

Load the base model without LoRA to compare outputs.

In [None]:
# Disable LoRA adapter to get base model outputs
model.disable_adapter_layers()

prompt = "Explain what transfer learning is and why it's useful."
print("BASE MODEL (no LoRA):")
print("-" * 40)
base_response = generate_response(model, tokenizer, prompt, max_new_tokens=200, temperature=0.7)
print(base_response)

# Re-enable LoRA
model.enable_adapter_layers()

print("\n" + "=" * 60)
print("FINE-TUNED MODEL (with LoRA):")
print("-" * 40)
ft_response = generate_response(model, tokenizer, prompt, max_new_tokens=200, temperature=0.7)
print(ft_response)

---
## 11. Merge LoRA Weights (Optional)

Merge the LoRA adapter into the base model for faster inference without PEFT dependency.

In [None]:
# WARNING: This requires more VRAM. Only run if you have enough memory.
# For RTX 4060 Ti 8GB, this may OOM with 1.5B+ models.

MERGE = False  # Set to True to merge

if MERGE:
    MERGED_DIR = "./models/merged"
    os.makedirs(MERGED_DIR, exist_ok=True)

    print("Merging LoRA weights into base model...")
    merged_model = model.merge_and_unload()

    print(f"Saving merged model to {MERGED_DIR}...")
    merged_model.save_pretrained(MERGED_DIR)
    tokenizer.save_pretrained(MERGED_DIR)

    print("Done! Merged model saved.")
    print(f"Contents:")
    for f in sorted(Path(MERGED_DIR).glob("*")):
        size_mb = f.stat().st_size / 1e6
        print(f"  {f.name:40s} {size_mb:8.2f} MB")
else:
    print("Skipping merge. Set MERGE=True above to merge LoRA into base model.")

---
## 12. TensorBoard (Optional)

View training metrics in TensorBoard.

In [None]:
# Launch TensorBoard inline
# %load_ext tensorboard
# %tensorboard --logdir ./logs

---
## 13. Custom Dataset Creation

Use this section to create your own domain-specific dataset.

In [None]:
from src.dataset import create_custom_dataset

# Example: Create a small custom dataset
custom_instructions = [
    "What is AToM-FM?",
    "Explain the AToM-FM architecture.",
    "How does AToM-FM handle multimodal inputs?",
    "What datasets can AToM-FM be trained on?",
    "Compare AToM-FM with standard transformer models.",
]

custom_outputs = [
    "AToM-FM (Adaptive Transformer of Multimodal Foundation Model) is a foundation model framework designed for adaptive learning across multiple modalities including text, code, and structured data.",
    "AToM-FM uses a Qwen-based transformer backbone with QLoRA adapters for parameter-efficient fine-tuning. The architecture supports 4-bit quantization for deployment on consumer GPUs.",
    "AToM-FM processes multimodal inputs through a unified tokenization scheme that maps different data modalities into a shared embedding space before passing them through the transformer layers.",
    "AToM-FM can be trained on instruction-following datasets like Alpaca, OpenOrca, and domain-specific datasets. It also supports custom JSONL datasets with instruction/input/output format.",
    "Unlike standard transformers that require full fine-tuning, AToM-FM uses QLoRA to train only 1-3% of parameters while maintaining comparable performance. This makes it accessible on consumer hardware like the RTX 4060 Ti.",
]

custom_ds = create_custom_dataset(
    instructions=custom_instructions,
    outputs=custom_outputs,
    save_path="./data/processed/custom_atom_fm.jsonl",
)

print(f"Custom dataset created: {len(custom_ds)} samples")
print(f"Saved to: ./data/processed/custom_atom_fm.jsonl")
print(f"\nSample:")
print(custom_ds[0])

---
## 14. Cleanup & Final VRAM Check

In [None]:
print("Final VRAM usage:")
print_vram_usage()

# Uncomment to free GPU memory:
# import gc
# del model, trainer
# gc.collect()
# torch.cuda.empty_cache()
# print("\nAfter cleanup:")
# print_vram_usage()

print("\n" + "=" * 60)
print("  AToM-FM Training Pipeline Complete!")
print("=" * 60)