# Fine-Tuning Llama-3-8B for Socratic Technical Interviews

## Bachelor's Thesis: Custom SLM Training

**Goal**: Train a Small Language Model (8B parameters) to act as a Socratic technical interviewer.

**Why This Matters for ML Engineering**:
- Demonstrates understanding of model fine-tuning
- Shows ability to work with resource constraints (free GPU)
- Implements modern techniques (LoRA, 4-bit quantization)
- Creates a specialized model for a specific task

**Hardware**: Google Colab Free Tier (T4 GPU, 15GB VRAM)

**Techniques Used**:
1. **4-bit Quantization**: Reduces model size from 32GB to ~4GB (8x compression)
2. **LoRA (Low-Rank Adaptation)**: Only trains 0.1% of parameters, 10x faster
3. **Unsloth**: 2x faster training, 70% less memory vs vanilla HuggingFace

---

## Part 1: Environment Setup

### Install Unsloth

Unsloth is an optimized training library that makes fine-tuning on free GPUs possible.

**Why Unsloth?**
- Manual PyTorch: Would need 40GB+ VRAM for 8B model
- Standard HuggingFace: Would need 20GB+ VRAM
- Unsloth: Works in 15GB VRAM (Colab free tier!)

**How it works**:
- Custom CUDA kernels for faster attention
- Smart gradient checkpointing
- Optimized memory layout

In [None]:
%%capture
!pip install unsloth
# Also install xformers for memory-efficient attention
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

## Part 2: Load Pre-trained Model with 4-bit Quantization

### Understanding Quantization

**Normal Model (FP32/FP16)**:
- Each parameter = 32 bits (or 16 bits)
- 8 billion parameters √ó 16 bits = 16GB memory

**4-bit Quantized Model (NF4)**:
- Each parameter = 4 bits
- 8 billion parameters √ó 4 bits = 4GB memory
- **Result**: 4x memory reduction!

**Quality Trade-off**:
- Minimal accuracy loss (<2% on benchmarks)
- Perfect for fine-tuning scenarios
- Uses NF4 (NormalFloat4) - specially designed for neural networks

In [None]:
from unsloth import FastLanguageModel
import torch

# Model configuration
max_seq_length = 2048  # Supports up to 2048 token context
dtype = None  # Auto-detect: Float16 for Tesla T4, Bfloat16 for Ampere+
load_in_4bit = True  # CRITICAL: Enables 4-bit quantization

# Load model and tokenizer
# This downloads a pre-quantized version (saves time!)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # token = "hf_...", # Use if you want to use gated models (optional)
)

print("‚úÖ Model loaded successfully!")
print(f"üìä Model size: ~4GB (quantized from 16GB)")
print(f"üéØ Context length: {max_seq_length} tokens")

## Part 3: Configure LoRA (Low-Rank Adaptation)

### Why LoRA?

**Problem**: Fine-tuning all 8 billion parameters requires:
- 100GB+ VRAM
- Days of training
- Massive datasets

**Solution**: LoRA (Hu et al., 2021)

**Key Insight**: 
Most model adaptations happen in a low-dimensional subspace. Instead of updating the full weight matrix W, inject trainable low-rank decomposition:

```
W_new = W_frozen + B √ó A
```

Where:
- W_frozen: Original 8B parameters (unchanged)
- B √ó A: Low-rank matrices (only 0.1% parameters)

**Parameters**:
- `r` (rank): Dimension of low-rank matrices. Higher = more capacity, more memory
  - r=8: Ultra-lightweight (recommended for <1000 examples)
  - r=16: Balanced (good for 1000-10000 examples)
  - r=32: High-capacity (for 10000+ examples)

- `lora_alpha`: Scaling factor. Higher = stronger adaptation
  - Typically set to r or 2√ór
  - We use 16 to match our rank

- `target_modules`: Which layers to apply LoRA
  - q_proj, k_proj, v_proj: Attention mechanisms (most important)
  - o_proj: Output projection
  - gate_proj, up_proj, down_proj: MLP layers

**Memory Savings**:
- Full fine-tuning: 8B parameters to train
- LoRA (r=16): ~8M parameters to train (1000x reduction!)

In [None]:
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Rank: 16 is good balance for our ~50 examples
    target_modules=[
        "q_proj",   # Query projection (attention)
        "k_proj",   # Key projection (attention)
        "v_proj",   # Value projection (attention)
        "o_proj",   # Output projection (attention)
        "gate_proj", # MLP gate
        "up_proj",   # MLP up
        "down_proj", # MLP down
    ],
    lora_alpha=16,  # Scaling factor (typically = r)
    lora_dropout=0,  # No dropout (adds randomness, we have small dataset)
    bias="none",     # Don't train bias terms (saves memory)
    use_gradient_checkpointing="unsloth",  # Smart checkpointing (saves VRAM)
    random_state=3407,  # Reproducibility
    use_rslora=False,   # Rank-stabilized LoRA (not needed for our case)
    loftq_config=None,  # LoftQ quantization (not needed)
)

print("‚úÖ LoRA adapters configured!")
print(f"üìä Trainable parameters: ~{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.1f}M")
print(f"üîí Frozen parameters: ~8000M")
print(f"‚ö° Memory efficient: Only training {(sum(p.numel() for p in model.parameters() if p.requires_grad) / 8e9 * 100):.3f}% of model")

## Part 4: Load Training Dataset

### Dataset Format: ShareGPT

Our JSONL file contains conversations in this format:

```json
{
  "conversations": [
    {"from": "human", "value": "React uses real DOM directly."},
    {"from": "gpt", "value": "Not quite. React maintains an in-memory representation to optimize updates. Do you remember what that concept is called?"}
  ]
}
```

**Why ShareGPT format?**
- Standard format for chat models
- Supports multi-turn conversations
- Compatible with Unsloth/TRL trainers

**Upload Instructions**:
1. In Colab, click the folder icon (left sidebar)
2. Upload `interviewer_training_data.jsonl`
3. Or use Google Drive mount (code below)

In [None]:
from datasets import load_dataset, concatenate_datasets

# 1) Load your existing synthetic interviewer data
custom = load_dataset(
    "json",
    data_files="interviewer_training_data.jsonl",
    split="train",
)

# 2) Load the Socratic dataset from Hugging Face
#    (full split is ~50k rows; we sample for Colab to avoid OOM)
socratic_full = load_dataset("facat/Socratic", split="train")

# Optional: subsample Socratic to keep training light (e.g. 5k examples)
socratic = socratic_full.shuffle(seed=3407).select(range(5000))

# 3) Keep only the `conversations` field in both, to match later formatting

def keep_conversations(ds):
    cols_to_drop = [c for c in ds.column_names if c != "conversations"]
    return ds.remove_columns(cols_to_drop) if cols_to_drop else ds

custom = keep_conversations(custom)
socratic = keep_conversations(socratic)

# 4) Merge them
dataset = concatenate_datasets([custom, socratic])

print(f"‚úÖ Merged dataset loaded: {len(dataset)} examples")
print("   - Custom examples:", len(custom))
print("   - Socratic examples:", len(socratic))
print("üìù Sample conversation:")
print(dataset[0])

## Part 5: Format Dataset for Training

### Chat Template

We need to format conversations into the model's expected format:

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a Socratic technical interviewer...<|eot_id|>
<|start_header_id|>user<|end_header_id|>

React uses real DOM directly.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Not quite. React maintains...<|eot_id|>
```

Unsloth handles this automatically with the `standardize_sharegpt` function.

In [None]:
from unsloth.chat_templates import get_chat_template

# Apply Llama-3 chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3",  # Use Llama-3's official format
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)

# Function to format conversations
def formatting_prompts_func(examples):
    """
    Format conversations into model's expected structure.

    Adds system prompt to guide Socratic behavior.
    """
    convos = examples["conversations"]
    texts = []

    for convo in convos:
        # Add system message at the start
        full_convo = [
            {
                "from": "system",
                "value": "You are a Senior Technical Interviewer who uses the Socratic method. " + 
                         "Ask guiding questions instead of providing direct answers. " +
                         "Help candidates discover solutions through questioning."
            }
        ] + convo

        # Apply chat template
        text = tokenizer.apply_chat_template(
            full_convo,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)

    return {"text": texts}

# Apply formatting to entire dataset
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

print("‚úÖ Dataset formatted for training")
print(f"üìù Sample formatted text (first 500 chars):")
print(dataset[0]["text"][:500] + "...")

## Part 6: Training Configuration

### SFTTrainer (Supervised Fine-Tuning)

**Hyperparameters Explained**:

**Learning Rate (2e-4)**:
- How fast the model adapts
- Too high: Model forgets original knowledge (catastrophic forgetting)
- Too low: Training takes forever
- 2e-4 is sweet spot for LoRA fine-tuning

**Batch Size (2)**:
- How many examples processed at once
- Limited by GPU memory (T4 has 15GB)
- We use gradient accumulation to simulate larger batches

**Gradient Accumulation (4)**:
- Simulates batch size of 2√ó4 = 8
- Updates weights every 4 steps
- Saves memory while maintaining training quality

**Epochs (3)**:
- How many times to see each example
- Too few: Underfitting
- Too many: Overfitting (memorization)
- 3 epochs is good for ~50 examples

**Warmup Steps (5)**:
- Gradually increase learning rate at start
- Prevents initial instability
- 10% of total steps is common

**Weight Decay (0.01)**:
- L2 regularization to prevent overfitting
- Penalizes large weights
- 0.01 is standard for small datasets

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",  # Field containing formatted conversations
    max_seq_length=max_seq_length,
    dataset_num_proc=2,  # Parallel data loading
    packing=False,  # Don't pack multiple examples (better for chat)
    args=TrainingArguments(
        # Output
        output_dir="outputs",
        
        # Training schedule
        per_device_train_batch_size=2,  # Batch size per GPU
        gradient_accumulation_steps=4,   # Simulate batch size of 8
        num_train_epochs=3,              # 3 passes through dataset
        
        # Optimization
        learning_rate=2e-4,              # LoRA sweet spot
        fp16=not torch.cuda.is_bf16_supported(),  # Use FP16 if no BF16
        bf16=torch.cuda.is_bf16_supported(),       # Use BF16 if available
        
        # Regularization
        warmup_steps=5,                  # Learning rate warmup
        weight_decay=0.01,               # L2 regularization
        
        # Logging
        logging_steps=1,                 # Log every step (we have few steps)
        optim="adamw_8bit",              # 8-bit AdamW (saves memory)
        
        # Saving
        save_strategy="epoch",           # Save after each epoch
        save_total_limit=2,              # Keep only last 2 checkpoints
        
        # Performance
        seed=3407,                       # Reproducibility
    ),
)

print("‚úÖ Trainer configured")
print(f"üìä Effective batch size: {2 * 4} (2 √ó 4 gradient accumulation)")
print(f"üî¢ Total training steps: ~{len(dataset) * 3 // (2 * 4)}")
print(f"‚è±Ô∏è Estimated training time: ~15-30 minutes")

## Part 7: Train the Model! üöÄ

This will take 15-30 minutes on Colab Free (T4 GPU).

**What to Watch**:
- Loss should decrease (from ~2.0 to ~0.5)
- If loss stops decreasing: might be overfitting
- If loss is erratic: learning rate might be too high

**Memory Usage**:
- Peak VRAM: ~12-14GB (within T4's 15GB)
- If OOM (Out of Memory): Reduce batch size to 1

In [None]:
# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"üéÆ GPU: {gpu_stats.name}")
print(f"üíæ Memory: {start_gpu_memory} GB / {max_memory} GB")
print()
print("üöÄ Starting training...")
print()

# Train!
trainer_stats = trainer.train()

# Show final stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)

print()
print("‚úÖ Training complete!")
print(f"‚è±Ô∏è Time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"üìâ Final loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"üíæ Peak memory: {used_memory} GB ({used_percentage}% of {max_memory} GB)")
print(f"üéØ Memory for LoRA: {used_memory_for_lora} GB")

## Part 8: Test the Model

Let's see if our Socratic interviewer works!

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test conversation
messages = [
    {
        "role": "system",
        "content": "You are a Senior Technical Interviewer who uses the Socratic method. " +
                   "Ask guiding questions instead of providing direct answers."
    },
    {
        "role": "user",
        "content": "React uses the real DOM directly, right?"
    }
]

# Format with chat template
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

# Generate response
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=128,
    temperature=0.7,
    do_sample=True,
    use_cache=True
)

# Decode and print
response = tokenizer.batch_decode(outputs)[0]
print("üßë‚Äçüíº Interviewer Response:")
print(response.split("<|start_header_id|>assistant<|end_header_id|>")[-1].split("<|eot_id|>")[0].strip())

## Part 9: Save the Model (HuggingFace Format)

Save LoRA adapters for future use or sharing.

In [None]:
# Save LoRA adapters
model.save_pretrained("socratic_interviewer_lora")
tokenizer.save_pretrained("socratic_interviewer_lora")

print("‚úÖ Model saved to 'socratic_interviewer_lora/'")
print("üì¶ Files:")
!ls -lh socratic_interviewer_lora/

# Optional: Push to HuggingFace Hub
# model.push_to_hub("your-username/socratic-interviewer", token="hf_...")
# tokenizer.push_to_hub("your-username/socratic-interviewer", token="hf_...")

## Part 10: Export to GGUF (For Local Inference with Ollama)

### What is GGUF?

**GGUF (GPT-Generated Unified Format)** is a file format for storing LLMs, designed by the llama.cpp team.

**Why GGUF?**:
- Runs on CPU (no GPU needed!)
- Quantized (4-bit, 5-bit, 8-bit options)
- Fast inference with llama.cpp
- Compatible with Ollama (easy local deployment)

**Quantization Levels**:
- Q4_K_M: 4-bit, balanced (recommended) - ~4.5GB
- Q5_K_M: 5-bit, higher quality - ~5.5GB
- Q8_0: 8-bit, maximum quality - ~8GB

For thesis: **Q4_K_M** is perfect balance.

In [None]:
# Merge LoRA adapters with base model (required for GGUF export)
model.save_pretrained_merged(
    "socratic_interviewer_merged",
    tokenizer,
    save_method="merged_16bit",  # Save as FP16 (smaller than FP32)
)

print("‚úÖ Model merged (LoRA + base model)")

# Export to GGUF format
model.save_pretrained_gguf(
    "socratic_interviewer_gguf",
    tokenizer,
    quantization_method="q4_k_m",  # 4-bit quantization (balanced)
)

print("‚úÖ GGUF model exported!")
print("üì¶ Files:")
!ls -lh socratic_interviewer_gguf/

print()
print("üì• Download the .gguf file to your local machine")
print("   (Click folder icon ‚Üí right-click .gguf file ‚Üí Download)")

## Part 11: Alternative Export Options

In [None]:
# Export different quantization levels (optional)

# Q5_K_M: Higher quality, slightly larger
# model.save_pretrained_gguf(
#     "socratic_interviewer_q5",
#     tokenizer,
#     quantization_method="q5_k_m",
# )

# Q8_0: Maximum quality, largest size
# model.save_pretrained_gguf(
#     "socratic_interviewer_q8",
#     tokenizer,
#     quantization_method="q8_0",
# )

print("‚ÑπÔ∏è For thesis, Q4_K_M is recommended (best balance)")

---

## üéâ Training Complete!

### What You've Accomplished:

‚úÖ Fine-tuned an 8B parameter model on free hardware
‚úÖ Used modern techniques: LoRA, 4-bit quantization, Unsloth
‚úÖ Created a specialized Socratic interviewer
‚úÖ Exported model for local inference (GGUF)

### For Your Thesis:

**Technical Contributions**:
1. Demonstrated resource-constrained ML Engineering
2. Implemented parameter-efficient fine-tuning (PEFT)
3. Created specialized dataset for Socratic teaching
4. Deployed custom model locally (cost-effective)

**Key Metrics to Report**:
- Base model: Llama-3-8B (8 billion parameters)
- Trainable parameters: ~8M (0.1% of total)
- Training time: ~20 minutes on T4 GPU
- Memory usage: ~12GB VRAM
- Final model size: ~4.5GB (GGUF Q4_K_M)
- Quantization: 4-bit (4x compression vs FP16)

### Next Steps:

1. Download the `.gguf` file
2. Install Ollama locally
3. Import model to Ollama
4. Integrate into Next.js app
5. Compare with Groq/Gemini performance

---

**References for Thesis**:
- LoRA: Hu et al. (2021) - "LoRA: Low-Rank Adaptation of Large Language Models"
- Quantization: Dettmers et al. (2023) - "QLoRA: Efficient Finetuning of Quantized LLMs"
- Llama-3: Meta (2024) - "The Llama 3 Herd of Models"