# AI + CS Interview Assistant: Local Fine-Tuning with LLaMA 3.2

**Goal**: Fine-tune a LLaMA 3.2 model locally to act as an expert technical interview assistant.

**Overview**:
1. **Setup**: Compare with local Ollama model.
2. **Data**: Load local interview dataset.
3. **Train**: Use QLoRA (4-bit quantization) for efficient local training.
4. **Verify**: Compare responses before and after training.

**Note**: This notebook assumes you have a GPU (NVIDIA) for QLoRA. If running strictly on CPU, training will be significantly slower and QLoRA (bitsandbytes) might need specific configuration.

## 1. Environment Setup
We need `transformers`, `peft` for adapters, `bitsandbytes` for quantization, and `ollama` for the baseline comparison.

In [None]:
# Install necessary libraries
!pip install -q transformers datasets peft bitsandbytes ollama trl accelerate

In [None]:
import os
import torch
import ollama
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Suppress excessive warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Load Local Dataset
We load the `ai_cs_interview_120.json` file. Ensure this file is in the same directory.

In [None]:
dataset_file = "ai_cs_interview_120.json"

try:
    # Load dataset from local JSON
    dataset = load_dataset("json", data_files=dataset_file, split="train")
    print(f"Dataset Loaded. Size: {len(dataset)} samples")
    print("Sample Entry:", dataset[0])
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Please ensure 'ai_cs_interview_120.json' exists in the notebook directory.")

## 3. Baseline Inference (Before Training)
We use the locally running Ollama instance to check how the base LLaMA 3.2 model answers interview questions *without* fine-tuning.

In [None]:
def query_ollama(prompt, model="llama3.2"):
    try:
        response = ollama.chat(model=model, messages=[
            {'role': 'user', 'content': prompt}
        ])
        return response['message']['content']
    except Exception as e:
        return f"Ollama Error: {str(e)}"

# Test Prompts
test_instruction = "Explain the difference between a Process and a Thread in an OS context."

print("--- Baseline (Ollama) ---")
baseline_response = query_ollama(test_instruction)
print(f"Instruction: {test_instruction}\n")
print(f"Response:\n{baseline_response}")

## 4. Fine-Tuning Setup (QLoRA)

We will fine-tune usage Hugging Face Transformers. 
**Important**: Ollama stores models in GGUF format which isn't directly trainable by standard tools. We will download the base weights for `Llama-3.2-1B-Instruct` (or 3B) from Hugging Face to perform the training, then save the adapter.

*Note: You need a Hugging Face token if accessing gated models, though Llama 3.2 1B/3B open weights are often available.*

In [None]:
# Model ID - We use 1B or 3B for local efficiency. 
# Ensure you have accepted the license on HF Hub if using meta-llama/Llama-3.2-1B-Instruct
MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"
# Or use "unsloth/Llama-3.2-1B-Instruct" for faster download if available

NEW_MODEL_NAME = "llama-3.2-interview-assistant"

# QLoRA Configuration (4-bit quantization)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

# Load Base Model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

### Formatting the Dataset
We convert the (Instruction, Input, Output) format into a single prompt string for training.

In [None]:
def format_prompt(sample):
    # Standard Alpaca/Instruction format
    if sample.get("input"):
        text = f"### Instruction:\n{sample['instruction']}\n\n### Input:\n{sample['input']}\n\n### Response:\n{sample['output']}"
    else:
        text = f"### Instruction:\n{sample['instruction']}\n\n### Response:\n{sample['output']}"
    return {"text": text}

dataset = dataset.map(format_prompt)

### LoRA Configuration
We configure the Low-Rank Adapter (LoRA) to train only a small percentage of parameters.

In [None]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

## 5. Training
We use the `SFTTrainer` (Supervised Fine-tuning Trainer) from `trl`.

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,          # Quick epoch for demonstration
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=10,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True if torch.cuda.is_available() else False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="none"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_args,
    packing=False,
)

# Start training
trainer.train()

In [None]:
# Save the fine-tuned adapter
trainer.model.save_pretrained(NEW_MODEL_NAME)
tokenizer.save_pretrained(NEW_MODEL_NAME)
print(f"Model adapter saved to locally at: {NEW_MODEL_NAME}")

## 6. Inference Comparison (After Training)
We now reload the model with the trained LoRA adapter to see the difference.

*Note: To run this in Ollama (outside Python), you would typically fuse this adapter with the base model and convert it to GGUF format using `llama.cpp`.*

In [None]:
from peft import PeftModel

# Load base model again (or reuse if memory allows, simpler to reload for clean state)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load the adapter we just trained
ft_model = PeftModel.from_pretrained(base_model, NEW_MODEL_NAME)
ft_model = ft_model.merge_and_unload() # Merge for faster inference

ft_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
ft_tokenizer.pad_token = ft_tokenizer.eos_token

In [None]:
def query_finetuned(instruction, input_text=""):
    # Format prompt exactly as in training
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
        
    inputs = ft_tokenizer(prompt, return_tensors="pt").to(base_model.device)
    outputs = ft_model.generate(**inputs, max_new_tokens=200, use_cache=True)
    response = ft_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the response part if possible
    if "### Response:" in response:
        return response.split("### Response:")[1].strip()
    return response

print("--- Fine-Tuned Model Inference ---")
ft_response = query_finetuned(test_instruction)
print(f"Instruction: {test_instruction}\n")
print(f"Response:\n{ft_response}")

## 7. Results Comparison
Side-by-side view of the base Ollama model vs the Fine-Tuned Local model.

In [None]:
print("=== COMPARISON ===\n")
print(f"PROMPT: {test_instruction}\n")

print("[BEFORE - Ollama Base Model]")
print(baseline_response)
print("\n" + "-"*30 + "\n")

print("[AFTER - Fine-Tuned Adapter]")
print(ft_response)