# Part 2: "The Intern" (Fine-Tuning)

## Project 01 - Operation Ledger-Mind
**Course Module:** Weeks 01-03 (Prompt Engineering, Fine-Tuning, Advanced RAG)
**Scenario:** Financial Analysis of Uber Technologies (2024 Annual Report)

###  Technical Requirements Checklist:
- [x] **Hugging Face Ecosystem**: transformers, peft, trl, bitsandbytes
- [x] **Base Model**: Qwen/Qwen2.5-1.5B-Instruct (Optimized for T4)
- [x] **Quantization**: 4-bit NF4 with double quantization
- [x] **Adapter Config**: LoRA (Targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
- [x] **Training**: SFTTrainer for 100 steps
- [x] **Inference**: `query_intern(question)`
- [x] **Evaluation**: local ROUGE-L baseline

## 0. Setup & Dependency Installation

Standardizing dependencies for both Google Colab and Local environments. Note: `numpy==1.26.4` is required for `trl==0.9.6` compatibility.

In [1]:
import os
import sys
import subprocess

def is_colab():
    return 'google.colab' in str(get_ipython())

if is_colab():
    print(" Detected Google Colab environment.")
    PROJECT_NAME = "ZuuCrew-AEE-Project01"
    REPO_URL = "https://github.com/Sulamaxx/ZuuCrew-AEE-Project01.git"
    
    if not os.path.exists(PROJECT_NAME):
        print(f" Cloning repository from {REPO_URL}...")
        !git clone {REPO_URL}
    else:
        print(f" Repository already exists. Fetching latest updates...")
        !git -C {PROJECT_NAME} pull
    
    if os.getcwd().split('/')[-1] != PROJECT_NAME:
        os.chdir(PROJECT_NAME)
    
    # Standardize src path
    if os.path.abspath("src") not in sys.path:
        sys.path.append(os.path.abspath("src"))
    
    print(" Installing dependencies from requirements.txt...")
    # Install specific numpy range first to prevent binary incompatibility with transformers/trl
    !pip install "numpy>=1.26.4,<2.0" -q
    !pip install -r requirements.txt -q
    
    print(" Installation complete. Please Restart Session if you see numpy binary incompatibility errors.")
else:
    print(" Running in local environment.")

 Detected Google Colab environment.
 Cloning repository from https://github.com/Sulamaxx/ZuuCrew-AEE-Project01.git...
Cloning into 'ZuuCrew-AEE-Project01'...
remote: Enumerating objects: 111, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (71/71), done.[K
remote: Total 111 (delta 54), reused 88 (delta 31), pack-reused 0 (from 0)[K
Receiving objects: 100% (111/111), 6.52 MiB | 15.06 MiB/s, done.
Resolving deltas: 100% (54/54), done.
 Installing dependencies...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m109.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-contrib-python 4.13.0.92 requires nu

## 1. Environment Diagnostics & Configuration

Verifying hardware compatibility and loading the centralized configuration.

In [2]:
import torch
import yaml
import random
import numpy as np
from dotenv import load_dotenv
from pprint import pprint

# 1. Load Environment & Config
load_dotenv(".env" if os.path.exists(".env") else "../.env")
hf_token = os.getenv("HF_TOKEN")

config_path = "src/config/config.yaml" if os.path.exists("src/config/config.yaml") else "../src/config/config.yaml"
with open(config_path, "r") as f:
    CONFIG_YAML = yaml.safe_load(f)

# 2. Seed for Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# 3. Hardware Diagnostics (from Workshop)
use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
compute_dtype = torch.bfloat16 if use_bf16 else torch.float16

print("="*60)
print("ENVIRONMENT & GPU CHECK")
print("="*60)
print(f"PyTorch: {torch.__version__}")
if torch.cuda.is_available():
    print(f"Device: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    print(f"Compute Dtype: {compute_dtype}")
    print(f"BF16 Support: {use_bf16}")
else:
    print(" WARNING: No CUDA GPU detected. Training will fail.")
print("="*60)
pprint(CONFIG_YAML)

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

## 2. Model & Quantization Implementation

Implementing 4-bit NF4 quantization with double quantization.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

base_model_id = CONFIG_YAML.get("base_model", "Qwen/Qwen2.5-1.5B-Instruct")

# 4-bit Quantization Config (Standardized for T4 and RTX)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype
)

print(f" Loading model: {base_model_id}...")
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    token=hf_token
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## 3. LoRA Configuration (Expanded Adapters)

Injecting trainable Rank-Adaptive matrices. We expand the target modules to include MLP layers for better performance.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    r=CONFIG_YAML.get("lora_r", 16),
    lora_alpha=CONFIG_YAML.get("lora_alpha", 32),
    lora_dropout=CONFIG_YAML.get("lora_dropout", 0.05),
    # Expanded targets modules from Workshop for better coverage
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

## 4. Dataset Loading & ChatML Formatting

Formatting the generated Uber instruction data into ChatML structure with proper validation splits.

In [None]:
from datasets import load_dataset

train_path = "artifacts/train_data/train.jsonl"
if not os.path.exists(train_path):
    train_path = "../" + train_path

dataset = load_dataset("json", data_files=train_path, split="train")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['question'])):
        # ChatML Structure
        text = f"<|im_start|>system\nYou are a professional financial analyst assistant. Answer questions based on Uber's 2024 Annual Report.<|im_end|>\n<|im_start|>user\n{example['question'][i]}<|im_end|>\n<|im_start|>assistant\n{example['answer'][i]}<|im_end|>"
        output_texts.append(text)
    return output_texts

print(f" Loaded {len(dataset)} training examples.")

# Split for evaluation during training
dataset = dataset.train_test_split(test_size=0.1, seed=SEED)
train_data = dataset["train"]
val_data = dataset["test"]

## 5. Training Execution

Executing the SFT loop. We use `SFTConfig` for better integration.

In [None]:
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

training_args = SFTConfig(
    output_dir="artifacts/intern_checkpoints",
    per_device_train_batch_size=CONFIG_YAML.get("per_device_train_batch_size", 1),
    gradient_accumulation_steps=CONFIG_YAML.get("gradient_accumulation_steps", 64),
    learning_rate=CONFIG_YAML.get("learning_rate", 2e-5),
    logging_steps=10,
    max_steps=CONFIG_YAML.get("max_steps", 100),
    save_steps=50,
    optim="paged_adamw_8bit",
    fp16=not use_bf16,
    bf16=use_bf16,
    max_seq_length=1024,
    packing=False,
    eval_strategy="steps",
    eval_steps=25,
    dataset_text_field="text",
    report_to="none"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    peft_config=peft_config,
    formatting_func=formatting_prompts_func,
    args=training_args
)

trainer.train()
trainer.save_model("artifacts/intern_final_adapter")
print(" Training Complete. Adapters saved to artifacts/intern_final_adapter")

## 6. Inference Pipeline: `query_intern` 

Critical inference function using ChatML prompt template.

In [None]:
def query_intern(question, max_new_tokens=256):
    prompt = f"<|im_start|>system\nYou are a professional financial analyst assistant. Answer questions based on Uber's 2024 Annual Report.<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=max_new_tokens, 
            temperature=0.1, 
            do_sample=True, 
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "assistant" in response:
        return response.split("assistant")[-1].strip()
    return response.strip()

# Sample test
test_q = "What were the key drivers of Uber's revenue growth in 2024?"
print(f"Q: {test_q}\nA: {query_intern(test_q)}")

## 7. Local Evaluation (ROUGE-L)

Testing performance on the Golden Test Set.

In [None]:
from rouge_score import rouge_scorer

test_path = "artifacts/train_data/golden_test_set.jsonl"
if not os.path.exists(test_path):
    test_path = "../" + test_path

test_dataset = load_dataset("json", data_files=test_path, split="train")
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
scores = []

print(f" Evaluating samples from Golden Test Set...")

for i in range(min(5, len(test_dataset))):
    question = test_dataset[i]['question']
    ground_truth = test_dataset[i]['answer']
    prediction = query_intern(question)
    
    score = scorer.score(ground_truth, prediction)['rougeL'].fmeasure
    scores.append(score)
    
    print(f"--- Sample {i+1} ---")
    print(f"Q: {question}")
    print(f"ROUGE-L: {score:.4f}")

print(f"\n Average ROUGE-L Baseline: {np.mean(scores):.4f}")