# Part 2: "The Intern" (Fine-Tuning)

## Project 01 - Operation Ledger-Mind
**Course Module:** Weeks 01-03 (Prompt Engineering, Fine-Tuning, Advanced RAG)
**Scenario:** Financial Analysis of Uber Technologies (2024 Annual Report)

### ðŸ“‹ Technical Requirements Checklist:
- [x] **Hugging Face Ecosystem**: transformers, peft, trl, bitsandbytes
- [x] **Base Model**: Qwen/Qwen2.5-1.5B-Instruct (Optimized for T4)
- [x] **Quantization**: 4-bit NF4 with double quantization
- [x] **Adapter Config**: LoRA (Targets: q_proj, k_proj, v_proj, o_proj)
- [x] **Training**: SFTTrainer for 100 steps
- [x] **Inference**: `query_intern(question)`
- [x] **Evaluation**: local ROUGE-L baseline

## 0. Colab Setup (Git Integration)

Run this cell ONLY if you are in Google Colab to clone the project repository and set up the working environment.

In [1]:
import os
import sys

if 'google.colab' in str(get_ipython()):
    print("ðŸš€ Detected Google Colab environment.")
    PROJECT_NAME = "ZuuCrew-AEE-Project01"
    REPO_URL = "https://github.com/Sulamaxx/ZuuCrew-AEE-Project01.git"
    
    if not os.path.exists(PROJECT_NAME):
        print(f"ðŸ“¥ Cloning repository from {REPO_URL}...")
        !git clone {REPO_URL}
    
    # Move into the project directory
    if os.getcwd().split('/')[-1] != PROJECT_NAME:
        os.chdir(PROJECT_NAME)
    print(f"âœ… Working directory changed to: {os.getcwd()}")
    
    # Add src to python path for local imports
    if os.path.abspath("src") not in sys.path:
        sys.path.append(os.path.abspath("src"))
    
    # Install dependencies
    print("ðŸ“¦ Installing dependencies...")
    # NOTE: Explicitly upgrading numpy to >= 2.0 to avoid binary incompatibility (ValueError: numpy.dtype size changed)
    # If you encounter import errors AFTER running this, please Restart Session (Kernel -> Restart Session) and run again.
    !pip install -U "numpy>=2.0" transformers==4.44.2 datasets==2.20.0 accelerate==0.34.2 peft==0.12.0 trl==0.9.6 bitsandbytes==0.43.1 python-dotenv pyyaml rouge-score -q
else:
    print("ðŸ  Running in local environment.")

ðŸš€ Detected Google Colab environment.
ðŸ“¥ Cloning repository from https://github.com/Sulamaxx/ZuuCrew-AEE-Project01.git...
Cloning into 'ZuuCrew-AEE-Project01'...
remote: Enumerating objects: 100, done.[K
remote: Counting objects: 100% (100/100), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 100 (delta 46), reused 80 (delta 26), pack-reused 0 (from 0)[K
Receiving objects: 100% (100/100), 6.51 MiB | 14.80 MiB/s, done.
Resolving deltas: 100% (46/46), done.
âœ… Working directory changed to: /content/ZuuCrew-AEE-Project01
ðŸ“¦ Installing dependencies...
[2K     [90mâ”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”[0m [32m43.7/43.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[31mERROR: Cannot install accelerate==0.34.2, datasets==2.20.0, numpy>=2.0, peft==0.12.0, transformers==4.44.2 and trl==0.9.6 because these package ve

## 1. Environment Diagnostics & Configuration

Verifying hardware compatibility and loading the centralized configuration.

In [2]:
import torch
import os
import sys
import yaml
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import rouge_score
from rouge_score import rouge_scorer

# Load environment variables
env_path = ".env" if os.path.exists(".env") else "../.env"
load_dotenv(env_path)
hf_token = os.getenv("HF_TOKEN")

# Load Project Config
config_path = "src/config/config.yaml" if os.path.exists("src/config/config.yaml") else "../src/config/config.yaml"
if not os.path.exists(config_path):
    raise FileNotFoundError(f"âŒ Configuration file not found at {config_path}. Current Dir: {os.getcwd()}")

with open(config_path, "r") as f:
    config = yaml.safe_load(f)

print("="*60)
print("ENVIRONMENT & GPU CHECK")
print("="*60)
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Device Name: {torch.cuda.get_device_name(0)}")
    print(f"Total VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    print(f"BFloat16 Support: {torch.cuda.is_bf16_supported()}")
else:
    print("âš ï¸ WARNING: No CUDA GPU detected. Training will fail.")
print("="*60)

ModuleNotFoundError: No module named 'trl'

## 2. Model & Quantization Implementation

Implementing 4-bit NF4 quantization with double quantization per assessment specifications.

In [None]:
base_model_id = config.get("base_model", "Qwen/Qwen2.5-1.5B-Instruct")

# 4-bit Quantization Config (NF4, double quant)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    token=hf_token
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## 3. LoRA Configuration (The Adapters)

Injecting trainable Rank-Adaptive matrices into the attention heads.

In [None]:
peft_config = LoraConfig(
    r=config.get("lora_r", 16),
    lora_alpha=config.get("lora_alpha", 32),
    lora_dropout=config.get("lora_dropout", 0.05),
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

## 4. Dataset Loading & Formatting

Formatting the generated Uber instruction data into ChatML structure.

In [None]:
# Determine path for train.jsonl
train_path = "artifacts/train_data/train.jsonl"
if not os.path.exists(train_path):
    train_path = config.get('train_data_path', './artifacts/train_data') + '/train.jsonl'
    if not os.path.exists(train_path):
        train_path = "../" + train_path

dataset = load_dataset("json", data_files=train_path, split="train")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['question'])):
        # ChatML Structure
        text = f"<|im_start|>system\nYou are a professional financial analyst assistant. Answer questions based on Uber's 2024 Annual Report.<|im_end|>\n<|im_start|>user\n{example['question'][i]}<|im_end|>\n<|im_start|>assistant\n{example['answer'][i]}<|im_end|>"
        output_texts.append(text)
    return output_texts

print(f"âœ… Loaded {len(dataset)} training examples from {train_path}.")

## 5. Training Execution (The Intern Learns)

Executing the SFT (Supervised Fine-Tuning) loop for 100 steps.

In [None]:
training_args = SFTConfig(
    output_dir="artifacts/intern_checkpoints",
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=64, 
    learning_rate=2e-5,
    logging_steps=10,
    max_steps=100, 
    save_steps=50,
    optim="paged_adamw_8bit",
    fp16=not torch.cuda.is_bf16_supported() if torch.cuda.is_available() else False,
    bf16=torch.cuda.is_bf16_supported() if torch.cuda.is_available() else False,
    report_to="none",
    max_seq_length=1024,
    packing=False,
    dataset_text_field="text"  # Added as it's often required in SFTConfig even if not used by formatting_func
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    formatting_func=formatting_prompts_func,
    args=training_args
)

trainer.train()
trainer.save_model("artifacts/intern_final_adapter")
print("âœ… Training Complete. Adapters saved to artifacts/intern_final_adapter")

## 6. Inference Pipeline: `query_intern` 

Establishing the critical inference function for evaluation.

In [None]:
def query_intern(question):
    prompt = f"<|im_start|>system\nYou are a professional financial analyst assistant. Answer questions based on Uber's 2024 Annual Report.<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=256, 
            temperature=0.1, 
            do_sample=True, 
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Clean up response
    if "assistant" in response:
        return response.split("assistant")[-1].strip()
    return response.strip()

# Sample baseline test
test_q = "What were the key drivers of Uber's revenue growth in 2024?"
print(f"Q: {test_q}")
print(f"A: {query_intern(test_q)}")

## 7. Local Evaluation (ROUGE-L)

Testing performance on the Golden Test Set.

In [None]:
# Load Golden Test Set
test_path = "artifacts/train_data/golden_test_set.jsonl"
if not os.path.exists(test_path):
    test_path = "../" + test_path
    if not os.path.exists(test_path):
        test_path = "./artifacts/train_data/golden_test_set.jsonl"

test_dataset = load_dataset("json", data_files=test_path, split="train")

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
scores = []

print(f"ðŸ§ª Evaluating 5 samples from Golden Test Set...")

for i in range(min(5, len(test_dataset))):
    question = test_dataset[i]['question']
    ground_truth = test_dataset[i]['answer']
    prediction = query_intern(question)
    
    score = scorer.score(ground_truth, prediction)['rougeL'].fmeasure
    scores.append(score)
    
    print(f"--- Sample {i+1} ---")
    print(f"Q: {question}")
    print(f"ROUGE-L: {score:.4f}")

print(f"\nðŸ† Average ROUGE-L Baseline: {sum(scores)/len(scores):.4f}")