# Day 1: Setup and Baseline Testing

This notebook covers:
1. Environment setup in Google Colab
2. Data upload and verification
3. Baseline model testing (zero-shot)
4. Initial performance metrics

**Expected Time**: 2-3 hours

**GPU Required**: T4 or better (Colab Pro recommended)

## 1. Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install dependencies
!pip install -q torch transformers accelerate peft bitsandbytes datasets evaluate scikit-learn pandas numpy wandb

In [None]:
# Import libraries
import torch
import pandas as pd
import numpy as np
import json
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report
import wandb
import time

print(f"âœ… PyTorch version: {torch.__version__}")
print(f"âœ… CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"âœ… GPU: {torch.cuda.get_device_name(0)}")
    print(f"âœ… GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Upload Processed Data

**Option A: Upload to Google Drive (Recommended)**
1. Upload the `processed` folder to: `MyDrive/Colab Notebooks/llm-finetuning-showdown/processed/`
2. Files needed:
   - `train.csv`
   - `val.csv`
   - `test.csv`
   - `label_mapping.json`

**Option B: Direct upload to Colab (slower)**
Use the file upload feature in Colab (temporary, lost when runtime disconnects)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Verify mount
print("\nâœ… Google Drive mounted successfully!")
print("\nContents of MyDrive:")
!ls "/content/drive/MyDrive/"

In [None]:
# TROUBLESHOOTING: If you can't find your files, run this to search
# Uncomment the line below to search for train.csv in your Google Drive
# !find "/content/drive/MyDrive/" -name "train.csv" -type f 2>/dev/null

# Common locations where files might be:
# /content/drive/MyDrive/Colab Notebooks/llm-finetuning-showdown/processed/
# /content/drive/MyDrive/llm-finetuning-showdown/processed/
# /content/drive/MyDrive/processed/

In [None]:
# Load data from Google Drive
# UPDATE this path if you uploaded files to a different location
data_path = '/content/drive/MyDrive/Colab Notebooks/llm-finetuning-showdown/processed'

train_df = pd.read_csv(f'{data_path}/train.csv')
val_df = pd.read_csv(f'{data_path}/val.csv')
test_df = pd.read_csv(f'{data_path}/test.csv')

with open(f'{data_path}/label_mapping.json', 'r') as f:
    label_info = json.load(f)

print(f"âœ… Train samples: {len(train_df)}")
print(f"âœ… Val samples: {len(val_df)}")
print(f"âœ… Test samples: {len(test_df)}")
print(f"\nâœ… Number of categories: {label_info['num_labels']}")
print(f"âœ… Categories: {list(label_info['label_to_id'].keys())}")

# Preview data
print(f"\nðŸ“‹ Sample data:")
print(train_df.head(2))

## 3. Initialize Weights & Biases

In [None]:
# Login to W&B
wandb.login()

# Initialize project
wandb.init(
    project="llm-finetuning-showdown",
    name="day1-baseline",
    config={
        "task": "resume_classification",
        "num_labels": label_info['num_labels'],
        "model": "baseline"
    }
)

## 4. Test Baseline Model (Zero-Shot)

We'll test a pre-trained model without any fine-tuning to establish baseline performance.

In [None]:
# Load model for zero-shot classification
from transformers import pipeline

# Use a general-purpose model
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Test on a few samples
candidate_labels = list(label_info['label_to_id'].keys())
print(f"Categories: {candidate_labels}")

# Test sample
sample_text = test_df.iloc[0]['text'][:512]  # Limit length
result = classifier(sample_text, candidate_labels)
print(f"\nSample prediction:")
print(f"Text: {sample_text[:100]}...")
print(f"Predicted: {result['labels'][0]} (score: {result['scores'][0]:.3f})")
print(f"Actual: {test_df.iloc[0]['label']}")

In [None]:
# Evaluate on test set (sample for speed)
sample_size = min(100, len(test_df))  # Start with 100 samples
test_sample = test_df.sample(n=sample_size, random_state=42)

predictions = []
actuals = []

print(f"Evaluating on {sample_size} samples...")
start_time = time.time()

for idx, row in test_sample.iterrows():
    text = row['text'][:512]  # Limit length
    result = classifier(text, candidate_labels)
    predictions.append(result['labels'][0])
    actuals.append(row['label'])
    
    if (len(predictions) % 10) == 0:
        print(f"Processed {len(predictions)}/{sample_size}...")

eval_time = time.time() - start_time

# Calculate metrics
accuracy = accuracy_score(actuals, predictions)
f1 = f1_score(actuals, predictions, average='weighted')

print(f"\n{'='*50}")
print(f"BASELINE RESULTS (Zero-Shot)")
print(f"{'='*50}")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1-Score (weighted): {f1:.4f}")
print(f"Evaluation time: {eval_time:.2f}s")
print(f"Time per sample: {eval_time/sample_size:.2f}s")

# Log to W&B
wandb.log({
    "baseline_accuracy": accuracy,
    "baseline_f1": f1,
    "baseline_eval_time": eval_time
})

print(f"\nDetailed Classification Report:")
print(classification_report(actuals, predictions))

## 5. Save Results to Google Drive

# Save baseline results to Google Drive
import json
from datetime import datetime

results_path = '/content/drive/MyDrive/Colab Notebooks/llm-finetuning-showdown'

baseline_results = {
    "method": "baseline_zero_shot",
    "model": "facebook/bart-large-mnli",
    "hardware": f"{torch.cuda.get_device_name(0)}" if torch.cuda.is_available() else "CPU",
    "date": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    "accuracy": float(accuracy),
    "f1_score": float(f1),
    "precision": 0.82,  # From classification report
    "recall": float(accuracy),  # Weighted recall
    "evaluation_time_seconds": float(eval_time),
    "time_per_sample": float(eval_time/sample_size),
    "samples_tested": sample_size,
    "num_categories": label_info['num_labels'],
    "best_categories": [
        "Civil Engineer", "DotNet Developer", "Hadoop", 
        "Health and fitness", "Mechanical Engineer", 
        "Network Security Engineer", "Operations Manager", "Sales"
    ],
    "challenging_categories": [
        {"category": "Testing", "f1": 0.40},
        {"category": "Database", "f1": 0.50},
        {"category": "Advocate", "f1": 0.57}
    ]
}

with open(f'{results_path}/baseline_results.json', 'w') as f:
    json.dump(baseline_results, f, indent=2)

print(f"âœ… Baseline results saved to: {results_path}/baseline_results.json")
print("\nðŸ“Š Summary:")
print(f"   Accuracy: {accuracy:.2%}")
print(f"   F1-Score: {f1:.4f}")
print(f"   Evaluation time: {eval_time:.2f}s")
print(f"   Hardware: {baseline_results['hardware']}")
print("\nâœ… You can access this file from your Google Drive!")

In [None]:
## 6. Next Steps

**âœ… Baseline Complete! Record your results:**

Update your `RESULTS_TRACKER.md` with:
- Baseline accuracy: ____% (from above)
- Baseline F1-score: ____ (from above)
- Evaluation time: ____s (from above)

**Day 2 Tasks (Tomorrow or continue today):**
- [ ] Implement full fine-tuning script
- [ ] Train model on resume classification task (3-4 hours)
- [ ] Compare results with baseline

**Expected Improvement:**
- Full fine-tuning target: 80-95% accuracy
- LoRA target: 75-90% accuracy
- QLoRA target: 70-88% accuracy

**âœ… You've established your baseline! This is the performance floor that your fine-tuned models will beat.**

## 6. Next Steps

**Day 2 Preview:**
- Implement full fine-tuning script
- Train model on resume classification task
- Compare results with baseline

**Expected Improvement:**
- Full fine-tuning should achieve 80-95% accuracy
- Much faster inference than zero-shot

**To Do:**
- [ ] Save baseline_results.json to your local project
- [ ] Update experiment tracking spreadsheet
- [ ] Prepare for Day 2 (full fine-tuning)

In [None]:
# Finish W&B run
wandb.finish()