# Evaluate Fine-tuned Model on Full Dataset

This notebook evaluates your fine-tuned GPT-4o model on the entire dataset (all 250 samples).

**Steps:**
1. Retrieve your fine-tuned model name from OpenAI
2. Load the full dataset
3. Run inference on all samples (with latency tracking)
4. Calculate and save metrics

## Setup

In [1]:
# Install required packages
!pip install openai pandas numpy matplotlib -q

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
from getpass import getpass

# Set OpenAI API key
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"]:
    os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key: ")

Paste your OpenAI API key: ··········


In [4]:
import pandas as pd
import numpy as np
import json
import time
from datetime import datetime
from typing import Dict, List
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

print("✓ Libraries loaded")

✓ Libraries loaded


## Step 1: Retrieve Your Fine-tuned Model

This will list all your fine-tuning jobs and show the model names.

In [5]:
# List all fine-tuning jobs
print("Retrieving your fine-tuning jobs...")
print("=" * 70)

jobs = client.fine_tuning.jobs.list(limit=10)

print("\nRecent Fine-tuning Jobs:\n")

for i, job in enumerate(jobs.data, 1):
    print(f"{i}. Job ID: {job.id}")
    print(f"   Status: {job.status}")
    print(f"   Created: {datetime.fromtimestamp(job.created_at).strftime('%Y-%m-%d %H:%M:%S')}")

    if job.fine_tuned_model:
        print(f"   ✓ Fine-tuned Model: {job.fine_tuned_model}")
    else:
        print(f"   Model: Not yet available (status: {job.status})")

    if job.finished_at:
        print(f"   Finished: {datetime.fromtimestamp(job.finished_at).strftime('%Y-%m-%d %H:%M:%S')}")

    print()

Retrieving your fine-tuning jobs...

Recent Fine-tuning Jobs:

1. Job ID: ftjob-t9Yi16FKGtHzeZsY7zAPq7KU
   Status: succeeded
   Created: 2025-11-30 19:04:04
   ✓ Fine-tuned Model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh
   Finished: 2025-11-30 19:29:03

2. Job ID: ftjob-5o6R4f1LQDKyeDsJ0tICpclW
   Status: failed
   Created: 2025-11-29 22:31:39
   Model: Not yet available (status: failed)



In [6]:
# Set your fine-tuned model name here
# Copy it from the output above

FINE_TUNED_MODEL = "ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh"  # ← UPDATE THIS!

# Example: FINE_TUNED_MODEL = "ft:gpt-4o-2024-08-06:personal::AbCdEfGh"

print(f"Using model: {FINE_TUNED_MODEL}")

if FINE_TUNED_MODEL == "YOUR_MODEL_NAME_HERE":
    print("\n⚠️  WARNING: You need to update FINE_TUNED_MODEL with your actual model name!")
    print("Copy it from the output above.")

Using model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh


## Step 2: Configuration

In [7]:
# File paths - UPDATE THESE
DATA_PATH = "/content/drive/MyDrive/hiring_evaluations.csv"  # ← Update this path
OUTPUT_DIR = "/content/drive/MyDrive/finetuning_output"  # Where to save results

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Metrics
METRICS = [
    "cognitive_ability",
    "experience",
    "problem_solving",
    "reliability",
    "professionalism",
    "communication"
]

print(f"Configuration:")
print(f"  Model: {FINE_TUNED_MODEL}")
print(f"  Data path: {DATA_PATH}")
print(f"  Output directory: {OUTPUT_DIR}")

Configuration:
  Model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh
  Data path: /content/drive/MyDrive/hiring_evaluations.csv
  Output directory: /content/drive/MyDrive/finetuning_output


## Step 3: Load Full Dataset

In [8]:
# Load the full dataset
print("Loading dataset...")
df = pd.read_csv(DATA_PATH)

print(f"✓ Loaded {len(df)} samples")
print(f"\nDataset overview:")
print(df[['interview_id', 'role'] + METRICS].head())

print(f"\nRole distribution:")
print(df['role'].value_counts())

Loading dataset...
✓ Loaded 250 samples

Dataset overview:
                          interview_id                             role  \
0  customer_service_representative_001  Customer Service Representative   
1  customer_service_representative_002  Customer Service Representative   
2  customer_service_representative_003  Customer Service Representative   
3  customer_service_representative_004  Customer Service Representative   
4  customer_service_representative_005  Customer Service Representative   

   cognitive_ability  experience  problem_solving  reliability  \
0                  6           1                6            5   
1                  6           5                9            5   
2                  6           5                9            5   
3                  6           5                6            5   
4                  6           4                9            5   

   professionalism  communication  
0                4              1  
1                4   

## Step 4: Helper Functions

In [9]:
def build_scoring_prompt(qa_text: str, role: str) -> str:
    """Build the scoring prompt matching the training format"""
    prompt = f"""You are evaluating a candidate interview for the role: {role}

Analyze the candidate's responses using these six metrics (each scored 1-10):

1. **Cognitive Ability (35%)**: Structured thinking, planning, logic, analytical reasoning
2. **Experience (35%)**: Relevant work history (last 10 years), demonstrated skills, accomplishments
3. **Problem Solving (15%)**: Resourcefulness, creative solutions, handling constraints
4. **Reliability (5%)**: Punctuality, follow-through, dependability signals
5. **Professionalism (5%)**: Respect for clients/rules, composure under stress
6. **Communication (5%)**: Clarity and tone (ignore filler words like um, uh, like)

CRITICAL INSTRUCTIONS:
- Return ONLY a valid JSON object
- Use these exact keys: cognitive_ability, experience, problem_solving, reliability, professionalism, communication
- Each value must be an integer from 1 to 10
- Do not include any explanations, just the JSON

Interview Transcript:
--- START TRANSCRIPT ---
{qa_text}
--- END TRANSCRIPT ---

Return your scores in this format:
{{"cognitive_ability":7,"experience":6,"problem_solving":7,"reliability":6,"professionalism":7,"communication":8}}"""
    return prompt

def run_single_inference(qa_text: str, role: str, model_name: str) -> tuple:
    """Run inference on a single sample and return (scores, latency_ms)"""
    system_message = "You are an expert interviewer evaluating candidates. You provide accurate, consistent scoring based on interview transcripts."
    user_message = build_scoring_prompt(qa_text, role)

    try:
        start_time = time.time()

        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            temperature=0.0,
            max_tokens=512
        )

        latency_ms = (time.time() - start_time) * 1000

        response_text = response.choices[0].message.content.strip()

        # Clean response (remove markdown code blocks if present)
        if response_text.startswith("```"):
            response_text = response_text.split("```")[1]
            if response_text.startswith("json"):
                response_text = response_text[4:]
            response_text = response_text.strip()

        scores = json.loads(response_text)

        # Validate and clamp scores
        validated_scores = {}
        for metric in METRICS:
            if metric in scores:
                validated_scores[metric] = max(1, min(10, int(scores[metric])))
            else:
                print(f"⚠️  Missing metric '{metric}', defaulting to 5")
                validated_scores[metric] = 5

        return validated_scores, latency_ms

    except json.JSONDecodeError as e:
        latency_ms = (time.time() - start_time) * 1000 if 'start_time' in locals() else 0
        print(f"❌ JSON decode error: {e}")
        return {metric: 5 for metric in METRICS}, latency_ms
    except Exception as e:
        latency_ms = (time.time() - start_time) * 1000 if 'start_time' in locals() else 0
        print(f"❌ Error: {e}")
        return {metric: 5 for metric in METRICS}, latency_ms

print("✓ Helper functions loaded")

✓ Helper functions loaded


## Step 5: Run Inference on All Samples

This will take approximately 3-5 minutes for 250 samples.

In [10]:
# Run inference on all samples
print("=" * 70)
print(f"Running inference on {len(df)} samples...")
print("=" * 70)
print("This will take approximately 3-5 minutes\n")

predictions = []
latencies = []
start_time = time.time()

for idx, row in df.iterrows():
    print(f"Processing {idx + 1}/{len(df)}: {row['interview_id']}", end="\r")

    pred_scores, latency = run_single_inference(row['qa_text'], row['role'], FINE_TUNED_MODEL)
    predictions.append(pred_scores)
    latencies.append(latency)

    # Small delay to avoid rate limits
    time.sleep(0.1)

elapsed_time = time.time() - start_time
print(f"\n\n✓ Completed in {elapsed_time:.1f} seconds ({elapsed_time/len(df):.2f}s per sample)")

# Latency statistics
avg_latency = np.mean(latencies)
p50_latency = np.percentile(latencies, 50)
p95_latency = np.percentile(latencies, 95)
p99_latency = np.percentile(latencies, 99)

print("\nLatency Statistics:")
print("-" * 70)
print(f"Mean:   {avg_latency:.0f} ms")
print(f"Median: {p50_latency:.0f} ms")
print(f"P95:    {p95_latency:.0f} ms")
print(f"P99:    {p99_latency:.0f} ms")
print(f"Min:    {min(latencies):.0f} ms")
print(f"Max:    {max(latencies):.0f} ms")

Running inference on 250 samples...
This will take approximately 3-5 minutes



✓ Completed in 428.7 seconds (1.71s per sample)

Latency Statistics:
----------------------------------------------------------------------
Mean:   1614 ms
Median: 1519 ms
P95:    3570 ms
P99:    4415 ms
Min:    769 ms
Max:    5192 ms


## Step 6: Calculate Metrics

In [11]:
# Calculate evaluation metrics
print("\n" + "=" * 70)
print("EVALUATION METRICS")
print("=" * 70)

# Per-metric metrics
results = {}
for metric in METRICS:
    true_values = df[metric].values
    pred_values = np.array([p[metric] for p in predictions])

    mae = np.mean(np.abs(true_values - pred_values))
    rmse = np.sqrt(np.mean((true_values - pred_values) ** 2))
    exact_match = np.mean(true_values == pred_values)
    within_1 = np.mean(np.abs(true_values - pred_values) <= 1)
    within_2 = np.mean(np.abs(true_values - pred_values) <= 2)

    results[metric] = {
        'mae': mae,
        'rmse': rmse,
        'exact_match': exact_match,
        'within_1': within_1,
        'within_2': within_2
    }

# Display per-metric results
print("\nPer-Metric Performance:")
print("-" * 70)
print(f"{'Metric':<20} {'MAE':>8} {'RMSE':>8} {'Exact':>8} {'±1':>8} {'±2':>8}")
print("-" * 70)

for metric in METRICS:
    m = results[metric]
    print(f"{metric:<20} {m['mae']:>8.2f} {m['rmse']:>8.2f} "
          f"{m['exact_match']*100:>7.1f}% {m['within_1']*100:>7.1f}% {m['within_2']*100:>7.1f}%")

# Overall metrics
all_true = df[METRICS].values.flatten()
all_pred = np.array([[p[m] for m in METRICS] for p in predictions]).flatten()

overall_mae = np.mean(np.abs(all_true - all_pred))
overall_rmse = np.sqrt(np.mean((all_true - all_pred) ** 2))
overall_exact = np.mean(all_true == all_pred)
overall_within_1 = np.mean(np.abs(all_true - all_pred) <= 1)
overall_within_2 = np.mean(np.abs(all_true - all_pred) <= 2)

print("\n" + "=" * 70)
print("Overall Performance (all metrics combined):")
print("-" * 70)
print(f"Mean Absolute Error (MAE):  {overall_mae:.3f}")
print(f"Root Mean Squared Error:     {overall_rmse:.3f}")
print(f"Exact Match Accuracy:        {overall_exact*100:.1f}%")
print(f"Within ±1 Accuracy:          {overall_within_1*100:.1f}%")
print(f"Within ±2 Accuracy:          {overall_within_2*100:.1f}%")


EVALUATION METRICS

Per-Metric Performance:
----------------------------------------------------------------------
Metric                    MAE     RMSE    Exact       ±1       ±2
----------------------------------------------------------------------
cognitive_ability        0.14     0.54    92.8%    92.8%   100.0%
experience               0.20     0.81    93.6%    94.0%    94.0%
problem_solving          0.67     1.41    76.0%    76.8%    82.0%
reliability              0.54     1.12    74.8%    74.8%    98.0%
professionalism          0.40     0.78    69.6%    90.0%   100.0%
communication            0.62     1.08    61.2%    79.6%    96.8%

Overall Performance (all metrics combined):
----------------------------------------------------------------------
Mean Absolute Error (MAE):  0.431
Root Mean Squared Error:     0.996
Exact Match Accuracy:        78.0%
Within ±1 Accuracy:          84.7%
Within ±2 Accuracy:          95.1%


## Step 7: Save Results

In [12]:
# Create detailed results dataframe
results_df = df.copy()

# Add predictions and errors
for metric in METRICS:
    results_df[f'predicted_{metric}'] = [p[metric] for p in predictions]
    results_df[f'error_{metric}'] = results_df[f'predicted_{metric}'] - results_df[metric]

results_df['total_abs_error'] = sum(abs(results_df[f'error_{m}']) for m in METRICS)
results_df['latency_ms'] = latencies

# Save predictions
predictions_path = os.path.join(OUTPUT_DIR, "full_dataset_predictions.csv")
results_df.to_csv(predictions_path, index=False)
print(f"✓ Detailed predictions saved: {predictions_path}")

# Save metrics as JSON
metrics_data = {
    'model': FINE_TUNED_MODEL,
    'total_samples': len(df),
    'timestamp': datetime.now().isoformat(),
    'per_metric': results,
    'overall': {
        'mae': float(overall_mae),
        'rmse': float(overall_rmse),
        'exact_match': float(overall_exact),
        'within_1': float(overall_within_1),
        'within_2': float(overall_within_2)
    },
    'latency': {
        'mean_ms': float(np.mean(latencies)),
        'median_ms': float(np.median(latencies)),
        'p95_ms': float(np.percentile(latencies, 95)),
        'p99_ms': float(np.percentile(latencies, 99)),
        'min_ms': float(min(latencies)),
        'max_ms': float(max(latencies))
    }
}

metrics_path = os.path.join(OUTPUT_DIR, "full_dataset_metrics.json")
with open(metrics_path, 'w') as f:
    json.dump(metrics_data, f, indent=2)
print(f"✓ Metrics saved: {metrics_path}")

✓ Detailed predictions saved: /content/drive/MyDrive/finetuning_output/full_dataset_predictions.csv
✓ Metrics saved: /content/drive/MyDrive/finetuning_output/full_dataset_metrics.json


## Step 8: Error Analysis

In [13]:
# Error Analysis
print("\n" + "=" * 70)
print("ERROR ANALYSIS")
print("=" * 70)

# Find worst predictions
worst_10 = results_df.nlargest(10, 'total_abs_error')

print("\nTop 10 Worst Predictions (by total absolute error):")
print("-" * 70)

for idx, row in worst_10.iterrows():
    print(f"\n#{list(worst_10.index).index(idx) + 1}. Interview ID: {row['interview_id']}")
    print(f"   Role: {row['role']}")
    print(f"   Total Absolute Error: {row['total_abs_error']:.0f}")
    print(f"   Latency: {row['latency_ms']:.0f} ms")
    print("   Scores (Truth → Predicted):")
    for metric in METRICS:
        error_indicator = "✓" if abs(row[f'error_{metric}']) <= 1 else "✗"
        print(f"     {error_indicator} {metric:<20}: {row[metric]:>2} → {row[f'predicted_{metric}']:>2} (error: {row[f'error_{metric}']:+3.0f})")


ERROR ANALYSIS

Top 10 Worst Predictions (by total absolute error):
----------------------------------------------------------------------

#1. Interview ID: sales_representative_004
   Role: Sales Representative
   Total Absolute Error: 11
   Latency: 1732 ms
   Scores (Truth → Predicted):
     ✓ cognitive_ability   :  8 →  8 (error:  +0)
     ✗ experience          :  5 →  8 (error:  +3)
     ✗ problem_solving     :  7 →  9 (error:  +2)
     ✗ reliability         :  7 →  5 (error:  -2)
     ✗ professionalism     :  7 →  5 (error:  -2)
     ✗ communication       :  7 →  5 (error:  -2)

#2. Interview ID: general_manager_(franchise)_019
   Role: General Manager (Franchise)
   Total Absolute Error: 9
   Latency: 3835 ms
   Scores (Truth → Predicted):
     ✗ cognitive_ability   :  4 →  6 (error:  +2)
     ✓ experience          :  8 →  8 (error:  +0)
     ✗ problem_solving     :  9 →  6 (error:  -3)
     ✗ reliability         :  1 →  5 (error:  +4)
     ✓ professionalism     :  2 →  2 (err

In [14]:
# Show some examples of best predictions
best_10 = results_df.nsmallest(10, 'total_abs_error')

print("\n" + "=" * 70)
print("Top 10 Best Predictions (by total absolute error):")
print("-" * 70)

for idx, row in best_10.iterrows():
    print(f"\n#{list(best_10.index).index(idx) + 1}. Interview ID: {row['interview_id']}")
    print(f"   Role: {row['role']}")
    print(f"   Total Absolute Error: {row['total_abs_error']:.0f}")
    print(f"   Latency: {row['latency_ms']:.0f} ms")
    print("   Scores (Truth → Predicted):")
    for metric in METRICS:
        error_indicator = "✓" if abs(row[f'error_{metric}']) == 0 else "~"
        print(f"     {error_indicator} {metric:<20}: {row[metric]:>2} → {row[f'predicted_{metric}']:>2} (error: {row[f'error_{metric}']:+3.0f})")


Top 10 Best Predictions (by total absolute error):
----------------------------------------------------------------------

#1. Interview ID: customer_service_representative_002
   Role: Customer Service Representative
   Total Absolute Error: 0
   Latency: 2167 ms
   Scores (Truth → Predicted):
     ✓ cognitive_ability   :  6 →  6 (error:  +0)
     ✓ experience          :  5 →  5 (error:  +0)
     ✓ problem_solving     :  9 →  9 (error:  +0)
     ✓ reliability         :  5 →  5 (error:  +0)
     ✓ professionalism     :  4 →  4 (error:  +0)
     ✓ communication       :  1 →  1 (error:  +0)

#2. Interview ID: customer_service_representative_016
   Role: Customer Service Representative
   Total Absolute Error: 0
   Latency: 1659 ms
   Scores (Truth → Predicted):
     ✓ cognitive_ability   :  6 →  6 (error:  +0)
     ✓ experience          :  5 →  5 (error:  +0)
     ✓ problem_solving     :  6 →  6 (error:  +0)
     ✓ reliability         :  5 →  5 (error:  +0)
     ✓ professionalism     : 

## Summary

In [15]:
print("\n" + "=" * 70)
print("EVALUATION COMPLETE")
print("=" * 70)
print(f"\nSummary:")
print(f"  Model: {FINE_TUNED_MODEL}")
print(f"  Total samples evaluated: {len(df)}")
print(f"  Overall MAE: {overall_mae:.3f}")
print(f"  Overall RMSE: {overall_rmse:.3f}")
print(f"  Exact Match: {overall_exact*100:.1f}%")
print(f"  Within ±1: {overall_within_1*100:.1f}%")
print(f"  Within ±2: {overall_within_2*100:.1f}%")
print(f"\nLatency:")
print(f"  Mean: {np.mean(latencies):.0f} ms")
print(f"  P95: {np.percentile(latencies, 95):.0f} ms")
print(f"\nFiles saved:")
print(f"  - {predictions_path}")
print(f"  - {metrics_path}")


EVALUATION COMPLETE

Summary:
  Model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh
  Total samples evaluated: 250
  Overall MAE: 0.431
  Overall RMSE: 0.996
  Exact Match: 78.0%
  Within ±1: 84.7%
  Within ±2: 95.1%

Latency:
  Mean: 1614 ms
  P95: 3570 ms

Files saved:
  - /content/drive/MyDrive/finetuning_output/full_dataset_predictions.csv
  - /content/drive/MyDrive/finetuning_output/full_dataset_metrics.json
