# GPT-4o Fine-tuning for Interview Scoring

This notebook fine-tunes GPT-4o on human-scored interview evaluations.

**Dataset**: 250 interviews with human scores across 6 metrics

**Process**:
1. Data quality assessment and cleaning
2. 80-20 train-test split
3. Format data for OpenAI fine-tuning API
4. Upload and initiate fine-tuning job
5. Monitor training progress
6. Evaluate fine-tuned model

## Setup & Installation

In [1]:
# Install required packages
!pip install openai pandas numpy scikit-learn -q

In [2]:
# Mount Google Drive (for Colab)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
from getpass import getpass

# Set OpenAI API key
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"]:
    os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key: ")

Paste your OpenAI API key: ··········


## Imports

In [4]:
import pandas as pd
import numpy as np
import json
import time
from datetime import datetime
from typing import Dict, List
from sklearn.model_selection import train_test_split
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

print("✓ Libraries loaded")

✓ Libraries loaded


## Configuration

In [5]:
# File paths - UPDATE THESE FOR YOUR ENVIRONMENT
DATA_PATH = "/content/drive/MyDrive/hiring_evaluations.csv"  # Update this path
OUTPUT_DIR = "/content/drive/MyDrive/finetuning_output"  # Where to save outputs

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Metrics configuration
METRICS = [
    "cognitive_ability",
    "experience",
    "problem_solving",
    "reliability",
    "professionalism",
    "communication"
]

# Fine-tuning configuration
MODEL_TO_FINETUNE = "gpt-4o-2024-08-06"  # Only this model supports fine-tuning
TRAIN_TEST_SPLIT = 0.8  # 80% train, 20% test
RANDOM_SEED = 42

print(f"Configuration loaded")
print(f"Data path: {DATA_PATH}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Model: {MODEL_TO_FINETUNE}")

Configuration loaded
Data path: /content/drive/MyDrive/hiring_evaluations.csv
Output directory: /content/drive/MyDrive/finetuning_output
Model: gpt-4o-2024-08-06


## 1. Load and Inspect Data

In [6]:
# Load dataset
df = pd.read_csv(DATA_PATH)

print("Dataset Overview:")
print(f"Total samples: {len(df)}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

Dataset Overview:
Total samples: 250

Columns: ['interview_id', 'role', 'qa_text', 'cognitive_ability', 'experience', 'problem_solving', 'reliability', 'professionalism', 'communication']

First few rows:


Unnamed: 0,interview_id,role,qa_text,cognitive_ability,experience,problem_solving,reliability,professionalism,communication
0,customer_service_representative_001,Customer Service Representative,Q: Have you worked in a customer-facing role f...,6,1,6,5,4,1
1,customer_service_representative_002,Customer Service Representative,Q: Have you worked in a customer-facing role f...,6,5,9,5,4,1
2,customer_service_representative_003,Customer Service Representative,Q: Have you worked in a customer-facing role f...,6,5,9,5,4,1
3,customer_service_representative_004,Customer Service Representative,Q: Have you worked in a customer-facing role f...,6,5,6,5,4,2
4,customer_service_representative_005,Customer Service Representative,Q: Have you worked in a customer-facing role f...,6,4,9,5,4,3


In [7]:
# Check data quality
print("Data Quality Checks:")
print("=" * 70)

# Missing values
missing = df.isnull().sum()
if missing.any():
    print("⚠️  Missing values found:")
    print(missing[missing > 0])
else:
    print("✓ No missing values")

# Duplicate interview IDs
duplicates = df[df.duplicated(subset=['interview_id'], keep=False)]
if len(duplicates) > 0:
    print(f"\n⚠️  {len(duplicates)} duplicate interview_ids found")
    print(duplicates[['interview_id', 'role']].head())
else:
    print("✓ No duplicate interview_ids")

# Score range validation (should be 1-10)
print("\nScore Range Validation (expected: 1-10):")
for metric in METRICS:
    min_val = df[metric].min()
    max_val = df[metric].max()
    out_of_range = df[(df[metric] < 1) | (df[metric] > 10)]

    if len(out_of_range) > 0:
        print(f"⚠️  {metric}: {len(out_of_range)} scores out of range (min={min_val}, max={max_val})")
    else:
        print(f"✓ {metric}: All scores in range [1-10] (actual range: {min_val}-{max_val})")

Data Quality Checks:
✓ No missing values
✓ No duplicate interview_ids

Score Range Validation (expected: 1-10):
✓ cognitive_ability: All scores in range [1-10] (actual range: 4-8)
✓ experience: All scores in range [1-10] (actual range: 1-8)
✓ problem_solving: All scores in range [1-10] (actual range: 2-9)
✓ reliability: All scores in range [1-10] (actual range: 1-7)
✓ professionalism: All scores in range [1-10] (actual range: 2-7)
✓ communication: All scores in range [1-10] (actual range: 1-7)


## 2. Data Analysis

In [8]:
# Score distributions
print("Score Distributions:")
print("=" * 70)

for metric in METRICS:
    print(f"\n{metric.upper().replace('_', ' ')}:")
    print(f"  Mean: {df[metric].mean():.2f}")
    print(f"  Median: {df[metric].median():.0f}")
    print(f"  Std Dev: {df[metric].std():.2f}")
    print(f"  Distribution: {df[metric].value_counts().sort_index().to_dict()}")

Score Distributions:

COGNITIVE ABILITY:
  Mean: 6.98
  Median: 8
  Std Dev: 1.19
  Distribution: {4: 13, 6: 102, 8: 135}

EXPERIENCE:
  Mean: 7.03
  Median: 8
  Std Dev: 1.53
  Distribution: {1: 3, 4: 2, 5: 71, 8: 174}

PROBLEM SOLVING:
  Mean: 8.00
  Median: 9
  Std Dev: 1.58
  Distribution: {2: 4, 4: 3, 5: 5, 6: 49, 7: 20, 9: 169}

RELIABILITY:
  Mean: 5.02
  Median: 5
  Std Dev: 1.53
  Distribution: {1: 2, 3: 65, 5: 111, 7: 72}

PROFESSIONALISM:
  Mean: 4.63
  Median: 5
  Std Dev: 2.05
  Distribution: {2: 82, 4: 30, 5: 20, 6: 52, 7: 66}

COMMUNICATION:
  Mean: 3.51
  Median: 4
  Std Dev: 2.09
  Distribution: {1: 77, 2: 23, 3: 11, 4: 49, 5: 31, 6: 42, 7: 17}


In [9]:
# Role distribution
print("\nRole Distribution:")
print("=" * 70)
role_counts = df['role'].value_counts()
print(role_counts)

print(f"\nTotal roles: {len(role_counts)}")
print(f"Samples per role: {role_counts.min()} - {role_counts.max()}")


Role Distribution:
role
Customer Service Representative    50
Sales Representative               50
Field Technician                   50
Home Service Technician            50
General Manager (Franchise)        50
Name: count, dtype: int64

Total roles: 5
Samples per role: 50 - 50


## 3. Data Cleaning (if needed)

Based on the analysis above, we'll clean any issues found.

In [10]:
# Create cleaned dataset
df_clean = df.copy()

print("Data Cleaning Steps:")
print("=" * 70)

# Remove duplicates if any
initial_count = len(df_clean)
df_clean = df_clean.drop_duplicates(subset=['interview_id'], keep='first')
removed_dupes = initial_count - len(df_clean)
if removed_dupes > 0:
    print(f"✓ Removed {removed_dupes} duplicate rows")
else:
    print("✓ No duplicates to remove")

# Clamp scores to 1-10 range (if needed)
scores_clamped = 0
for metric in METRICS:
    before = df_clean[metric].copy()
    df_clean[metric] = df_clean[metric].clip(lower=1, upper=10)
    scores_clamped += (df_clean[metric] != before).sum()

if scores_clamped > 0:
    print(f"✓ Clamped {scores_clamped} scores to [1-10] range")
else:
    print("✓ All scores already in [1-10] range")

# Remove rows with missing critical data
critical_cols = ['interview_id', 'role', 'qa_text'] + METRICS
before_drop = len(df_clean)
df_clean = df_clean.dropna(subset=critical_cols)
removed_na = before_drop - len(df_clean)
if removed_na > 0:
    print(f"✓ Removed {removed_na} rows with missing data")
else:
    print("✓ No missing data in critical columns")

print(f"\nFinal dataset size: {len(df_clean)} samples")
print(f"Reduction: {len(df) - len(df_clean)} samples ({((len(df) - len(df_clean))/len(df)*100):.1f}%)")

Data Cleaning Steps:
✓ No duplicates to remove
✓ All scores already in [1-10] range
✓ No missing data in critical columns

Final dataset size: 250 samples
Reduction: 0 samples (0.0%)


## 4. Train-Test Split (80-20)

In [11]:
# Perform stratified split by role to maintain distribution
train_df, test_df = train_test_split(
    df_clean,
    test_size=1-TRAIN_TEST_SPLIT,
    random_state=RANDOM_SEED,
    stratify=df_clean['role']
)

print("Train-Test Split:")
print("=" * 70)
print(f"Total samples: {len(df_clean)}")
print(f"Training samples: {len(train_df)} ({len(train_df)/len(df_clean)*100:.1f}%)")
print(f"Test samples: {len(test_df)} ({len(test_df)/len(df_clean)*100:.1f}%)")

print("\nRole distribution in train set:")
print(train_df['role'].value_counts())

print("\nRole distribution in test set:")
print(test_df['role'].value_counts())

Train-Test Split:
Total samples: 250
Training samples: 200 (80.0%)
Test samples: 50 (20.0%)

Role distribution in train set:
role
Field Technician                   40
Customer Service Representative    40
General Manager (Franchise)        40
Sales Representative               40
Home Service Technician            40
Name: count, dtype: int64

Role distribution in test set:
role
Home Service Technician            10
Customer Service Representative    10
Field Technician                   10
Sales Representative               10
General Manager (Franchise)        10
Name: count, dtype: int64


## 5. Format Data for Fine-tuning

OpenAI fine-tuning requires JSONL format with specific structure:
```json
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

In [12]:
def build_scoring_prompt(qa_text: str, role: str) -> str:
    """Build the scoring prompt matching the inference prompt from kRuns.ipynb"""
    prompt = f"""You are evaluating a candidate interview for the role: {role}

Analyze the candidate's responses using these six metrics (each scored 1-10):

1. **Cognitive Ability (35%)**: Structured thinking, planning, logic, analytical reasoning
2. **Experience (35%)**: Relevant work history (last 10 years), demonstrated skills, accomplishments
3. **Problem Solving (15%)**: Resourcefulness, creative solutions, handling constraints
4. **Reliability (5%)**: Punctuality, follow-through, dependability signals
5. **Professionalism (5%)**: Respect for clients/rules, composure under stress
6. **Communication (5%)**: Clarity and tone (ignore filler words like um, uh, like)

CRITICAL INSTRUCTIONS:
- Return ONLY a valid JSON object
- Use these exact keys: cognitive_ability, experience, problem_solving, reliability, professionalism, communication
- Each value must be an integer from 1 to 10
- Do not include any explanations, just the JSON

Interview Transcript:
--- START TRANSCRIPT ---
{qa_text}
--- END TRANSCRIPT ---

Return your scores in this format:
{{"cognitive_ability":7,"experience":6,"problem_solving":7,"reliability":6,"professionalism":7,"communication":8}}"""
    return prompt

def format_scores_as_json(row: pd.Series) -> str:
    """Format scores as JSON string for assistant response"""
    scores = {
        "cognitive_ability": int(row['cognitive_ability']),
        "experience": int(row['experience']),
        "problem_solving": int(row['problem_solving']),
        "reliability": int(row['reliability']),
        "professionalism": int(row['professionalism']),
        "communication": int(row['communication'])
    }
    return json.dumps(scores)

def create_training_example(row: pd.Series) -> dict:
    """Create a single training example in OpenAI format"""
    system_message = "You are an expert interviewer evaluating candidates. You provide accurate, consistent scoring based on interview transcripts."
    user_message = build_scoring_prompt(row['qa_text'], row['role'])
    assistant_message = format_scores_as_json(row)

    return {
        "messages": [
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_message}
        ]
    }

print("✓ Formatting functions defined")

✓ Formatting functions defined


In [13]:
# Convert training data to JSONL format
print("Converting training data to JSONL format...")

train_examples = []
for idx, row in train_df.iterrows():
    train_examples.append(create_training_example(row))

# Save training file
train_file_path = os.path.join(OUTPUT_DIR, "train.jsonl")
with open(train_file_path, 'w') as f:
    for example in train_examples:
        f.write(json.dumps(example) + '\n')

print(f"✓ Training file saved: {train_file_path}")
print(f"  Total examples: {len(train_examples)}")

# Show sample
print("\nSample training example:")
print(json.dumps(train_examples[0], indent=2)[:1000] + "...")

Converting training data to JSONL format...
✓ Training file saved: /content/drive/MyDrive/finetuning_output/train.jsonl
  Total examples: 200

Sample training example:
{
  "messages": [
    {
      "role": "system",
      "content": "You are an expert interviewer evaluating candidates. You provide accurate, consistent scoring based on interview transcripts."
    },
    {
      "role": "user",
      "content": "You are evaluating a candidate interview for the role: Field Technician\n\nAnalyze the candidate's responses using these six metrics (each scored 1-10):\n\n1. **Cognitive Ability (35%)**: Structured thinking, planning, logic, analytical reasoning\n2. **Experience (35%)**: Relevant work history (last 10 years), demonstrated skills, accomplishments\n3. **Problem Solving (15%)**: Resourcefulness, creative solutions, handling constraints\n4. **Reliability (5%)**: Punctuality, follow-through, dependability signals\n5. **Professionalism (5%)**: Respect for clients/rules, composure unde

In [14]:
# Convert test data to JSONL format (for later evaluation)
print("Converting test data to JSONL format...")

test_examples = []
for idx, row in test_df.iterrows():
    test_examples.append(create_training_example(row))

# Save test file
test_file_path = os.path.join(OUTPUT_DIR, "test.jsonl")
with open(test_file_path, 'w') as f:
    for example in test_examples:
        f.write(json.dumps(example) + '\n')

print(f"✓ Test file saved: {test_file_path}")
print(f"  Total examples: {len(test_examples)}")

# Also save test data as CSV for easier evaluation later
test_csv_path = os.path.join(OUTPUT_DIR, "test_set.csv")
test_df.to_csv(test_csv_path, index=False)
print(f"✓ Test CSV saved: {test_csv_path}")

Converting test data to JSONL format...
✓ Test file saved: /content/drive/MyDrive/finetuning_output/test.jsonl
  Total examples: 50
✓ Test CSV saved: /content/drive/MyDrive/finetuning_output/test_set.csv


## 6. Validate Training Data

OpenAI provides validation tools to check data format before uploading.

In [15]:
# Basic validation checks
print("Validating training data format...")
print("=" * 70)

validation_errors = []

for i, example in enumerate(train_examples[:10]):  # Check first 10
    # Check structure
    if "messages" not in example:
        validation_errors.append(f"Example {i}: Missing 'messages' key")
        continue

    messages = example["messages"]

    # Check message count (should be 3: system, user, assistant)
    if len(messages) != 3:
        validation_errors.append(f"Example {i}: Expected 3 messages, got {len(messages)}")

    # Check roles
    expected_roles = ["system", "user", "assistant"]
    actual_roles = [msg["role"] for msg in messages]
    if actual_roles != expected_roles:
        validation_errors.append(f"Example {i}: Role sequence mismatch. Expected {expected_roles}, got {actual_roles}")

    # Check content exists
    for j, msg in enumerate(messages):
        if "content" not in msg or not msg["content"]:
            validation_errors.append(f"Example {i}, message {j}: Missing or empty content")

    # Validate assistant response is valid JSON
    try:
        assistant_content = messages[2]["content"]
        scores = json.loads(assistant_content)

        # Check all required keys present
        required_keys = set(METRICS)
        actual_keys = set(scores.keys())
        if required_keys != actual_keys:
            validation_errors.append(f"Example {i}: Score keys mismatch. Expected {required_keys}, got {actual_keys}")

        # Check score ranges
        for key, value in scores.items():
            if not isinstance(value, int) or value < 1 or value > 10:
                validation_errors.append(f"Example {i}: Score '{key}' = {value} is invalid (must be int 1-10)")

    except json.JSONDecodeError:
        validation_errors.append(f"Example {i}: Assistant response is not valid JSON")

if validation_errors:
    print("⚠️  Validation errors found:")
    for error in validation_errors[:10]:  # Show first 10 errors
        print(f"  - {error}")
    if len(validation_errors) > 10:
        print(f"  ... and {len(validation_errors) - 10} more errors")
else:
    print("✓ All validation checks passed!")

print(f"\nSamples validated: {min(10, len(train_examples))}")
print(f"Total training samples: {len(train_examples)}")

Validating training data format...
✓ All validation checks passed!

Samples validated: 10
Total training samples: 200


## 7. Upload Training File to OpenAI

In [16]:
# Upload training file
print("Uploading training file to OpenAI...")
print("=" * 70)

try:
    with open(train_file_path, 'rb') as f:
        train_file = client.files.create(
            file=f,
            purpose='fine-tune'
        )

    print(f"✓ Training file uploaded successfully!")
    print(f"  File ID: {train_file.id}")
    print(f"  Filename: {train_file.filename}")
    print(f"  Bytes: {train_file.bytes:,}")
    print(f"  Status: {train_file.status}")

    # Save file ID for later reference
    file_info = {
        "train_file_id": train_file.id,
        "filename": train_file.filename,
        "bytes": train_file.bytes,
        "upload_timestamp": datetime.now().isoformat(),
        "num_examples": len(train_examples)
    }

    file_info_path = os.path.join(OUTPUT_DIR, "file_info.json")
    with open(file_info_path, 'w') as f:
        json.dump(file_info, f, indent=2)

    print(f"\n✓ File info saved: {file_info_path}")

except Exception as e:
    print(f"❌ Error uploading file: {e}")
    raise

Uploading training file to OpenAI...
✓ Training file uploaded successfully!
  File ID: file-9AY2fj4MCN9kJtnYMzZUeh
  Filename: train.jsonl
  Bytes: 1,708,822
  Status: processed

✓ File info saved: /content/drive/MyDrive/finetuning_output/file_info.json


## 8. Create Fine-tuning Job

In [17]:
# Create fine-tuning job
print("Creating fine-tuning job...")
print("=" * 70)

try:
    job = client.fine_tuning.jobs.create(
        training_file=train_file.id,
        model=MODEL_TO_FINETUNE,
        suffix="interview-scorer",  # Will be appended to model name
        hyperparameters={
            "n_epochs": "auto"  # Let OpenAI determine optimal epochs
        }
    )

    print(f"✓ Fine-tuning job created successfully!")
    print(f"  Job ID: {job.id}")
    print(f"  Model: {job.model}")
    print(f"  Status: {job.status}")
    print(f"  Created at: {datetime.fromtimestamp(job.created_at)}")

    # Save job info
    job_info = {
        "job_id": job.id,
        "model": job.model,
        "status": job.status,
        "created_at": job.created_at,
        "training_file": job.training_file,
        "hyperparameters": job.hyperparameters.to_dict() if hasattr(job.hyperparameters, 'to_dict') else {}
    }

    job_info_path = os.path.join(OUTPUT_DIR, "job_info.json")
    with open(job_info_path, 'w') as f:
        json.dump(job_info, f, indent=2)

    print(f"\n✓ Job info saved: {job_info_path}")
    print(f"\n⏳ Fine-tuning is now in progress. This may take 10-30 minutes.")
    print(f"   You can monitor progress in the next cell or at:")
    print(f"   https://platform.openai.com/finetune/{job.id}")

except Exception as e:
    print(f"❌ Error creating fine-tuning job: {e}")
    raise

Creating fine-tuning job...
✓ Fine-tuning job created successfully!
  Job ID: ftjob-t9Yi16FKGtHzeZsY7zAPq7KU
  Model: gpt-4o-2024-08-06
  Status: validating_files
  Created at: 2025-11-30 19:04:04

✓ Job info saved: /content/drive/MyDrive/finetuning_output/job_info.json

⏳ Fine-tuning is now in progress. This may take 10-30 minutes.
   You can monitor progress in the next cell or at:
   https://platform.openai.com/finetune/ftjob-t9Yi16FKGtHzeZsY7zAPq7KU


## 9. Monitor Fine-tuning Progress

This cell will check the job status periodically until completion.

In [18]:
# Monitor fine-tuning progress
print("Monitoring fine-tuning job...")
print("=" * 70)
print(f"Job ID: {job.id}")
print("This will check status every 60 seconds.\n")

import time
from IPython.display import clear_output

while True:
    try:
        # Retrieve current job status
        current_job = client.fine_tuning.jobs.retrieve(job.id)

        clear_output(wait=True)

        print("Fine-tuning Job Status")
        print("=" * 70)
        print(f"Job ID: {current_job.id}")
        print(f"Status: {current_job.status}")
        print(f"Model: {current_job.model}")
        print(f"Created: {datetime.fromtimestamp(current_job.created_at)}")

        if current_job.finished_at:
            print(f"Finished: {datetime.fromtimestamp(current_job.finished_at)}")
            duration = current_job.finished_at - current_job.created_at
            print(f"Duration: {duration // 60} minutes {duration % 60} seconds")

        if current_job.fine_tuned_model:
            print(f"\n✓ Fine-tuned Model: {current_job.fine_tuned_model}")

        if current_job.error:
            print(f"\n❌ Error: {current_job.error}")

        # Get training metrics if available
        if hasattr(current_job, 'result_files') and current_job.result_files:
            print(f"\nResult files: {current_job.result_files}")

        # Check if job is complete
        if current_job.status in ['succeeded', 'failed', 'cancelled']:
            print("\n" + "=" * 70)
            if current_job.status == 'succeeded':
                print("✓ Fine-tuning completed successfully!")
                print(f"\nYour fine-tuned model: {current_job.fine_tuned_model}")
                print("\nYou can now use this model for inference in the next section.")

                # Save final job info
                final_job_info = {
                    "job_id": current_job.id,
                    "status": current_job.status,
                    "fine_tuned_model": current_job.fine_tuned_model,
                    "created_at": current_job.created_at,
                    "finished_at": current_job.finished_at,
                    "trained_tokens": getattr(current_job, 'trained_tokens', None),
                    "hyperparameters": current_job.hyperparameters.to_dict() if hasattr(current_job.hyperparameters, 'to_dict') else {}
                }

                final_job_path = os.path.join(OUTPUT_DIR, "final_job_info.json")
                with open(final_job_path, 'w') as f:
                    json.dump(final_job_info, f, indent=2)
                print(f"\nFinal job info saved: {final_job_path}")

            else:
                print(f"❌ Fine-tuning {current_job.status}")
            break

        print("\nNext check in 60 seconds...")
        time.sleep(60)

    except KeyboardInterrupt:
        print("\n\nMonitoring stopped by user.")
        print(f"Job is still running. Check status at: https://platform.openai.com/finetune/{job.id}")
        break
    except Exception as e:
        print(f"\n❌ Error checking job status: {e}")
        print("Retrying in 60 seconds...")
        time.sleep(60)

Fine-tuning Job Status
Job ID: ftjob-t9Yi16FKGtHzeZsY7zAPq7KU
Status: succeeded
Model: gpt-4o-2024-08-06
Created: 2025-11-30 19:04:04
Finished: 2025-11-30 19:29:03
Duration: 24 minutes 59 seconds

✓ Fine-tuned Model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh

❌ Error: Error(code=None, message=None, param=None)

Result files: ['file-PmthLM9WcYswwgCeTJvPcn']

✓ Fine-tuning completed successfully!

Your fine-tuned model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh

You can now use this model for inference in the next section.

Final job info saved: /content/drive/MyDrive/finetuning_output/final_job_info.json


## 10. Test Fine-tuned Model (Single Inference)

Run a single test inference to verify the model works.

In [22]:
# Load the fine-tuned model name
# If you're running this cell separately, update this with your model name
FINE_TUNED_MODEL = current_job.fine_tuned_model  # From previous cell
# Or manually set it: FINE_TUNED_MODEL = "ft:gpt-4o-2024-08-06:...:...:..."

print(f"Testing fine-tuned model: {FINE_TUNED_MODEL}")
print("=" * 70)

Testing fine-tuned model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh


In [20]:
# Select a test sample
test_sample = test_df.iloc[0]

print("Test Sample:")
print(f"Interview ID: {test_sample['interview_id']}")
print(f"Role: {test_sample['role']}")
print(f"\nTranscript (first 500 chars):")
print(test_sample['qa_text'][:500] + "...")

print("\n" + "=" * 70)
print("Ground Truth Scores:")
for metric in METRICS:
    print(f"  {metric}: {test_sample[metric]}")

Test Sample:
Interview ID: home_service_technician_002
Role: Home Service Technician

Transcript (first 500 chars):
Q: Do you have a valid driver's license?
A: Yes, I do have a valid driver's license. I've had it for over five years since I got my first job doing deliveries for a home appliance store.

Q: Have you worked with HVAC systems for at least 2 years?
A: Yes, I've been working with HVAC systems for about 3 years. I started as an apprentice and have since worked on various types of systems, including residential and commercial units.

Q: Are you available to work on weekends?
A: Yes, I'm available to ...

Ground Truth Scores:
  cognitive_ability: 8
  experience: 8
  problem_solving: 9
  reliability: 7
  professionalism: 7
  communication: 6


In [21]:
# Run inference
print("\nRunning inference with fine-tuned model...\n")

system_message = "You are an expert interviewer evaluating candidates. You provide accurate, consistent scoring based on interview transcripts."
user_message = build_scoring_prompt(test_sample['qa_text'], test_sample['role'])

try:
    response = client.chat.completions.create(
        model=FINE_TUNED_MODEL,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message}
        ],
        temperature=0.0,
        max_tokens=512
    )

    predicted_response = response.choices[0].message.content
    print("Model Response:")
    print(predicted_response)

    # Parse and compare
    try:
        predicted_scores = json.loads(predicted_response)

        print("\n" + "=" * 70)
        print("Comparison (Ground Truth vs Predicted):")
        print("=" * 70)
        print(f"{'Metric':<20} {'Truth':>6} {'Predicted':>10} {'Diff':>6}")
        print("-" * 70)

        total_diff = 0
        for metric in METRICS:
            truth = test_sample[metric]
            pred = predicted_scores.get(metric, 0)
            diff = pred - truth
            total_diff += abs(diff)
            print(f"{metric:<20} {truth:>6} {pred:>10} {diff:>+6}")

        print("-" * 70)
        print(f"Mean Absolute Error: {total_diff / len(METRICS):.2f}")

    except json.JSONDecodeError:
        print("\n⚠️  Response is not valid JSON")

except Exception as e:
    print(f"❌ Error during inference: {e}")


Running inference with fine-tuned model...

Model Response:
{"cognitive_ability": 8, "experience": 8, "problem_solving": 9, "reliability": 7, "professionalism": 6, "communication": 4}

Comparison (Ground Truth vs Predicted):
Metric                Truth  Predicted   Diff
----------------------------------------------------------------------
cognitive_ability         8          8     +0
experience                8          8     +0
problem_solving           9          9     +0
reliability               7          7     +0
professionalism           7          6     -1
communication             6          4     -2
----------------------------------------------------------------------
Mean Absolute Error: 0.50


## 11. Evaluation on Test Set

In [26]:
# Helper function for inference with latency tracking
def run_single_inference(qa_text: str, role: str, model_name: str) -> tuple[Dict[str, int], float]:
    """Run inference on a single sample and return (scores, latency_ms)"""
    system_message = "You are an expert interviewer evaluating candidates. You provide accurate, consistent scoring based on interview transcripts."
    user_message = build_scoring_prompt(qa_text, role)

    try:
        start_time = time.time()

        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            temperature=0.0,
            max_tokens=512
        )

        latency_ms = (time.time() - start_time) * 1000

        response_text = response.choices[0].message.content.strip()

        # Clean response (remove markdown code blocks if present)
        if response_text.startswith("```"):
            response_text = response_text.split("```")[1]
            if response_text.startswith("json"):
                response_text = response_text[4:]
            response_text = response_text.strip()

        scores = json.loads(response_text)

        # Validate and clamp scores
        validated_scores = {}
        for metric in METRICS:
            if metric in scores:
                validated_scores[metric] = max(1, min(10, int(scores[metric])))
            else:
                print(f"⚠️  Missing metric '{metric}', defaulting to 5")
                validated_scores[metric] = 5

        return validated_scores, latency_ms

    except json.JSONDecodeError as e:
        latency_ms = (time.time() - start_time) * 1000 if 'start_time' in locals() else 0
        print(f"❌ JSON decode error: {e}")
        return {metric: 5 for metric in METRICS}, latency_ms
    except Exception as e:
        latency_ms = (time.time() - start_time) * 1000 if 'start_time' in locals() else 0
        print(f"❌ Error: {e}")
        return {metric: 5 for metric in METRICS}, latency_ms

print("✓ Inference function loaded")

✓ Inference function loaded


In [27]:
# Run inference on all test samples
print("=" * 70)
print(f"Running inference on {len(test_df)} test samples...")
print("=" * 70)
print("This will take approximately 1-2 minutes\n")

predictions = []
latencies = []
start_time = time.time()

for idx, row in test_df.iterrows():
    print(f"Processing {idx + 1}/{len(test_df)}: {row['interview_id']}", end="\r")

    pred_scores, latency = run_single_inference(row['qa_text'], row['role'], FINE_TUNED_MODEL)
    predictions.append(pred_scores)
    latencies.append(latency)

    # Small delay to avoid rate limits
    time.sleep(0.1)

elapsed_time = time.time() - start_time
print(f"\n\n✓ Completed in {elapsed_time:.1f} seconds ({elapsed_time/len(test_df):.2f}s per sample)")

# Latency statistics
avg_latency = np.mean(latencies)
p50_latency = np.percentile(latencies, 50)
p95_latency = np.percentile(latencies, 95)
p99_latency = np.percentile(latencies, 99)

print("\nLatency Statistics:")
print("-" * 70)
print(f"Mean:   {avg_latency:.0f} ms")
print(f"Median: {p50_latency:.0f} ms")
print(f"P95:    {p95_latency:.0f} ms")
print(f"P99:    {p99_latency:.0f} ms")
print(f"Min:    {min(latencies):.0f} ms")
print(f"Max:    {max(latencies):.0f} ms")

Running inference on 50 test samples...
This will take approximately 1-2 minutes



✓ Completed in 107.1 seconds (2.14s per sample)

Latency Statistics:
----------------------------------------------------------------------
Mean:   2041 ms
Median: 1686 ms
P95:    3859 ms
P99:    4360 ms
Min:    949 ms
Max:    4475 ms


In [28]:
# Calculate evaluation metrics
print("\n" + "=" * 70)
print("EVALUATION METRICS")
print("=" * 70)

# Per-metric metrics
results = {}
for metric in METRICS:
    true_values = test_df[metric].values
    pred_values = np.array([p[metric] for p in predictions])

    mae = np.mean(np.abs(true_values - pred_values))
    rmse = np.sqrt(np.mean((true_values - pred_values) ** 2))
    exact_match = np.mean(true_values == pred_values)
    within_1 = np.mean(np.abs(true_values - pred_values) <= 1)
    within_2 = np.mean(np.abs(true_values - pred_values) <= 2)

    results[metric] = {
        'mae': mae,
        'rmse': rmse,
        'exact_match': exact_match,
        'within_1': within_1,
        'within_2': within_2
    }

# Display per-metric results
print("\nPer-Metric Performance:")
print("-" * 70)
print(f"{'Metric':<20} {'MAE':>8} {'RMSE':>8} {'Exact':>8} {'±1':>8} {'±2':>8}")
print("-" * 70)

for metric in METRICS:
    m = results[metric]
    print(f"{metric:<20} {m['mae']:>8.2f} {m['rmse']:>8.2f} "
          f"{m['exact_match']*100:>7.1f}% {m['within_1']*100:>7.1f}% {m['within_2']*100:>7.1f}%")

# Overall metrics
all_true = test_df[METRICS].values.flatten()
all_pred = np.array([[p[m] for m in METRICS] for p in predictions]).flatten()

overall_mae = np.mean(np.abs(all_true - all_pred))
overall_rmse = np.sqrt(np.mean((all_true - all_pred) ** 2))
overall_exact = np.mean(all_true == all_pred)
overall_within_1 = np.mean(np.abs(all_true - all_pred) <= 1)
overall_within_2 = np.mean(np.abs(all_true - all_pred) <= 2)

print("\n" + "=" * 70)
print("Overall Performance (all metrics combined):")
print("-" * 70)
print(f"Mean Absolute Error (MAE):  {overall_mae:.3f}")
print(f"Root Mean Squared Error:     {overall_rmse:.3f}")
print(f"Exact Match Accuracy:        {overall_exact*100:.1f}%")
print(f"Within ±1 Accuracy:          {overall_within_1*100:.1f}%")
print(f"Within ±2 Accuracy:          {overall_within_2*100:.1f}%")


EVALUATION METRICS

Per-Metric Performance:
----------------------------------------------------------------------
Metric                    MAE     RMSE    Exact       ±1       ±2
----------------------------------------------------------------------
cognitive_ability        0.20     0.63    90.0%    90.0%   100.0%
experience               0.18     0.73    94.0%    94.0%    94.0%
problem_solving          0.88     1.55    66.0%    68.0%    78.0%
reliability              0.72     1.26    66.0%    66.0%    98.0%
professionalism          0.50     0.88    64.0%    86.0%   100.0%
communication            0.78     1.19    50.0%    76.0%    96.0%

Overall Performance (all metrics combined):
----------------------------------------------------------------------
Mean Absolute Error (MAE):  0.543
Root Mean Squared Error:     1.091
Exact Match Accuracy:        71.7%
Within ±1 Accuracy:          80.0%
Within ±2 Accuracy:          94.3%


In [29]:
# Create detailed results dataframe
results_df = test_df.copy()

# Add predictions and errors
for metric in METRICS:
    results_df[f'predicted_{metric}'] = [p[metric] for p in predictions]
    results_df[f'error_{metric}'] = results_df[f'predicted_{metric}'] - results_df[metric]

results_df['total_abs_error'] = sum(abs(results_df[f'error_{m}']) for m in METRICS)

# Add latency data
results_df['latency_ms'] = latencies

# Save predictions
predictions_path = os.path.join(OUTPUT_DIR, "test_predictions.csv")
results_df.to_csv(predictions_path, index=False)
print(f"\n✓ Detailed predictions saved: {predictions_path}")

# Save metrics as JSON
metrics_data = {
    'model': FINE_TUNED_MODEL,
    'test_samples': len(test_df),
    'timestamp': datetime.now().isoformat(),
    'per_metric': results,
    'overall': {
        'mae': overall_mae,
        'rmse': overall_rmse,
        'exact_match': overall_exact,
        'within_1': overall_within_1,
        'within_2': overall_within_2
    },
    'latency': {
        'mean_ms': float(np.mean(latencies)),
        'median_ms': float(np.median(latencies)),
        'p95_ms': float(np.percentile(latencies, 95)),
        'p99_ms': float(np.percentile(latencies, 99)),
        'min_ms': float(min(latencies)),
        'max_ms': float(max(latencies))
    }
}

metrics_path = os.path.join(OUTPUT_DIR, "evaluation_metrics.json")
with open(metrics_path, 'w') as f:
    json.dump(metrics_data, f, indent=2)
print(f"✓ Metrics saved: {metrics_path}")


✓ Detailed predictions saved: /content/drive/MyDrive/finetuning_output/test_predictions.csv
✓ Metrics saved: /content/drive/MyDrive/finetuning_output/evaluation_metrics.json


In [30]:
# Error Analysis
print("\n" + "=" * 70)
print("ERROR ANALYSIS")
print("=" * 70)

# Find worst predictions
worst_5 = results_df.nlargest(5, 'total_abs_error')

print("\nTop 5 Worst Predictions (by total absolute error):")
print("-" * 70)

for idx, row in worst_5.iterrows():
    print(f"\n#{list(worst_5.index).index(idx) + 1}. Interview ID: {row['interview_id']}")
    print(f"   Role: {row['role']}")
    print(f"   Total Absolute Error: {row['total_abs_error']:.0f}")
    print("   Scores (Truth → Predicted):")
    for metric in METRICS:
        error_indicator = "✓" if abs(row[f'error_{metric}']) <= 1 else "✗"
        print(f"     {error_indicator} {metric:<20}: {row[metric]:>2} → {row[f'predicted_{metric}']:>2} (error: {row[f'error_{metric}']:+3.0f})")


ERROR ANALYSIS

Top 5 Worst Predictions (by total absolute error):
----------------------------------------------------------------------

#1. Interview ID: customer_service_representative_015
   Role: Customer Service Representative
   Total Absolute Error: 8
   Scores (Truth → Predicted):
     ✓ cognitive_ability   :  6 →  6 (error:  +0)
     ✓ experience          :  5 →  5 (error:  +0)
     ✗ problem_solving     :  9 →  6 (error:  -3)
     ✗ reliability         :  3 →  5 (error:  +2)
     ✓ professionalism     :  4 →  4 (error:  +0)
     ✗ communication       :  4 →  1 (error:  -3)

#2. Interview ID: general_manager_(franchise)_028
   Role: General Manager (Franchise)
   Total Absolute Error: 8
   Scores (Truth → Predicted):
     ✗ cognitive_ability   :  4 →  6 (error:  +2)
     ✓ experience          :  8 →  8 (error:  +0)
     ✗ problem_solving     :  9 →  6 (error:  -3)
     ✓ reliability         :  5 →  5 (error:  +0)
     ✗ professionalism     :  4 →  2 (error:  -2)
     ✓ comm

In [31]:
# Show some examples of good predictions
best_5 = results_df.nsmallest(5, 'total_abs_error')

print("\n" + "=" * 70)
print("Top 5 Best Predictions (by total absolute error):")
print("-" * 70)

for idx, row in best_5.iterrows():
    print(f"\n#{list(best_5.index).index(idx) + 1}. Interview ID: {row['interview_id']}")
    print(f"   Role: {row['role']}")
    print(f"   Total Absolute Error: {row['total_abs_error']:.0f}")
    print("   Scores (Truth → Predicted):")
    for metric in METRICS:
        error_indicator = "✓" if abs(row[f'error_{metric}']) == 0 else "~"
        print(f"     {error_indicator} {metric:<20}: {row[metric]:>2} → {row[f'predicted_{metric}']:>2} (error: {row[f'error_{metric}']:+3.0f})")


Top 5 Best Predictions (by total absolute error):
----------------------------------------------------------------------

#1. Interview ID: field_technician_002
   Role: Field Technician
   Total Absolute Error: 0
   Scores (Truth → Predicted):
     ✓ cognitive_ability   :  6 →  6 (error:  +0)
     ✓ experience          :  5 →  5 (error:  +0)
     ✓ problem_solving     :  6 →  6 (error:  +0)
     ✓ reliability         :  3 →  3 (error:  +0)
     ✓ professionalism     :  2 →  2 (error:  +0)
     ✓ communication       :  1 →  1 (error:  +0)

#2. Interview ID: field_technician_008
   Role: Field Technician
   Total Absolute Error: 0
   Scores (Truth → Predicted):
     ✓ cognitive_ability   :  6 →  6 (error:  +0)
     ✓ experience          :  5 →  5 (error:  +0)
     ✓ problem_solving     :  9 →  9 (error:  +0)
     ✓ reliability         :  3 →  3 (error:  +0)
     ✓ professionalism     :  2 →  2 (error:  +0)
     ✓ communication       :  1 →  1 (error:  +0)

#3. Interview ID: general_man

## 12. Summary & Next Steps

In [32]:
print("Fine-tuning Summary")
print("=" * 70)
print(f"\nDataset:")
print(f"  Total samples: {len(df_clean)}")
print(f"  Training samples: {len(train_df)}")
print(f"  Test samples: {len(test_df)}")
print(f"\nFiles created:")
print(f"  Train JSONL: {train_file_path}")
print(f"  Test JSONL: {test_file_path}")
print(f"  Test CSV: {test_csv_path}")
print(f"\nOpenAI Resources:")
print(f"  Training file ID: {train_file.id}")
print(f"  Job ID: {job.id}")
if hasattr(current_job, 'fine_tuned_model') and current_job.fine_tuned_model:
    print(f"  Fine-tuned model: {current_job.fine_tuned_model}")
print(f"\nOutput directory: {OUTPUT_DIR}")
print("\n" + "=" * 70)
print("Next Steps:")
print("  1. For full evaluation, use the kRuns.ipynb notebook with your fine-tuned model")
print("  2. Update CURRENT_MODEL in kRuns.ipynb to your fine-tuned model name")
print("  3. Run inference ONCE per test sample (no self-consistency needed for fine-tuned)")
print("  4. Compare performance against base GPT-4o")
print("=" * 70)

Fine-tuning Summary

Dataset:
  Total samples: 250
  Training samples: 200
  Test samples: 50

Files created:
  Train JSONL: /content/drive/MyDrive/finetuning_output/train.jsonl
  Test JSONL: /content/drive/MyDrive/finetuning_output/test.jsonl
  Test CSV: /content/drive/MyDrive/finetuning_output/test_set.csv

OpenAI Resources:
  Training file ID: file-9AY2fj4MCN9kJtnYMzZUeh
  Job ID: ftjob-t9Yi16FKGtHzeZsY7zAPq7KU
  Fine-tuned model: ft:gpt-4o-2024-08-06:cleanagent:interview-scorer:ChhOUzkh

Output directory: /content/drive/MyDrive/finetuning_output

Next Steps:
  1. For full evaluation, use the kRuns.ipynb notebook with your fine-tuned model
  2. Update CURRENT_MODEL in kRuns.ipynb to your fine-tuned model name
  3. Run inference ONCE per test sample (no self-consistency needed for fine-tuned)
  4. Compare performance against base GPT-4o
