# Comparatice Evaluation: Race Vector vs Prompt Engineering

This notebook runs the rigorous evaluation comparing our Latent Race Vector method against a standard Prompt Engineering baseline.

We measure:
- **Identity Preservation:** Face similarity, LPIPS
- **Structural Consistency:** SSIM, Pose difference
- **Disentanglement:** Metric scores

In [None]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, Image

# Add src to path
sys.path.insert(0, '../../')

from src.evaluation.comparative_eval import ComparativeEvaluator

%matplotlib inline

## 1. Run Evaluation Pipeline
This will take some time as it generates images and computes metrics.

In [None]:
evaluator = ComparativeEvaluator()

# Run comparison (adjust num_samples as needed for speed vs rigor)
df_results = evaluator.run_comparison(num_samples=10)

## 2. Quantitative Results
Comparison of average metrics between methods.

In [None]:
# Load results if not running fresh
results_path = '../../experiments/comparative_results/comparative_metrics.csv'
if os.path.exists(results_path):
    df_results = pd.read_csv(results_path)

# Group by method and target race
summary = df_results.groupby(['method', 'target_race'])[
    ['face_similarity', 'background_ssim', 'overall_score', 'overall_ssim', 'lpips']
].mean()

display(summary)

## 3. Visual Qualitative Comparison

In [None]:
import glob

output_dir = '../../experiments/comparative_results'
vector_samples = sorted(glob.glob(f"{output_dir}/vector_sample_*.png"))[:4]
prompt_samples = sorted(glob.glob(f"{output_dir}/prompt_sample_*.png"))[:4]

print("Vector Modification Samples:")
for p in vector_samples:
    display(Image(filename=p))

print("\nPrompt Engineering Samples:")
for p in prompt_samples:
    display(Image(filename=p))