# LookBench Model Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SerendipityOneInc/look-bench/blob/main/notebooks/02_model_evaluation.ipynb)

This notebook demonstrates how to perform **complete model evaluation** on the LookBench dataset with all metrics.

We'll evaluate models using:
- **Recall@K** (K=1, 5, 10, 20)
- **MRR** (Mean Reciprocal Rank)
- **NDCG@5** (Normalized Discounted Cumulative Gain)
- **mAP** (Mean Average Precision)

📄 **Paper**: [arxiv.org/abs/2601.14706](https://arxiv.org/abs/2601.14706)

## Setup

In [None]:
# Install required packages
!pip install -q torch torchvision transformers datasets pillow pandas pyarrow pyyaml tqdm matplotlib

# Clone LookBench repository
!git clone https://github.com/SerendipityOneInc/look-bench.git
%cd look-bench

import sys
sys.path.append('/content/look-bench')

print("✅ Setup complete!")

In [None]:
import torch
import numpy as np
from datasets import load_dataset
from tqdm import tqdm
import matplotlib.pyplot as plt

from manager import ConfigManager, ModelManager
from metrics import RankEvaluator, MRREvaluator, NDCGEvaluator, MAPEvaluator

print("✅ Imports successful!")

## Load LookBench Dataset

Choose which subset to evaluate:
- `real_studio_flat` - Easy: single-item retrieval
- `aigen_studio` - Medium: AI-generated studio images
- `real_streetlook` - Hard: multi-item outfit retrieval
- `aigen_streetlook` - Hard: AI-generated street looks

In [None]:
# Load dataset from Hugging Face
print("Loading LookBench dataset...")
dataset = load_dataset("srpone/look-bench")

# Choose subset to evaluate
subset_name = 'real_studio_flat'  # Change this to evaluate other subsets

query_data = dataset[subset_name]['query']
gallery_data = dataset[subset_name]['gallery']

print(f"\n✅ Evaluating on: {subset_name}")
print(f"   Query samples: {len(query_data)}")
print(f"   Gallery samples: {len(gallery_data)}")

## Load Model

Available models: `clip`, `siglip`, `dinov2`

In [None]:
# Initialize managers
config_manager = ConfigManager('configs/config.yaml')
model_manager = ModelManager(config_manager)

# Choose model to evaluate
model_name = 'clip'  # Change to 'siglip' or 'dinov2' to test other models

# Load model
print(f"Loading {model_name} model...")
model, model_wrapper = model_manager.load_model(model_name)
transform = model_manager.get_transform(model_name)

model.eval()
if torch.cuda.is_available():
    model = model.cuda()
    print("✅ Model moved to CUDA")
else:
    print("⚠️ Running on CPU (slower)")

print(f"✅ Model loaded: {model_name}")

## Extract Features

This may take a few minutes depending on dataset size and GPU availability.

In [None]:
def extract_features_from_dataset(data, model, transform, batch_size=32):
    """Extract features from a dataset"""
    features = []
    labels = []
    
    with torch.no_grad():
        for i in tqdm(range(0, len(data), batch_size), desc="Extracting features"):
            batch_data = data[i:i+batch_size]
            
            batch_images = []
            batch_labels = []
            
            for idx, sample in enumerate(batch_data):
                img = sample['image']
                img_tensor = transform(img)
                batch_images.append(img_tensor)
                
                # Use item_ID as label if available, otherwise use index
                label = sample.get('item_ID', i + idx)
                batch_labels.append(label)
            
            # Stack batch
            batch_tensor = torch.stack(batch_images)
            if torch.cuda.is_available():
                batch_tensor = batch_tensor.cuda()
            
            # Extract features
            batch_features = model(batch_tensor)
            features.append(batch_features.cpu())
            labels.extend(batch_labels)
    
    features = torch.cat(features, dim=0)
    return features.numpy(), np.array(labels)

# Extract features
print("\n📊 Extracting query features...")
query_features, query_labels = extract_features_from_dataset(query_data, model, transform)

print("\n📊 Extracting gallery features...")
gallery_features, gallery_labels = extract_features_from_dataset(gallery_data, model, transform)

print(f"\n✅ Feature extraction complete!")
print(f"   Query features shape: {query_features.shape}")
print(f"   Gallery features shape: {gallery_features.shape}")

## L2 Normalization

Normalize features for cosine similarity computation.

In [None]:
# L2 normalize features
query_features = query_features / np.linalg.norm(query_features, axis=1, keepdims=True)
gallery_features = gallery_features / np.linalg.norm(gallery_features, axis=1, keepdims=True)

print("✅ Features normalized")

## Compute Similarity Matrix

In [None]:
# Compute cosine similarity
similarity_matrix = np.dot(query_features, gallery_features.T)
sorted_indices = np.argsort(-similarity_matrix, axis=1)

print(f"✅ Similarity matrix computed")
print(f"   Shape: {similarity_matrix.shape}")
print(f"   Range: [{similarity_matrix.min():.4f}, {similarity_matrix.max():.4f}]")

## Evaluate with All Metrics

Computing Recall@K, MRR, NDCG@5, and mAP...

In [None]:
# Initialize evaluators
rank_evaluator = RankEvaluator(top_k=[1, 5, 10, 20])
mrr_evaluator = MRREvaluator()
ndcg_evaluator = NDCGEvaluator(k=5)
map_evaluator = MAPEvaluator()

results = {}

# Recall@K
print("Computing Recall@K...")
for k in [1, 5, 10, 20]:
    scores = []
    for i in range(len(query_labels)):
        score = rank_evaluator.metric_eval(
            sorted_indices[i],
            k,
            query_labels[i],
            gallery_labels
        )
        scores.append(score)
    results[f'Recall@{k}'] = np.mean(scores) * 100

# MRR
print("Computing MRR...")
mrr_scores = []
for i in range(len(query_labels)):
    score = mrr_evaluator.metric_eval(
        sorted_indices[i],
        None,
        query_labels[i],
        gallery_labels
    )
    mrr_scores.append(score)
results['MRR'] = np.mean(mrr_scores) * 100

# NDCG@5
print("Computing NDCG@5...")
ndcg_scores = []
for i in range(len(query_labels)):
    score = ndcg_evaluator.metric_eval(
        sorted_indices[i],
        5,
        query_labels[i],
        gallery_labels
    )
    ndcg_scores.append(score)
results['NDCG@5'] = np.mean(ndcg_scores) * 100

# mAP
print("Computing mAP...")
map_scores = []
for i in range(len(query_labels)):
    score = map_evaluator.metric_eval(
        sorted_indices[i],
        None,
        query_labels[i],
        gallery_labels
    )
    map_scores.append(score)
results['mAP'] = np.mean(map_scores) * 100

print("\n✅ All metrics computed!")

## Results

In [None]:
# Print results
print(f"\n{'='*60}")
print(f"Evaluation Results on {subset_name}")
print(f"Model: {model_name}")
print(f"{'='*60}")
for metric, value in results.items():
    print(f"{metric:15s}: {value:6.2f}%")
print(f"{'='*60}")

## Visualize Top Retrievals

Let's visualize some retrieval results to understand model performance.

In [None]:
def visualize_retrieval(query_idx, top_k=5):
    """Visualize top-k retrievals for a query"""
    query_sample = query_data[query_idx]
    top_indices = sorted_indices[query_idx][:top_k]
    
    fig, axes = plt.subplots(1, top_k + 1, figsize=(4 * (top_k + 1), 4))
    
    # Show query
    axes[0].imshow(query_sample['image'])
    axes[0].set_title('Query\n' + f"Category: {query_sample.get('category', 'N/A')}\n" +
                     f"Label: {query_labels[query_idx]}", 
                     fontsize=10, fontweight='bold')
    axes[0].axis('off')
    
    # Show top retrievals
    for i, gallery_idx in enumerate(top_indices):
        gallery_sample = gallery_data[gallery_idx]
        sim = similarity_matrix[query_idx, gallery_idx]
        match = gallery_labels[gallery_idx] == query_labels[query_idx]
        
        axes[i + 1].imshow(gallery_sample['image'])
        title = f"Rank {i+1}\nSim: {sim:.3f}\nLabel: {gallery_labels[gallery_idx]}"
        if match:
            title += "\n✓ MATCH"
            axes[i + 1].set_title(title, fontsize=10, color='green', fontweight='bold')
        else:
            title += "\n✗ NO MATCH"
            axes[i + 1].set_title(title, fontsize=10, color='red')
        axes[i + 1].axis('off')
    
    plt.tight_layout()
    plt.show()

# Visualize a few examples
for i in range(min(3, len(query_data))):
    visualize_retrieval(i, top_k=5)

## Evaluate on Multiple Subsets (Optional)

Uncomment and run this cell to evaluate on all LookBench subsets.

In [None]:
# # Evaluate on all subsets
# all_results = {}
# subsets = ['real_studio_flat', 'aigen_studio', 'real_streetlook', 'aigen_streetlook']
# 
# for subset in subsets:
#     print(f"\n{'='*60}")
#     print(f"Evaluating on {subset}...")
#     print(f"{'='*60}")
#     
#     # Load subset
#     query_data_sub = dataset[subset]['query']
#     gallery_data_sub = dataset[subset]['gallery']
#     
#     # Extract features and evaluate
#     # ... (copy the feature extraction and evaluation code here)
#     
#     all_results[subset] = results
# 
# # Print summary
# import pandas as pd
# df = pd.DataFrame(all_results).T
# print("\n" + "="*60)
# print("Summary Across All Subsets")
# print("="*60)
# print(df)

## Next Steps

1. **Try different models**: Change `model_name` to 'siglip' or 'dinov2'
2. **Evaluate other subsets**: Change `subset_name` to test different difficulty levels
3. **Integrate custom models**: See `03_custom_model.ipynb`
4. **Fine-tune models**: Use LookBench for model training and evaluation

### Useful Links

- 📄 **Paper**: https://arxiv.org/abs/2601.14706
- 🏠 **Project**: https://serendipityoneinc.github.io/look-bench-page/
- 🤗 **Dataset**: https://huggingface.co/datasets/srpone/look-bench
- 💻 **GitHub**: https://github.com/SerendipityOneInc/look-bench