# LookBench Data Exploration

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SerendipityOneInc/look-bench/blob/main/notebooks/00_data_exploration.ipynb)

This notebook demonstrates how to download, explore, and analyze the **LookBench** dataset.

üìÑ **Paper**: [LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval](https://arxiv.org/abs/2601.14706)

üè† **Project Page**: [https://serendipityoneinc.github.io/look-bench-page/](https://serendipityoneinc.github.io/look-bench-page/)

ü§ó **Dataset**: [https://huggingface.co/datasets/srpone/look-bench](https://huggingface.co/datasets/srpone/look-bench)

## Setup and Installation

First, let's install the required packages.

In [None]:
# Install look-bench (includes datasets)
%pip install -q look-bench

# Install matplotlib for visualization
%pip install -q matplotlib

print("‚úÖ Installation complete!")

## 1. Download LookBench Dataset

The LookBench dataset has 5 configs that need to be loaded separately:
- **real_studio_flat**: Real studio flat-lay photos (Easy)
- **aigen_studio**: AI-generated lifestyle images (Medium)
- **real_streetlook**: Real street outfit photos (Hard)
- **aigen_streetlook**: AI-generated street outfits (Hard)
- **noise**: Noise/distractor images

In [None]:
from datasets import load_dataset
from collections import Counter
import matplotlib.pyplot as plt

# Available configs
configs = ['aigen_streetlook', 'aigen_studio', 'real_streetlook', 'real_studio_flat', 'noise']

print("Downloading LookBench dataset from Hugging Face...")
print("This may take a few minutes on first run...\n")

# Load all configs
dataset = {}
for config_name in configs:
    print(f"  Loading {config_name}...")
    dataset[config_name] = load_dataset("srpone/look-bench", config_name)

print("\n‚úÖ Dataset downloaded successfully!")

## 2. Explore Dataset Structure

In [None]:
print("="*70)
print(" Dataset Structure")
print("="*70)

print(f"\nAvailable configs: {list(dataset.keys())}")
print(f"Total configs: {len(dataset.keys())}\n")

for config_name in dataset.keys():
    print(f"üìÅ {config_name}:")
    subset_data = dataset[config_name]
    for split_name in subset_data.keys():
        num_samples = len(subset_data[split_name])
        print(f"   ‚îú‚îÄ {split_name}: {num_samples:,} samples")
    print()

## 3. Dataset Summary

In [None]:
total_images = 0
total_categories = set()

for config_name in dataset.keys():
    subset_data = dataset[config_name]
    for split_name in subset_data.keys():
        split_data = subset_data[split_name]
        total_images += len(split_data)
        
        # Collect unique categories
        for sample in split_data:
            if 'category' in sample:
                total_categories.add(sample['category'])

print("="*70)
print(" Dataset Summary Report")
print("="*70)
print(f"\nüìà Overall Statistics:")
print(f"  ‚Ä¢ Total configs: {len(dataset.keys())}")
print(f"  ‚Ä¢ Total images: {total_images:,}")
print(f"  ‚Ä¢ Unique categories: {len(total_categories)}")
print(f"\nüìã Available configs:")
for config in sorted(dataset.keys()):
    print(f"  ‚Ä¢ {config}")

## 4. Compare Configs

In [None]:
print("="*70)
print(" Cross-Config Comparison")
print("="*70)

print("\nüìä Config Comparison:")
print(f"{'Config':<25} {'Query':<10} {'Gallery':<10} {'Total':<10}")
print("-" * 60)

total_queries = 0
total_gallery = 0

for config_name in sorted(dataset.keys()):
    subset_data = dataset[config_name]
    num_queries = len(subset_data['query']) if 'query' in subset_data else 0
    num_gallery = len(subset_data['gallery']) if 'gallery' in subset_data else 0
    total = num_queries + num_gallery
    
    print(f"{config_name:<25} {num_queries:<10,} {num_gallery:<10,} {total:<10,}")
    total_queries += num_queries
    total_gallery += num_gallery

print("-" * 60)
print(f"{'TOTAL':<25} {total_queries:<10,} {total_gallery:<10,} {total_queries + total_gallery:<10,}")

## 5. Analyze Individual Config Statistics

In [None]:
def analyze_config_statistics(dataset, config_name):
    """Analyze and print statistics for a specific config"""
    print("\n" + "="*70)
    print(f" Statistics for '{config_name}'")
    print("="*70)
    
    if config_name not in dataset:
        print(f"Config '{config_name}' not found!")
        return
    
    subset_data = dataset[config_name]
    
    # Query statistics
    if 'query' in subset_data:
        query_data = subset_data['query']
        print(f"\nüìä Query Split ({len(query_data):,} samples):")
        
        # Category distribution
        categories = [sample['category'] for sample in query_data]
        category_counts = Counter(categories)
        print(f"\n  Categories ({len(category_counts)} unique):")
        for cat, count in category_counts.most_common(10):
            print(f"    ‚Ä¢ {cat}: {count}")
        
        # Task distribution
        if 'task' in query_data[0]:
            tasks = [sample['task'] for sample in query_data]
            task_counts = Counter(tasks)
            print(f"\n  Tasks:")
            for task, count in task_counts.items():
                print(f"    ‚Ä¢ {task}: {count} ({count/len(query_data)*100:.1f}%)")
        
        # Difficulty distribution
        if 'difficulty' in query_data[0]:
            difficulties = [sample['difficulty'] for sample in query_data]
            diff_counts = Counter(difficulties)
            print(f"\n  Difficulty levels:")
            for diff, count in diff_counts.items():
                print(f"    ‚Ä¢ {diff}: {count} ({count/len(query_data)*100:.1f}%)")
        
        # Attribute statistics
        if 'main_attribute' in query_data[0]:
            main_attrs = [sample['main_attribute'] for sample in query_data]
            attr_counts = Counter(main_attrs)
            print(f"\n  Main attributes ({len(attr_counts)} unique):")
            for attr, count in attr_counts.most_common(5):
                print(f"    ‚Ä¢ {attr}: {count}")
    
    # Gallery statistics
    if 'gallery' in subset_data:
        gallery_data = subset_data['gallery']
        print(f"\nüìö Gallery Split ({len(gallery_data):,} samples):")
        
        categories = [sample['category'] for sample in gallery_data]
        category_counts = Counter(categories)
        print(f"\n  Categories ({len(category_counts)} unique):")
        for cat, count in category_counts.most_common(10):
            print(f"    ‚Ä¢ {cat}: {count}")

# Analyze main configs (skip noise for detailed analysis)
main_configs = ['real_studio_flat', 'aigen_studio', 'real_streetlook', 'aigen_streetlook']
for config_name in main_configs:
    if config_name in dataset:
        analyze_config_statistics(dataset, config_name)

## 6. Display Sample Images

In [None]:
def display_sample_images(dataset, config_name, num_samples=4):
    """Display sample images from the dataset"""
    if config_name not in dataset:
        print(f"Config '{config_name}' not found!")
        return
    
    subset_data = dataset[config_name]
    if 'query' not in subset_data:
        print(f"Query split not found in '{config_name}'!")
        return
    
    query_data = subset_data['query']
    num_samples = min(num_samples, len(query_data))
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    axes = axes.flatten()
    
    for idx in range(num_samples):
        sample = query_data[idx]
        ax = axes[idx]
        
        # Display image
        ax.imshow(sample['image'])
        ax.axis('off')
        
        # Create title with metadata
        title = f"Category: {sample['category']}\n"
        if 'main_attribute' in sample:
            title += f"Attribute: {sample['main_attribute']}\n"
        if 'task' in sample:
            title += f"Task: {sample['task']}"
        
        ax.set_title(title, fontsize=9, pad=10)
    
    plt.suptitle(f"Sample Images from '{config_name}'", fontsize=14, y=0.995)
    plt.tight_layout()
    plt.show()

# Display samples from main configs
for config_name in ['real_studio_flat', 'real_streetlook']:
    if config_name in dataset:
        display_sample_images(dataset, config_name, num_samples=4)

## Summary

‚úÖ Dataset exploration completed successfully!

üìö **Next Steps:**
1. Review the statistics above
2. Run `01_quickstart.ipynb` to test model inference
3. Run `02_model_evaluation.ipynb` for full benchmark evaluation

üí° **Resources:**
- üìÑ Paper: https://arxiv.org/abs/2601.14706
- ü§ó Dataset: https://huggingface.co/datasets/srpone/look-bench
- üè† Project Page: https://serendipityoneinc.github.io/look-bench-page/
- üíª GitHub: https://github.com/SerendipityOneInc/look-bench