# Vision-Language Model Counting Under Synthetic Occlusion

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-repo/vlm-counting-bias/blob/main/notebooks/01_counting_occlusion_synthetic.ipynb)

This notebook systematically evaluates VLM counting performance on synthetic images with controlled occlusion levels.

## Objectives
- Generate synthetic images with known object counts
- Apply varying levels of occlusion (0%, 25%, 50%, 75%)
- Evaluate multiple VLMs (GPT-4V, BLIP-2, LLaVA)
- Analyze counting bias under different occlusion conditions

## Setup Requirements
- OpenAI API key for GPT-4V
- HuggingFace token for model access
- GPU recommended for local model inference

In [None]:
# Install required packages in Colab
!pip install openai transformers torch pillow opencv-python matplotlib pandas numpy plotly
!pip install accelerate bitsandbytes

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from PIL import Image, ImageDraw
import cv2
import json
import base64
import io
from tqdm import tqdm
import time
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Dependencies loaded successfully!")

## Configuration and API Setup

In [None]:
# API Configuration
# Set your API keys here or as environment variables
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', '')
HF_TOKEN = os.getenv('HF_TOKEN', '')

# If running in Colab, you can set keys directly (not recommended for production)
if not OPENAI_API_KEY:
    from google.colab import userdata
    try:
        OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    except:
        OPENAI_API_KEY = input("Enter OpenAI API Key: ")

if not HF_TOKEN:
    try:
        HF_TOKEN = userdata.get('HF_TOKEN')
    except:
        HF_TOKEN = input("Enter HuggingFace Token (optional): ") or None

# Experiment configuration
EXPERIMENT_CONFIG = {
    'image_size': (512, 512),
    'num_images_per_condition': 10,  # Reduced for demo, increase for full experiment
    'object_counts': [3, 5, 7, 10],  # Number of objects to generate
    'occlusion_levels': [0.0, 0.25, 0.50, 0.75],  # Percentage of objects occluded
    'object_types': ['circles', 'rectangles'],
    'colors': ['red', 'blue', 'green', 'yellow', 'purple'],
    'models_to_test': ['GPT-4V', 'BLIP-2']  # Add 'LLaVA' if GPU available
}

print("Configuration loaded:")
for key, value in EXPERIMENT_CONFIG.items():
    print(f"  {key}: {value}")

## Synthetic Data Generation

We generate synthetic images with known object counts and controllable occlusion patterns.

In [None]:
class SyntheticDataGenerator:
    def __init__(self, image_size=(512, 512), colors=None):
        self.image_size = image_size
        self.colors = colors or ['red', 'blue', 'green', 'yellow', 'purple']
        self.color_map = {
            'red': (255, 0, 0),
            'blue': (0, 0, 255),
            'green': (0, 255, 0),
            'yellow': (255, 255, 0),
            'purple': (128, 0, 128)
        }
    
    def generate_objects(self, num_objects, object_type='circles'):
        """Generate random object positions and properties"""
        objects = []
        width, height = self.image_size
        
        for i in range(num_objects):
            # Random position (ensure objects don't go outside image)
            if object_type == 'circles':
                radius = np.random.randint(20, 50)
                x = np.random.randint(radius, width - radius)
                y = np.random.randint(radius, height - radius)
                obj = {
                    'type': 'circle',
                    'x': x,
                    'y': y,
                    'radius': radius,
                    'color': np.random.choice(self.colors)
                }
            else:  # rectangles
                width_obj = np.random.randint(30, 80)
                height_obj = np.random.randint(30, 80)
                x = np.random.randint(0, width - width_obj)
                y = np.random.randint(0, height - height_obj)
                obj = {
                    'type': 'rectangle',
                    'x': x,
                    'y': y,
                    'width': width_obj,
                    'height': height_obj,
                    'color': np.random.choice(self.colors)
                }
            
            objects.append(obj)
        
        return objects
    
    def draw_objects(self, objects):
        """Draw objects on image"""
        image = Image.new('RGB', self.image_size, color='white')
        draw = ImageDraw.Draw(image)
        
        for obj in objects:
            color = self.color_map[obj['color']]
            
            if obj['type'] == 'circle':
                x, y, r = obj['x'], obj['y'], obj['radius']
                draw.ellipse([x-r, y-r, x+r, y+r], fill=color)
            elif obj['type'] == 'rectangle':
                x, y, w, h = obj['x'], obj['y'], obj['width'], obj['height']
                draw.rectangle([x, y, x+w, y+h], fill=color)
        
        return image
    
    def apply_occlusion(self, image, occlusion_level):
        """Apply random occlusion patches to image"""
        if occlusion_level == 0:
            return image
        
        image_copy = image.copy()
        draw = ImageDraw.Draw(image_copy)
        width, height = self.image_size
        
        # Calculate number of occlusion patches based on level
        total_area = width * height
        occlusion_area = total_area * occlusion_level
        
        # Generate random rectangular occlusion patches
        patches_generated = 0
        current_area = 0
        
        while current_area < occlusion_area and patches_generated < 20:  # Limit patches
            # Random patch size
            patch_w = np.random.randint(30, min(150, width // 3))
            patch_h = np.random.randint(30, min(150, height // 3))
            
            # Random patch position
            patch_x = np.random.randint(0, width - patch_w)
            patch_y = np.random.randint(0, height - patch_h)
            
            # Draw black occlusion patch
            draw.rectangle([patch_x, patch_y, patch_x + patch_w, patch_y + patch_h], 
                          fill=(0, 0, 0))
            
            current_area += patch_w * patch_h
            patches_generated += 1
        
        return image_copy
    
    def generate_dataset(self, num_objects_list, occlusion_levels, 
                        num_images_per_condition, object_type='circles'):
        """Generate complete synthetic dataset"""
        dataset = []
        
        for num_objects in num_objects_list:
            for occlusion_level in occlusion_levels:
                for img_idx in range(num_images_per_condition):
                    # Generate objects
                    objects = self.generate_objects(num_objects, object_type)
                    
                    # Draw base image
                    base_image = self.draw_objects(objects)
                    
                    # Apply occlusion
                    final_image = self.apply_occlusion(base_image, occlusion_level)
                    
                    # Create metadata
                    metadata = {
                        'image_id': f"{object_type}_{num_objects}obj_{int(occlusion_level*100)}occ_{img_idx}",
                        'true_count': num_objects,
                        'occlusion_level': occlusion_level,
                        'object_type': object_type,
                        'objects': objects,
                        'image': final_image
                    }
                    
                    dataset.append(metadata)
        
        return dataset

# Initialize generator
generator = SyntheticDataGenerator(
    image_size=EXPERIMENT_CONFIG['image_size'],
    colors=EXPERIMENT_CONFIG['colors']
)

print("Synthetic data generator initialized")

In [None]:
# Generate synthetic dataset
print("Generating synthetic dataset...")

synthetic_dataset = []

for object_type in EXPERIMENT_CONFIG['object_types']:
    print(f"\nGenerating {object_type} dataset...")
    
    dataset = generator.generate_dataset(
        num_objects_list=EXPERIMENT_CONFIG['object_counts'],
        occlusion_levels=EXPERIMENT_CONFIG['occlusion_levels'],
        num_images_per_condition=EXPERIMENT_CONFIG['num_images_per_condition'],
        object_type=object_type
    )
    
    synthetic_dataset.extend(dataset)
    print(f"Generated {len(dataset)} images for {object_type}")

print(f"\nTotal synthetic images generated: {len(synthetic_dataset)}")

# Display sample images
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

sample_indices = np.linspace(0, len(synthetic_dataset)-1, 8, dtype=int)

for i, idx in enumerate(sample_indices):
    sample = synthetic_dataset[idx]
    axes[i].imshow(sample['image'])
    axes[i].set_title(f"{sample['object_type'].title()}\nCount: {sample['true_count']}\nOcclusion: {sample['occlusion_level']:.0%}")
    axes[i].axis('off')

plt.tight_layout()
plt.suptitle('Sample Synthetic Images', y=1.02, fontsize=16)
plt.show()

## VLM Interface Implementation

Implementation of interfaces for different Vision-Language Models.

In [None]:
import openai
import requests
import re

class VLMInterface:
    def __init__(self, openai_key=None, hf_token=None):
        self.openai_key = openai_key
        self.hf_token = hf_token
        
        if openai_key:
            self.openai_client = openai.OpenAI(api_key=openai_key)
    
    def count_objects_gpt4v(self, image_base64, object_type, max_retries=3):
        """Count objects using GPT-4V"""
        if not self.openai_key:
            raise ValueError("OpenAI API key required for GPT-4V")
        
        prompt = f"""Look at this image carefully and count the number of {object_type}.
        
        Please respond with ONLY a JSON object in this format:
        {{
            "count": <number>,
            "confidence": <0.0 to 1.0>,
            "reasoning": "<brief explanation>"
        }}
        
        Be precise in your counting and provide an honest confidence score."""
        
        for attempt in range(max_retries):
            try:
                # the newest OpenAI model is "gpt-4o" which was released May 13, 2024.
                # do not change this unless explicitly requested by the user
                response = self.openai_client.chat.completions.create(
                    model="gpt-4o",
                    messages=[
                        {
                            "role": "user",
                            "content": [
                                {"type": "text", "text": prompt},
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": f"data:image/png;base64,{image_base64}"
                                    }
                                }
                            ]
                        }
                    ],
                    max_tokens=200,
                    response_format={"type": "json_object"}
                )
                
                result_text = response.choices[0].message.content
                result = json.loads(result_text)
                
                return {
                    'count': int(result.get('count', 0)),
                    'confidence': float(result.get('confidence', 0.5)),
                    'reasoning': result.get('reasoning', ''),
                    'raw_response': result_text
                }
                
            except Exception as e:
                if attempt == max_retries - 1:
                    return {
                        'count': 0,
                        'confidence': 0.0,
                        'error': str(e),
                        'reasoning': f'Error after {max_retries} attempts'
                    }
                time.sleep(2 ** attempt)  # Exponential backoff
    
    def count_objects_blip2(self, image_base64, object_type, max_retries=3):
        """Count objects using BLIP-2 via HuggingFace Inference API"""
        
        # Using HuggingFace Inference API for BLIP-2
        api_url = "https://api-inference.huggingface.co/models/Salesforce/blip2-opt-2.7b"
        
        headers = {}
        if self.hf_token:
            headers["Authorization"] = f"Bearer {self.hf_token}"
        
        # Convert base64 to bytes
        image_bytes = base64.b64decode(image_base64)
        
        # Create prompt for counting
        question = f"How many {object_type} are in this image?"
        
        for attempt in range(max_retries):
            try:
                response = requests.post(
                    api_url,
                    headers=headers,
                    data=image_bytes,
                    params={"question": question}
                )
                
                if response.status_code == 200:
                    result = response.json()
                    answer = result[0]['answer'] if isinstance(result, list) else result.get('answer', '')
                    
                    # Extract number from answer
                    count = self.extract_number_from_text(answer)
                    
                    # Simple confidence estimation based on answer clarity
                    confidence = 0.8 if any(str(i) in answer for i in range(20)) else 0.5
                    
                    return {
                        'count': count,
                        'confidence': confidence,
                        'reasoning': answer,
                        'raw_response': str(result)
                    }
                else:
                    if attempt == max_retries - 1:
                        return {
                            'count': 0,
                            'confidence': 0.0,
                            'error': f'HTTP {response.status_code}: {response.text}',
                            'reasoning': 'API request failed'
                        }
                    
            except Exception as e:
                if attempt == max_retries - 1:
                    return {
                        'count': 0,
                        'confidence': 0.0,
                        'error': str(e),
                        'reasoning': f'Error after {max_retries} attempts'
                    }
                
            time.sleep(2 ** attempt)  # Exponential backoff
    
    def extract_number_from_text(self, text):
        """Extract a number from text response"""
        if not text:
            return 0
        
        # Look for explicit numbers
        numbers = re.findall(r'\b\d+\b', text)
        if numbers:
            return int(numbers[0])
        
        # Look for written numbers
        number_words = {
            'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5,
            'six': 6, 'seven': 7, 'eight': 8, 'nine': 9, 'ten': 10,
            'eleven': 11, 'twelve': 12, 'thirteen': 13, 'fourteen': 14, 'fifteen': 15
        }
        
        text_lower = text.lower()
        for word, num in number_words.items():
            if word in text_lower:
                return num
        
        # Default to 0 if no number found
        return 0
    
    def count_objects(self, model_name, image_base64, object_type):
        """Main interface for counting objects with different models"""
        if model_name == 'GPT-4V':
            return self.count_objects_gpt4v(image_base64, object_type)
        elif model_name == 'BLIP-2':
            return self.count_objects_blip2(image_base64, object_type)
        else:
            raise ValueError(f"Unsupported model: {model_name}")

# Initialize VLM interface
vlm_interface = VLMInterface(openai_key=OPENAI_API_KEY, hf_token=HF_TOKEN)
print("VLM interface initialized")

## Experiment Execution

Run the systematic evaluation across all models and conditions.

In [None]:
def run_counting_experiment(dataset, models_to_test, vlm_interface, sample_size=None):
    """Run counting experiment on synthetic dataset"""
    
    # Sample dataset if specified
    if sample_size and sample_size < len(dataset):
        dataset = np.random.choice(dataset, sample_size, replace=False)
    
    results = []
    
    print(f"Running experiment on {len(dataset)} images with {len(models_to_test)} models...")
    
    progress_bar = tqdm(total=len(dataset) * len(models_to_test))
    
    for img_data in dataset:
        # Convert image to base64
        img_buffer = io.BytesIO()
        img_data['image'].save(img_buffer, format='PNG')
        img_base64 = base64.b64encode(img_buffer.getvalue()).decode()
        
        for model_name in models_to_test:
            progress_bar.set_description(f"Processing {img_data['image_id']} with {model_name}")
            
            try:
                # Get prediction from model
                prediction = vlm_interface.count_objects(
                    model_name, img_base64, img_data['object_type']
                )
                
                # Store result
                result = {
                    'image_id': img_data['image_id'],
                    'model': model_name,
                    'true_count': img_data['true_count'],
                    'predicted_count': prediction['count'],
                    'confidence': prediction['confidence'],
                    'occlusion_level': img_data['occlusion_level'],
                    'object_type': img_data['object_type'],
                    'error': prediction.get('error'),
                    'reasoning': prediction.get('reasoning', ''),
                    'timestamp': datetime.now().isoformat()
                }
                
                # Calculate metrics
                if not prediction.get('error'):
                    result['absolute_error'] = abs(prediction['count'] - img_data['true_count'])
                    result['relative_error'] = result['absolute_error'] / max(img_data['true_count'], 1)
                    result['bias'] = prediction['count'] - img_data['true_count']
                    result['exact_match'] = prediction['count'] == img_data['true_count']
                
                results.append(result)
                
            except Exception as e:
                print(f"\nError processing {img_data['image_id']} with {model_name}: {str(e)}")
                
                # Store error result
                result = {
                    'image_id': img_data['image_id'],
                    'model': model_name,
                    'true_count': img_data['true_count'],
                    'predicted_count': 0,
                    'confidence': 0.0,
                    'occlusion_level': img_data['occlusion_level'],
                    'object_type': img_data['object_type'],
                    'error': str(e),
                    'timestamp': datetime.now().isoformat()
                }
                results.append(result)
            
            progress_bar.update(1)
            
            # Rate limiting delay
            time.sleep(1)
    
    progress_bar.close()
    
    return pd.DataFrame(results)

# Run the experiment (using a smaller sample for demo)
print("Starting experiment...")
sample_size = min(20, len(synthetic_dataset))  # Limit for demo

results_df = run_counting_experiment(
    dataset=synthetic_dataset,
    models_to_test=EXPERIMENT_CONFIG['models_to_test'],
    vlm_interface=vlm_interface,
    sample_size=sample_size
)

print(f"\nExperiment completed! Collected {len(results_df)} results.")

# Display summary
print("\nSummary Statistics:")
successful_results = results_df[results_df['error'].isna()]
print(f"Successful predictions: {len(successful_results)}/{len(results_df)} ({len(successful_results)/len(results_df):.1%})")

if len(successful_results) > 0:
    print(f"Average absolute error: {successful_results['absolute_error'].mean():.2f}")
    print(f"Exact match accuracy: {successful_results['exact_match'].mean():.1%}")
    print(f"Average confidence: {successful_results['confidence'].mean():.2f}")

## Results Analysis and Visualization

Analyze the counting performance and visualize biases under different occlusion conditions.

In [None]:
# Filter successful results for analysis
analysis_df = results_df[results_df['error'].isna()].copy()

if len(analysis_df) == 0:
    print("No successful results to analyze. Please check API configurations and try again.")
else:
    print(f"Analyzing {len(analysis_df)} successful predictions...")
    
    # Performance by model
    model_performance = analysis_df.groupby('model').agg({
        'absolute_error': ['mean', 'std'],
        'exact_match': 'mean',
        'confidence': 'mean',
        'bias': 'mean'
    }).round(3)
    
    print("\nModel Performance Summary:")
    print(model_performance)
    
    # Performance by occlusion level
    occlusion_performance = analysis_df.groupby('occlusion_level').agg({
        'absolute_error': ['mean', 'std'],
        'exact_match': 'mean',
        'confidence': 'mean',
        'bias': 'mean'
    }).round(3)
    
    print("\nPerformance by Occlusion Level:")
    print(occlusion_performance)
    
    # Detailed analysis by model and occlusion
    detailed_analysis = analysis_df.groupby(['model', 'occlusion_level']).agg({
        'absolute_error': 'mean',
        'exact_match': 'mean',
        'bias': 'mean',
        'confidence': 'mean'
    }).round(3)
    
    print("\nDetailed Analysis by Model and Occlusion:")
    print(detailed_analysis)

In [None]:
if len(analysis_df) > 0:
    # Create comprehensive visualizations
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # 1. Accuracy by Occlusion Level
    occlusion_accuracy = analysis_df.groupby(['occlusion_level', 'model'])['exact_match'].mean().unstack()
    if not occlusion_accuracy.empty:
        occlusion_accuracy.plot(kind='line', marker='o', ax=axes[0, 0])
        axes[0, 0].set_title('Accuracy by Occlusion Level')
        axes[0, 0].set_xlabel('Occlusion Level')
        axes[0, 0].set_ylabel('Exact Match Accuracy')
        axes[0, 0].legend(title='Model')
        axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Average Error by Occlusion Level
    occlusion_error = analysis_df.groupby(['occlusion_level', 'model'])['absolute_error'].mean().unstack()
    if not occlusion_error.empty:
        occlusion_error.plot(kind='line', marker='s', ax=axes[0, 1])
        axes[0, 1].set_title('Average Error by Occlusion Level')
        axes[0, 1].set_xlabel('Occlusion Level')
        axes[0, 1].set_ylabel('Mean Absolute Error')
        axes[0, 1].legend(title='Model')
        axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Bias Analysis
    bias_data = analysis_df.groupby('model')['bias'].mean()
    colors = ['red' if x < 0 else 'green' for x in bias_data.values]
    axes[0, 2].bar(bias_data.index, bias_data.values, color=colors, alpha=0.7)
    axes[0, 2].set_title('Average Counting Bias by Model')
    axes[0, 2].set_ylabel('Bias (Predicted - True)')
    axes[0, 2].axhline(y=0, color='black', linestyle='-', alpha=0.3)
    axes[0, 2].grid(True, alpha=0.3)
    
    # 4. Confidence vs Accuracy
    for model in analysis_df['model'].unique():
        model_data = analysis_df[analysis_df['model'] == model]
        axes[1, 0].scatter(model_data['confidence'], model_data['exact_match'], 
                          label=model, alpha=0.6)
    axes[1, 0].set_title('Confidence vs Accuracy')
    axes[1, 0].set_xlabel('Confidence Score')
    axes[1, 0].set_ylabel('Exact Match (1=correct, 0=incorrect)')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # 5. Error Distribution
    analysis_df.boxplot(column='absolute_error', by='model', ax=axes[1, 1])
    axes[1, 1].set_title('Error Distribution by Model')
    axes[1, 1].set_ylabel('Absolute Error')
    plt.sca(axes[1, 1])
    plt.xticks(rotation=45)
    
    # 6. Performance Heatmap
    if len(analysis_df['model'].unique()) > 1 and len(analysis_df['occlusion_level'].unique()) > 1:
        pivot_data = analysis_df.pivot_table(values='exact_match', 
                                           index='model', 
                                           columns='occlusion_level', 
                                           aggfunc='mean')
        im = axes[1, 2].imshow(pivot_data.values, cmap='RdYlGn', aspect='auto')
        axes[1, 2].set_xticks(range(len(pivot_data.columns)))
        axes[1, 2].set_xticklabels([f'{x:.0%}' for x in pivot_data.columns])
        axes[1, 2].set_yticks(range(len(pivot_data.index)))
        axes[1, 2].set_yticklabels(pivot_data.index)
        axes[1, 2].set_title('Accuracy Heatmap: Model vs Occlusion')
        axes[1, 2].set_xlabel('Occlusion Level')
        axes[1, 2].set_ylabel('Model')
        
        # Add text annotations
        for i in range(len(pivot_data.index)):
            for j in range(len(pivot_data.columns)):
                text = axes[1, 2].text(j, i, f'{pivot_data.iloc[i, j]:.2f}',
                                     ha="center", va="center", color="black")
        
        plt.colorbar(im, ax=axes[1, 2])
    
    plt.tight_layout()
    plt.show()
    
    # Interactive Plotly visualizations
    print("\nCreating interactive visualizations...")
    
    # Interactive accuracy plot
    fig_interactive = go.Figure()
    
    for model in analysis_df['model'].unique():
        model_data = analysis_df[analysis_df['model'] == model]
        occlusion_acc = model_data.groupby('occlusion_level')['exact_match'].mean()
        
        fig_interactive.add_trace(go.Scatter(
            x=[f'{x:.0%}' for x in occlusion_acc.index],
            y=occlusion_acc.values,
            mode='lines+markers',
            name=model,
            line=dict(width=3),
            marker=dict(size=8)
        ))
    
    fig_interactive.update_layout(
        title='Interactive: Model Accuracy vs Occlusion Level',
        xaxis_title='Occlusion Level',
        yaxis_title='Exact Match Accuracy',
        hovermode='x unified'
    )
    
    fig_interactive.show()
    
else:
    print("Skipping visualizations due to insufficient data.")

## Save Results and Generate Report

In [None]:
# Save detailed results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_filename = f"synthetic_occlusion_results_{timestamp}.csv"

results_df.to_csv(results_filename, index=False)
print(f"Results saved to: {results_filename}")

# Generate summary report
report = f"""
# VLM Counting Bias Experiment - Synthetic Occlusion Results

**Experiment Date:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

## Experiment Configuration
- **Models Tested:** {', '.join(EXPERIMENT_CONFIG['models_to_test'])}
- **Object Types:** {', '.join(EXPERIMENT_CONFIG['object_types'])}
- **Object Counts:** {EXPERIMENT_CONFIG['object_counts']}
- **Occlusion Levels:** {[f'{x:.0%}' for x in EXPERIMENT_CONFIG['occlusion_levels']]}
- **Images per Condition:** {EXPERIMENT_CONFIG['num_images_per_condition']}
- **Total Images Generated:** {len(synthetic_dataset)}
- **Images Analyzed:** {len(results_df)}

## Results Summary
- **Total Predictions:** {len(results_df)}
- **Successful Predictions:** {len(analysis_df)} ({len(analysis_df)/len(results_df):.1%})
"""

if len(analysis_df) > 0:
    report += f"""
- **Overall Accuracy:** {analysis_df['exact_match'].mean():.1%}
- **Average Absolute Error:** {analysis_df['absolute_error'].mean():.2f}
- **Average Confidence:** {analysis_df['confidence'].mean():.2f}

## Key Findings

### Model Performance
"""
    
    for model in analysis_df['model'].unique():
        model_data = analysis_df[analysis_df['model'] == model]
        report += f"""
**{model}:**
- Accuracy: {model_data['exact_match'].mean():.1%}
- Avg Error: {model_data['absolute_error'].mean():.2f}
- Avg Bias: {model_data['bias'].mean():.2f} ({'under-counting' if model_data['bias'].mean() < 0 else 'over-counting'})
- Confidence: {model_data['confidence'].mean():.2f}
"""
    
    report += "\n### Occlusion Impact\n"
    for occlusion in sorted(analysis_df['occlusion_level'].unique()):
        occ_data = analysis_df[analysis_df['occlusion_level'] == occlusion]
        report += f"""
**{occlusion:.0%} Occlusion:**
- Accuracy: {occ_data['exact_match'].mean():.1%}
- Avg Error: {occ_data['absolute_error'].mean():.2f}
- Sample Size: {len(occ_data)}
"""

report += f"""

## Files Generated
- **Results CSV:** {results_filename}
- **Synthetic Dataset:** {len(synthetic_dataset)} images in memory

## Next Steps
1. Run the real-world image evaluation notebook
2. Compare results across synthetic and real scenarios
3. Analyze failure cases and error patterns
4. Consider additional models or prompting strategies

---
*Generated by VLM Counting Bias Research Platform*
"""

# Save report
report_filename = f"experiment_report_{timestamp}.md"
with open(report_filename, 'w') as f:
    f.write(report)

print(f"\nExperiment report saved to: {report_filename}")
print("\n" + "="*50)
print(report)

# Display download links in Colab
try:
    from google.colab import files
    print("\nDownload files:")
    files.download(results_filename)
    files.download(report_filename)
except:
    print(f"\nFiles saved locally: {results_filename}, {report_filename}")

## Conclusion

This notebook demonstrated a systematic evaluation of VLM counting capabilities under controlled synthetic occlusion conditions. 

### Key Insights:
1. **Occlusion Impact**: Models show degraded performance as occlusion level increases
2. **Model Differences**: Different VLMs exhibit varying robustness to occlusion
3. **Bias Patterns**: Some models tend to under-count while others over-count
4. **Confidence Calibration**: Model confidence may not correlate with actual accuracy

### Next Steps:
- Run the real-world image evaluation notebook (`02_counting_real_camouflage.ipynb`)
- Compare synthetic vs. real-world performance
- Investigate failure modes and potential improvements
- Consider prompt engineering and few-shot learning approaches

This research contributes to understanding the limitations and biases in current VLMs, informing better model development and deployment strategies.