## üìã Setup & Configuration

First, let's configure the paths and parameters for our evaluation.

In [2]:
import sys
from pathlib import Path
import json
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Configuration
TEST_IMAGES_DIR = Path("/content/GenAI-for-Visual-Synthesis/test_data")  # Change this to your test images directory
OUTPUT_DIR = Path("/content/GenAI-for-Visual-Synthesis/outputs")
CUSTOM_OUTPUT = OUTPUT_DIR / "custom_method"
DIFFEDIT_OUTPUT = OUTPUT_DIR / "diffedit"

# Number of test images to use (set to None to use all)
MAX_TEST_IMAGES = 75  # Adjust based on your needs

# Device configuration
DEVICE = "cuda"  # Change to "cpu" if no GPU available

# Create output directories
CUSTOM_OUTPUT.mkdir(parents=True, exist_ok=True)
DIFFEDIT_OUTPUT.mkdir(parents=True, exist_ok=True)

print(f"‚úì Configuration set")
print(f"  Test images: {TEST_IMAGES_DIR}")
print(f"  Output directory: {OUTPUT_DIR}")
print(f"  Device: {DEVICE}")
print(f"  Max images: {MAX_TEST_IMAGES if MAX_TEST_IMAGES else 'All'}")

‚úì Configuration set
  Test images: /content/GenAI-for-Visual-Synthesis/test_data
  Output directory: /content/GenAI-for-Visual-Synthesis/outputs
  Device: cuda
  Max images: 75


## üîç Understanding the Evaluation Metrics

### How Mask Quality is Evaluated (IoU - Intersection over Union)

**Without Ground Truth Masks:**
Since we don't have manual ground truth masks, we use **cross-validation**:
1. We compare the masks generated by both methods
2. Higher IoU with the original segmentation = better consistency
3. We also evaluate the **quality of the final output** (FID, IS, CLIP)

**The key insight:** A good mask should:
- Accurately segment the vehicle
- Lead to better final image quality
- Be consistent across the pipeline

### Metric Details:

1. **FID (Fr√©chet Inception Distance)** ‚¨áÔ∏è Lower is Better
   - Compares distribution of generated images to real images
   - Uses deep features from Inception network
   - **Good**: < 50, **Excellent**: < 30
   - Measures: Overall image quality and realism

2. **IS (Inception Score)** ‚¨ÜÔ∏è Higher is Better
   - Measures quality and diversity of generated images
   - **Good**: > 3.0, **Excellent**: > 4.0
   - Formula: exp(E[KL(p(y|x) || p(y))])
   - High score = diverse, high-quality images

3. **IoU (Intersection over Union)** ‚¨ÜÔ∏è Higher is Better
   - Measures mask overlap: IoU = (A ‚à© B) / (A ‚à™ B)
   - **Good**: > 0.7, **Excellent**: > 0.85
   - Compares generated masks between methods
   - Higher = more accurate segmentation

4. **LPIPS (Learned Perceptual Image Patch Similarity)** ‚¨áÔ∏è Lower is Better
   - Uses deep learning to measure perceptual similarity
   - **Good**: < 0.3
   - Better than pixel-wise metrics (MSE, SSIM)

5. **CLIP Score** ‚¨ÜÔ∏è Higher is Better
   - Measures how well image matches text prompt
   - **Good**: > 0.75, **Excellent**: > 0.85
   - Uses CLIP's vision-language alignment

## üìù Step 1: Prepare Test Images & Prompts

Let's check what test images we have and create prompts for them.

In [3]:
import os
from PIL import Image

# Check test images
if not TEST_IMAGES_DIR.exists():
    print(f"‚ùå Test images directory not found: {TEST_IMAGES_DIR}")
    print("Please create it and add test images, or update TEST_IMAGES_DIR")
else:
    image_files = sorted([
        f for f in TEST_IMAGES_DIR.iterdir()
        if f.suffix.lower() in ['.jpg', '.jpeg', '.png']
    ])

    if MAX_TEST_IMAGES:
        image_files = image_files[:MAX_TEST_IMAGES]

    print(f"‚úì Found {len(image_files)} test images")

    # Display first few images
    if len(image_files) > 0:
        print("\nFirst few test images:")
        for i, img_file in enumerate(image_files[:5], 1):
            print(f"  {i}. {img_file.name}")
        if len(image_files) > 5:
            print(f"  ... and {len(image_files) - 5} more")

‚úì Found 75 test images

First few test images:
  1. 000aa097d423_03.jpg
  2. 00ad56bf7ee6_03.jpg
  3. 00afb946a54c_03.jpg
  4. 00b6aee52419_03.jpg
  5. 00c07d49f4c5_03.jpg
  ... and 70 more


### Create Prompts

You need to create prompts for each test image. Here are two approaches:

**Option A: Use the sample prompts below and customize**
**Option B: Load from existing JSON files**

In [6]:
# Auto-generate prompts using AI models
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration

print("ü§ñ Loading AI models for automatic prompt generation...")
print("This may take a minute on first run (downloading models)...\n")

# Load BLIP for image captioning
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base"
).to(DEVICE)

# Load FLAN-T5 for prompt enhancement
t5_tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
t5_model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base").to(DEVICE)

print("‚úì Models loaded successfully!\n")
print("Generating prompts...")
print("="*70)

sample_prompts_custom = {}
sample_prompts_diffedit = {}

for img_file in tqdm(image_files, desc="Generating prompts"):
    img_name = img_file.name

    try:
        # Step 1: Get image caption using BLIP
        image = Image.open(img_file).convert("RGB")
        inputs = blip_processor(image, return_tensors="pt").to(DEVICE)
        out = blip_model.generate(**inputs, max_length=50)
        caption = blip_processor.decode(out[0], skip_special_tokens=True)

        # Step 2: Generate vehicle prompt using FLAN-T5
        vehicle_instruction = f"""Given this image description: "{caption}"
Generate a detailed, creative prompt to regenerate the vehicle in a more stylish way.
Focus on: vehicle type, color, style, condition.
Prompt:"""

        inputs = t5_tokenizer(vehicle_instruction, return_tensors="pt", max_length=512, truncation=True).to(DEVICE)
        outputs = t5_model.generate(inputs.input_ids, max_length=100, num_beams=4, temperature=0.8, do_sample=True, top_p=0.9)
        vehicle_prompt = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Fallback if prompt too short
        if len(vehicle_prompt.split()) < 5:
            vehicle_type = "car"
            if "suv" in caption.lower():
                vehicle_type = "SUV"
            elif "truck" in caption.lower():
                vehicle_type = "truck"
            vehicle_prompt = f"sleek modern {vehicle_type}, glossy finish, high detail"

        # Step 3: Generate background prompt using FLAN-T5
        bg_instruction = f"""Given this image description: "{caption}"
Generate a creative prompt for an interesting background setting.
Focus on: environment, lighting, atmosphere, scenery.
Prompt:"""

        inputs = t5_tokenizer(bg_instruction, return_tensors="pt", max_length=512, truncation=True).to(DEVICE)
        outputs = t5_model.generate(inputs.input_ids, max_length=100, num_beams=4, temperature=0.8, do_sample=True, top_p=0.9)
        bg_prompt = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Fallback if prompt too short
        if len(bg_prompt.split()) < 5:
            backgrounds = [
                "scenic mountain highway at golden hour",
                "modern city street with glass buildings",
                "coastal road with ocean view at sunset"
            ]
            bg_prompt = backgrounds[hash(caption) % len(backgrounds)]

        # Step 4: Generate DiffEdit prompts
        source_instruction = f"""Simplify this description to a short phrase: "{caption}"
Simplified:"""

        inputs = t5_tokenizer(source_instruction, return_tensors="pt", max_length=512, truncation=True).to(DEVICE)
        outputs = t5_model.generate(inputs.input_ids, max_length=50, num_beams=2, temperature=0.3)
        source_prompt = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

        if len(source_prompt.split()) < 3:
            source_prompt = "a car on the road"

        target_instruction = f"""Rewrite this as an enhanced, stylish version: "{caption}"
Enhanced:"""

        inputs = t5_tokenizer(target_instruction, return_tensors="pt", max_length=512, truncation=True).to(DEVICE)
        outputs = t5_model.generate(inputs.input_ids, max_length=50, num_beams=4, temperature=0.7, do_sample=True, top_p=0.9)
        target_prompt = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

        if len(target_prompt.split()) < 3:
            target_prompt = "a sleek modern sports car"

        # Store prompts
        sample_prompts_custom[img_name] = {
            "vehicle": vehicle_prompt,
            "background": bg_prompt,
            "original_caption": caption
        }

        sample_prompts_diffedit[img_name] = {
            "source": source_prompt,
            "target": target_prompt,
            "original_caption": caption
        }

        # Display progress
        if len(sample_prompts_custom) <= 3:  # Show first 3
            print(f"\n{img_name}")
            print(f"  Caption: {caption}")
            print(f"  Vehicle: {vehicle_prompt}")
            print(f"  Background: {bg_prompt}")

    except Exception as e:
        print(f"\n‚ö†Ô∏è  Error processing {img_name}: {e}")
        # Add fallback prompts
        sample_prompts_custom[img_name] = {
            "vehicle": "sleek modern sports car, high detail",
            "background": "scenic highway at sunset"
        }
        sample_prompts_diffedit[img_name] = {
            "source": "a car on the road",
            "target": "a modern sports car"
        }

# Clean up models to free memory
del blip_model, blip_processor, t5_model, t5_tokenizer
torch.cuda.empty_cache()

print(f"\n{'='*70}")
print(f"‚úì Generated prompts for {len(sample_prompts_custom)} images!")
print("\nSample prompts (first 3 shown above)")
print("All prompts will be saved in the next cell.")

ü§ñ Loading AI models for automatic prompt generation...
This may take a minute on first run (downloading models)...

‚úì Models loaded successfully!

Generating prompts...


Generating prompts:   0%|          | 0/75 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



000aa097d423_03.jpg
  Caption: a white car parked in a garage
  Vehicle: A white car is parked in a garage.
  Background: A white car parked in a garage.

00ad56bf7ee6_03.jpg
  Caption: a gray car
  Vehicle: A gray car is in good condition.
  Background: A gray car in a parking lot.

00afb946a54c_03.jpg
  Caption: a gray nissan rogue suv parked in a white room
  Vehicle: a gray nissan rogue suv parked in a white room
  Background: A gray nissan rogue suv parked in a white room.

‚úì Generated prompts for 75 images!

Sample prompts (first 3 shown above)
All prompts will be saved in the next cell.


In [10]:
sample_prompts_custom = {}
sample_prompts_diffedit = {}

# File paths for prompt files
prompts_custom_file = Path("/content/GenAI-for-Visual-Synthesis/outputs/custom_method/prompts_custom.json")
prompts_diffedit_file = Path("/content/GenAI-for-Visual-Synthesis/outputs/diffedit/prompts_diffedit.json")

# Check if prompt files already exist and load them
if prompts_custom_file.exists() and prompts_diffedit_file.exists():
    print("üìÇ Found existing prompt files! Loading...")

    with open(prompts_custom_file, 'r') as f:
        sample_prompts_custom = json.load(f)

    with open(prompts_diffedit_file, 'r') as f:
        sample_prompts_diffedit = json.load(f)

    print(f"‚úì Loaded prompts for {len(sample_prompts_custom)} images from existing files")
    print(f"  - {prompts_custom_file}")
    print(f"  - {prompts_diffedit_file}")

    # Show first 3 prompts as preview
    if len(sample_prompts_custom) > 0:
        print("\nüìù Preview of loaded prompts (first 3):")
        for idx, (img_name, prompts) in enumerate(list(sample_prompts_custom.items())[:3], 1):
            print(f"\n  {idx}. {img_name}")
            print(f"     Vehicle: {prompts.get('vehicle', 'N/A')}")
            print(f"     Background: {prompts.get('background', 'N/A')}")

        if len(sample_prompts_custom) > 3:
            print(f"\n  ... and {len(sample_prompts_custom) - 3} more images")

    print("\nüí° To regenerate prompts, delete the JSON files and run the auto-generation cell.")

# If no existing files, generate generic prompts as fallback
elif len(image_files) > 0:
    print("‚ùå No existing prompt files found.")
    print("\n‚ö†Ô∏è  IMPORTANT: You have two options:")
    print("   1. Run the 'Auto-Generate Prompts with AI' cell below (RECOMMENDED)")
    print("   2. Continue with generic prompts (not recommended)")
    print("\nCreating generic prompts as temporary fallback...")

    for img_file in image_files:
        img_name = img_file.name
        # Generic prompts - PLEASE CUSTOMIZE FOR YOUR IMAGES
        sample_prompts_custom[img_name] = {
            "vehicle": "sleek modern sports car",
            "background": "scenic highway at sunset"
        }
        sample_prompts_diffedit[img_name] = {
            "source": "a car on the road",
            "target": "a modern sports car"
        }

    print(f"‚úì Created generic prompts for {len(image_files)} images")
    print("\n‚ö†Ô∏è  WARNING: Using GENERIC prompts - all images will have the same prompt!")
    print("   For better results, run the auto-generation cell below.")

    # Save generic prompts to files
    with open(prompts_custom_file, 'w') as f:
        json.dump(sample_prompts_custom, f, indent=2)

    with open(prompts_diffedit_file, 'w') as f:
        json.dump(sample_prompts_diffedit, f, indent=2)

    print(f"\n‚úì Generic prompts saved to:")
    print(f"  - {prompts_custom_file}")
    print(f"  - {prompts_diffedit_file}")

else:
    print("‚ö†Ô∏è  No image files found. Please add images to the test_data directory first.")

üìÇ Found existing prompt files! Loading...
‚úì Loaded prompts for 75 images from existing files
  - /content/GenAI-for-Visual-Synthesis/outputs/custom_method/prompts_custom.json
  - /content/GenAI-for-Visual-Synthesis/outputs/diffedit/prompts_diffedit.json

üìù Preview of loaded prompts (first 3):

  1. 000aa097d423_03.jpg
     Vehicle: sleek modern sports car
     Background: scenic highway at sunset

  2. 00ad56bf7ee6_03.jpg
     Vehicle: sleek modern sports car
     Background: scenic highway at sunset

  3. 00afb946a54c_03.jpg
     Vehicle: sleek modern sports car
     Background: scenic highway at sunset

  ... and 72 more images

üí° To regenerate prompts, delete the JSON files and run the auto-generation cell.


## üöÄ Step 2: Run Custom Method

This will process all test images through your 4-stage pipeline:
1. **Stage 1**: UNet segmentation (original image)
2. **Stage 2**: Stable Diffusion vehicle regeneration
3. **Stage 3**: UNet re-segmentation (edited image)
4. **Stage 4**: Stable Diffusion background inpainting

In [16]:
%pip install -r /content/GenAI-for-Visual-Synthesis/requirements.txt

Collecting git+https://github.com/openai/CLIP.git (from -r /content/GenAI-for-Visual-Synthesis/requirements.txt (line 20))
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-5_y7lblt
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-5_y7lblt
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torchmetrics (from -r /content/GenAI-for-Visual-Synthesis/requirements.txt (line 14))
  Downloading torchmetrics-1.8.2-py3-none-any.whl.metadata (22 kB)
Collecting ftfy (from -r /content/GenAI-for-Visual-Synthesis/requirements.txt (line 17))
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting httptools>=0.6.3 (from uvicorn[standard]->-r /content/GenAI-for-Visual-Synthesis/requirements.txt (line 10))
  Downloading httptools-0.7.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_

In [21]:
# Import custom pipeline functions
import sys
from pathlib import Path

# Add parent directory to path to import main.py
MAIN_DIR = Path("/content/GenAI-for-Visual-Synthesis")
if str(MAIN_DIR) not in sys.path:
    sys.path.insert(0, str(MAIN_DIR))

# Import from main.py
from main import (
    segment_image,
    regenerate_vehicle,
    segment_edited_image,
    inpaint_background,
    BASE_DIR
)

print(f"‚úì Imported pipeline functions from: {MAIN_DIR / 'main.py'}")
print(f"‚úì Base directory: {BASE_DIR}")

# Model paths
unet_path = BASE_DIR / "model" / "unet_model_carvana_new.pth"
sd_model_path = (
    BASE_DIR / "model" / "stable-diffusion" /
    "models--runwayml--stable-diffusion-v1-5" / "snapshots" /
    "451f4fe16113bff5a5d2269ed5ad43b0592e9a14"
)

# Check models exist
print("\nChecking model files...")
if not unet_path.exists():
    print(f"‚ùå UNet model not found: {unet_path}")
    print("   Please run setup.py to download models")
else:
    print(f"‚úì UNet model found: {unet_path.name}")

if not sd_model_path.exists():
    print(f"‚ùå Stable Diffusion model not found: {sd_model_path}")
    print("   Please run setup.py to download models")
else:
    print(f"‚úì Stable Diffusion model found")

# Verify all imports are working
print("\n‚úì All imports successful and ready to process images!")

‚úì Imported pipeline functions from: /content/GenAI-for-Visual-Synthesis/main.py
‚úì Base directory: /content/GenAI-for-Visual-Synthesis

Checking model files...
‚úì UNet model found: unet_model_carvana_new.pth
‚úì Stable Diffusion model found

‚úì All imports successful and ready to process images!


In [22]:
# Process images with custom method
print("Processing images with Custom Method...")
print("="*60)

custom_results = []

for img_file in tqdm(image_files, desc="Custom Method"):
    img_name = img_file.name

    if img_name not in sample_prompts_custom:
        print(f"‚ö† No prompts for {img_name}, skipping...")
        continue

    prompts = sample_prompts_custom[img_name]
    vehicle_prompt = prompts.get('vehicle', '')
    background_prompt = prompts.get('background', '')

    try:
        # Stage 1: Initial Segmentation
        stage1_mask_name = f"stage1_mask_{img_name}"
        stage1_mask_path = segment_image(
            img_path=str(img_file),
            model_path=str(unet_path),
            output_dir=CUSTOM_OUTPUT,
            output_name=stage1_mask_name
        )

        # Stage 2: Vehicle Regeneration
        stage2_vehicle_name = f"stage2_vehicle_{img_name}"
        _, stage2_vehicle_path = regenerate_vehicle(
            img_path=str(img_file),
            mask_path=stage1_mask_path,
            model_dir=str(sd_model_path),
            prompt=vehicle_prompt,
            output_dir=CUSTOM_OUTPUT,
            output_name=stage2_vehicle_name
        )

        # Stage 3: Re-segmentation
        stage3_mask_name = f"stage3_mask_{img_name}"
        stage3_mask_path = segment_edited_image(
            img_path=stage2_vehicle_path,
            model_path=str(unet_path),
            output_dir=CUSTOM_OUTPUT,
            output_name=stage3_mask_name
        )

        # Stage 4: Background Inpainting
        stage4_final_name = f"final_{img_name}"
        _, stage4_final_path = inpaint_background(
            img_path=stage2_vehicle_path,
            mask_path=stage3_mask_path,
            model_dir=str(sd_model_path),
            prompt=background_prompt,
            output_dir=CUSTOM_OUTPUT,
            output_name=stage4_final_name
        )

        custom_results.append({
            'image': img_name,
            'status': 'success',
            'stage1_mask': stage1_mask_path,
            'stage3_mask': stage3_mask_path,
            'final': stage4_final_path
        })

    except Exception as e:
        print(f"\n‚ùå Error processing {img_name}: {e}")
        custom_results.append({
            'image': img_name,
            'status': 'error',
            'error': str(e)
        })

successful_custom = len([r for r in custom_results if r['status'] == 'success'])
print(f"\n‚úì Custom method completed: {successful_custom}/{len(image_files)} images successful")

Processing images with Custom Method...


Custom Method:   0%|          | 0/75 [00:00<?, ?it/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

Using CPU


  0%|          | 0/50 [00:00<?, ?it/s]

KeyboardInterrupt: 

## üé® Step 3: Run DiffEdit

Now let's run DiffEdit on the same images for comparison.

In [None]:
import torch
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
from diffusers.utils import load_image

# Setup DiffEdit pipeline
print("Loading DiffEdit pipeline...")
diffedit_pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    safety_checker=None,
    use_safetensors=True,
)

diffedit_pipeline.scheduler = DDIMScheduler.from_config(diffedit_pipeline.scheduler.config)
diffedit_pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(
    diffedit_pipeline.scheduler.config
)

diffedit_pipeline.enable_model_cpu_offload()
diffedit_pipeline.enable_vae_slicing()

print("‚úì DiffEdit pipeline loaded")

In [None]:
# Process images with DiffEdit
print("\nProcessing images with DiffEdit...")
print("="*60)
print("Note: DiffEdit includes an inversion step, so each image takes longer (~60-120s)")

diffedit_results = []
IMAGE_SIZE = (768, 768)

for img_file in tqdm(image_files, desc="DiffEdit"):
    img_name = img_file.name

    if img_name not in sample_prompts_diffedit:
        print(f"‚ö† No prompts for {img_name}, skipping...")
        continue

    prompts = sample_prompts_diffedit[img_name]
    source_prompt = prompts.get('source', '')
    target_prompt = prompts.get('target', '')

    try:
        # Load and resize image
        raw_image = load_image(str(img_file)).resize(IMAGE_SIZE)

        # Generate mask
        mask_image = diffedit_pipeline.generate_mask(
            image=raw_image,
            source_prompt=source_prompt,
            target_prompt=target_prompt,
        )

        # Invert latents
        inv_latents = diffedit_pipeline.invert(
            prompt=source_prompt,
            image=raw_image
        ).latents

        # Generate final image
        output_image = diffedit_pipeline(
            prompt=target_prompt,
            mask_image=mask_image,
            image_latents=inv_latents,
            negative_prompt=source_prompt,
        ).images[0]

        # Save outputs
        output_path = DIFFEDIT_OUTPUT / f"edited_{img_name}"
        output_image.save(output_path)

        # Save mask
        mask_pil = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L")
        mask_pil = mask_pil.resize(IMAGE_SIZE)
        mask_path = DIFFEDIT_OUTPUT / f"mask_{img_name}"
        mask_pil.save(mask_path)

        diffedit_results.append({
            'image': img_name,
            'status': 'success',
            'output': str(output_path),
            'mask': str(mask_path)
        })

    except Exception as e:
        print(f"\n‚ùå Error processing {img_name}: {e}")
        diffedit_results.append({
            'image': img_name,
            'status': 'error',
            'error': str(e)
        })

successful_diffedit = len([r for r in diffedit_results if r['status'] == 'success'])
print(f"\n‚úì DiffEdit completed: {successful_diffedit}/{len(image_files)} images successful")

# Free memory
del diffedit_pipeline
torch.cuda.empty_cache()

## üìä Step 4: Compute Evaluation Metrics

Now let's compute all the metrics to compare both methods.

In [None]:
import torch
import numpy as np
from torchmetrics.image.fid import FrechetInceptionDistance
from torchmetrics.image.lpip import LearnedPerceptualImagePatchSimilarity
import clip
from torchvision import transforms
from scipy.stats import entropy
import torch.nn.functional as F
from torchvision.models import inception_v3

print("Initializing evaluation metrics...")

# Initialize metrics
fid_metric = FrechetInceptionDistance(normalize=True).to(DEVICE)
lpips_metric = LearnedPerceptualImagePatchSimilarity(net_type='alex').to(DEVICE)
clip_model, clip_preprocess = clip.load("ViT-B/32", device=DEVICE)

print("‚úì Metrics initialized")

In [None]:
# Helper functions for metric computation

def preprocess_for_fid(image_path):
    """Convert image to tensor for FID"""
    img = Image.open(image_path).convert("RGB")
    img_array = np.array(img)
    if img_array.dtype != np.uint8:
        img_array = (img_array * 255).astype(np.uint8)
    tensor = torch.from_numpy(img_array).permute(2, 0, 1).unsqueeze(0)
    return tensor.to(DEVICE)

def preprocess_for_lpips(image_path):
    """Convert image to tensor for LPIPS"""
    img = Image.open(image_path).convert("RGB")
    transform = transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
    ])
    return transform(img).unsqueeze(0).to(DEVICE)

def calculate_clip_similarity(image_path, text_prompt):
    """Calculate CLIP image-text similarity"""
    img = Image.open(image_path).convert("RGB")
    image_input = clip_preprocess(img).unsqueeze(0).to(DEVICE)
    text_input = clip.tokenize([text_prompt]).to(DEVICE)

    with torch.no_grad():
        image_features = clip_model.encode_image(image_input)
        text_features = clip_model.encode_text(text_input)

        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        similarity = (image_features @ text_features.T).item()

    return similarity

def calculate_iou(mask1_path, mask2_path):
    """Calculate IoU between two masks"""
    mask1 = np.array(Image.open(mask1_path).convert("L")) > 127
    mask2 = np.array(Image.open(mask2_path).convert("L")) > 127

    intersection = np.logical_and(mask1, mask2).sum()
    union = np.logical_or(mask1, mask2).sum()

    if union == 0:
        return 0.0
    return intersection / union

print("‚úì Helper functions defined")

In [None]:
# Compute metrics for Custom Method
print("Computing metrics for Custom Method...")
print("="*60)

custom_metrics = {
    'lpips_scores': [],
    'clip_scores': [],
    'iou_scores': [],
    'per_image': []
}

# Reset FID
fid_metric.reset()

for result in tqdm(custom_results, desc="Custom Metrics"):
    if result['status'] != 'success':
        continue

    img_name = result['image']
    orig_path = TEST_IMAGES_DIR / img_name
    final_path = Path(result['final'])

    # FID
    orig_tensor = preprocess_for_fid(orig_path)
    final_tensor = preprocess_for_fid(final_path)
    fid_metric.update(orig_tensor, real=True)
    fid_metric.update(final_tensor, real=False)

    # LPIPS
    orig_lpips = preprocess_for_lpips(orig_path)
    final_lpips = preprocess_for_lpips(final_path)
    lpips_score = lpips_metric(final_lpips, orig_lpips).item()
    custom_metrics['lpips_scores'].append(lpips_score)

    # CLIP (using target prompt)
    if img_name in sample_prompts_custom:
        prompts = sample_prompts_custom[img_name]
        combined_prompt = f"{prompts['vehicle']} on {prompts['background']}"
        clip_score = calculate_clip_similarity(final_path, combined_prompt)
        custom_metrics['clip_scores'].append(clip_score)

    # IoU (comparing stage 1 and stage 3 masks for consistency)
    iou_score = calculate_iou(result['stage1_mask'], result['stage3_mask'])
    custom_metrics['iou_scores'].append(iou_score)

    custom_metrics['per_image'].append({
        'image': img_name,
        'lpips': lpips_score,
        'clip': clip_score if img_name in sample_prompts_custom else None,
        'iou': iou_score
    })

# Compute FID
custom_metrics['fid'] = fid_metric.compute().item()
custom_metrics['avg_lpips'] = np.mean(custom_metrics['lpips_scores'])
custom_metrics['avg_clip'] = np.mean(custom_metrics['clip_scores'])
custom_metrics['avg_iou'] = np.mean(custom_metrics['iou_scores'])

print(f"‚úì Custom Method Metrics:")
print(f"  FID:   {custom_metrics['fid']:.3f}")
print(f"  LPIPS: {custom_metrics['avg_lpips']:.4f}")
print(f"  CLIP:  {custom_metrics['avg_clip']:.4f}")
print(f"  IoU:   {custom_metrics['avg_iou']:.4f}")

In [None]:
# Compute metrics for DiffEdit
print("\nComputing metrics for DiffEdit...")
print("="*60)

diffedit_metrics = {
    'lpips_scores': [],
    'clip_scores': [],
    'iou_scores': [],
    'per_image': []
}

# Reset FID
fid_metric.reset()

for result in tqdm(diffedit_results, desc="DiffEdit Metrics"):
    if result['status'] != 'success':
        continue

    img_name = result['image']
    orig_path = TEST_IMAGES_DIR / img_name
    output_path = Path(result['output'])

    # FID
    orig_tensor = preprocess_for_fid(orig_path)
    output_tensor = preprocess_for_fid(output_path)
    fid_metric.update(orig_tensor, real=True)
    fid_metric.update(output_tensor, real=False)

    # LPIPS
    orig_lpips = preprocess_for_lpips(orig_path)
    output_lpips = preprocess_for_lpips(output_path)
    lpips_score = lpips_metric(output_lpips, orig_lpips).item()
    diffedit_metrics['lpips_scores'].append(lpips_score)

    # CLIP
    if img_name in sample_prompts_diffedit:
        target_prompt = sample_prompts_diffedit[img_name]['target']
        clip_score = calculate_clip_similarity(output_path, target_prompt)
        diffedit_metrics['clip_scores'].append(clip_score)

    # IoU (compare with custom method's mask for the same image)
    custom_result = next((r for r in custom_results if r['image'] == img_name), None)
    if custom_result and custom_result['status'] == 'success':
        iou_score = calculate_iou(result['mask'], custom_result['stage3_mask'])
        diffedit_metrics['iou_scores'].append(iou_score)

    diffedit_metrics['per_image'].append({
        'image': img_name,
        'lpips': lpips_score,
        'clip': clip_score if img_name in sample_prompts_diffedit else None,
        'iou': iou_score if custom_result else None
    })

# Compute FID
diffedit_metrics['fid'] = fid_metric.compute().item()
diffedit_metrics['avg_lpips'] = np.mean(diffedit_metrics['lpips_scores'])
diffedit_metrics['avg_clip'] = np.mean(diffedit_metrics['clip_scores'])
diffedit_metrics['avg_iou'] = np.mean(diffedit_metrics['iou_scores']) if diffedit_metrics['iou_scores'] else 0

print(f"‚úì DiffEdit Metrics:")
print(f"  FID:   {diffedit_metrics['fid']:.3f}")
print(f"  LPIPS: {diffedit_metrics['avg_lpips']:.4f}")
print(f"  CLIP:  {diffedit_metrics['avg_clip']:.4f}")
print(f"  IoU:   {diffedit_metrics['avg_iou']:.4f}")

## üìà Step 5: Compare Results

Let's create a comprehensive comparison of both methods.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Create comparison table
print("\n" + "="*70)
print("COMPARATIVE RESULTS: Custom Method vs DiffEdit")
print("="*70 + "\n")

comparison_data = {
    'Metric': ['FID ‚Üì', 'LPIPS ‚Üì', 'CLIP ‚Üë', 'IoU ‚Üë'],
    'Custom Method': [
        f"{custom_metrics['fid']:.3f}",
        f"{custom_metrics['avg_lpips']:.4f}",
        f"{custom_metrics['avg_clip']:.4f}",
        f"{custom_metrics['avg_iou']:.4f}"
    ],
    'DiffEdit': [
        f"{diffedit_metrics['fid']:.3f}",
        f"{diffedit_metrics['avg_lpips']:.4f}",
        f"{diffedit_metrics['avg_clip']:.4f}",
        f"{diffedit_metrics['avg_iou']:.4f}"
    ]
}

df = pd.DataFrame(comparison_data)

# Determine winners
winners = []
metrics_list = [
    ('fid', False),  # lower is better
    ('avg_lpips', False),
    ('avg_clip', True),  # higher is better
    ('avg_iou', True)
]

for metric_key, higher_better in metrics_list:
    custom_val = custom_metrics[metric_key]
    diffedit_val = diffedit_metrics[metric_key]

    if higher_better:
        winner = '‚úì Custom' if custom_val > diffedit_val else '‚úì DiffEdit'
    else:
        winner = '‚úì Custom' if custom_val < diffedit_val else '‚úì DiffEdit'

    winners.append(winner)

df['Winner'] = winners

print(df.to_string(index=False))
print("\n" + "="*70)

# Count wins
custom_wins = winners.count('‚úì Custom')
diffedit_wins = winners.count('‚úì DiffEdit')

print(f"\nOVERALL WINNER: ", end="")
if custom_wins > diffedit_wins:
    print("‚úì Custom Method")
elif diffedit_wins > custom_wins:
    print("‚úì DiffEdit")
else:
    print("Tie")

print(f"(Custom: {custom_wins} wins, DiffEdit: {diffedit_wins} wins)")
print("="*70 + "\n")

## üìä Step 6: Visualize Results

Let's create visual comparisons of the methods.

In [None]:
# Create bar chart comparison
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

metrics_to_plot = [
    ('FID', 'fid', False),
    ('LPIPS', 'avg_lpips', False),
    ('CLIP', 'avg_clip', True),
    ('IoU', 'avg_iou', True)
]

for idx, (name, key, higher_better) in enumerate(metrics_to_plot):
    custom_val = custom_metrics[key]
    diffedit_val = diffedit_metrics[key]

    # Determine colors
    if higher_better:
        colors = ['green' if custom_val > diffedit_val else 'lightgreen',
                 'blue' if diffedit_val > custom_val else 'lightblue']
    else:
        colors = ['green' if custom_val < diffedit_val else 'lightgreen',
                 'blue' if diffedit_val < custom_val else 'lightblue']

    bars = axes[idx].bar(['Custom', 'DiffEdit'],
                        [custom_val, diffedit_val],
                        color=colors)

    # Add value labels
    for bar in bars:
        height = bar.get_height()
        axes[idx].text(bar.get_x() + bar.get_width()/2., height,
                      f'{height:.3f}',
                      ha='center', va='bottom', fontsize=11)

    # Styling
    direction = '‚Üë Higher Better' if higher_better else '‚Üì Lower Better'
    axes[idx].set_title(f'{name}\n{direction}', fontsize=13, fontweight='bold')
    axes[idx].set_ylabel('Score', fontsize=11)
    axes[idx].grid(axis='y', alpha=0.3)

plt.suptitle('Metrics Comparison: Custom Method vs DiffEdit',
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'metrics_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"‚úì Metrics chart saved to: {OUTPUT_DIR / 'metrics_comparison.png'}")

In [None]:
# Create side-by-side image comparisons
print("\nCreating visual comparisons of individual images...")

# Select up to 5 images to display
display_images = min(5, len([r for r in custom_results if r['status'] == 'success']))

fig, axes = plt.subplots(display_images, 3, figsize=(15, 5 * display_images))

if display_images == 1:
    axes = axes.reshape(1, -1)

success_results = [(c, d) for c, d in zip(custom_results, diffedit_results)
                   if c['status'] == 'success' and d['status'] == 'success']

for idx, (custom_result, diffedit_result) in enumerate(success_results[:display_images]):
    img_name = custom_result['image']

    # Load images
    original = Image.open(TEST_IMAGES_DIR / img_name)
    custom_output = Image.open(custom_result['final'])
    diffedit_output = Image.open(diffedit_result['output'])

    # Display
    axes[idx, 0].imshow(original)
    axes[idx, 0].set_title(f'Original\n{img_name}', fontsize=10)
    axes[idx, 0].axis('off')

    axes[idx, 1].imshow(custom_output)
    axes[idx, 1].set_title('Custom Method', fontsize=10, color='green', fontweight='bold')
    axes[idx, 1].axis('off')

    axes[idx, 2].imshow(diffedit_output)
    axes[idx, 2].set_title('DiffEdit', fontsize=10, color='blue', fontweight='bold')
    axes[idx, 2].axis('off')

plt.suptitle('Side-by-Side Comparison of Sample Images', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'visual_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"‚úì Visual comparison saved to: {OUTPUT_DIR / 'visual_comparison.png'}")

## üíæ Step 7: Save Results

Let's save all results to a JSON file for future reference.

In [None]:
from datetime import datetime

# Compile all results
final_results = {
    'timestamp': datetime.now().isoformat(),
    'test_configuration': {
        'num_images': len(image_files),
        'device': DEVICE,
        'test_images_dir': str(TEST_IMAGES_DIR)
    },
    'custom_method': {
        'fid': custom_metrics['fid'],
        'lpips': custom_metrics['avg_lpips'],
        'clip': custom_metrics['avg_clip'],
        'iou': custom_metrics['avg_iou'],
        'num_successful': len([r for r in custom_results if r['status'] == 'success'])
    },
    'diffedit_method': {
        'fid': diffedit_metrics['fid'],
        'lpips': diffedit_metrics['avg_lpips'],
        'clip': diffedit_metrics['avg_clip'],
        'iou': diffedit_metrics['avg_iou'],
        'num_successful': len([r for r in diffedit_results if r['status'] == 'success'])
    },
    'per_image_comparison': []
}

# Add per-image details
for custom_img, diffedit_img in zip(custom_metrics['per_image'], diffedit_metrics['per_image']):
    final_results['per_image_comparison'].append({
        'image': custom_img['image'],
        'custom': custom_img,
        'diffedit': diffedit_img
    })

# Save to file
results_file = OUTPUT_DIR / 'evaluation_results.json'
with open(results_file, 'w') as f:
    json.dump(final_results, f, indent=2)

print(f"‚úì Results saved to: {results_file}")

## üéØ Summary & Interpretation

### Key Takeaways

Your custom method should outperform DiffEdit if:
- **FID is lower** ‚Üí Better image quality
- **IoU is higher** ‚Üí Better segmentation accuracy
- **CLIP is higher** ‚Üí Better text-prompt alignment

### What the IoU Score Actually Means

Since we don't have ground truth masks, the IoU here measures:
1. **For Custom Method**: Consistency between Stage 1 and Stage 3 masks
   - High IoU = the segmentation is stable/consistent
2. **Between Methods**: How similar the masks are
   - This shows if both methods identify similar regions

### Understanding "Good" vs "Bad" Masks

A good mask:
- ‚úì Cleanly separates foreground from background
- ‚úì Follows object boundaries accurately
- ‚úì Leads to better final image quality (reflected in FID/CLIP)
- ‚úì Is consistent across the pipeline

A bad mask:
- ‚úó Has rough/jagged edges
- ‚úó Misses parts of the object or includes too much background
- ‚úó Causes artifacts in the final image
- ‚úó Is inconsistent between stages

### Why Custom Method Should Win

Your 4-stage approach has advantages:
1. **Trained segmentation** (UNet) vs heuristic mask generation
2. **Sequential refinement** (re-segment after editing)
3. **Specialized control** (vehicle and background separately)

### Next Steps

1. Review the visual comparisons above
2. Check the JSON file for detailed per-image metrics
3. Identify failure cases (if any)
4. Fine-tune prompts for better results
5. Use these results in your report/paper

## üìù Recommended Test Set Sizes

Based on your evaluation needs:

| Purpose | Images | Time (GPU) | Why |
|---------|--------|------------|-----|
| **Quick Test** | 5-10 | ~15-30 min | Verify pipeline works |
| **Development** | 10-15 | ~30-45 min | Iterate on prompts |
| **Evaluation** | 20-30 | ~1-2 hours | Reliable statistics |
| **Publication** | 50-100 | ~3-5 hours | Publication-quality results |

### Statistical Reliability

- **< 10 images**: Results may be unreliable, high variance
- **10-20 images**: Moderate confidence, good for class projects
- **20-30 images**: Good confidence, suitable for papers
- **50+ images**: High confidence, publication-quality

The metrics become more stable with more images, especially FID and IS which measure distribution similarity.

## ‚úÖ Evaluation Complete!

You've successfully:
- ‚úì Run both methods on your test images
- ‚úì Computed quantitative metrics (FID, LPIPS, CLIP, IoU)
- ‚úì Created visual comparisons
- ‚úì Saved all results to JSON

Check the `outputs/` directory for all generated images and the results JSON file.