# Synthetic Crash Scene Generation Pipeline — Demo

**End-to-end pipeline:** Crash report text → Structured scene → Depth-conditioned image → Video

This notebook demonstrates the full multi-model pipeline running on a free Google Colab T4 GPU.

### Pipeline Stages:
1. **Parse** — Groq LLM extracts structured scene representation from crash report
2. **Depth** — Depth Anything V2 + programmatic manipulation creates depth conditioning
3. **Image** — ControlNet + SDXL generates depth-conditioned dashcam image
4. **Video** — Wan2.1 generates 5-second dashcam video
5. **Evaluate** — CLIP score + YOLO verification measures quality

### VRAM Management:
Models are loaded/unloaded sequentially to fit within T4's 15GB VRAM.

## Setup

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to project
%cd /content/drive/MyDrive/tesla_crash_synth

# Install dependencies
!pip install -q torch torchvision transformers diffusers accelerate
!pip install -q openai pydantic python-dotenv pillow numpy opencv-python scipy
!pip install -q ultralytics  # For YOLO evaluation
!pip install -e . -q

In [None]:
# Set API key (paste your Groq key here or load from .env)
import os
os.environ["GROQ_API_KEY"] = ""  # <-- paste your key if not using .env

# Verify GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")

## Test Scenarios

Four diverse crash scenarios testing different conditions:
- Weather (rain, night, clear)
- Road type (highway, intersection, residential)
- Incident type (hydroplane, pedestrian, side-impact, rear-end)
- Object types (guardrail, pedestrian, vehicle, truck)

In [None]:
test_scenarios = {
    "wet_highway": "Vehicle traveling 45mph on wet highway in heavy rain, hydroplaned and hit guardrail on the left side",
    "night_pedestrian": "Pedestrian crossed outside crosswalk at night on a residential street, struck by vehicle going 25mph, pedestrian was wearing dark clothing",
    "intersection_crash": "Vehicle ran red light at 35mph at urban intersection, T-bone side impact with cross traffic vehicle going 30mph",
    "rear_end_highway": "Car rear-ended stopped truck at highway on-ramp in clear weather, going approximately 40mph, truck was loaded with cargo",
}

## Stage 1: Parse All Scenarios

The enhanced parser extracts:
- Basic crash fields (speed, weather, incident type)
- **Scene objects with spatial positions** (for depth map manipulation)
- **Temporal description** (for video generation)
- Rich image prompt (for diffusion model)

In [None]:
from utils.parser import LLMParser

parser = LLMParser()
parsed_scenarios = {}

for name, report in test_scenarios.items():
    print(f"\n--- Parsing: {name} ---")
    scenario = parser.parse(report)
    parsed_scenarios[name] = scenario
    
    print(f"  Incident: {scenario.incident_type}")
    print(f"  Weather: {scenario.weather}, Lighting: {scenario.lighting}")
    print(f"  Road: {scenario.road_type}, Condition: {scenario.road_condition}")
    print(f"  Scene objects ({len(scenario.scene_objects)}):")
    for obj in scenario.scene_objects:
        print(f"    - {obj.type}: {obj.distance_m}m, lateral={obj.lateral_position}, action={obj.action}")
    print(f"  Temporal: {scenario.temporal_description[:120]}...")

## Stage 2: Depth Map Generation

For each scenario, we:
1. Create a base depth map matching the road type (perspective geometry)
2. **Programmatically manipulate** the depth map using scene object positions

This is NOT just running a model — we're applying geometric reasoning to place objects at correct distances for ControlNet conditioning.

In [None]:
from utils.vram_manager import VRAMManager
from utils.depth_generator import DepthGenerator
import matplotlib.pyplot as plt
import os

os.makedirs("outputs", exist_ok=True)
vram = VRAMManager()
depth_gen = DepthGenerator(vram_manager=vram)

depth_maps = {}
fig, axes = plt.subplots(2, 4, figsize=(20, 10))

for i, (name, scenario) in enumerate(parsed_scenarios.items()):
    # Base depth (road geometry only)
    base_depth = depth_gen.create_base_depth(scenario.road_type)
    
    # Manipulated depth (with objects placed at correct distances)
    obj_dicts = [obj.model_dump() for obj in scenario.scene_objects]
    manipulated = depth_gen.manipulate_depth(base_depth, obj_dicts) if obj_dicts else base_depth
    
    depth_img = depth_gen.depth_to_pil(manipulated)
    depth_maps[name] = depth_img
    depth_img.save(f"outputs/{name}_depth.png")
    
    # Visualize base vs manipulated
    axes[0, i].imshow(base_depth, cmap='viridis')
    axes[0, i].set_title(f"{name}\n(base depth)")
    axes[0, i].axis('off')
    
    axes[1, i].imshow(manipulated, cmap='viridis')
    axes[1, i].set_title(f"{name}\n(+ {len(obj_dicts)} objects)")
    axes[1, i].axis('off')

plt.suptitle("Depth Maps: Base Geometry (top) vs Object-Manipulated (bottom)", fontsize=14)
plt.tight_layout()
plt.savefig("outputs/depth_comparison.png", dpi=150)
plt.show()
print("Depth maps generated for all scenarios")

## Stage 3: ControlNet Image Generation

Generate depth-conditioned dashcam images using ControlNet + SDXL.

**Key improvement over raw SDXL:** The depth map tells the model WHERE objects should appear in 3D space, producing better spatial accuracy for pedestrians, vehicles, and road elements.

In [None]:
from utils.controlnet_generator import ControlNetGenerator

# Unload depth model before loading ControlNet (VRAM management)
depth_gen.unload_model()

cn_gen = ControlNetGenerator(vram_manager=vram)
generated_images = {}

for name, scenario in parsed_scenarios.items():
    print(f"\nGenerating: {name}")
    image = cn_gen.generate_from_scenario(
        scenario=scenario,
        depth_image=depth_maps[name],
        controlnet_conditioning_scale=0.25,  # lower = less tiling from programmatic depth
    )
    generated_images[name] = image
    image.save(f"outputs/{name}_image.png")
    print(f"  Saved: outputs/{name}_image.png")

vram.snapshot("after_all_images")

In [None]:
# Display all generated images
fig, axes = plt.subplots(1, 4, figsize=(24, 6))
for i, (name, img) in enumerate(generated_images.items()):
    axes[i].imshow(img)
    axes[i].set_title(name.replace('_', ' ').title(), fontsize=12)
    axes[i].axis('off')

plt.suptitle("Generated Dashcam Images (ControlNet + SDXL)", fontsize=14)
plt.tight_layout()
plt.savefig("outputs/all_images.png", dpi=150)
plt.show()

## Stage 3b: Crash-Moment Keyframes

Attempt to generate temporal keyframes: "3 seconds before", "1 second before", and "impact moment".

**Honest expectation:** Pre-crash keyframes should look reasonable. The impact moment will likely be imperfect — this is documented as part of the evaluation.

In [None]:
# Generate keyframes for one scenario (to save time)
demo_scenario = "wet_highway"
print(f"Generating temporal keyframes for: {demo_scenario}")

keyframes = cn_gen.generate_crash_keyframes(
    scenario=parsed_scenarios[demo_scenario],
    depth_image=depth_maps[demo_scenario],
)

fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for i, (timepoint, kf_img) in enumerate(keyframes.items()):
    axes[i].imshow(kf_img)
    axes[i].set_title(timepoint.replace('_', ' '), fontsize=12)
    axes[i].axis('off')
    kf_img.save(f"outputs/{demo_scenario}_{timepoint}.png")

plt.suptitle(f"Temporal Keyframes: {demo_scenario.replace('_', ' ').title()}", fontsize=14)
plt.tight_layout()
plt.savefig("outputs/keyframes.png", dpi=150)
plt.show()

## Stage 4: Video Generation (SVD)

Animate a generated dashcam image into a 3.5-second video using Stable Video Diffusion.

**Why SVD instead of Wan2.1:** Wan2.1 crashes Colab free tier's 12GB system RAM. SVD is image-to-video (~6GB VRAM), lighter, and animates our ControlNet output directly.

**Important:** ControlNet must be fully unloaded first to free memory.

In [None]:
import gc
import torch
from utils.video_generator import VideoGenerator

# Fully unload ControlNet to free both VRAM and system RAM
cn_gen.unload_model()
del cn_gen
gc.collect()
torch.cuda.empty_cache()

print(f"VRAM after cleanup: {torch.cuda.memory_allocated()/1024**3:.1f} GB")

# Load SVD (image-to-video, ~6GB)
video_gen = VideoGenerator(vram_manager=vram)

# Animate the best generated image
demo_name = "wet_highway"
print(f"Generating video for: {demo_name}")

frames = video_gen.generate(
    image=generated_images[demo_name],
    num_frames=25,  # 25 frames = ~3.5s at 7fps
)
video_gen.export_video(frames, f"outputs/{demo_name}_video.mp4")

video_gen.unload_model()
print("Video generation complete")

## Stage 5: Evaluation

Quantitative quality assessment:
- **CLIP score** — How well does the generated image match the text prompt?
- **YOLO verification** — Did the model actually render the objects we requested?
- **Quality grade** — Human-readable assessment based on metrics

In [None]:
from utils.evaluator import ScenarioEvaluator
import json

evaluator = ScenarioEvaluator()

eval_results = {}
for name, scenario in parsed_scenarios.items():
    print(f"\nEvaluating: {name}")
    report = evaluator.evaluate_scenario(
        image=generated_images[name],
        scenario=scenario,
        label="controlnet_depth",
    )
    eval_results[name] = report
    
    print(f"  CLIP score: {report['clip_score']}")
    print(f"  Objects found: {report['object_verification']['found']}")
    print(f"  Objects missing: {report['object_verification']['missing']}")
    print(f"  Detection rate: {report['object_verification']['detection_rate']}")
    print(f"  Quality: {report['quality_assessment']}")

# Save all evaluations
with open("outputs/evaluation_report.json", "w") as f:
    json.dump(eval_results, f, indent=2, default=str)
print("\nFull evaluation saved to outputs/evaluation_report.json")

## VRAM Usage Report

Profiling GPU memory across all pipeline stages — demonstrates VRAM-aware orchestration.

In [None]:
vram.print_report()

# Visualize VRAM usage over time
report = vram.report()
stages = [s["stage"] for s in report["stages"]]
peaks = [s["peak_mb"] for s in report["stages"]]

plt.figure(figsize=(14, 5))
plt.bar(range(len(stages)), peaks, color='steelblue')
plt.axhline(y=15360, color='red', linestyle='--', label='T4 VRAM limit (15GB)')
plt.xticks(range(len(stages)), stages, rotation=45, ha='right', fontsize=8)
plt.ylabel('Peak VRAM (MB)')
plt.title('GPU Memory Usage Across Pipeline Stages')
plt.legend()
plt.tight_layout()
plt.savefig("outputs/vram_usage.png", dpi=150)
plt.show()

## Summary & Failure Analysis

### What Worked
- Depth conditioning improves spatial accuracy over raw SDXL
- Scene object parsing from crash reports → structured depth manipulation
- Video generation produces temporally consistent dashcam footage
- VRAM-aware sequential loading keeps pipeline within T4 limits

### Known Limitations
- **Crash moment quality**: Diffusion models can't reliably render physical collision dynamics
- **Small objects**: Pedestrians at distance remain hard to render accurately
- **Temporal consistency**: T2V doesn't guarantee consistency with the generated still image
- **Object detection gap**: YOLO can't verify all object types (guardrails, debris)

### ML Concepts Demonstrated
- Diffusion conditioning theory (depth maps as geometric priors)
- VRAM-aware multi-model pipeline orchestration
- LLM-structured scene understanding (CrashAgent-inspired)
- Programmatic depth manipulation (geometric reasoning)
- Video diffusion (Wan2.1)
- Automated evaluation methodology (CLIP, YOLO)
- Honest failure analysis with quantitative metrics

In [None]:
# Final summary table
print("\n" + "="*70)
print("FINAL EVALUATION SUMMARY")
print("="*70)
print(f"{'Scenario':<25} {'CLIP':>8} {'Det. Rate':>10} {'Quality':>30}")
print("-"*70)
for name, report in eval_results.items():
    clip = report['clip_score']
    det = report['object_verification']['detection_rate']
    qual = report['quality_assessment'].split(' — ')[0]
    print(f"{name:<25} {clip:>8.4f} {det:>10.2f} {qual:>30}")
print("="*70)