# Notebook 01: VLM Feasibility Study for Causal Reasoning

**Author:** Antonio Guillen-Perez
**Date:** November 12, 2025

## 1. Objective

The primary goal of this study is to validate the core hypothesis of the "Causal Scrutinizer" project: **Can a powerful, open-source Vision-Language Model (VLM), running on local hardware (a single RTX 3090), perform high-quality causal reasoning on complex driving scenarios?**

This notebook documents the final, successful experiment in our feasibility study, which proves that a VLM can indeed understand and reason about a *sequence of events* when presented with a clean, schematic visual input. This result provides the green light to proceed with the main engineering phase of the project: building a full-scale Scenario Renderer for the Waymo Open Motion Dataset.

## 2. Setup: Loading the GGUF Model with `llama-cpp-python`

After initial experiments with `transformers` and `bitsandbytes` revealed numerical instability with on-the-fly quantization for large models like Gemma 3, we pivoted to a more robust and stable inference toolchain.

Our final, stable stack consists of:
- **Model:** `google/gemma-3-12b-it-qat-q4_0-gguf` - The official, quantization-aware trained 4-bit version of Gemma 3 12B.
- **Vision Projector:** `mmproj-model-f16-12B.gguf` - The corresponding multimodal projector required to connect the vision encoder to the language model.
- **Inference Backend:** `llama-cpp-python` with full CUDA support, which is highly optimized for running GGUF models on local GPUs.

The following code cell initializes this entire pipeline.

In [None]:
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
from PIL import Image
from IPython.display import display, Image as IPImage
import base64
from io import BytesIO
import os

# --- 1. SETUP THE MODEL --- 
# Define paths relative to the project root directory
MODEL_DIR = "../models/"
LLM_FILENAME = "gemma-3-12b-it-q4_0.gguf"
PROJECTOR_FILENAME = "mmproj-model-f16-12B.gguf"

llm_path = os.path.join(MODEL_DIR, LLM_FILENAME)
projector_path = os.path.join(MODEL_DIR, PROJECTOR_FILENAME)

# Check if model files exist
if not all(os.path.exists(p) for p in [llm_path, projector_path]):
    print(f"ERROR: Model or projector file not found in {MODEL_DIR}.")
    print("Please download the GGUF files from the Hugging Face Hub.")
else:
    print("--- Initializing Chat Handler with Vision Projector... ---")
    # Note: Llama.from_pretrained will download from HF if repo_id is used.
    # We are loading from local paths directly.
    chat_handler = Llava15ChatHandler(clip_model_path=projector_path)
    
    print("--- Chat Handler loaded. Initializing GGUF model... ---")
    llm = Llama(
        model_path=llm_path,
        chat_handler=chat_handler,
        n_gpu_layers=-1,  # Offload all layers to GPU
        n_ctx=8096,       # Set context window size
        verbose=False
    )
    print("--- Multimodal model loaded successfully. ---")

## 3. The Core Experiment: Causal Analysis of a Dynamic Scenario

This experiment tests the VLM's ability to analyze a sequence of events. We will provide it with several keyframes from a simple, programmatically generated GIF of a collision scenario. 

The keyframes are selected to tell a story:
1.  **Frame 1:** The initial approach of both vehicles.
2.  **Frame 2:** The moment of imminent risk, just before impact.
3.  **Frame 3:** The moment of impact.
4.  **Frame 4:** The final state after the collision.

The prompt is designed to explicitly ask the model to move beyond simple description and perform a causal analysis.

In [None]:
# --- Display the GIF for context ---
GIF_PATH = "archive/assets/collision_scenario_v2.gif"
print(f"Displaying the input scenario from: {GIF_PATH}")
display(IPImage(filename=GIF_PATH))

In [None]:
# --- 2. PREPARE MULTI-IMAGE INPUTS ---

# Helper function to convert PIL Image to a Data URI
def pil_image_to_data_uri(image: Image.Image) -> str:
    """Converts a PIL Image object to a base64-encoded data URI."""
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode('utf-8')
    return f"data:image/jpeg;base64,{img_str}"

# Load the GIF and select keyframes
print(f"--- Loading GIF and selecting keyframes from '{GIF_PATH}'... ---")
gif_image = Image.open(GIF_PATH)

# Our "storytelling" keyframe selection strategy
key_frame_indices = [5, 10, 15, gif_image.n_frames - 1] # Early approach -> Imminent risk -> Impact -> Final state
print(f"Selected keyframe indices: {key_frame_indices}")

key_frames = []
for i in key_frame_indices:
    gif_image.seek(i)
    key_frames.append(gif_image.convert("RGB").copy())

# Construct the structured prompt
system_prompt = "You are a world-class expert in autonomous vehicle safety and causal reasoning. Your task is to analyze a sequence of visual frames and provide a clear, concise, and logical explanation of the events."
user_prompt_text = (
    "The following images are sequential keyframes from a simplified, top-down traffic scenario. "
    "The red and blue shapes are vehicles.\n\n"
    "1. **Describe the sequence of events** from the first frame to the last.\n"
    "2. **Identify the primary causal event** that defines this scenario and explain your reasoning.\n"
    "3. Based on the vehicle dynamics, what was the most critical safety failure?"
)

# Build the content list for the user message
user_content = []
for frame in key_frames:
    user_content.append({"type": "image_url", "image_url": {"url": pil_image_to_data_uri(frame)}})
user_content.append({"type": "text", "text": user_prompt_text})

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_content}
]

# --- 3. RUN INFERENCE ---
print("\n--- Generating response for multi-image input... ---")
response = llm.create_chat_completion(
    messages=messages,
    max_tokens=700,      # Allow for a detailed response
    temperature=0.1      # Keep the output factual and deterministic
)

assistant_response = response['choices'][0]['message']['content']

print("\n--- MODEL OUTPUT (Gemma 3 12B - Multi-Frame Causal Analysis) ---")
print(assistant_response)

## 4. Analysis and Conclusion

The model's output is a definitive success and validates our core hypothesis.

**Key Successes:**
1.  **Temporal Understanding:** The model correctly narrated the sequence of events, identifying the approach, entry into the intersection, and the final collision.
2.  **Causal Inference:** It successfully identified the **"failure of one or both vehicles to yield the right-of-way"** as the *primary causal event*, correctly distinguishing the cause from the effect (the collision itself).
3.  **Application of Latent Knowledge:** The model applied external knowledge about traffic rules ("right-of-way") to the abstract scene, demonstrating a deep, generalizable understanding of the world.
4.  **Expert Framework:** It correctly broke down the critical safety failure into the standard AV stack components (Perception, Prediction, Decision-Making), adopting the persona of a true safety expert.

**Conclusion:** The feasibility study is complete and successful. We have proven that a locally-run Gemma 3 12B model can perform sophisticated, multi-frame causal reasoning on schematic visual inputs. This provides a strong, evidence-based foundation to proceed with the primary engineering task of building the full `ScenarioRenderer` for the Waymo dataset.