# LLaVA-1.5-7B VQA Pipeline Test

**Model**: `llava-hf/llava-1.5-7b-hf`  
**Task**: Visual Question Answering (VQA) on fetal ultrasound images  
**Status**: ❌ FAILED - GPU memory constraints

## Problem

The LLaVA-1.5-7B model is too large for the RTX 4070 laptop GPU, even with 8-bit quantization.

## Hardware

- GPU: RTX 4070 (Laptop)
- VRAM: ~8GB
- Model size: ~7B parameters (~14GB in FP16, ~7GB in 8-bit)

## Error

```
ValueError: Some modules are dispatched on the CPU or the disk. 
Make sure you have enough GPU RAM to run the model.
```

## 1. Imports

In [None]:
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
import pandas as pd
from PIL import Image
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Load Dataset

In [None]:
# Load Excel annotations
DATA_DIR = Path(r"C:\Users\elyas\Workspace\PyCharm\fada-v3\data\Fetal Ultrasound Labeled")
IMAGE_DIR = Path(r"C:\Users\elyas\Workspace\PyCharm\fada-v3\data\Fetal Ultrasound\Non_standard_NT")

excel_path = DATA_DIR / "Non_standard_NT_image_list.xlsx"
df = pd.read_excel(excel_path)

print(f"Dataset size: {len(df)} images")
print(f"\nColumns: {df.columns.tolist()}")

## 3. Prepare VQA Data

In [None]:
q_cols = [col for col in df.columns if col.startswith('Q')]

vqa_data = []
for idx, row in df.head(5).iterrows():
    img_path = IMAGE_DIR / row['Image Name']
    if not img_path.exists():
        continue
    
    for q_col in q_cols:
        question = q_col.split('\n', 1)[1] if '\n' in q_col else q_col
        question = question[:100]
        answer = str(row[q_col])
        
        if pd.notna(answer) and answer.lower() not in ['nan', 'none', '']:
            vqa_data.append({
                'image_path': str(img_path),
                'question': question,
                'answer': answer
            })

print(f"Prepared {len(vqa_data)} VQA examples")

## 4. Test Image Loading

In [None]:
test_img = Image.open(vqa_data[0]['image_path']).convert('RGB')
print(f"Image size: {test_img.size}")
test_img

## 5. Attempt Model Loading (8-bit Quantization)

**This cell will fail due to insufficient GPU memory.**

In [None]:
model_id = "llava-hf/llava-1.5-7b-hf"

try:
    print("Loading processor...")
    processor = AutoProcessor.from_pretrained(model_id)
    
    print("Loading model with 8-bit quantization...")
    model = LlavaForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        device_map="auto",
        load_in_8bit=True
    )
    
    print("Model loaded successfully!")
    print(f"Device map: {model.hf_device_map}")
    
except Exception as e:
    print(f"\n❌ FAILED: {type(e).__name__}")
    print(f"Error: {str(e)}")
    print("\nReason: RTX 4070 laptop GPU does not have enough VRAM for LLaVA-1.5-7B")
    print("Even with 8-bit quantization, the model requires ~7-8GB VRAM")

## 6. Test VQA (If Model Loaded)

**This cell will not run due to model loading failure.**

In [None]:
if 'model' in locals():
    question = "What organ is shown in this ultrasound image?"
    
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": f"Question: {question}"}
            ]
        }
    ]
    
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(images=test_img, text=prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=100)
    
    response = processor.decode(output[0], skip_special_tokens=True)
    print(f"Question: {question}")
    print(f"Response: {response}")
else:
    print("Model not loaded - skipping VQA test")

## Summary

### Why LLaVA-1.5-7B Failed

1. **Model Size**: 7 billion parameters
2. **Memory Requirements**:
   - FP16: ~14GB VRAM
   - 8-bit quantization: ~7-8GB VRAM
   - 4-bit quantization: ~4-5GB VRAM (quality degradation)
3. **Available VRAM**: RTX 4070 laptop has ~8GB, but OS and processes use some
4. **Result**: Model cannot fit in GPU memory

### Alternative Approaches

1. **Smaller VLMs**:
   - BLIP-2 with Flan-T5-base (~1B parameters) ✅ SUCCESS
   - InstructBLIP with smaller backbones

2. **Cloud GPUs**:
   - Google Colab with A100/V100
   - AWS/Azure with larger GPUs

3. **Hybrid Approach (Recommended)**:
   - Phase 1: Classification + template responses
   - Phase 2: Add smaller VLM when dataset grows

### Decision

Proceeding with BLIP-2 as it fits in available GPU memory and provides acceptable VQA performance.