# Florence-2 VQA Pipeline Test

**Model**: `microsoft/Florence-2-base`  
**Task**: Visual Question Answering (VQA) on fetal ultrasound images  
**Status**: ❌ FAILED - Compatibility issues with SDPA and dtype

This notebook documents an attempt to use Microsoft's Florence-2 base model for VQA on fetal ultrasound images from the FADA dataset.

## 1. Setup and Imports

In [None]:
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
import pandas as pd
from PIL import Image
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

## 2. Load Dataset

In [None]:
# Load Excel annotation file
DATA_DIR = Path(r"C:\Users\elyas\Workspace\PyCharm\fada-v3\data\Fetal Ultrasound Labeled")
IMAGE_DIR = Path(r"C:\Users\elyas\Workspace\PyCharm\fada-v3\data\Fetal Ultrasound\Non_standard_NT")

excel_path = DATA_DIR / "Non_standard_NT_image_list.xlsx"
df = pd.read_excel(excel_path)

print(f"Dataset size: {len(df)} images")
print(f"\nColumns: {df.columns.tolist()}")

## 3. Prepare VQA Data

In [None]:
q_cols = [col for col in df.columns if col.startswith('Q')]
print(f"Questions per image: {len(q_cols)}")

vqa_data = []
for idx, row in df.head(5).iterrows():
    img_path = IMAGE_DIR / row['Image Name']
    if not img_path.exists():
        continue
    
    for q_col in q_cols:
        question = q_col.split('\n', 1)[1] if '\n' in q_col else q_col
        question = question[:100]
        answer = str(row[q_col])
        
        if pd.notna(answer) and answer.lower() not in ['nan', 'none', '']:
            vqa_data.append({
                'image_path': str(img_path),
                'question': question,
                'answer': answer
            })

print(f"Prepared {len(vqa_data)} VQA examples")

## 4. Test Image Loading

In [None]:
test_img = Image.open(vqa_data[0]['image_path']).convert('RGB')
print(f"Image size: {test_img.size}")
test_img

## 5. Load Florence-2 Model (First Attempt)

**Error Encountered**: `'Florence2ForConditionalGeneration' object has no attribute '_supports_sdpa'`

In [None]:
model_name = "microsoft/Florence-2-base"

try:
    processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        torch_dtype=torch.float16
    ).to("cuda")
    print("Model loaded successfully!")
except AttributeError as e:
    print(f"❌ ERROR: {e}")
    print("\nThe model doesn't support SDPA (Scaled Dot-Product Attention)")

## 6. Load Florence-2 Model (Second Attempt with Eager Attention)

**Workaround**: Using `attn_implementation="eager"` to bypass SDPA requirement.

In [None]:
try:
    processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        attn_implementation="eager"  # Workaround for SDPA issue
    ).to("cuda")
    print("Model loaded successfully with eager attention!")
    print(f"Model device: {next(model.parameters()).device}")
    print(f"Model dtype: {next(model.parameters()).dtype}")
except Exception as e:
    print(f"❌ ERROR: {e}")

## 7. Test VQA Inference

**Error Encountered**: `'NoneType' object has no attribute 'shape'`

In [None]:
question = "What organ is visible in this ultrasound image?"
prompt = f"<VQA> {question}"

try:
    inputs = processor(text=prompt, images=test_img, return_tensors="pt").to("cuda", torch.float16)
    
    print("Input tensors prepared:")
    for key, value in inputs.items():
        if isinstance(value, torch.Tensor):
            print(f"  {key}: shape={value.shape}, dtype={value.dtype}")
    
    print("\nGenerating response...")
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=100)
    
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(f"\nGenerated response: {generated_text}")
    
except AttributeError as e:
    print(f"\n❌ ERROR during inference: {e}")
    print("\nThis error suggests a dtype/shape mismatch in the model's forward pass.")
except Exception as e:
    print(f"\n❌ ERROR: {type(e).__name__}: {e}")

## Summary

### Errors Encountered

1. **SDPA Compatibility Issue**:
   - Error: `'Florence2ForConditionalGeneration' object has no attribute '_supports_sdpa'`
   - Workaround: Use `attn_implementation="eager"`

2. **Inference Dtype/Shape Issue**:
   - Error: `'NoneType' object has no attribute 'shape'`
   - Cause: Internal model forward pass fails, likely due to vision encoder incompatibility

### Root Cause Analysis

The Florence-2 model has compatibility issues with the current transformers version or requires specific initialization. The errors suggest:

1. **Version Mismatch**: The model may require a specific transformers version
2. **Custom Code Issues**: The `trust_remote_code=True` flag loads custom model code that may have bugs
3. **Preprocessing Requirements**: The model may need specific image preprocessing

### Recommendations

1. **Alternative Models**: Consider using BLIP-2, LLaVA, or InstructBLIP (more stable)
2. **Report Issue**: This appears to be a model compatibility bug

### Conclusion

**Florence-2 is NOT recommended for this project** due to unstable integration with transformers library. Proceeding with BLIP-2 instead.