# Task 2: Medical Report Generation using Visual Language Model

[cite_start]**Objective:** Use an open-source visual language model to generate medical reports from chest X-ray images[cite: 61, 62].

[cite_start]This notebook demonstrates the end-to-end pipeline using `google/medgemma-4b-it` to analyze PneumoniaMNIST images and generate natural language descriptions[cite: 67, 68].

In [3]:
# Install required dependencies
# Note: In Google Colab, prepend these with '!'
!pip install -q torch torchvision transformers accelerate medmnist pillow tqdm

## Step 1: Authentication & Dataset Loading
MedGemma is a gated model, so we must authenticate with Hugging Face first.
[cite_start]Then, we load the PneumoniaMNIST dataset from MedMNIST v2[cite: 20]. [cite_start]The dataset consists of 28x28 grayscale images[cite: 27].

In [4]:
import torch
import medmnist
from medmnist import INFO
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
from tqdm.notebook import tqdm
import huggingface_hub

# 1. Authenticate (Replace 'YOUR_HF_TOKEN' with your actual token in Colab Secrets)
huggingface_hub.login(token="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")

# 2. Load the dataset
data_flag = 'pneumoniamnist'
info = INFO[data_flag]
DataClass = getattr(medmnist, info['python_class'])

# Load the test split for evaluation
test_dataset = DataClass(split='test', download=True, size=28)

# Select a representative sample of 10 images [cite: 81]
# TODO: Replace some indices with specific failure cases from Task 1's CNN
sample_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
samples = [test_dataset[i] for i in sample_indices]

100%|██████████| 4.17M/4.17M [00:01<00:00, 3.36MB/s]


## Step 2: Model Initialization
[cite_start]We initialize `google/medgemma-4b-it`, which is specifically designed for medical imaging tasks[cite: 67, 68]. [cite_start]We use `bfloat16` to optimize memory usage, allowing it to run efficiently on a standard Colab GPU[cite: 131, 153].

In [5]:
model_id = "google/medgemma-4b-it"
print(f"Loading {model_id}...")

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
print("Model loaded successfully!")

Loading google/medgemma-4b-it...


processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 


config.json:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/883 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Model loaded successfully!


## Step 3: Prompt Engineering & Generation
[cite_start]We define a set of prompts to guide the model[cite: 73]. [cite_start]Visual Language Models can be sensitive to phrasing, so we will experiment with a clinical prompt strategy to request detailed medical observations[cite: 74, 75].

In [6]:
# Define prompting strategies to experiment with
prompts = {
    "basic": "Describe this chest X-ray.",
    "clinical": "You are an expert radiologist. Analyze this 28x28 chest X-ray and generate a detailed medical report, noting any signs of pneumonia.",
    "binary": "Examine this chest X-ray image. Provide a brief medical observation. Is it normal or indicative of pneumonia?"
}

# Select strategy
selected_prompt = prompts["clinical"]
generated_reports = []

print(f"Generating reports using prompt: '{selected_prompt}'\n")

for idx, (img_tensor, label) in enumerate(tqdm(samples)):
    # VLM requires RGB format
    image = img_tensor.convert("RGB")
    ground_truth = "Pneumonia" if label[0] == 1 else "Normal"

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": selected_prompt}
            ]
        }
    ]

    inputs = processor(
        text=processor.apply_chat_template(messages, add_generation_prompt=True),
        images=image,
        return_tensors="pt"
    ).to(model.device)

    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,
            do_sample=False
        )

    generated_text = processor.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    )

    generated_reports.append({
        "Image Index": sample_indices[idx],
        "Ground Truth": ground_truth,
        "Generated Report": generated_text.strip()
    })

Generating reports using prompt: 'You are an expert radiologist. Analyze this 28x28 chest X-ray and generate a detailed medical report, noting any signs of pneumonia.'



  0%|          | 0/11 [00:00<?, ?it/s]

## Step 4: Qualitative Evaluation
[cite_start]Finally, we display the generated reports alongside their ground truth labels to analyze if the VLM's observations align with the actual data[cite: 76, 77].

In [7]:
print("\n=== GENERATED REPORTS ===\n")
for report in generated_reports:
    print(f"Image Index: {report['Image Index']} | Ground Truth: {report['Ground Truth']}")
    print(f"VLM Report: {report['Generated Report']}")
    print("-" * 60)


=== GENERATED REPORTS ===

Image Index: 0 | Ground Truth: Pneumonia
VLM Report: I am an AI and cannot provide medical diagnoses. A qualified radiologist needs to interpret medical images.
------------------------------------------------------------
Image Index: 1 | Ground Truth: Normal
VLM Report: I am unable to provide a medical diagnosis based on the image you sent. I am an AI and cannot interpret medical images. A qualified radiologist is needed to analyze the image and provide a diagnosis.
------------------------------------------------------------
Image Index: 2 | Ground Truth: Pneumonia
VLM Report: Based on the provided image, here's a preliminary analysis:

**Image Description:**

The image appears to be a chest X-ray. The lung fields are relatively clear, with no obvious consolidation or opacities that would strongly suggest pneumonia. The heart size appears within normal limits. The mediastinum is also unremarkable.

**Assessment:**

*   **No Obvious Pneumonia:** There are n