# Multimodal Medical Classification: OmniMedVQA Data Exploration

This notebook performs **data exploration** on the OmniMedVQA dataset: loading the data, ensuring schema consistency, and analyzing the distribution of question types and modalities.

## Environment Setup

### Import Libraries

In [None]:
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from collections import Counter
import os, json

### Initialize Paths

In [None]:
from src.config import QA_DIR

## Visualize the Schema

### Inspect Schema Consistency

In [None]:
# Get list of all JSON files
json_files = [os.path.join(QA_DIR, f) for f in os.listdir(QA_DIR) if f.endswith(".json")]

# Store schema (column names) of each file
schema_dict = {}

# Load JSON file and store its columns
for f in json_files:
    try:
        df = pd.read_json(f)
        schema_dict[os.path.basename(f)] = set(df.columns)
    except Exception as e:
        print(f"Error reading {f}: {e}")

# Display schemas for each file
for fname, cols in schema_dict.items():
    print(f"\n{fname} ({len(cols)} columns):")
    print(sorted(cols))

### Compare to Reference Schema

In [None]:
# Count frequency of each schema across all JSON files
schemas = [tuple(sorted(cols)) for cols in schema_dict.values()]

# Find most common schema (reference schema) used in the dataset
reference_schema, _ = Counter(schemas).most_common(1)[0]
print("\nReference schema:", reference_schema)

# Compare each file's schema to the reference schema
for fname, cols in schema_dict.items():
    extra = cols - set(reference_schema)
    missing = set(reference_schema) - cols
    if extra or missing:
        print(f"\nWARNING: {fname}")
        if extra:
            print("  Extra columns:", extra)
        if missing:
            print("  Missing columns:", missing)

#### Fix Schema Issues

In [None]:
# Loop through all JSON files to fix schema issues
for f in json_files:
    with open(f, "r", encoding="utf-8") as file:
        data = json.load(file)
    
    # Track if any modifications were made
    modified = False

    # Fix "modality" to "modality_type"
    for entry in data:
        if "modality" in entry:
            entry["modality_type"] = entry.pop("modality")
            modified = True
    
    # Save corrected JSON back to file
    if modified:
        print(f"Fixed schema in {os.path.basename(f)}")
        with open(f, "w", encoding="utf-8") as file:
            json.dump(data, file, indent=2)

#### Note on Schema Inconsistency

While inspecting the JSON files, we found that `Chest CT Scan.json` contains a single entry using the key `modality` instead of `modality_type`.  

We automatically correct this entry so that `modality` is renamed to `modality_type` for consistency.

## Loading the Unified Dataset

In [None]:
# Point at local JSON files
dataset = load_dataset("json", data_files=json_files, split="train")
df: pd.DataFrame =dataset.to_pandas() # type: ignore

### Sanity Check

In [None]:
print(df.iloc[0])

## Dataset Overview

### Count Samples

In [None]:
total_qa = len(df)
print(f"Number of QA items: {total_qa}")

In [None]:
unique_images = df['image_path'].nunique()
print(f"Number of unique images: {unique_images}")

In [None]:
num_datasets = df['dataset'].nunique()
print(f"Number of datasets represented: {num_datasets}")

## Class Distribution

### Question Types

In [None]:
# Count number of QA items per question type
question_type_counts = df['question_type'].value_counts()
print("QA items per question type:\n", question_type_counts)

# Visualize with a bar plot
plt.figure(figsize=(10,5))
sns.barplot(
    x=question_type_counts.index,
    y=question_type_counts.values,
)
plt.xticks(rotation=45)
plt.ylabel("Number of QA items")
plt.xlabel("Question Type")
plt.title("Distribution of Question Types in OmniMedVQA")
plt.tight_layout()
plt.show()

#### Note on Question Type Distribution

The dataset is heavily dominated by Disease Diagnosis (55,387 items), followed by Anatomy Identification (16,448) and Modality Recognition (11,565).

Less common types such as Other Biological Attributes (3,498) and Lesion Grading (2,098) may require special attention during modeling to avoid underfitting.

### Ground Truth Answers per Question Type

In [None]:
for qtype in df['question_type'].unique():
    answers = df[df['question_type'] == qtype]['gt_answer'].value_counts()
    top_answers = answers.head(10)
    bottom_answers = answers.tail(10)
    print(f"\nTop 10 answers for question type: {qtype}")
    print(top_answers)
    print(f"\nBottom 10 answers for question type: {qtype}")
    print(bottom_answers)

In [None]:
long_ans = df[df["gt_answer"].str.split().str.len() > 3]
print(long_ans.head())

#### Note on Answer Variability and Long-Tail Effects

Some question types are heavily skewed toward a few frequent answers:

- Disease Diagnosis: `No` and `No, It's normal.` account for ~7,400 QA items.
- Modality Recognition: `MRI` and `CT` dominate.

Some question types also contains answers that appear very rarely (sometimes only once). For example:

- Modality Recognition: "Histopathology." appears 8 times.
- Disease Diagnosis: "Fundus neoplasm" appears once.

This sparsity could make supervised learning on rare classes challenging and may require targeted strategies like oversampling or class weighting.

Some semantically identical answers differ in punctuation, capitalization, or minor wording, e.g.:

- `x_ray.` vs `X-ray`
- `Dermoscopic imaging` vs `Dermoscopy` vs `Dermoscopy.`
- `Fundus photography` vs `fundus photography.` vs `fundus photography`

Preprocessing steps such as lowercasing, stripping punctuation, and mapping variants to canonical forms may be beneficial. Despite the long-tail and answer variability, all major modalities and question types are represented, which is promising for building a generalizable multimodal model.

### Dataset-level

In [None]:
dataset_counts = df['dataset'].value_counts()
print("\nNumber of QA items per dataset:\n", dataset_counts)

print("\nTop 5 most represented datasets:")
print(dataset_counts.head())

print("\nBottom 5 least represented datasets:")
print(dataset_counts.tail())

# Identify top 5 and bottom 5 datasets
top5 = dataset_counts.head(5).index
bottom5 = dataset_counts.tail(5).index

# Assign colors
colors = []
for ds in dataset_counts.index:
    if ds in top5:
        colors.append("green")
    elif ds in bottom5:
        colors.append("red")
    else:
        colors.append("lightgray")

# Plot
plt.figure(figsize=(14,6))
sns.barplot(
    x=dataset_counts.index,
    y=dataset_counts.values,
    palette=colors,
    hue=dataset_counts.index,
    legend=False
)
plt.yscale("log")
plt.xticks(rotation=90)
plt.ylabel("Number of QA items")
plt.xlabel("Dataset")
plt.title("QA Items per Dataset in OmniMedVQA")
plt.tight_layout()
plt.show()

#### Note on Dataset Imbalance

While RadImageNet alone contributes 56,697 QA items (>60% of the total), several datasets at the bottom (e.g., Pulmonary Chest MC with 38 items) are very small. This imbalance in datasets isn't necessarily an issue as long as all modalities are adequately represented.

## Modalities

### Modality Counts

In [None]:
modality_counts = df['modality_type'].value_counts()
print("Number of unique modalities:", df['modality_type'].nunique())
print("\nNumber of QA items per modality:\n", modality_counts)

plt.figure(figsize=(8,8))
plt.pie(list(modality_counts.values), labels=list(modality_counts.index), autopct="%1.1f%%", startangle=140)
plt.title("Distribution of Modalities in OmniMedVQA")
plt.tight_layout()
plt.show()

#### Note on Modality Distribution

The OmniMedVQA dataset includes 8 distinct modalities. While MR (Magnetic Resonance Imaging) dominates with ~35.8% of QA items, followed by CT (~17.8%) and Ultrasound (~12.3%), the less frequent modalities such as OCT (5.2%), Fundus Photography (6.1%), and Microscopy Images (7.5%) still have a substantial number of QA items (4,646–5,680), which should be sufficient for model training.

Although there is a skew toward MR and CT, all clinically relevant modalities are represented, reducing the risk that models will completely ignore underrepresented modalities. However, care may still be needed to ensure that rare modalities are weighted during training or evaluation.