<a href="https://colab.research.google.com/github/AhlamBashiti1/MedCUI_ML_Project/blob/main/Mapping_Images_Texts_CUI_Clusters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🎯 Data Preprocessing: Merging Captions and Concepts

We begin preparing the data for **multi-label classification** by merging information from two sources:

📂 **Input Files**:
- 🖼️ File 1: Contains `image_id` (named `ID`) and its corresponding **image caption**.
- 🧠 File 2: Contains `image_id` and a list of **clinical CUIs** (Concept Unique Identifiers), which represent medical concepts.

🔗 **Goal**: Merge both files into a single unified dataset with the following structure:




This merge is based on the shared `image_id` key. It is a **crucial step**, as we need both the **visual description (captions)** and the **medical annotations (CUIs)** for every image to perform meaningful learning.

♻️ **Applied to all splits**:
- ✅ Training set
- ✅ Validation set
- ✅ Test set

This ensures that the full dataset is consistently formatted for downstream processing like cluster mapping and multi-label encoding.


## 📥 Clone Dataset Repository

We start by cloning the **ROCOv2 Radiology** dataset repository from GitHub:

```bash
!git clone https://github.com/sctg-development/ROCOv2-radiology.git


In [21]:
!git clone https://github.com/sctg-development/ROCOv2-radiology.git

fatal: destination path 'ROCOv2-radiology' already exists and is not an empty directory.


In [9]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np


In [13]:

# Load both CSV files
captions_df = pd.read_csv('/content/ROCOv2-radiology/source_dataset/train_captions.csv')
cuis_df = pd.read_csv('/content/ROCOv2-radiology/source_dataset/train_concepts.csv')

# Merge them on 'image_id'
merged_df = pd.merge(captions_df, cuis_df, on='ID')

# Save the result as CSV
output_path = '/content/MergeAllTrain.csv'
merged_df.to_csv(output_path, index=False)

print(f"Merged file saved as '{output_path}'")




Merged file saved as '/content/MergeAllTrain.csv'


In [14]:
# Load both CSV files
captions_df = pd.read_csv('/content/ROCOv2-radiology/source_dataset/valid_captions.csv')
cuis_df = pd.read_csv('/content/ROCOv2-radiology/source_dataset/valid_concepts.csv')

# Merge them on 'image_id'
merged_df = pd.merge(captions_df, cuis_df, on='ID')

# Save the result as CSV
output_path = '/content/MergeAllValid.csv'
merged_df.to_csv(output_path, index=False)

print(f"Merged file saved as '{output_path}'")

Merged file saved as '/content/MergeAllValid.csv'


In [15]:
# Load both CSV files
captions_df = pd.read_csv('/content/ROCOv2-radiology/source_dataset/test_captions.csv')
cuis_df = pd.read_csv('/content/ROCOv2-radiology/source_dataset/test_concepts.csv')

# Merge them on 'image_id'
merged_df = pd.merge(captions_df, cuis_df, on='ID')

# Save the result as CSV
output_path = '/content/MergeAllTest.csv'
merged_df.to_csv(output_path, index=False)

print(f"Merged file saved as '{output_path}'")

Merged file saved as '/content/MergeAllTest.csv'


## 🧠 Concept Clustering and Multi-Label Encoding

After merging captions and CUIs, we proceed to map each CUI to a **cluster label** and transform the data into a format suitable for multi-label classification.

🔍 **Step 1: Normalize and Map CUIs**
- CUIs are cleaned (uppercase, stripped).
- Using a precomputed mapping file, each CUI is mapped to a **cluster index** (e.g., `C0002871` → `Cluster 2`).
- For each image, we collect all mapped cluster labels into a list.

⚠️ If an image’s CUIs do not match any known cluster, it is marked as `"UNKNOWN"` and **removed** from the dataset to ensure label quality.

🧪 Example:



---

🏷️ **Step 2: Multi-Label Encoding**
- We apply **MultiLabelBinarizer** to convert the list of cluster labels into a **multi-hot encoded vector**.
- Each cluster becomes a column (`cluster_0`, `cluster_1`, ...) with 0 or 1 indicating absence or presence.
- The encoder is **fitted on the training set** only, and then reused on the validation and test sets to ensure consistency.

🧾 **Final Output Columns**:
- `ID`, `Caption`, `CUIs` — original data
- `clusters` — comma-separated cluster labels
- `multi_hot_vector` — full binary vector
- One column per cluster class (for model training)

✅ This prepares the data for training a **multi-label classifier**, where each image can be associated with multiple concept clusters.


In [24]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

def process_caption_cui_file(
    merged_path,              # Path to merged file with ID, Caption, CUIs
    cui_cluster_path,         # Path to CSV with CUI → cluster
    output_path               # Where to save final file
):
    # 1. Load files
    image_cuis_df = pd.read_csv(merged_path)
    cui_cluster_df = pd.read_csv(cui_cluster_path)

    # 2. Normalize CUIs
    image_cuis_df['CUIs'] = image_cuis_df['CUIs'].astype(str).str.strip().str.upper()
    cui_cluster_df['CUI'] = cui_cluster_df['CUI'].astype(str).str.strip().str.upper()

    # 3. Create CUI → Cluster mapping
    cui_to_cluster = dict(zip(cui_cluster_df['CUI'], cui_cluster_df['cluster']))

    # 4. Map CUIs to cluster labels
    def get_clusters_from_cuis(cuis_str):
        cuis = [c.strip().upper() for c in str(cuis_str).split(';') if c.strip()]
        clusters = [str(cui_to_cluster.get(cui)) for cui in cuis if cui in cui_to_cluster]
        return list(set(clusters)) if clusters else ["UNKNOWN"]

    image_cuis_df['cluster_labels'] = image_cuis_df['CUIs'].apply(get_clusters_from_cuis)
    image_cuis_df['clusters'] = image_cuis_df['cluster_labels'].apply(
        lambda x: ",".join(x) if x else "UNKNOWN"
    )

    # 5. Remove UNKNOWN rows
    image_cuis_df = image_cuis_df[image_cuis_df['clusters'] != "UNKNOWN"].reset_index(drop=True)

    # 6. Multi-hot encode
    mlb = MultiLabelBinarizer()
    y_multi_hot = mlb.fit_transform(image_cuis_df['cluster_labels'])
    class_names = mlb.classes_

    # 7. Prepare multi-hot DataFrame
    y_df = pd.DataFrame(y_multi_hot, columns=class_names)
    image_cuis_df['multi_hot_vector'] = y_df.apply(lambda row: row.tolist(), axis=1)

    # 8. Final merge and save
    final_df = pd.concat([
        image_cuis_df[['ID', 'Caption', 'CUIs', 'clusters', 'multi_hot_vector']],
        y_df.reset_index(drop=True)
    ], axis=1)

    final_df.to_csv(output_path, index=False)
    print(f"Saved: {output_path} — shape: {final_df.shape} — classes: {list(class_names)}")


In [25]:
# Set paths for each split
train_file = "/content/MergeAllTrain.csv"
val_file = "/content/MergeAllValid.csv"
test_file = "/content/MergeAllTest.csv"
cui_cluster_file = "/content/clustered_cui2.csv"

# Process each split
process_caption_cui_file(train_file, cui_cluster_file, "Train_with_clusters.csv")
process_caption_cui_file(val_file, cui_cluster_file, "Val_with_clusters.csv")
process_caption_cui_file(test_file, cui_cluster_file, "Test_with_clusters.csv")


Saved: Train_with_clusters.csv — shape: (59957, 9) — classes: ['0', '1', '2', '3']
Saved: Val_with_clusters.csv — shape: (9904, 9) — classes: ['0', '1', '2', '3']
Saved: Test_with_clusters.csv — shape: (9927, 9) — classes: ['0', '1', '2', '3']
