1.Load the dataset mount to *drive*

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Load a Pretrained ResNet18 Model**

Use a ResNet model trained on ImageNet as a feature extractor.

Remove the final classification layer to get 512-dimensional feature vectors from each image.

These vectors represent the visual characteristics of MRI slices.

**Extract Image Features**

For each patient’s MRI folder, load all image slices.

Pass each slice through the ResNet model to obtain feature vectors.

Average all slice-level vectors to get one feature vector per patient.

**Combine with Structured EHR Data**

Add numerical features such as age and gender to the MRI feature vector.

This creates a richer patient representation.

**Standardize Features**

Normalize all features using StandardScaler so that every feature contributes equally to clustering.

**Cluster Patients (KMeans)**

Apply KMeans clustering to group patients based on their combined MRI and EHR features.


**Assign Disease Labels**

After inspecting the clusters, manually map each cluster to a disease name using a dictionary.

In [None]:

import os
import numpy as np
import pandas as pd
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
resnet = models.resnet18(pretrained=True)
resnet.fc = torch.nn.Identity()
resnet = resnet.to(device)
resnet.eval()

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])


def extract_features_from_folder(folder_path):
    """Extracts average ResNet features from all MRI slices in a folder."""
    features = []
    if not os.path.exists(folder_path):
        return np.zeros(512)

    for file in os.listdir(folder_path):
        if file.endswith(".png"):
            img_path = os.path.join(folder_path, file)
            try:
                img = Image.open(img_path).convert("RGB")
                x = transform(img).unsqueeze(0).to(device)
                with torch.no_grad():
                    feat = resnet(x).cpu().numpy().flatten()
                features.append(feat)
            except Exception as e:
                print(f"Skipping {img_path}: {e}")

    if len(features) == 0:
        return np.zeros(512)

    return np.mean(features, axis=0)

metadata_path = "/content/drive/MyDrive/heart_mri_ct/cleaned_structured_ehr.csv"
df = pd.read_csv(metadata_path)

base_path = "/content/drive/MyDrive/heart_mri_ct/processed_mri_images"
df["folder_path"] = df["folder_path"].apply(lambda x: os.path.join(base_path, x))

base_path = "/content/drive/MyDrive/heart_mri_ct/processed_mri_images"
df["folder_path"] = df["folder_path"].apply(lambda x: os.path.join(base_path, x))

all_features = []

for idx, row in tqdm(df.iterrows(), total=len(df)):
    folder = row["folder_path"]
    f_img = extract_features_from_folder(folder)

    f_full = np.concatenate([f_img, [row["age"], 1 if row["gender"] == "M" else 0]])
    all_features.append(f_full)


X = np.array(all_features)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df["Cluster"] = kmeans.fit_predict(X_scaled)


cluster_to_disease = {
    0: "Cardiomyopathy",
    1: "Myocardial_Infarction",
    2: "Heart_Failure",
    3: "Normal"
}

df["Disease"] = df["Cluster"].map(cluster_to_disease)

df = df.drop(columns=["Cluster"])
output_path = "/content/drive/MyDrive/heart_mri_ct/final_labeled_patients.csv"
df.to_csv(output_path, index=False)

print("\n✅ Disease labels generated successfully!")
print(df.head())


100%|██████████| 150/150 [00:00<00:00, 2893.79it/s]


✅ Disease labels generated successfully!
    patient_id  age gender modality  num_slices  \
0  Patient_001   74      F      MRI          84   
1  Patient_002   40      M      MRI          79   
2  Patient_003   80      M      MRI          77   
3  Patient_004   41      F      MRI          72   
4  Patient_005   46      M      MRI          57   

                                         folder_path                Disease  
0  /content/drive/MyDrive/heart_mri_ct/processed_...         Cardiomyopathy  
1  /content/drive/MyDrive/heart_mri_ct/processed_...  Myocardial_Infarction  
2  /content/drive/MyDrive/heart_mri_ct/processed_...                 Normal  
3  /content/drive/MyDrive/heart_mri_ct/processed_...          Heart_Failure  
4  /content/drive/MyDrive/heart_mri_ct/processed_...  Myocardial_Infarction  





**Save the Final Labeled Dataset**

The resulting CSV includes all patients with an assigned disease label.

In [None]:
import pandas as pd

input_csv = "/content/drive/MyDrive/heart_mri_ct/final_labeled_patients.csv"
df = pd.read_csv(input_csv)

ICD10_CODES = {
    "Cardiomyopathy": "I42",
    "Myocardial_Infarction": "I21",
    "Heart_Failure": "I50",
    "Normal": "Z00"
}

df["ICD10_Code"] = df["Disease"].map(ICD10_CODES)

df["ICD10_Code"].fillna("Unknown", inplace=True)

output_csv = "/content/drive/MyDrive/heart_mri_ct/final_labeled_patients_with_icd10.csv"
df.to_csv(output_csv, index=False)

print("✅ ICD-10 codes assigned successfully!")
print(df.head())


✅ ICD-10 codes assigned successfully!
    patient_id  age gender modality  num_slices  \
0  Patient_001   74      F      MRI          84   
1  Patient_002   40      M      MRI          79   
2  Patient_003   80      M      MRI          77   
3  Patient_004   41      F      MRI          72   
4  Patient_005   46      M      MRI          57   

                                         folder_path                Disease  \
0  /content/drive/MyDrive/heart_mri_ct/processed_...         Cardiomyopathy   
1  /content/drive/MyDrive/heart_mri_ct/processed_...  Myocardial_Infarction   
2  /content/drive/MyDrive/heart_mri_ct/processed_...                 Normal   
3  /content/drive/MyDrive/heart_mri_ct/processed_...          Heart_Failure   
4  /content/drive/MyDrive/heart_mri_ct/processed_...  Myocardial_Infarction   

  ICD10_Code  
0        I42  
1        I21  
2        Z00  
3        I50  
4        I21  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["ICD10_Code"].fillna("Unknown", inplace=True)
