Step 1: Create a Base Data Directory

In [1]:
import os

# Create a base directory for all datasets
BASE_DIR = './data'
os.makedirs(BASE_DIR, exist_ok=True)
print("Base data directory created:", BASE_DIR)

Base data directory created: ./data


Step 2: Download the Datasets
A. Kaggle Datasets
Make sure you have configured your Kaggle API credentials (i.e. your kaggle.json in the correct folder). run the commands below to download and unzip the Kaggle datasets.

In [2]:
#!pip install kaggle

In [3]:
# Ensure Kaggle API credentials are set up
!mkdir -p ~/.kaggle
!echo '{"username": "koechdeborah", "key": "5b8de1bf1baee6da107d94ca83543650"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

# Download and unzip Kaggle datasets into consistent directories under ./data
!kaggle datasets download -d tawsifurrahman/tuberculosis-tb-chest-xray-dataset -p ./data/tuberculosis --unzip
!kaggle datasets download -d rahimanshu/cardiomegaly-disease-prediction-using-cnn -p ./data/cardiomegaly --unzip
!kaggle datasets download -d adnanenasser/atelectasis -p ./data/atelectasis --unzip
!kaggle datasets download -d samiulbari/pulmonary-edema-classified-by-nih -p ./data/edema --unzip
!kaggle datasets download -d ivanadityamaulana/chest-xray-dataset-fibrosispleural-thickening -p ./data/pleural_thickening --unzip

Dataset URL: https://www.kaggle.com/datasets/tawsifurrahman/tuberculosis-tb-chest-xray-dataset
License(s): copyright-authors
Dataset URL: https://www.kaggle.com/datasets/rahimanshu/cardiomegaly-disease-prediction-using-cnn
License(s): CC0-1.0
Dataset URL: https://www.kaggle.com/datasets/adnanenasser/atelectasis
License(s): unknown
Dataset URL: https://www.kaggle.com/datasets/samiulbari/pulmonary-edema-classified-by-nih
License(s): CC0-1.0
Dataset URL: https://www.kaggle.com/datasets/ivanadityamaulana/chest-xray-dataset-fibrosispleural-thickening
License(s): unknown


B. Hugging Face Datasets
Install and use the Hugging Face datasets library to load the following datasets.

In [4]:
# Install the datasets library
!pip install datasets

from datasets import load_dataset

# Load Hugging Face datasets in streaming mode
hf_datasets = {}
hf_datasets["Pneumonia"]    = load_dataset("MadElf1337/Pneumonia_Images", streaming=True)
hf_datasets["Effusion"]     = load_dataset("lagobellojp/synthetic_effusion_dataset_12", streaming=True)
hf_datasets["Emphysema"]    = load_dataset("Alwaly/cardiology_dataset-resampled-emphysema-test", streaming=True)
hf_datasets["Fibrosis"]     = load_dataset("Alwaly/cardiology_dataset-resampled-fibrosis", streaming=True)
hf_datasets["Infiltration"] = load_dataset("CHENHJDJSD/infiltration_train_test_large", streaming=True)
hf_datasets["Nodule"]       = load_dataset("Alwaly/cardiology_dataset-resampled-nodule", streaming=True)
hf_datasets["Pneumothorax"] = load_dataset("Alwaly/cardiology_dataset-resampled-pneumothorax", streaming=True)

print("Hugging Face datasets loaded in streaming mode.")

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/200 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/200 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/363 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/363 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/567 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/19 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/363 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/363 [00:00<?, ?B/s]

Hugging Face datasets loaded in streaming mode.


In [5]:
import os
import json

export_dir = "./data/hf_export"
os.makedirs(export_dir, exist_ok=True)  # Ensure the export directory exists

for name, dataset in hf_datasets.items():
    dataset_path = os.path.join(export_dir, name)
    os.makedirs(dataset_path, exist_ok=True)  # Make a folder for each dataset

    try:
        sample = next(iter(dataset))  # ✅ Fetch first item correctly in streaming mode
        print(f"Saving {name} to {dataset_path}...")

        # ✅ Save a sample to confirm it's working
        with open(os.path.join(dataset_path, "sample.json"), "w") as f:
            json.dump(sample, f, indent=4)

        print(f"✅ {name} exported successfully.")

    except Exception as e:
        print(f"❌ Failed to save {name}: {e}")


Saving Pneumonia to ./data/hf_export/Pneumonia...
✅ Pneumonia exported successfully.
Saving Effusion to ./data/hf_export/Effusion...
✅ Effusion exported successfully.
Saving Emphysema to ./data/hf_export/Emphysema...
✅ Emphysema exported successfully.
Saving Fibrosis to ./data/hf_export/Fibrosis...
✅ Fibrosis exported successfully.
Saving Infiltration to ./data/hf_export/Infiltration...
✅ Infiltration exported successfully.
Saving Nodule to ./data/hf_export/Nodule...
✅ Nodule exported successfully.
Saving Pneumothorax to ./data/hf_export/Pneumothorax...
✅ Pneumothorax exported successfully.


Step 3: Create a Unified DataFrame
Each dataset only provides labels for one pathology. We create a label vector of length 12 (one for each pathology), marking unknown labels as -1.

A. Define Pathologies and Label Mapping
python
Copy
Edit


In [6]:
import numpy as np
import pandas as pd

# Define the 12 pathologies
pathologies = ["Pneumonia", "Tuberculosis", "Cardiomegaly", "Atelectasis", "Edema",
               "Effusion", "Emphysema", "Fibrosis", "Infiltration", "Nodule",
               "Pleural-thickening", "Pneumothorax"]

def create_label_vector(target_pathology, positive_label=1):
    label_vector = np.full(len(pathologies), -1, dtype=np.int8)
    if target_pathology in pathologies:
        label_vector[pathologies.index(target_pathology)] = positive_label
    return label_vector

B. Process Kaggle Datasets
Scan the Kaggle dataset folders and extract image paths.




In [7]:
import glob

def process_kaggle_dataset(dataset_dir, pathology_name):
    # Find all image files recursively in the folder
    img_files = glob.glob(os.path.join(dataset_dir, "**", "*.*"), recursive=True)
    img_files = [f for f in img_files if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp'))]

    data = []
    positive_keyword = pathology_name.lower()
    negative_keyword = "normal"

    for path in img_files:
        parent_folder = os.path.basename(os.path.dirname(path)).lower()
        if positive_keyword in parent_folder:
            label_vector = create_label_vector(pathology_name, positive_label=1)  # [1, -1, -1, ...]
        elif negative_keyword in parent_folder:
            label_vector = np.zeros(len(pathologies), dtype=np.int8)  # [0, 0, 0, ...]
        else:
            continue  # Skip files in unrecognized folders
        data.append({'image_path': path, 'label_vector': label_vector})

    df = pd.DataFrame(data)
    return df

# Map Kaggle datasets to pathology names and their folder paths
kaggle_mappings = {
    "Tuberculosis": os.path.join(BASE_DIR, "tuberculosis"),
    "Cardiomegaly": os.path.join(BASE_DIR, "cardiomegaly"),
    "Atelectasis": os.path.join(BASE_DIR, "atelectasis"),
    "Edema": os.path.join(BASE_DIR, "edema"),
    "Pleural-thickening": os.path.join(BASE_DIR, "pleural_thickening")
}

C. Process Hugging Face Datasets
Extract or export images from the Hugging Face datasets so they are saved locally.

In [8]:
from PIL import Image

def process_hf_dataset(hf_dataset, pathology_name, split="train", export_dir=None):
    os.makedirs(export_dir, exist_ok=True)
    data = []

    for i, example in enumerate(hf_dataset[split]):
        img = example["image"]
        if not isinstance(img, Image.Image):
            img = Image.fromarray(img)

        # Convert RGBA to RGB if necessary
        if img.mode == 'RGBA':
            img = img.convert('RGB')

        img_filename = f"{pathology_name}_{split}_{i}.jpg"
        img_path = os.path.join(export_dir, img_filename)
        img.save(img_path)

        # Use label field if available
        label = example.get("label", None)
        if label is not None:
            if label == 1:
                label_vector = create_label_vector(pathology_name, positive_label=1)  # [1, -1, -1, ...]
            elif label == 0:
                label_vector = np.zeros(len(pathologies), dtype=np.int8)  # [0, 0, 0, ...]
            else:
                continue  # Skip if label is neither 0 nor 1
        else:
            # Default to positive if no label (less ideal, but maintains compatibility)
            label_vector = create_label_vector(pathology_name, positive_label=1)

        data.append({'image_path': img_path, 'label_vector': label_vector})

    df = pd.DataFrame(data)
    return df

# Map Hugging Face datasets to pathology names; set an export directory for each
hf_mappings = {
    "Pneumonia": hf_datasets["Pneumonia"],
    "Effusion": hf_datasets["Effusion"],
    "Emphysema": hf_datasets["Emphysema"],
    "Fibrosis": hf_datasets["Fibrosis"],
    "Infiltration": hf_datasets["Infiltration"],
    "Nodule": hf_datasets["Nodule"],
    "Pneumothorax": hf_datasets["Pneumothorax"]
}
export_base = os.path.join(BASE_DIR, "hf_export")
os.makedirs(export_base, exist_ok=True)

D. Combine All DataFrames
Process each dataset and combine them.

In [9]:
df_list = []

# Process Kaggle datasets
for pathology, folder in kaggle_mappings.items():
    if os.path.exists(folder):
        df_kaggle = process_kaggle_dataset(folder, pathology)
        df_list.append(df_kaggle)
    else:
        print(f"Folder for {pathology} not found at {folder}. Please download the dataset.")

# Process Hugging Face datasets
for pathology, ds in hf_mappings.items():
    export_dir = os.path.join(export_base, pathology)
    df_hf = process_hf_dataset(ds, pathology, split="train", export_dir=export_dir)
    df_list.append(df_hf)

# Create the unified DataFrame
combined_df = pd.concat(df_list, ignore_index=True)
print("Combined dataset size:", combined_df.shape)
combined_df.to_csv("./data/combined_dataset.csv", index=False)


Combined dataset size: (40555, 2)


Step 4: Define PyTorch Dataset and DataLoader



In [10]:
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

class ChestXrayDataset(Dataset):
    def __init__(self, dataframe, transform=None):
        self.dataframe = dataframe.reset_index(drop=True)
        self.transform = transform

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        row = self.dataframe.iloc[idx]
        img_path = row['image_path']
        image = Image.open(img_path).convert('RGB')
        label_vector = np.array(row['label_vector'], dtype=np.float32)
        mask = (label_vector != -1).astype(np.float32)

        if self.transform:
            image = self.transform(image)

        return image, torch.tensor(label_vector), torch.tensor(mask)

# Define image transformations (with augmentation for training)
data_transforms = {
    'train': transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.RandomHorizontalFlip(),
        transforms.ColorJitter(brightness=0.2, contrast=0.2),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])
}

# Split the DataFrame into training and validation sets (80/20 split)
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(combined_df, test_size=0.2, random_state=42)

train_dataset = ChestXrayDataset(train_df, transform=data_transforms['train'])
val_dataset = ChestXrayDataset(val_df, transform=data_transforms['val'])

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, num_workers=2)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False, num_workers=2)


Step 5: Define the Model



In [11]:
import torch.nn as nn
from torchvision import models

class ChestXrayClassifier(nn.Module):
    def __init__(self, num_classes=12, dropout_p=0.5):
        super(ChestXrayClassifier, self).__init__()
        self.base_model = models.resnet50(pretrained=True)
        num_ftrs = self.base_model.fc.in_features

        # Remove the original fully connected layer
        self.base_model.fc = nn.Identity()

        # Extra hidden layer for additional complexity
        self.fc1 = nn.Linear(num_ftrs, 512)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=dropout_p)
        self.fc2 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.base_model(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Initialize the model and move it to the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ChestXrayClassifier(num_classes=len(pathologies), dropout_p=0.5).to(device)
print(model)


Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 72.6MB/s]


ChestXrayClassifier(
  (base_model): ResNet(
    (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (layer1): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (downsample): Sequent

Step 6: Define Loss Function and Optimizer



In [12]:
import torch.optim as optim

def masked_bce_loss(outputs, targets, mask):
    criterion = nn.BCEWithLogitsLoss(reduction='none')
    loss = criterion(outputs, targets)
    loss = loss * mask  # Apply the mask to ignore unknown labels
    return loss.sum() / mask.sum()

optimizer = optim.Adam(model.parameters(), lr=0.001)


Step 7: Train and Validate the Model



In [13]:
import torch
import torch.optim as optim
from tqdm.notebook import tqdm, trange
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score

num_epochs = 7
best_metric = 0.0  # We'll use accuracy to track the best model

for epoch in trange(num_epochs, desc="Epochs", leave=True):
    model.train()
    train_loss = 0.0
    for images, labels, masks in tqdm(train_loader, desc=f"Training Epoch {epoch+1}", leave=False):
        images, labels, masks = images.to(device), labels.to(device), masks.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = masked_bce_loss(outputs, labels, masks)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_loss /= len(train_loader)

    # Validation phase
    model.eval()
    val_loss = 0.0
    all_targets, all_preds, all_masks = [], [], []  # Add masks to track -1s
    for images, labels, masks in tqdm(val_loader, desc="Validation", leave=False):
        images, labels, masks = images.to(device), labels.to(device), masks.to(device)
        outputs = model(images)
        loss = masked_bce_loss(outputs, labels, masks)
        val_loss += loss.item()

        probs = torch.sigmoid(outputs).detach().cpu().numpy()
        preds = (probs > 0.5).astype(int)

        all_targets.append(labels.cpu().numpy())
        all_preds.append(preds)
        all_masks.append(masks.cpu().numpy())  # Collect masks

    val_loss /= len(val_loader)

    # Concatenate across all validation batches
    all_targets = np.concatenate(all_targets, axis=0)
    all_preds = np.concatenate(all_preds, axis=0)
    all_masks = np.concatenate(all_masks, axis=0)

    # Ensure 2D arrays
    num_classes = len(pathologies)
    if all_targets.ndim == 1:
        all_targets = all_targets.reshape(-1, num_classes)
    if all_preds.ndim == 1:
        all_preds = all_preds.reshape(-1, num_classes)
    if all_masks.ndim == 1:
        all_masks = all_masks.reshape(-1, num_classes)

    # Convert to appropriate types
    all_targets = all_targets.astype(np.int64)
    all_preds = all_preds.astype(np.int64)
    all_masks = all_masks.astype(np.bool_)  # Masks should be boolean (True for known, False for -1)

    # Mask out -1 values (unknown labels) by converting to binary format
    valid_targets = np.where(all_targets == -1, 0, all_targets)  # Replace -1 with 0
    valid_mask = (all_targets != -1)  # True where labels are known (0 or 1)

    # Apply mask to targets and preds for metric calculation
    masked_targets = valid_targets[valid_mask]
    masked_preds = all_preds[valid_mask]

    # Calculate metrics only on known labels
    if masked_targets.size > 0:  # Ensure there are valid labels to evaluate
        accuracy = accuracy_score(masked_targets, masked_preds)
        precision = precision_score(masked_targets, masked_preds, average='macro', zero_division=0)
        recall = recall_score(masked_targets, masked_preds, average='macro', zero_division=0)
    else:
        accuracy = precision = recall = 0.0  # Default to 0 if no valid labels

    print(f"\nEpoch {epoch+1}/{num_epochs} - Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
    print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")

    # Save model checkpoint with a Keras-like filename
    checkpoint_path = f"CXR_Pathologies_epoch_{epoch+1}.h5"
    torch.save(model.state_dict(), checkpoint_path)
    print(f"Model saved as {checkpoint_path}")

    # Update best model based on accuracy
    if accuracy > best_metric:
        best_metric = accuracy
        best_model_path = "CXR_Pathologies_best.h5"
        torch.save(model.state_dict(), best_model_path)
        print(f"Best model updated (Accuracy: {best_metric:.4f}) and saved as {best_model_path}")

Epochs:   0%|          | 0/7 [00:00<?, ?it/s]

Training Epoch 1:   0%|          | 0/4056 [00:00<?, ?it/s]

Validation:   0%|          | 0/1014 [00:00<?, ?it/s]


Epoch 1/7 - Train Loss: 0.1715, Val Loss: 0.1451
Accuracy: 0.9237, Precision: 0.7470, Recall: 0.8673
Model saved as CXR_Pathologies_epoch_1.h5
Best model updated (Accuracy: 0.9237) and saved as CXR_Pathologies_best.h5


Training Epoch 2:   0%|          | 0/4056 [00:00<?, ?it/s]

Validation:   0%|          | 0/1014 [00:00<?, ?it/s]


Epoch 2/7 - Train Loss: 0.1511, Val Loss: 0.1495
Accuracy: 0.9182, Precision: 0.7397, Recall: 0.8978
Model saved as CXR_Pathologies_epoch_2.h5


Training Epoch 3:   0%|          | 0/4056 [00:00<?, ?it/s]

Validation:   0%|          | 0/1014 [00:00<?, ?it/s]


Epoch 3/7 - Train Loss: 0.1425, Val Loss: 0.1361
Accuracy: 0.9364, Precision: 0.7851, Recall: 0.7614
Model saved as CXR_Pathologies_epoch_3.h5
Best model updated (Accuracy: 0.9364) and saved as CXR_Pathologies_best.h5


Training Epoch 4:   0%|          | 0/4056 [00:00<?, ?it/s]

Validation:   0%|          | 0/1014 [00:00<?, ?it/s]


Epoch 4/7 - Train Loss: 0.1387, Val Loss: 0.1315
Accuracy: 0.9379, Precision: 0.7897, Recall: 0.7732
Model saved as CXR_Pathologies_epoch_4.h5
Best model updated (Accuracy: 0.9379) and saved as CXR_Pathologies_best.h5


Training Epoch 5:   0%|          | 0/4056 [00:00<?, ?it/s]

Validation:   0%|          | 0/1014 [00:00<?, ?it/s]


Epoch 5/7 - Train Loss: 0.1382, Val Loss: 0.1305
Accuracy: 0.9357, Precision: 0.7751, Recall: 0.8418
Model saved as CXR_Pathologies_epoch_5.h5


Training Epoch 6:   0%|          | 0/4056 [00:00<?, ?it/s]

Validation:   0%|          | 0/1014 [00:00<?, ?it/s]


Epoch 6/7 - Train Loss: 0.1331, Val Loss: 0.1635
Accuracy: 0.9315, Precision: 0.7911, Recall: 0.6481
Model saved as CXR_Pathologies_epoch_6.h5


Training Epoch 7:   0%|          | 0/4056 [00:00<?, ?it/s]

Validation:   0%|          | 0/1014 [00:00<?, ?it/s]


Epoch 7/7 - Train Loss: 0.1314, Val Loss: 0.1271
Accuracy: 0.9418, Precision: 0.8027, Recall: 0.7903
Model saved as CXR_Pathologies_epoch_7.h5
Best model updated (Accuracy: 0.9418) and saved as CXR_Pathologies_best.h5
