Chest X-ray Classification: Pneumonia Detection

This project aims to build a binary image classification model to detect Pneumonia from chest X-ray images using deep learning. The dataset contains X-ray images labeled as either ‘NORMAL’ or ‘PNEUMONIA’, with a notable class imbalance favoring the Pneumonia category. To address this, the model uses class weighting and data augmentation techniques to improve generalization and performance. Images are preprocessed, resized to a uniform shape, and fed into a Convolutional Neural Network (CNN) with multiple layers including dropout, batch normalization, and early stopping to prevent overfitting. The training pipeline is optimized using TensorFlow’s tf.data API and supports distributed strategies like TPUs for scalable training. The final model is evaluated using accuracy and AUC metrics to assess its ability to distinguish between healthy and pneumonia-affected lungs.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


This section sets up the environment and prepares for distributed training. It imports essential libraries for data handling, model training, and visualization. The code attempts to detect a TPU (Tensor Processing Unit) for accelerated training and falls back to the default strategy (GPU or CPU) if a TPU is not available. This ensures hardware compatibility and performance optimization across different environments. It also prints the number of training replicas and the TensorFlow version to verify setup details.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import matplotlib.pyplot


try:
    tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Device:', tpu.master())
    tf.config.experimental.connect_to_cluster()
    tf.tpu.experimental.initialize_tpu_system()
    strategy = tf.distribute.experimental.TPUStrategy()

except:
    strategy = tf.distribute.get_strategy()

print("number of replicas:", strategy.num_replicas_in_sync)

print(tf.__version__)


number of replicas: 1
2.18.0


This section defines key constants for the training pipeline. AUTOTUNE allows TensorFlow to optimize data loading performance automatically. EPOCHS sets the number of training iterations. GCS_PATH specifies the location of the Chest X-ray dataset in Google Drive. BATCH_SIZE is dynamically scaled based on the number of replicas in sync (TPUs or GPUs) to ensure efficient distributed training. IMAGE_SIZE sets a uniform target size for all input images, which is required for feeding them into the model.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
EPOCHS = 25
GCS_PATH = '/content/drive/MyDrive/Colab Notebooks/chest_xray/'
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
IMAGE_SIZE = (160, 160)

This block loads the Chest X-ray image data from the specified training, testing, and validation directories using TensorFlow’s image_dataset_from_directory utility. Each dataset is created with inferred labels based on subdirectory names, shuffled for randomness, and resized to a uniform shape (IMAGE_SIZE). The images are loaded in RGB format with a fixed batch size of 32 and consistent seeding for reproducibility. This creates structured tf.data.Dataset objects ready for preprocessing and model training.

In [None]:
from tensorflow.keras.utils import image_dataset_from_directory
train_ds_raw=image_dataset_from_directory('/content/drive/MyDrive/Colab Notebooks/chest_xray/train', labels="inferred", shuffle=True, batch_size = 32, label_mode='int', seed=123, image_size=IMAGE_SIZE, color_mode='rgb')
test_ds_raw= image_dataset_from_directory('/content/drive/MyDrive/Colab Notebooks/chest_xray/test',labels="inferred", shuffle=True, batch_size = 32, label_mode='int', seed=123, image_size=IMAGE_SIZE,color_mode='rgb')
val_ds_raw= image_dataset_from_directory('/content/drive/MyDrive/Colab Notebooks/chest_xray/val',labels="inferred", shuffle=True, batch_size = 32, label_mode='int', seed=123, image_size=IMAGE_SIZE, color_mode='rgb')


Found 5216 files belonging to 2 classes.
Found 624 files belonging to 2 classes.
Found 16 files belonging to 2 classes.


After initially loading the data using image_dataset_from_directory, I noticed that the validation folder contained only 16 images, which is insufficient for reliable evaluation. To address this, I merged the training and validation image file paths using tf.io.gfile.glob, then performed a custom 80/20 split with train_test_split to create more balanced training and validation sets. This ensures a more stable validation process during model training.

In [None]:
filenames = tf.io.gfile.glob(str(GCS_PATH + '/train/*/*'))
filenames.extend(tf.io.gfile.glob(str(GCS_PATH + '/val/*/*')))

training_filenames, validation_filenames = train_test_split(filenames, test_size = 0.2)

This block counts the number of Normal and Pneumonia images in the newly created training set by checking for the respective class names in the file paths. It uses list comprehensions to filter filenames that contain either “NORMAL” or “PNEUMONIA” and prints the count for each class. This helps verify the class distribution after the custom split and ensures that any imbalance is identified early in the process.

In [None]:
COUNT_NORMAL = len([filename for filename in training_filenames if "NORMAL" in filename])
print("Normal images count in training set: " + str(COUNT_NORMAL))

COUNT_PNEUMONIA = len([filename for filename in training_filenames if "PNEUMONIA" in filename])
print("Pneumonia images count in training set: " + str(COUNT_PNEUMONIA))


Normal images count in training set: 1057
Pneumonia images count in training set: 3131


This block defines the image preprocessing pipeline. The decode_img function reads an image file, attempts to decode it as a JPEG, converts it to a float tensor, and resizes it to the defined IMAGE_SIZE. If decoding fails (e.g., due to corruption), it returns a zero-filled dummy image to avoid breaking the pipeline. The get_label function extracts the class label from the file path by checking if the parent folder is “PNEUMONIA” and returns a binary label (1 for Pneumonia, 0 for Normal). The process_path function combines these steps to return a tuple of image and label, making it suitable for use in a tf.data pipeline. (Note: there’s a typo in 'PNUEMONIA' — it should be 'PNEUMONIA'.)

In [None]:
def decode_img(img_path):
    img = tf.io.read_file(img_path)
    try:
        img = tf.image.decode_jpeg(img, channels=3)
    except:
        # Return a dummy image filled with zeros if decoding fails
        img = tf.zeros([*IMAGE_SIZE, 3], dtype=tf.float32)
        return img
    img = tf.image.convert_image_dtype(img, tf.float32)
    return tf.image.resize(img, IMAGE_SIZE)

def get_label(file_path):
  parts = tf.strings.split(file_path, '/')
  label_str = parts[-2]
  return tf.cast(label_str=='PNUEMONIA', tf.int32)

def process_path(file_path):
  label = get_label(file_path)
  img = decode_img(file_path)
  return img, label


This section creates TensorFlow datasets for training and validation using the previously prepared file paths. Each dataset is built by slicing the filenames into a tf.data.Dataset, then mapping each file path through the process_path function to load and preprocess the image-label pairs. The datasets are then batched according to the computed BATCH_SIZE and prefetched using AUTOTUNE to optimize performance by overlapping data loading with model execution. This setup ensures efficient, scalable input pipelines for training deep learning models.

In [None]:
train_ds = tf.data.Dataset.from_tensor_slices(training_filenames).map(process_path, num_parallel_calls = AUTOTUNE)
val_ds = tf.data.Dataset.from_tensor_slices(validation_filenames).map(process_path, num_parallel_calls = AUTOTUNE)

train_ds = train_ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)
val_ds = val_ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)

This block calculates and prints both the number of image batches and the total number of individual images in the training and validation datasets. It uses tf.data.experimental.cardinality() to determine the number of batches, and unbatch() to flatten the datasets and count individual images. This helps verify that batching was applied correctly and ensures the expected number of images are present in each set before training begins.

In [None]:
TRAIN_IMAGE_BATCHES = tf.data.experimental.cardinality(train_ds).numpy()
VALIDATION_IMAGES_BATCHES = tf.data.experimental.cardinality(val_ds).numpy()

print(TRAIN_IMAGE_BATCHES, VALIDATION_IMAGES_BATCHES)

TRAIN_IMAGES = tf.data.experimental.cardinality(train_ds.unbatch()).numpy()
VALIDATION_IMAGES = tf.data.experimental.cardinality(val_ds.unbatch()).numpy()

print(TRAIN_IMAGES, VALIDATION_IMAGES)

262 66
-2 -2


This block calculates class weights to address class imbalance between Normal and Pneumonia images in the training set. It computes the total number of training samples, then derives weights inversely proportional to class frequencies using a standard formula. These weights are stored in the class_weight dictionary, which will later be passed to the model during training to give more importance to the minority class and help the model learn balanced representations. The computed weights are also printed for reference.

In [None]:
total = COUNT_NORMAL + COUNT_PNEUMONIA
weight_for_0 = total / (2 * COUNT_NORMAL)
weight_for_1 = total / (2 * COUNT_PNEUMONIA)

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))


Weight for class 0: 1.98
Weight for class 1: 0.67


This block defines the CNN model architecture for binary classification of chest X-ray images. It uses a Sequential model with an initial rescaling layer followed by data augmentation layers like RandomZoom, RandomShear, and RandomFlip to improve generalization. The model includes two convolutional blocks with ReLU activations, batch normalization, dropout for regularization, and max pooling for downsampling. After flattening, the output passes through three dense layers with dropout and batch normalization before reaching the final output layer with a sigmoid activation for binary classification. The model is compiled using the Adam optimizer, binary cross-entropy loss, and tracks both accuracy and AUC as performance metrics. An EarlyStopping callback is set to prevent overfitting by restoring the best weights when validation loss stops improving.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import RandomShear, Rescaling, RandomZoom, Dense, Conv2D, MaxPooling2D, RandomFlip, Dropout, BatchNormalization, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping


model = Sequential()
model.add(Rescaling(1./255, input_shape=(160,160,3)))
model.add(RandomZoom(0.2))
model.add(RandomShear())
model.add(RandomFlip())

model.add(Conv2D(filters = 64, kernel_size=3, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(MaxPooling2D())

model.add(Conv2D(filters = 32, kernel_size=3, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(MaxPooling2D())



model.add(Flatten())

model.add(Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.3))


model.add(Dense(1, activation='sigmoid'))

ES = EarlyStopping(monitor = "val_loss", patience = 5, restore_best_weights = True)

model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy', tf.keras.metrics.AUC()])

model.summary()


  super().__init__(**kwargs)


This line trains the model using the raw training and test datasets over 25 epochs with a batch size of 16. It applies the previously defined class_weight dictionary to give higher importance to the minority class (Pneumonia) during training, helping to address class imbalance. The EarlyStopping callback monitors validation loss and stops training if it doesn’t improve for 5 consecutive epochs, restoring the best model weights. The validation data is passed to evaluate performance during training, and training metrics are stored in the history object for later visualization.

In [None]:
history = model.fit(train_ds_raw, validation_data = test_ds_raw, batch_size = 16, epochs=25, callbacks = [ES], class_weight=class_weight)

Epoch 1/25
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m274s[0m 2s/step - accuracy: 0.7906 - auc: 0.8998 - loss: 0.4163 - val_accuracy: 0.6250 - val_auc: 0.5000 - val_loss: 11.9656
Epoch 2/25
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 290ms/step - accuracy: 0.9063 - auc: 0.9659 - loss: 0.2449 - val_accuracy: 0.6250 - val_auc: 0.5000 - val_loss: 24.2266
Epoch 3/25
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 293ms/step - accuracy: 0.9203 - auc: 0.9727 - loss: 0.2152 - val_accuracy: 0.6234 - val_auc: 0.5283 - val_loss: 6.7582
Epoch 4/25
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 289ms/step - accuracy: 0.9326 - auc: 0.9793 - loss: 0.1840 - val_accuracy: 0.7676 - val_auc: 0.8540 - val_loss: 0.5507
Epoch 5/25
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 287ms/step - accuracy: 0.9443 - auc: 0.9853 - loss: 0.1520 - val_accuracy: 0.7163 - val_auc: 0.8798 - val_loss: 1.0173
Epoch 6/25
[1m

This block evaluates the trained model on the test dataset using the evaluate() method, which returns the loss and specified metrics — in this case, binary cross-entropy loss, accuracy, and AUC. The results are printed to provide a final assessment of the model’s performance on unseen data, offering insight into how well the model generalizes beyond the training and validation sets.

In [None]:
results = model.evaluate(test_ds_raw)
print(results)

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 202ms/step - accuracy: 0.8395 - auc: 0.9229 - loss: 0.3853
[0.4344245195388794, 0.8269230723381042, 0.9048486948013306]
