<a href="https://colab.research.google.com/github/Lalith-Lavu/Pneumonia-Detection-from-Chest-X-Rays/blob/main/Pneumonia_Detection_from_Chest_X_Rays.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16
from tensorflow.keras import layers, models

# 1. Dataset - We use the built-in dataset or a public link
# For a quick run in Colab, we'll simulate the structure or use a small sample
!wget https://github.com/ieee8023/covid-chestxray-dataset/archive/refs/heads/master.zip
!unzip -q master.zip

# 2. Build Model using VGG16
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
base_model.trainable = False

model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid') # Binary: Pneumonia or Normal
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
print("Medical AI Model Ready. This demonstrates high social impact in your portfolio.")

--2026-02-23 16:43:31--  https://github.com/ieee8023/covid-chestxray-dataset/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/ieee8023/covid-chestxray-dataset/zip/refs/heads/master [following]
--2026-02-23 16:43:31--  https://codeload.github.com/ieee8023/covid-chestxray-dataset/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.113.9
Connecting to codeload.github.com (codeload.github.com)|140.82.113.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 550535079 (525M) [application/zip]
Saving to: ‘master.zip’


2026-02-23 16:43:44 (40.9 MB/s) - ‘master.zip’ saved [550535079/550535079]

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m58889256/58889256[

In [5]:
import pandas as pd
import numpy as np
import os

# Load the metadata
metadata_path = 'covid-chestxray-dataset-master/metadata.csv'
metadata_df = pd.read_csv(metadata_path)

# Filter for relevant findings: 'Pneumonia' and 'No Finding' (Normal)
# And ensure image filenames exist
filtered_df = metadata_df[metadata_df['finding'].isin(['Pneumonia', 'No Finding'])]

# Define image directory
image_dir = 'covid-chestxray-dataset-master/images'

# Create full image paths
filtered_df['filepath'] = filtered_df['filename'].apply(lambda x: os.path.join(image_dir, x)) # Changed 'image_name' to 'filename'

# Filter out rows where the image file does not exist (optional but good practice)
filtered_df = filtered_df[filtered_df['filepath'].apply(os.path.exists)]

# Map 'No Finding' to 'Normal' for clarity in labels
filtered_df['class'] = filtered_df['finding'].apply(lambda x: 'Normal' if x == 'No Finding' else 'Pneumonia')

# For demonstration, let's take a small sample for testing.
# In a real scenario, you would have a dedicated test set.
# We'll use a random subset to simulate test data.
# Ensure a balanced sample if possible, or just take a general sample for now.

# Let's try to get a small balanced sample if possible or just a general sample
# Group by class and take a few from each
test_samples_per_class = 20 # Adjust as needed

test_df = filtered_df.groupby('class').apply(lambda x: x.sample(n=min(test_samples_per_class, len(x)), random_state=42)).reset_index(drop=True)

if test_df.empty:
    print("Warning: No suitable images found for testing after filtering and sampling.")
    print("Please check the dataset content and paths.")
else:
    print(f"Created a test DataFrame with {len(test_df)} samples.")
    print(test_df['class'].value_counts())

# Display the first few rows of the test DataFrame
display(test_df.head())

Created a test DataFrame with 40 samples.
class
Normal       20
Pneumonia    20
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['filepath'] = filtered_df['filename'].apply(lambda x: os.path.join(image_dir, x)) # Changed 'image_name' to 'filename'
  test_df = filtered_df.groupby('class').apply(lambda x: x.sample(n=min(test_samples_per_class, len(x)), random_state=42)).reset_index(drop=True)


Unnamed: 0,patientid,offset,sex,age,finding,RT_PCR_positive,survival,intubated,intubation_present,went_icu,...,folder,filename,doi,url,license,clinical_notes,other_notes,Unnamed: 29,filepath,class
0,38,0.0,F,61.0,No Finding,Unclear,Y,N,N,,...,images,F051E018-DAD1-4506-AD43-BE4CA29E960B.jpeg,,https://www.sirm.org/2020/03/08/covid-19-caso-13/,,"Female, 61 years old, smoker. In November 2019...",Credit to UOC Radiology ASST Bergamo Est Direc...,,covid-chestxray-dataset-master/images/F051E018...,Normal
1,316,7.0,M,54.0,No Finding,Y,Y,Y,N,Y,...,images,1-s2.0-S2213716520301168-gr1_lrg-c.png,10.1016/j.jgar.2020.04.024,https://www.sciencedirect.com/science/article/...,CC BY-NC-ND 4.0,A 54-year old man presented to the emergency d...,,,covid-chestxray-dataset-master/images/1-s2.0-S...,Normal
2,218,9.0,F,39.0,No Finding,Y,,,,,...,images,16631_1_2.jpg,,https://www.eurorad.org/case/16631,CC BY-NC-SA 4.0,"A female patient, 39-years-old, fever (38.1℃) ...",,,covid-chestxray-dataset-master/images/16631_1_...,Normal
3,38,0.0,F,61.0,No Finding,Unclear,Y,N,N,,...,images,5083A6B7-8983-472E-A427-570A3E03DDEE.jpeg,,https://www.sirm.org/2020/03/08/covid-19-caso-13/,,"Female, 61 years old, smoker. In November 2019...",Credit to UOC Radiology ASST Bergamo Est Direc...,,covid-chestxray-dataset-master/images/5083A6B7...,Normal
4,318,2.0,M,45.0,No Finding,Y,,Y,Y,Y,...,images,1-s2.0-S1341321X20301124-gr3_lrg-b.png,10.1016/j.jiac.2020.03.018,https://www.sciencedirect.com/science/article/...,CC BY-NC-ND 4.0,Initial and serial laboratory results are show...,12 hours after initiating VV-ECMO.,,covid-chestxray-dataset-master/images/1-s2.0-S...,Normal


In [6]:
# Create an ImageDataGenerator for the test set
test_datagen = ImageDataGenerator(rescale=1./255)

# Prepare the test generator from the DataFrame
# We need to specify image size and batch size
img_height, img_width = 224, 224
batch_size = 32

if not test_df.empty:
    test_generator = test_datagen.flow_from_dataframe(
        dataframe=test_df,
        x_col='filepath',
        y_col='class',
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='binary',
        shuffle=False # Do not shuffle test data to maintain order for evaluation if needed
    )
    print("Test data generator created.")
else:
    print("Cannot create test generator as test_df is empty.")
    test_generator = None


Found 40 validated image filenames belonging to 2 classes.
Test data generator created.


### Model Evaluation

Now, let's evaluate the model on the prepared test data. Keep in mind that since the model has not been trained, the performance (accuracy) will be very low, essentially random.

In [7]:
if test_generator is not None:
    print("Evaluating the untrained model...")
    loss, accuracy = model.evaluate(test_generator)
    print(f"Test Loss: {loss:.4f}")
    print(f"Test Accuracy: {accuracy:.4f}")
else:
    print("Model evaluation skipped because test data generator could not be created.")


Evaluating the untrained model...


  self._warn_if_super_not_called()


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 4s/step - accuracy: 0.4583 - loss: 0.7517
Test Loss: 0.7285
Test Accuracy: 0.5000
