# Data Exploration

In this notebook, we will explore the chest X-ray dataset used for early disease detection. We will visualize the images and understand the distribution of classes: COVID-19, Normal, Pneumonia, and Tuberculosis.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import numpy as np

# Set the path to the dataset
DATASET_PATH = '../data/chest_xray/'  # Update this path as necessary

# Load the dataset information
train_df = pd.read_csv(os.path.join(DATASET_PATH, 'train.csv'))
val_df = pd.read_csv(os.path.join(DATASET_PATH, 'val.csv'))
test_df = pd.read_csv(os.path.join(DATASET_PATH, 'test.csv'))

# Display the first few rows of the training dataset
train_df.head()

In [2]:
# Visualize the distribution of classes in the training dataset
plt.figure(figsize=(10, 6))
sns.countplot(x='label', data=train_df)
plt.title('Distribution of Classes in Training Dataset')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

In [3]:
# Function to display a sample of images from the dataset
def display_sample_images(df, num_images=5):
    plt.figure(figsize=(15, 10))
    for i in range(num_images):
        img_path = os.path.join(DATASET_PATH, df.iloc[i]['image_path'])
        img = Image.open(img_path)
        plt.subplot(1, num_images, i + 1)
        plt.imshow(img)
        plt.title(df.iloc[i]['label'])
        plt.axis('off')
    plt.show()

# Display sample images from the training dataset
display_sample_images(train_df, num_images=5)

## Summary

In this notebook, we explored the chest X-ray dataset by visualizing the distribution of classes and displaying sample images. This exploration helps us understand the dataset better and prepares us for the model training phase.