# Note: Steps for the Chest X-Ray Classification Project

This project aims to develop a Convolutional Neural Network (CNN) for binary classification of chest X-ray images (normal vs pneumonia). The following are the key steps to complete this project:

---

## Step 1: Setup and Environment
1. Import necessary libraries (TensorFlow, Keras, etc.).
2. Verify the environment setup in JupyterLab.
3. Set up paths for the dataset.

---

## Step 2: Dataset Preprocessing
1. Load images from dataset directories (`train`, `val`, `test`) and their corresponding labels.
2. Resize images to a consistent size for the CNN input.
3. Normalize pixel values for faster convergence.
4. Apply data augmentation to the training set (e.g., rotations, flips, zooms).

---

## Step 3: Building the CNN
1. Design a CNN model with convolutional and pooling layers.
2. Use dropout and/or regularization to reduce overfitting.
3. Compile the model with appropriate loss, optimizer, and metrics (accuracy, sensitivity, specificity).

---

## Step 4: Training the Model
1. Use data generators for training and validation sets.
2. Train the model for a sufficient number of epochs.
3. Use early stopping or checkpoints to prevent overfitting.
4. Monitor performance metrics during training.

---

## Step 5: Evaluation
1. Evaluate the model on the test set.
2. Plot accuracy and loss curves for training and validation.
3. Generate a classification report (precision, recall, F1-score).
4. Plot the confusion matrix.

---

## Step 6: Fine-Tuning
1. Adjust model hyperparameters (e.g., learning rate, number of layers).
2. Apply transfer learning with a pre-trained model if needed.
3. Re-train the model to achieve the target metrics (90% sensitivity and 90% specificity).

---

## Step 7: Final Analysis and Documentation
1. Summarize the model’s performance.
2. Compare results against the target metrics.
3. Document the process and findings.


### Step 1: Setup and Environment

In this step, we will:
1. Import the required libraries for data preprocessing, model creation, and evaluation.
2. Verify the environment to ensure the required tools are installed.
3. Set up paths for the dataset.

In [1]:
# Import necessary libraries
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

# Verify TensorFlow version
print("TensorFlow version:", tf.__version__)

# Set up dataset paths
base_dir = "..\data\chest_xray"  
train_dir = os.path.join(base_dir, "train")
test_dir = os.path.join(base_dir, "test")  # Keep only train and test directories

# Check if directories exist
print("Train directory exists:", os.path.exists(train_dir))
print("Test directory exists:", os.path.exists(test_dir))



  base_dir = "..\data\chest_xray"


TensorFlow version: 2.17.0
Train directory exists: True
Test directory exists: True


### Step 2: Dataset Preprocessing

In this step, we will:
1. Load and preprocess images from the dataset.
2. Standardize the input size and normalize pixel values.
3. Apply data augmentation to the training set for better generalization.


In [2]:
# Image dimensions
IMG_HEIGHT, IMG_WIDTH = 150, 150
BATCH_SIZE = 32

# Data augmentation for the training set
train_datagen = ImageDataGenerator(
    rescale=1.0 / 255,  # Normalize pixel values to [0, 1]
    rotation_range=20,  # Random rotation
    width_shift_range=0.2,  # Random horizontal shift
    height_shift_range=0.2,  # Random vertical shift
    shear_range=0.2,  # Shear transformations
    zoom_range=0.2,  # Zoom
    horizontal_flip=True,  # Flip images horizontally
    fill_mode="nearest"  # Fill in missing pixels
)

# Only rescaling for validation and test sets (no augmentation)
val_test_datagen = ImageDataGenerator(rescale=1.0 / 255)

# Load images from directories
train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode="binary"  # Binary classification (normal/pneumonia)
)

val_generator = val_test_datagen.flow_from_directory(
    val_dir,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode="binary"
)

test_generator = val_test_datagen.flow_from_directory(
    test_dir,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    class_mode="binary",
    shuffle=False  # Maintain order for evaluation
)


Found 5216 images belonging to 2 classes.


NameError: name 'val_dir' is not defined

### Explanation of Step 2: Dataset Preprocessing

In this step, we preprocess the chest X-ray images to prepare them for training, validation, and testing. Here's what each part of the code does:

1. **Image Dimensions**:
   - All images are resized to `150x150` pixels to ensure consistency across the dataset and compatibility with the CNN model.

2. **Batch Size**:
   - Images are processed in batches of 32 to efficiently load and train on data without exhausting memory.

3. **Data Augmentation (Training Set)**:
   - **`rescale=1.0 / 255`**: Normalizes pixel values to the range [0, 1] for faster convergence during training.
   - **`rotation_range=20`**: Randomly rotates images within a range of `-20 degrees` to `+20 degrees`.
   - **`width_shift_range=0.2` and `height_shift_range=0.2`**: Randomly shifts the image horizontally and vertically by up to 20% of the width/height.
   - **`shear_range=0.2`**: Applies a shearing transformation to the image. This essentially shifts one part of the image more than another, creating a "slanted" version of the image. (https://youtube.com/shorts/-lXkrjeB6Ls?si=CIIeZJAzOI7CgN27)
   - **`zoom_range=0.2`**: Randomly zooms in or out by up to 20%.
   - **`horizontal_flip=True`**: Randomly flips images horizontally.
   - **`fill_mode="nearest"`**: Fills in any missing pixels after transformations with the nearest pixel values.

4. **Validation and Test Sets**:
   - Only pixel normalization (`rescale=1.0 / 255`) is applied. No augmentation is used to ensure these datasets represent real-world, unaltered data. Images have pixel values in the range [0, 255]. 0 represents black and 255 represents white. These values are too large for most neural network models to handle efficiently, as they can lead to slower convergence and unstable training. 

5. **Data Generators**:
   - **`train_generator`**: Loads images from the `train` directory and applies augmentation.
   - **`val_generator`**: Loads images from the `val` directory for validation without augmentation.
   - **`test_generator`**: Loads images from the `test` directory for final evaluation, ensuring the images are not shuffled to maintain their order. Shuffle default value is True (so in training and validation it is gonna be shuffled)

**Why Preprocessing is Important**:
- **Data Augmentation**: Introduces variability in the training data, improving the model’s ability to generalize to unseen data.
- **Normalization**: Scales pixel values to a uniform range, making training more efficient and stable.
- **Batch Processing**: Loads data in chunks to optimize memory usage and speed up training.

This preprocessing step ensures the model is trained on a well-prepared dataset, improving its performance and generalization ability.
