# Deep Learning for Pneumonia Detection: A Neural Network Approach

[Link for the Chest Pneumonia X-ray Dataset](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia)

1. Raw Data
	•	Input Data: This is your initial dataset, which could be images, text, or other data types.
	•	Preprocessing: Perform necessary transformations like resizing, normalization, data augmentation (in case of images), or text tokenization to prepare the data for model input.

2. Training
	•	Training Set: Split the raw data into training and validation sets. The training set is used to teach the model by providing labeled data.
	•	Model Training: During training, the model learns by minimizing the loss function (e.g., CrossEntropyLoss) using backpropagation and optimization (e.g., Adam optimizer).
	•	Epochs: The model goes through multiple epochs, learning from the data and adjusting its weights.

3. Validation
	•	Validation Set: This data is used during training to evaluate the model’s performance, preventing overfitting. It is used for hyperparameter tuning and model selection (e.g., learning rate, model architecture).
	•	Model Evaluation: At the end of each epoch, the model is evaluated on the validation set, and the best-performing model (lowest validation loss) is saved.
	•	Metrics: You track various performance metrics like accuracy, recall, and F1 score to evaluate the model’s performance on the validation set.

4. Test
	•	Test Set: After training is complete, the model is tested on unseen data (test set) to evaluate its generalization ability.
	•	Model Prediction: You use the model to make predictions on the test set.
	•	Metrics: Evaluate the test set’s performance using metrics like accuracy, recall, F1 score, etc.

5. Ensemble Prediction
	•	Ensemble Models: Instead of relying on just one model, you can combine predictions from multiple models (e.g., EfficientNet-B1, B2, and B3 in your case).
	•	Ensemble Strategy: Common strategies include averaging the predictions (e.g., averaging the predicted probabilities or class labels from different models).
	•	Final Prediction: The final prediction is typically a combination of multiple models’ outputs. In your case, you averaged the predictions from the three models.

6. Results
	•	Final Evaluation: After combining predictions from the ensemble models, you evaluate the final predictions against the true labels from the test set.
	•	Final Metrics: You compute the final accuracy, recall, F1 score, etc., for the ensemble model.
	•	Interpretation: Analyze the results to understand how well the model is performing and decide whether any improvements are needed.

### Raw Data -> Preprocessing -> Train -> Validate -> Test -> Ensemble Prediction -> Final Results


## Import Packages

In [None]:
import torch 
import random
import numpy as np
import os

# Define preprocessing and augmentation pipelines for image data
# Standardization: Ensure all input images have the same dimensions, pixel range, and scale. 
#                  This is crucial for models that expect a consistent input format (e.g., ResNet, EfficientNet).
# Normalization: Normalize pixel values to have a mean of 0 and a standard deviation of 1, which helps models converge faster during training.
# Data Augmentation (Training Only): Apply random transformations like flips, rotations, or crops to artificially expand the training dataset and improve model generalization.
from torchvision import transforms

# Load image datasets organized in a folder hierarchy
# ImageFolder is designed to work with image datasets stored in a folder-based structure, where each subdirectory corresponds to a specific class label. 
# It automatically: Scans the directory / Maps subdirectory names to class labels / Creates a dataset with file paths and their respective labels.
# For custom datasets (e.g., datasets not organized into folders), you might need to write your own dataset class by subclassing torch.utils.data.Dataset.
from torchvision.datasets import ImageFolder

# Handle data loading during training and evaluation
# The DataLoader takes a dataset (like one created using ImageFolder) and provides an iterable over that dataset, with features such as:
# 1. Batching:
#   • Automatically divides the dataset into smaller subsets (batches) of a specified size.
#   • Batching is critical for training neural networks efficiently, as processing data in chunks speeds up computation using GPUs.
# 2. Shuffling:
#   • Randomly rearranges the order of the data at the beginning of each epoch.
#   • Shuffling prevents the model from learning unintended patterns in the order of the data (e.g., if the data is sorted by class).
# 3. Parallel Data Loading:
#   • Allows you to use multiple worker threads or processes to load data in parallel. This speeds up data loading for large datasets.
# 4. Sampling:
#   • Supports customized sampling techniques using samplers, such as randomly selecting data points or applying weighted sampling.
# 5. Customizable Preprocessing:
#   • Automatically applies transformations (e.g., normalization, resizing) to each data sample during loading.
from torch.utils.data import DataLoader

# Why Use EfficientNet?
# Pre-trained models like EfficientNet are generally trained on large, publicly available datasets, with ImageNet being the most common. 
# ImageNet is a massive dataset containing millions of labeled images across 1,000 categories, such as dogs, cats, cars, etc. 
# EfficientNet, specifically, has been pre-trained on ImageNet and is therefore very good at identifying general patterns and features in images—edges, textures, shapes, etc.—that are common across a wide variety of images.
# If you’re using EfficientNet pre-trained on ImageNet to classify chest X-ray images (a medical task), while the model has never seen chest X-ray images during pre-training, the features it learned from ImageNet (like textures, edges, etc.) are still helpful for distinguishing patterns in the X-ray images. You just need to fine-tune the model on your specific X-ray dataset.
# Efficiency: EfficientNet models are known for being more computationally efficient and achieving higher accuracy compared to other models like ResNet and VGG. 
#             They are designed using compound scaling, which optimizes depth, width, and resolution, leading to better performance while using fewer resources.
# Pre-trained Weights: The pre-trained models allow you to leverage transfer learning, which can greatly improve performance on your task (especially with limited data). 
#                      Instead of training a model from scratch, you can fine-tune a pre-trained EfficientNet model, saving time and computational resources.
# Multiple Variants: EfficientNet comes in several variants (B0 to B7), with each having a different trade-off between speed and accuracy. For example:
#   •	B0: Smallest and fastest but less accurate.
#   •	B7: Largest and most accurate but slower and more resource-heavy.
# The code above uses B1, B2, and B3, which are middle-ground options, balancing performance and computational efficiency.
!pip install efficientnet-pytorch==0.7.1
from efficientnet_pytorch import EfficientNet

# The torch.nn module contains many neural network components, such as layers, loss functions, optimizers, etc.
import torch.nn as nn

# get_cosine_schedule_with_warmup is a function provided by Hugging Face’s transformers library. It’s a learning rate scheduler that adjusts the learning rate in a cosine annealing fashion.
# Cosine annealing means the learning rate starts high, gradually decreases following a cosine curve, and then flattens out near the end of training. This schedule often helps achieve better training performance and faster convergence.
# Cosine Annealing helps improve the performance of the model by gradually reducing the learning rate, enabling finer weight adjustments later in training.
# Warm-up steps: The scheduler also includes “warm-up” steps, where the learning rate starts small and gradually increases to the initial learning rate over a set number of steps. This is done to avoid large updates at the start, which can destabilize the training.
# Warm-up helps stabilize training by preventing large gradient steps at the beginning, which can cause unstable updates when weights are initialized randomly.
# By gradually decreasing the learning rate and using warm-up, it helps the model converge more effectively and avoids oscillations.
from transformers import get_cosine_schedule_with_warmup

from sklearn.metrics import accuracy_score # Function to calculate accuracy
from sklearn.metrics import recall_score   # Function to calculate recall
from sklearn.metrics import f1_score       # Function to calculate F1 score
from tqdm.notebook import tqdm             # Progress bar for tracking training progress

## Setting the Seed

In [None]:
## import torch 
## import random
## import numpy as np
## import os


## Setting the Seed
# Defines a fixed random seed (50 in this case) for consistent results.
seed = 50
# Sets the seed for Python’s internal hashing functions, ensuring reproducibility in environments where Python’s hash-based operations might vary between runs.
os.environ['PYTHONHASHSEED'] = str(seed)
# Sets the seed for Python’s built-in random module. This ensures any random operations from this module produce the same results across runs.
random.seed(seed)
# Sets the seed for NumPy’s random number generator. This ensures consistency in NumPy’s random operations.
np.random.seed(seed)

## Configuring PyTorch
# Sets the seed for PyTorch’s CPU operations. Any random behavior (e.g., initialization of weights) will produce consistent results.
torch.manual_seed(seed)
# Sets the seed for PyTorch’s CUDA backend on a single GPU. Ensures reproducibility for GPU-based operations.
torch.cuda.manual_seed(seed)
# Sets the seed for all GPUs (if multiple GPUs are being used). This ensures reproducibility when using multiple devices.
torch.cuda.manual_seed_all(seed)

## Configuring cuDNN
# Forces cuDNN to use deterministic algorithms. This ensures that operations like convolution produce consistent results.
torch.backends.cudnn.deterministic = True
# Disables cuDNN’s auto-tuner, which selects the best convolution algorithm based on the hardware and input sizes. While this improves performance in some cases, it can introduce variability. Setting it to False prioritizes reproducibility.
torch.backends.cudnn.benchmark = False
# Disables cuDNN entirely. This guarantees deterministic behavior but may significantly reduce training and inference speed. This line is optional and often omitted unless required for debugging.
torch.backends.cudnn.enabled = False

## GPU Setting

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Data Preparation

In [1]:
# Locally: You could set data_path = '/path/to/your/dataset/'.
data_path = '/kaggle/input/chest-xray-pneumonia/chest_xray/'

train_path = data_path + 'train/'
valid_path = data_path + 'val/'
test_path = data_path + 'test/'

## Define Transforms

In [None]:
## from torchvision import transforms


## transforms.Compose allows chaining multiple transformations into a single pipeline.
# How Transforms Work
# 1. Data Augmentation:
#   •	Augmentations like flipping, rotating, and cropping create diverse variations of the dataset, making the model less prone to overfitting.
#   •	During training, these augmentations are applied on-the-fly to each batch.
# 2. Normalization:
#   •	Brings pixel values into a consistent range to make training stable.
#   •	mean and std values are specific to the dataset (ImageNet in this case).
# 3. Pipeline:
#   •	Each image goes through the transformations sequentially. For example:
#   •	An image is resized → cropped → augmented → converted to a tensor → normalized.

## transforms.Normalize
# Whether you can use the ImageNet normalization values (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225) or need to calculate your own depends on your specific use case:
# 1. Using ImageNet Values (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225)
#   •	Using a pre-trained model (e.g., ResNet, EfficientNet, etc.) from ImageNet: These models expect inputs to be normalized with these values, regardless of whether you’re training or fine-tuning on a new dataset.
#   •	Your dataset is similar to ImageNet: If your dataset consists of natural images (e.g., photos of animals, objects, or everyday scenes), the ImageNet values are likely a good approximation.
# 2. Calculating Dataset-Specific Values: 
#    You should compute the mean and standard deviation for your specific dataset when:
#   •	Your dataset differs significantly from ImageNet: Medical imaging (e.g., X-rays, MRIs) / Grayscale or non-natural images (e.g., thermal or satellite imagery) / Cartoon, stylized, or domain-specific images.
#   •	You are training a model from scratch: Without a pre-trained model, it’s better to normalize based on the statistics of your dataset to achieve better model performance.

# Chain multiple transformations for training data.
transform_train = transforms.Compose([      
    # Resize: Ensures all images have a uniform size, regardless of their original dimensions.                                  
    # Resize images to 250x250 pixels.
    transforms.Resize((250, 250)),    
    # CenterCrop: Focuses on the central part of the image, discarding less relevant peripheral areas.
    # Crop the central 180x180 region from the resized image.      
    transforms.CenterCrop(180),             
    # RandomHorizontalFlip & RandomVerticalFlip: Introduces variation by flipping images, helping the model generalize better.
    # Randomly flip the image horizontally with a 50% chance.
    transforms.RandomHorizontalFlip(0.5),   
    # Randomly flip the image vertically with a 20% chance.
    transforms.RandomVerticalFlip(0.2), 
    # RandomRotation: Adds robustness by simulating images taken from different angles.    
    # Rotate the image randomly within ±20 degrees.
    transforms.RandomRotation(20),          
    # ToTensor: Converts images into tensors for compatibility with PyTorch and normalizes pixel values to [0, 1].
    # Convert the image to a PyTorch tensor and scale pixel values to [0, 1].
    transforms.ToTensor(),                  
    # Normalize: Standardizes image pixel values using pre-calculated ImageNet statistics for mean and std, improving convergence during training.
    # Normalize pixel values using mean and std for each channel (RGB).
    transforms.Normalize((0.485, 0.456, 0.406),  
                         (0.229, 0.224, 0.225))  
])                        

# Chain transformations for test/validation data.
transform_test = transforms.Compose([   
    # Resize images to 250x250 pixels (same as training for consistency).
    transforms.Resize((250, 250)),      
    # Crop the central 180x180 region.
    transforms.CenterCrop(180),         
    # Convert the image to a tensor.
    transforms.ToTensor(),              
    # Normalize using ImageNet mean and std.
    transforms.Normalize((0.485, 0.456, 0.406),  
                         (0.229, 0.224, 0.225))
])

## Define Datasets

In [None]:
## from torchvision.datasets import ImageFolder

# Requirement: Your dataset folder must already be structured so that each class has its own subfolder containing all the images for that class.
'''
dataset/
├── train/
│   ├── class_1/
│   │   ├── img1.jpg
│   │   ├── img2.jpg
│   ├── class_2/
│       ├── img3.jpg
│       ├── img4.jpg
├── val/
    ├── class_1/
    │   ├── img5.jpg
    │   ├── img6.jpg
    ├── class_2/
        ├── img7.jpg
        ├── img8.jpg
'''

# root=train_path:
#   Specifies the root directory where the training images are stored.
#   train_path points to a folder containing subfolders, each representing a class, with images inside.
# transform=transform_train:
#   Applies the transformations defined in the transform_train pipeline to every image in the dataset.
#   These transformations include resizing, flipping, rotation, and normalization, which are essential for augmenting the training data and preparing it for the model.
datasets_train = ImageFolder(root = train_path, transform = transform_train)

datasets_valid = ImageFolder(root = valid_path, transform = transform_test) 

## Dataloader

In [None]:
## Ensure reproducibility when working with PyTorch’s DataLoader in a multi-threaded or parallel environment

# Ensures that each worker process in a DataLoader gets a unique but deterministic seed. 
# This avoids random operations like shuffling being inconsistent across runs.
# worker_id: Each worker in a DataLoader is identified by a unique worker_id.
def seed_worker(worker_id):
    # torch.initial_seed(): Returns the initial random seed for the current worker.
    # torch.initial_seed() % 2**32: Since torch.initial_seed() generates a large number, it’s reduced to a 32-bit range for compatibility with NumPy and Python’s random module.
    worker_seed = torch.initial_seed() % 2**32
    # np.random.seed(worker_seed): Sets the seed for NumPy random number generation.
    np.random.seed(worker_seed)
    # random.seed(worker_seed): Sets the seed for Python’s built-in random module.
    random.seed(worker_seed)

# Creates a random number generator (torch.Generator) specifically for controlling deterministic behavior in DataLoader sampling.
# torch.Generator(): Creates an independent random number generator object.
g = torch.Generator()
# g.manual_seed(0):
#   • Sets the seed of this generator to 0, ensuring deterministic sampling in the DataLoader.
g.manual_seed(0)

In [None]:
## from torch.utils.data import DataLoader


# Defines the number of samples that will be processed together in one batch.
# The batch size determines how many samples (images) are passed through the model at once during each training step (or iteration)
# Smaller batches consume less memory but may increase training time, while larger batches are faster but require more memory. A batch size of 8 balances these concerns.
batch_size = 8

loader_train = DataLoader(
    # dataset: The dataset object you want to load (e.g., ImageFolder, custom dataset).
    dataset = datasets_train, 
    # batch_size: Number of samples to load in each batch.
    batch_size = batch_size, 
    # shuffle: If True, the data is shuffled at the start of each epoch.
    shuffle = True, 
    # worker_init_fn: Allows you to set seeds for random operations in workers for reproducibility.
    worker_init_fn = seed_worker,
    # Provides a random number generator (torch.Generator) initialized with a fixed seed.
    generator = g, 
    # num_workers: Number of subprocesses used for data loading. Setting this to a higher value speeds up loading but increases CPU usage.
    # If you have more CPU cores available, you can increase the num_workers to 4 or 8. On a typical machine with 4–8 cores, setting num_workers to 4–8 should work well.
    num_workers = 8
    # drop_last: If True, drops the last incomplete batch if the dataset size isn’t divisible by batch_size.
)

loader_valid = DataLoader(
    dataset = datasets_valid, 
    batch_size = batch_size, 
    shuffle = False, 
    worker_init_fn = seed_worker,
    generator = g, 
    num_workers = 8
)

## Setting up Models

In [None]:
## from efficientnet_pytorch import EfficientNet

# When to Use EfficientNet:
#	1.	Image Classification Tasks:
#	•	EfficientNet is particularly suited for image classification, so if your problem involves classifying images into predefined categories, EfficientNet is a good choice. It can handle tasks from binary classification (e.g., distinguishing between “healthy” and “sick” images) to multi-class classification (e.g., identifying objects or animals from a set of classes).
#	2.	Limited Computational Resources:
#	•	EfficientNet is known for its efficiency—it provides a good trade-off between model size and performance. If you are working with constrained resources (e.g., limited memory or computational power), EfficientNet can be a great choice because it performs well with fewer parameters and computations compared to other large CNNs (like ResNet or VGG).
#	3.	Transfer Learning with Pretrained Models:
#	•	EfficientNet has pretrained models on large datasets like ImageNet. If you want to take advantage of transfer learning (fine-tuning the pretrained model on your own dataset), EfficientNet is an excellent option. Transfer learning allows the model to use the features it learned on large datasets to improve its performance on your specific dataset.
#	4.	Large Datasets:
#	•	If you have a large dataset and need a model that can scale to handle increasing complexity, EfficientNet is a great choice. It scales well across multiple versions (EfficientNet-B0 to EfficientNet-B7), allowing you to select a model size that fits your dataset and computing power.
#	5.	High Accuracy with Efficiency:
#	•	EfficientNet provides high accuracy with a small number of parameters, making it ideal for situations where you need a model that performs well but doesn’t consume excessive computational resources.

#When Not to Use EfficientNet:
#	1.	Very Small Datasets:
#	•	EfficientNet, like most deep learning models, requires a sufficient amount of data to train effectively. If you have a very small dataset, it may lead to overfitting. In such cases, you may want to consider simpler models or use techniques like data augmentation or pretraining to mitigate this.
#	2.	Non-Image Data:
#	•	EfficientNet is specifically designed for image data. If you’re working with non-image data (e.g., text, time series, or structured data), EfficientNet is not suitable. For text or sequence data, you would use models like LSTM, GRU, or Transformers. For structured data, simpler models like random forests or gradient boosting might be more appropriate.
#	3.	Real-Time or Low-Latency Requirements:
#	•	While EfficientNet is efficient in terms of accuracy and computation, for extremely low-latency applications (e.g., real-time video processing or embedded devices), you may need to optimize the model further or consider smaller models like MobileNet or SqueezeNet, which are specifically designed for such scenarios.
#	4.	Not Suitable for Extremely High-Speed Inference Needs:
#	•	Although EfficientNet is efficient in terms of accuracy and computational cost, for extremely high-speed inference tasks (e.g., edge devices with limited resources), it might still be too large compared to models optimized for inference at the edge. In such cases, you might prefer models optimized for inference speed, like MobileNet, ShuffleNet, or Tiny YOLO.
#	5.	Tasks Outside Image Classification:
#	•	EfficientNet is designed primarily for image classification. If you’re working on tasks that require other types of model architectures (e.g., object detection, segmentation, or generative tasks), EfficientNet may not be the best fit, although you could adapt it as a backbone for other tasks (e.g., object detection with EfficientNet as the feature extractor).

# EfficientNet.from_pretrained(): This function loads a pre-trained EfficientNet model. The model weights are pre-trained on the ImageNet dataset and can be used for transfer learning.
# 'efficientnet-b1', 'efficientnet-b2', 'efficientnet-b3': These are specific versions of the EfficientNet model, where:
#   •	B1: A smaller version with fewer parameters.
#   •	B2 and B3: Successively larger models with more parameters and higher accuracy but also more computational requirements.
# num_classes=2: This specifies the number of output classes for your task. In this case, it’s set to 2, which means you’re performing binary classification. For multi-class classification, you would adjust this value according to the number of classes in your dataset.
efficientnet_b1 = EfficientNet.from_pretrained('efficientnet-b1', num_classes = 2) 
efficientnet_b2 = EfficientNet.from_pretrained('efficientnet-b2', num_classes = 2)
efficientnet_b3 = EfficientNet.from_pretrained('efficientnet-b3', num_classes = 2) 

# Allocation of GPU to each model
# .to(device): This moves the model to the specified device, either CPU or GPU. The device variable is typically set to either "cpu" or "cuda" based on whether a GPU is available for computation. This ensures the models are placed on the right hardware for training or inference.
efficientnet_b1 = efficientnet_b1.to(device)
efficientnet_b2 = efficientnet_b2.to(device)
efficientnet_b3 = efficientnet_b3.to(device)

# This appends the three EfficientNet models (B1, B2, B3) to a list called models_list. Storing models in a list can be useful if you want to train or evaluate multiple models in parallel or ensemble methods. 
# The list will contain the models that can later be iterated over for tasks like training, validation, or testing.
models_list =[]
models_list.append(efficientnet_b1)
models_list.append(efficientnet_b2)
models_list.append(efficientnet_b3)

In [None]:
## Loop over a list of models (models_list) and print the number of parameters for each model
# This line iterates over each model in the models_list using enumerate, which provides both the index (idx) and the model itself (model) in each iteration.
for idx, model in enumerate(models_list):
    # Calculates the total number of parameters in the model
    #   •	model.parameters() returns an iterator over all the parameters of the model (weights and biases in the layers).
    #   •	param.numel() returns the total number of elements (parameters) in the given tensor param.
    #   •	The sum() function sums up the number of parameters across all layers of the model.
    num_parmas = sum(param.numel() for param in model.parameters())
    print(f'Model{idx+1} | Number of Parameters: {num_parmas}')

## Loss Function

In [None]:
## import torch.nn as nn

# nn.CrossEntropyLoss() is a loss function used for classification problems where the model outputs class probabilities, and the target labels are integers representing class indices.
# CrossEntropyLoss is particularly useful for classification problems, and it’s most commonly applied when you have two or more mutually exclusive classes. It works well for both binary classification (two classes) and multi-class classification (more than two classes).

# The loss function combines two steps:
#   1. LogSoftmax: It applies the logarithm of softmax to the model’s output, which normalizes the output into a probability distribution.
#   2. Negative Log Likelihood Loss: It calculates how much the predicted probabilities differ from the true class labels. The lower the loss, the better the model’s prediction matches the true labels.

# If you’re working with a dataset like a binary or multi-class classification problem (for example, pneumonia vs. non-pneumonia images), and you have two classes (class_1 and class_2), this loss function would compare the model’s predicted class probabilities for each input image against the true class label (e.g., 0 or 1) and compute how far off the prediction is.
# In this case: Class 1: Healthy (e.g., “Normal” X-ray images) / Class 2: Sick (e.g., “Pneumonia” X-ray images)

# When to Use CrossEntropyLoss:
# 1.	Classification Tasks:
# •	Binary Classification: When you have two mutually exclusive classes (e.g., “positive” vs “negative”).
# •	Multi-Class Classification: When you have more than two classes and each sample belongs to exactly one class (e.g., classifying images into “cat,” “dog,” “bird”).
# 2.	Softmax or Sigmoid Outputs:
# •	For binary classification, the model’s output should be a single value (representing the logit for one class, typically “positive” or “sick”) that can be passed through a sigmoid function.
# •	For multi-class classification, the model’s output should be a vector of logits (one for each class), which will be passed through a softmax function to convert the logits into probabilities.
# 3.	Multi-Class Problems (More than Two Classes):
# •	When you have more than two categories and each input can belong to exactly one of the categories, such as identifying animals (e.g., “dog,” “cat,” “bird”).

# When Not to Use CrossEntropyLoss:
# 1.	Multi-Label Classification:
# •	When a sample can belong to multiple classes at once (e.g., an image can contain both a “dog” and “cat”), CrossEntropyLoss is not suitable. Instead, use a binary cross-entropy loss or sigmoid activation for each class.
# •	In multi-label classification, each class is treated independently, and the model outputs a separate probability for each class.
# 2.	Regression Problems:
# •	If your task involves predicting continuous values (e.g., predicting house prices or temperatures), CrossEntropyLoss should not be used. In such cases, you should use Mean Squared Error Loss (MSELoss) or other regression loss functions, depending on your task.
# 3.	Ordinal Regression:
# •	When your classes have an ordinal relationship (e.g., predicting a rating scale from 1 to 5), where the classes are ordered, but not necessarily equally spaced, you might want to consider using a loss function designed for ordinal regression, like Ordinal Cross-Entropy Loss or Mean Squared Error Loss for ordinal tasks.

criterion = nn.CrossEntropyLoss()

## Optimizer

In [None]:
# AdamW is a variant of the Adam optimizer. It uses weight decay regularization, which helps prevent overfitting by penalizing large weights during training. 
# The “W” in AdamW stands for Weight Decay, and it decouples weight decay from the optimization process, which is often better than using the standard Adam optimizer with weight decay.

optimizer1 = torch.optim.AdamW(
    # models_list[0] refers to the first model (EfficientNet-B1) in your list of models.
    # .parameters() retrieves the parameters (weights and biases) of the model that will be optimized.
    models_list[0].parameters(), 
    # lr=0.0006: This sets the learning rate to 0.0006. The learning rate controls how large the steps will be when updating the model’s parameters. 
    #   A lower learning rate means smaller updates.
    lr = 0.0006, 
    # weight_decay=0.001: This applies a weight decay of 0.001. It adds a penalty to large weights, which helps prevent overfitting and promotes generalization.
    weight_decay = 0.001
)

optimizer2 = torch.optim.AdamW(
    models_list[1].parameters(), 
    lr = 0.0006, 
    weight_decay = 0.001
)

optimizer3 = torch.optim.AdamW(
    models_list[2].parameters(), 
    lr = 0.0006, 
    weight_decay = 0.001
)

## Scheduler

In [None]:
## from transformers import get_cosine_schedule_with_warmup


# Defines the number of epochs (full passes over the entire training dataset) for the training process. The model will train for 20 iterations over the dataset.
epochs = 20 

# his line initializes a learning rate scheduler for the first optimizer (optimizer1). The scheduler will adjust the learning rate during training according to a cosine annealing pattern, with a warm-up phase in the beginning.
scheduler1 = get_cosine_schedule_with_warmup(
    optimizer1, 
    # num_warmup_steps = len(loader_train) * 3:
	# •	len(loader_train) gives the number of batches in one epoch (since loader_train is the DataLoader for the training dataset).
	# •	* 3: The warm-up phase will last for 3 epochs, meaning the learning rate will increase gradually from 0 to the initial learning rate during the first 3 epochs. 
    #        This helps prevent large updates at the beginning of training, which could destabilize the learning process.
    num_warmup_steps = len(loader_train) * 3, 
    # num_training_steps = len(loader_train) * epochs:
	# •	This defines the total number of steps in the entire training process. 
    #   It is calculated as the number of batches in one epoch (len(loader_train)) multiplied by the number of epochs (epochs = 20).
	# •	This total number of steps is used to compute the cosine decay over the training process, so that the learning rate gradually decreases during the later stages of training following a cosine curve.
    num_training_steps = len(loader_train) * epochs
)

scheduler2 = get_cosine_schedule_with_warmup(
    optimizer2, 
    num_warmup_steps = len(loader_train) * 3, 
    num_training_steps = len(loader_train) * epochs
)

scheduler3 = get_cosine_schedule_with_warmup(
    optimizer3, 
    num_warmup_steps = len(loader_train) * 3, 
    num_training_steps = len(loader_train) * epochs
)

## Training and Validation

In [None]:
# from sklearn.metrics import accuracy_score 
# from sklearn.metrics import recall_score   
# from sklearn.metrics import f1_score       
# from tqdm.notebook import tqdm             


# The train function is performing the process of training and validating a deep learning model over multiple epochs, with additional functionality to save the best-performing model (based on validation loss) and evaluate it using metrics like accuracy, recall, and F1 score.
def train(model, 
          loader_train, 
          loader_valid, 
          criterion, 
          optimizer, 
          scheduler = None, 
          epochs = 10, 
          save_file = 'model_state_dict.pth'):
    
    # Initialize minimum validation loss as infinity
    # We want to track the lowest (best) validation loss across epochs during training. Initially, we don’t know what the best validation loss will be, so we set it to a very high value (infinity) to ensure that the first validation loss encountered will be smaller than this initial value.
    valid_loss_min = np.inf 

    # Loop through the total number of epochs
    for epoch in range(epochs):
        
        print(f'Epoch [{epoch+1}/{epochs}] \n-----------------------------')
        
        ## Training
        # Set model to training mode
        model.train()
        # Initialize training loss for this epoch        
        epoch_train_loss = 0 
        # Loop through the batches of data
        for images, labels in tqdm(loader_train):
            # Move images and labels to the device (GPU/CPU)
            images = images.to(device)
            labels = labels.to(device)
            # Reset gradients in the optimizer
            optimizer.zero_grad()
            # Forward pass: Get model predictions for input images
            outputs = model(images)
            # Calculate the loss by comparing predictions with actual labels
            loss = criterion(outputs, labels)
            # Add current batch's loss to the total epoch loss
            epoch_train_loss += loss.item()
            # Backpropagation to calculate gradients 
            loss.backward()       
            # Update model weights using gradients
            optimizer.step()      
            # If a scheduler is provided, update the learning rate
            if scheduler != None: 
                scheduler.step() 

        # Print the average training loss for this epoch
        print(f'\tTraining Loss: {epoch_train_loss / len(loader_train):.4f}')
        
        ## Validation
        # Set model to evaluation mode 
        model.eval()      
        # Initialize validation loss for this epoch   
        epoch_valid_loss = 0
        # List to store predictions for validation 
        preds_list = []     
        # List to store actual labels for validation
        true_list = []       
        
        # Disable gradient calculation during validation
        with torch.no_grad(): 
            for images, labels in loader_valid:
                images = images.to(device)
                labels = labels.to(device)
                
                outputs = model(images)
                loss = criterion(outputs, labels)
                epoch_valid_loss += loss.item()
                
                # Get predictions and true labels (moving to CPU for calculation)
                preds = torch.max(outputs.cpu(), dim=1)[1].numpy() 
                true = labels.cpu().numpy() 
    
                preds_list.extend(preds)
                true_list.extend(true)
                
        # Calculate accuracy, recall, and F1 score for validation
        val_accuracy = accuracy_score(true_list, preds_list)
        val_recall = recall_score(true_list, preds_list)
        val_f1_score = f1_score(true_list, preds_list)

        # Print validation loss, accuracy, recall, and F1 score
        print(f'\tValidation Loss: {epoch_valid_loss / len(loader_valid):.4f}')
        print(f'\tAccuracy: {val_accuracy:.4f} / Recall: {val_recall:.4f} / F1 Score: {val_f1_score:.4f}')
        
        ## Finding the Optimal Model Weights
        # When the code refers to saving the best model, it means that during training, the model is saved at the point where it performs best on the validation data—usually when the validation loss is the lowest. 
        # This is done to prevent overfitting and ensure that the model is in the best possible state for generalizing to new, unseen data.
        # Saving the best model means the model’s parameters (weights) are stored at the epoch where it achieved the lowest validation loss or the best performance on metrics like accuracy or F1 score.
        # If the current validation loss is lower than the previous minimum, save the model weights
        if epoch_valid_loss <= valid_loss_min: 
            print(f'\t### Validation Loss Decreased ({valid_loss_min:.4f} --> {epoch_valid_loss:.4f}). Saving model')
            # Save the model's state dict (weights) to a file
            torch.save(model.state_dict(), save_file) 
            valid_loss_min = epoch_valid_loss  # Update the minimum validation loss to the current epoch's loss
    
    # Return the model with the best validation loss by loading the saved weights
    return torch.load(save_file)  # Load the saved model weights and return the model

In [None]:
# Training the First Model 
model_state_dict = train(
    # model=models_list[0]: The first model in the models_list is selected, which is EfficientNet B1.
    model = models_list[0],
    loader_train = loader_train, 
    loader_valid = loader_valid,
    criterion = criterion, 
    optimizer = optimizer1,
    scheduler = scheduler1,
    epochs = epochs
)

# Loading the Best Model Weights
# This line takes the best model weights returned from the training process and loads them back into EfficientNet B1 (the first model in the list).
# This ensures that after training, the model is set to the version that performed best on the validation set.
models_list[0].load_state_dict(model_state_dict)

In [None]:
model_state_dict = train(
    model = models_list[1],
    loader_train = loader_train, 
    loader_valid = loader_valid,
    criterion = criterion, 
    optimizer = optimizer2,
    scheduler = scheduler2,
    epochs = epochs
)

models_list[1].load_state_dict(model_state_dict)

In [None]:
model_state_dict = train(
    model = models_list[2],
    loader_train = loader_train, 
    loader_valid = loader_valid,
    criterion = criterion, 
    optimizer = optimizer3,
    scheduler = scheduler3,
    epochs = epochs
)

models_list[2].load_state_dict(model_state_dict)

## Perform Testing

In [None]:
# datasets_test: This loads the test data using the ImageFolder class. 
# It assumes the images in test_path are organized into subfolders, each corresponding to a different class. 
# The transform_test function is applied to each image to prepare them for the model (e.g., resizing, normalization).
datasets_test = ImageFolder(root = test_path, transform = transform_test)

# loader_test: This creates a DataLoader for the test dataset, specifying the batch size, number of workers for data loading, and the seed for reproducibility.
loader_test = DataLoader(
    dataset = datasets_test, 
    batch_size = batch_size, 
    shuffle = False, 
    worker_init_fn = seed_worker,
    generator = g, 
    num_workers = 2
)

In [None]:
# Purpose of the predict() function:
# This function is designed to evaluate a trained model on a test dataset. 
# It makes predictions for each image in the test set, compares them to the true labels, and returns the results. 
# You can use this function to:
# •	Get a list of predicted labels for the test set.
# •	Optionally return the true labels along with the predictions for further evaluation.


def predict(model, loader_test, return_true = False):
    # model.eval(): This sets the model to evaluation mode, which affects layers like dropout and batch normalization. It’s important to do this before making predictions.
    model.eval()    
    # preds_list and true_list: These are lists to store the predicted class labels (preds_list) and the actual true labels (true_list) for the images in the test dataset.
    # Initialize list to store predictions
    preds_list = [] 
    # Initialize list to store true labels
    true_list = []  
    
    # Disable gradient computation for predictions
    # with torch.no_grad(): This context disables gradient calculation, which saves memory and speeds up inference since gradients are not needed for testing.
    with torch.no_grad(): 
        # For each batch in loader_test, the images and their corresponding labels are moved to the appropriate device (GPU or CPU).
        for images, labels in loader_test:
            # Move images to device (GPU/CPU)
            images = images.to(device)  
            labels = labels.to(device)  
            
            # Get model's predictions
            outputs = model(images)  
            
            # Get predicted class labels
            # torch.max(outputs, dim=1): This finds the index of the maximum value in each row of the outputs tensor, which corresponds to the predicted class label.
            # .cpu(): Ensures the data is moved back to the CPU as tensors on GPUs can't be directly converted to NumPy arrays.
            # .numpy(): Converts the tensor to a NumPy array for easier handling.
            preds = torch.max(outputs.cpu(), dim=1)[1].numpy()  
            # Get true labels
            true = labels.cpu().numpy()  
            
            # Add predictions to the list
            preds_list.extend(preds)  
            # Add true labels to the list
            true_list.extend(true)    
    
    # If return_true is True, the function returns both the true labels (true_list) and the predictions (preds_list).
    if return_true:
        # Return both true labels and predictions
        return true_list, preds_list  
    else:
        # Only return predictions
        return preds_list  

In [None]:
true_list, preds_list1 = predict(
    model = models_list[0], 
    loader_test = loader_test, 
    return_true = True
)

In [None]:
preds_list2 = predict(
    model = models_list[1], 
    loader_test = loader_test
)

In [None]:
preds_list3 = predict(
    model = models_list[2], 
    loader_test = loader_test
)

In [None]:
# Print the evaluation results for the efficientnet-b1 model
print('#'*5, 'efficientnet-b1 Model Prediction Evaluation Scores', '#'*5)
print(f'Accuracy: {accuracy_score(true_list, preds_list1):.4f}')
print(f'Recall: {recall_score(true_list, preds_list1):.4f}')
print(f'F1 Score: {f1_score(true_list, preds_list1):.4f}')

In [None]:
# Print the evaluation results for the efficientnet-b2 model
print('#'*5, 'efficientnet-b2 Model Prediction Evaluation Scores', '#'*5)
print(f'Accuracy: {accuracy_score(true_list, preds_list2):.4f}')
print(f'Recall: {recall_score(true_list, preds_list2):.4f}')
print(f'F1 Score: {f1_score(true_list, preds_list2):.4f}')

In [None]:
# Print the evaluation results for the efficientnet-b3 model
print('#'*5, 'efficientnet-b3 Model Prediction Evaluation Scores', '#'*5)
print(f'Accuracy: {accuracy_score(true_list, preds_list3):.4f}')
print(f'Recall: {recall_score(true_list, preds_list3):.4f}')
print(f'F1 Score: {f1_score(true_list, preds_list3):.4f}')

## Ensemble Prediction

In [None]:
# Why Use Ensemble?
# Ensemble methods are used to improve the overall prediction performance. 
# By averaging the predictions of different models, you can reduce the impact of errors from any one model, leading to a more robust and accurate result. 
# In this case, you’re averaging predictions from three different models (EfficientNet-B1, EfficientNet-B2, and EfficientNet-B3). This can help mitigate the bias or variance from each individual model, leading to better generalization on the test data.

ensemble_preds = []

for i in range(len(preds_list1)):
    # pred_element = np.round((preds_list1[i] + preds_list2[i] + preds_list3[i]) / 3):
	# •	For each prediction, it calculates the average of the predictions from all three models.
	# •	preds_list1[i], preds_list2[i], and preds_list3[i] are the individual predictions made by the three models (EfficientNet-B1, B2, and B3, respectively).
	# •	The sum of these predictions is divided by 3 to get the average.
	# •	np.round(...) rounds the average to the nearest integer (either 0 or 1 in the case of binary classification), creating the final ensemble prediction for that instance.
    pred_element = np.round((preds_list1[i] + preds_list2[i] + preds_list3[i]) / 3)
    ensemble_preds.append(pred_element)

## Result

In [None]:
print('#'*5, 'Final Ensemble Results Evaluation Scores', '#'*5)
print(f'Accuracy: {accuracy_score(true_list, ensemble_preds):.4f}')
print(f'Recall: {recall_score(true_list, ensemble_preds):.4f}')
print(f'F1 Score: {f1_score(true_list, ensemble_preds):.4f}')