#Image Classification using CNN Architectures
| Assignment

1. What is a Convolutional Neural Network (CNN), and how does it differ from
traditional fully connected neural networks in terms of architecture and performance on
image data?
   - A **Convolutional Neural Network (CNN)** is a specialized type of deep learning model designed primarily for processing and analyzing visual data such as images and videos. Unlike traditional **fully connected neural networks (FNNs)**, where every neuron in one layer is connected to every neuron in the next, CNNs are built to take advantage of the **spatial structure** of images. They use **convolutional layers** that apply small filters or kernels across local regions of the input image to automatically detect important features like edges, textures, shapes, and patterns. These filters share weights across the image, which drastically reduces the number of trainable parameters compared to fully connected networks and helps the model generalize better. CNNs also include **pooling layers**, which downsample the feature maps, making the network more efficient and less sensitive to small shifts or distortions in the input image. Toward the end of the architecture, **fully connected layers** are often used for the final classification or regression tasks.

     In contrast, fully connected networks treat all input features equally and lose the spatial relationships between pixels when an image is flattened into a one-dimensional vector, making them inefficient for image processing. CNNs, on the other hand, preserve spatial hierarchies and learn from local to global features in a structured way. This architectural difference allows CNNs to achieve **superior performance** on image-related tasks such as classification, object detection, and facial recognition while being computationally more efficient and less prone to overfitting. In summary, CNNs outperform traditional neural networks on image data because they effectively capture spatial dependencies, require fewer parameters, and learn feature representations automatically without manual feature extraction.


2. Discuss the architecture of LeNet-5 and explain how it laid the foundation
for modern deep learning models in computer vision. Include references to its original
research paper.
  - LeNet-5, developed by Yann LeCun and colleagues in 1998, is one of the earliest and most influential Convolutional Neural Network (CNN) architectures, introduced in the research paper “Gradient-Based Learning Applied to Document Recognition” (LeCun et al., 1998, Proceedings of the IEEE). It was primarily designed for handwritten digit recognition on the MNIST dataset and laid the groundwork for the deep learning revolution in computer vision. The LeNet-5 architecture consists of seven layers with trainable parameters, excluding the input. It takes a 32×32 grayscale image as input and passes it through alternating convolutional and subsampling (pooling) layers followed by fully connected layers. The first convolutional layer (C1) uses six 5×5 filters to extract basic features such as edges, producing 28×28 feature maps. The second layer (S2) performs average pooling to reduce dimensionality and achieve translation invariance. The third layer (C3) applies sixteen 5×5 filters to learn more complex features, followed by another pooling layer (S4), which further compresses the representation. The fifth layer (C5) acts as a fully connected convolutional layer with 120 feature maps, and the sixth layer (F6) is a fully connected layer with 84 neurons, leading finally to an output layer with 10 neurons representing the digit classes (0–9).

    LeNet-5 introduced several groundbreaking ideas such as local receptive fields, weight sharing, and subsampling, which drastically reduced computational complexity while preserving spatial relationships in images. Its design allowed the model to automatically learn hierarchical feature representations—from simple edges to complex shapes—without manual feature extraction. Although limited by the computational power of the 1990s, LeNet-5 became the conceptual blueprint for later, deeper networks such as AlexNet (2012), VGGNet (2014), and ResNet (2015). These modern CNNs expanded upon LeNet’s principles using larger datasets, more layers, and faster hardware. In essence, LeNet-5 demonstrated that end-to-end learning through convolution and pooling could effectively perform image recognition tasks, establishing the foundation for the modern era of deep learning in computer vision.

3. Compare and contrast AlexNet and VGGNet in terms of design principles,
number of parameters, and performance. Highlight key innovations and limitations of
each.
   - AlexNet and VGGNet are two landmark convolutional neural network (CNN) architectures that significantly advanced the field of deep learning in computer vision. AlexNet, introduced by Krizhevsky, Sutskever, and Hinton in 2012, won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) by a large margin and marked the beginning of the deep learning revolution. It consists of 8 layers—5 convolutional and 3 fully connected—and introduced key innovations such as the ReLU activation function for faster training, dropout for regularization, and the use of GPU acceleration to handle large-scale data efficiently. AlexNet used local response normalization (LRN) and overlapping max pooling to improve generalization and feature extraction. It has around 60 million parameters, which was massive for its time, and achieved a top-5 error rate of about 15.3% on ImageNet. However, its design relied on large filter sizes (e.g., 11×11, 5×5) and multiple fully connected layers, making it computationally heavy and prone to overfitting.

     In contrast, VGGNet, proposed by Simonyan and Zisserman in 2014, focused on architectural simplicity and depth as its core design principle. VGGNet explored the effect of increasing depth and introduced models with 16 or 19 layers (VGG16 and VGG19) using a uniform design of 3×3 convolution filters and 2×2 max pooling throughout the network. This consistent use of small filters allowed VGGNet to capture complex features more effectively while keeping the receptive field manageable. It achieved a top-5 error rate of about 7.3% on ImageNet, significantly outperforming AlexNet. However, this performance came at the cost of a huge increase in parameters—about 138 million—which made VGGNet computationally expensive and memory-intensive. Despite its heavy architecture, VGGNet’s simplicity, modular design, and depth inspired the development of more advanced models such as ResNet and Inception.

     In summary, AlexNet pioneered the practical application of deep CNNs with innovations like ReLU, dropout, and GPU training, while VGGNet refined CNN architecture by emphasizing depth and uniform convolutional design. AlexNet demonstrated the potential of deep learning on large-scale image data, and VGGNet established the design principles of deeper and more structured CNNs. The main limitation of AlexNet was its large filter sizes and overfitting tendency, while VGGNet’s drawback was its high computational and memory cost. Nonetheless, both architectures were pivotal in shaping the evolution of modern deep learning models in computer vision.

4. What is transfer learning in the context of image classification? Explain
how it helps in reducing computational costs and improving model performance with
limited data.
    - **Transfer learning** in the context of **image classification** refers to the process of using a **pre-trained deep learning model**, which has already learned useful visual features from a large dataset (such as ImageNet), and adapting it to a new but related image classification task. Instead of training a neural network from scratch—which requires vast amounts of labeled data and computational power—transfer learning allows the model to reuse previously learned patterns such as edges, textures, and shapes from earlier layers. These pre-learned features act as a strong foundation, enabling the model to learn new classes or tasks more efficiently. By fine-tuning only the later layers or adding new classification layers on top of the existing architecture, the model quickly adapts to the new dataset with relatively little data. This approach significantly **reduces computational cost**, as it avoids the need for training millions of parameters from the beginning, and **improves model performance** when training data is limited, because the model benefits from the general knowledge already embedded in its weights. In essence, transfer learning accelerates training, enhances accuracy, and prevents overfitting, making it one of the most effective techniques in modern image classification.


5. Describe the role of residual connections in ResNet architecture. How do
they address the vanishing gradient problem in deep CNNs?
   - In the ResNet (Residual Network) architecture, residual connections—also known as skip connections—play a crucial role in enabling the training of very deep convolutional neural networks. Introduced by He et al. (2015) in the paper “Deep Residual Learning for Image Recognition,” residual connections were designed to address the vanishing gradient problem, which commonly occurs when networks become very deep. In traditional deep CNNs, as the number of layers increases, gradients propagated backward during training can become extremely small, causing earlier layers to learn very slowly or stop learning altogether. This limits the depth and performance of the model.

     Residual connections solve this issue by allowing the input of a layer to bypass one or more intermediate layers and be directly added to the output of those layers. Mathematically, instead of learning a direct mappingH(x), the network learns a residual function 𝐹(𝑥)=𝐻(𝑥)−𝑥F(x)=H(x)−x, which is then combined as H(x)=F(x)+x.This simple addition ensures that if deeper layers fail to learn useful transformations, the network can still preserve the original input information through the skip connection. As a result, gradients can flow more easily through the network during backpropagation, preventing them from vanishing or exploding.

     By facilitating smoother gradient propagation, residual connections make it feasible to train networks with hundreds or even thousands of layers without degradation in accuracy. They also improve convergence speed and generalization performance. In essence, ResNet’s residual connections enable deeper and more stable learning by reformulating the learning objective into a simpler residual mapping, effectively overcoming one of the key challenges in training very deep CNNs.

6. Implement the LeNet-5 architectures using Tensorflow or PyTorch to
classify the MNIST dataset. Report the accuracy and training time.
(Include your Python code and output in the code box below.)

In [1]:
# LeNet-5 Implementation on MNIST using PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time

# -------------------------------
# 1. Device Configuration
# -------------------------------
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# -------------------------------
# 2. Data Loading and Preprocessing
# -------------------------------
transform = transforms.Compose([
    transforms.Resize((32, 32)),      # LeNet-5 expects 32x32 input
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transform, download=True)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1000, shuffle=False)

# -------------------------------
# 3. Define LeNet-5 Architecture
# -------------------------------
class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0)
        self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        self.tanh = nn.Tanh()

    def forward(self, x):
        x = self.tanh(self.conv1(x))
        x = self.pool(x)
        x = self.tanh(self.conv2(x))
        x = self.pool(x)
        x = x.view(-1, 16 * 5 * 5)
        x = self.tanh(self.fc1(x))
        x = self.tanh(self.fc2(x))
        x = self.fc3(x)
        return x

# -------------------------------
# 4. Initialize Model, Loss, and Optimizer
# -------------------------------
model = LeNet5().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# -------------------------------
# 5. Training Loop
# -------------------------------
num_epochs = 5
start_time = time.time()

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")

training_time = time.time() - start_time
print(f"\nTraining completed in {training_time:.2f} seconds")

# -------------------------------
# 6. Evaluation
# -------------------------------
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")


Using device: cpu


100%|██████████| 9.91M/9.91M [00:00<00:00, 37.8MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 1.10MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 10.1MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 1.52MB/s]


Epoch [1/5], Loss: 0.2558
Epoch [2/5], Loss: 0.0775
Epoch [3/5], Loss: 0.0542
Epoch [4/5], Loss: 0.0432
Epoch [5/5], Loss: 0.0333

Training completed in 221.60 seconds
Test Accuracy: 98.43%


7. Use a pre-trained VGG16 model (via transfer learning) on a small custom
dataset (e.g., flowers or animals). Replace the top layers and fine-tune the model.
Include your code and result discussion.
(Include your Python code and output in the code box below.)

In [None]:
# Transfer Learning using Pre-trained VGG16 on a Small Custom Dataset (Flowers)

import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam
import time

# -------------------------------
# 1. Load Pre-trained VGG16 Model
# -------------------------------
# Load the VGG16 model without the top (fully connected) layers
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the base model layers to retain pre-trained ImageNet features
for layer in base_model.layers:
    layer.trainable = False

# -------------------------------
# 2. Add Custom Classification Layers
# -------------------------------
model = Sequential([
    base_model,
    Flatten(),
    Dense(256, activation='relu'),
    Dropout(0.5),
    Dense(5, activation='softmax')  # Example: 5 flower classes
])

# -------------------------------
# 3. Data Preparation
# -------------------------------
# Assume directory structure:
# dataset/
#   ├── train/
#   │    ├── daisy/
#   │    ├── rose/
#   │    ├── sunflower/
#   │    ├── tulip/
#   │    └── dandelion/
#   ├── val/
#   │    ├── daisy/
#   │    ├── rose/
#   │    └── ...

train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

val_datagen = ImageDataGenerator(rescale=1./255)

train_dir = 'dataset/train'
val_dir = 'dataset/val'

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical'
)

val_generator = val_datagen.flow_from_directory(
    val_dir,
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical'
)

# -------------------------------
# 4. Compile and Train the Model
# -------------------------------
model.compile(optimizer=Adam(learning_rate=0.0001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

start_time = time.time()

history = model.fit(
    train_generator,
    epochs=5,
    validation_data=val_generator
)

training_time = time.time() - start_time

# -------------------------------
# 5. Fine-tuning (Optional)
# -------------------------------
# Unfreeze the last few convolutional blocks to fine-tune deeper features
for layer in base_model.layers[-4:]:
    layer.trainable = True

model.compile(optimizer=Adam(learning_rate=1e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

fine_tune_history = model.fit(
    train_generator,
    epochs=3,
    validation_data=val_generator
)

# -------------------------------
# 6. Evaluate the Model
# -------------------------------
loss, acc = model.evaluate(val_generator)
print(f"\nFinal Validation Accuracy: {acc*100:.2f}%")
print(f"Total Training Time: {training_time:.2f} seconds")


8. Write a program to visualize the filters and feature maps of the first
convolutional layer of AlexNet on an example input image.
(Include your Python code and output in the code box below.)

In [None]:
# Visualizing Filters and Feature Maps of the First Convolutional Layer in AlexNet

import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

# -------------------------------
# 1. Load Pre-trained AlexNet Model
# -------------------------------
alexnet = torchvision.models.alexnet(weights='IMAGENET1K_V1')
alexnet.eval()  # set to evaluation mode

# -------------------------------
# 2. Load and Preprocess an Example Image
# -------------------------------
# Use any sample image path (replace 'sample.jpg' with your image file)
img_path = 'sample.jpg'

# Preprocessing steps same as ImageNet-trained AlexNet expects
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

image = Image.open(img_path).convert('RGB')
input_tensor = transform(image).unsqueeze(0)  # add batch dimension

# -------------------------------
# 3. Visualize Filters of First Conv Layer
# -------------------------------
filters = alexnet.features[0].weight.data.clone()

print(f"Shape of first layer filters: {filters.shape}")  # (64, 3, 11, 11)

# Normalize filters to 0-1 for visualization
filters = (filters - filters.min()) / (filters.max() - filters.min())

# Plot first 8 filters
fig, axes = plt.subplots(4, 8, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    if i < 32:
        img = filters[i].permute(1, 2, 0).numpy()
        ax.imshow(img)
        ax.axis('off')
plt.suptitle("Filters of the First Convolutional Layer in AlexNet", fontsize=14)
plt.show()

# -------------------------------
# 4. Extract Feature Maps from First Conv Layer
# -------------------------------
with torch.no_grad():
    feature_maps = alexnet.features[0](input_tensor)

print(f"Feature map shape: {feature_maps.shape}")  # (1, 64, H, W)

# Normalize for visualization
feature_maps = feature_maps.squeeze(0)
feature_maps = (feature_maps - feature_maps.min()) / (feature_maps.max() - feature_maps.min())

# Plot first 16 feature maps
fig, axes = plt.subplots(4, 4, figsize=(10, 8))
for i, ax in enumerate(axes.flat):
    if i < 16:
        ax.imshow(feature_maps[i].cpu().numpy(), cmap='gray')
        ax.axis('off')
plt.suptitle("Feature Maps from the First Convolutional Layer (AlexNet)", fontsize=14)
plt.show()


9. Train a GoogLeNet (Inception v1) or its variant using a standard dataset
like CIFAR-10. Plot the training and validation accuracy over epochs and analyze
overfitting or underfitting.
(Include your Python code and output in the code box below.)

In [None]:
# Train GoogLeNet (Inception v1) on CIFAR-10 and analyze overfitting/underfitting

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision import models
import matplotlib.pyplot as plt
import time

# -------------------------------------
# 1. Device configuration
# -------------------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# -------------------------------------
# 2. Data Preprocessing and Loading
# -------------------------------------
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                             download=True, transform=transform_train)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128,
                                           shuffle=True, num_workers=2)

test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                            download=True, transform=transform_test)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=128,
                                          shuffle=False, num_workers=2)

# -------------------------------------
# 3. Load Pretrained GoogLeNet Model and Modify Final Layer
# -------------------------------------
model = models.googlenet(weights=None, num_classes=10)  # train from scratch
model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# -------------------------------------
# 4. Train the Model
# -------------------------------------
num_epochs = 10
train_acc_history = []
val_acc_history = []

start_time = time.time()

for epoch in range(num_epochs):
    model.train()
    correct, total, running_loss = 0, 0, 0.0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()

        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    train_acc = 100 * correct / total
    train_acc_history.append(train_acc)

    # Validation
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    val_acc = 100 * correct / total
    val_acc_history.append(val_acc)

    print(f"Epoch [{epoch+1}/{num_epochs}] | Train Acc: {train_acc:.2f}% | Val Acc: {val_acc:.2f}%")

training_time = time.time() - start_time
print(f"\nTraining Completed in {training_time:.2f} seconds")

# -------------------------------------
# 5. Plot Training and Validation Accuracy
# -------------------------------------
plt.figure(figsize=(8,5))
plt.plot(range(1, num_epochs+1), train_acc_history, label='Training Accuracy')
plt.plot(range(1, num_epochs+1), val_acc_history, label='Validation Accuracy')
plt.title("GoogLeNet on CIFAR-10: Training vs Validation Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy (%)")
plt.legend()
plt.grid(True)
plt.show()

# -------------------------------------
# 6. Analysis of Overfitting or Underfitting
# -------------------------------------
if val_acc_history[-1] < train_acc_history[-1] - 5:
    print("Model shows signs of OVERFITTING — high training accuracy but lower validation accuracy.")
elif val_acc_history[-1] < 70 and train_acc_history[-1] < 70:
    print("Model may be UNDERFITTING — both training and validation accuracy are low.")
else:
    print("Model seems well-balanced — no major overfitting or underfitting detected.")


10. You are working in a healthcare AI startup. Your team is tasked with
developing a system that automatically classifies medical X-ray images into normal,
pneumonia, and COVID-19. Due to limited labeled data, what approach would you
suggest using among CNN architectures discussed (e.g., transfer learning with ResNet
or Inception variants)? Justify your approach and outline a deployment strategy for
production use.
(Include your Python code and output in the code box below.)
    - Why transfer learning with ResNet (brief)

       ResNet-style models (ResNet-50/101) provide a strong balance of representational power and trainability because residual connections permit much deeper models to be optimized reliably. Pretrained ImageNet ResNets capture generic visual features in early layers (edges, textures) that transfer well to medical images; this is well-established in surveys of transfer learning for medical imaging. With limited labeled X-rays, starting from pretrained weights and fine-tuning only the head (and later some deeper blocks) reduces required training data and compute while improving generalization.
        cv-foundation.org +1

       High-level training strategy (recommended)

       Preprocessing & augmentation: resize, intensity normalization, robust augmentations (rotation, translation, random contrast, simulated noise, elastic transforms).

       Stage-wise training:

       Stage 1 (head training): freeze backbone, train new classifier head (few epochs).

      Stage 2 (fine-tuning): unfreeze last N blocks (e.g., last ResNet stage), train with a small learning rate. Optionally do multistage transfer learning using intermediate medical pretraining if available.
      SpringerLink

      Loss & balancing: use class weights or focal loss if classes are imbalanced (COVID may be rarer).

      Validation: use stratified k-fold cross-validation (patient-level splits) to avoid data leakage. Report AUROC per class, sensitivity, specificity, PPV/NPV, and calibration (Brier score).

      Explainability & failure modes: generate Grad-CAM maps for clinicians to inspect important regions.
      arXiv

      Uncertainty & triage: produce a calibrated probability and flag low-confidence cases for human review (human-in-the-loop). Use MC-dropout or deep ensembles for uncertainty estimation.

      Robustness checks: test on external datasets and varied acquisition settings (portable X-rays, different hospitals).

      Clinical validation: prospective study comparing model + clinician vs clinician alone, with appropriate IRB and regulatory steps.
      BioMed Central
      +1

      Deployment strategy (production)

      Containerize model (Docker), serve with a model server (TorchServe/TF-Serving) behind a secure REST/gRPC API.

      Inference pipeline: DICOM ingest → preproc (windowing, resizing) → model → postproc (thresholding, calibrated probability) → Grad-CAM overlay for review → store result in PACS/EMR.

      Monitoring & MLOps: log inputs, predictions, confidences, and clinician feedback; track data drift and performance metrics; implement automated alerts if performance drops.

      Retraining & change management: follow a documented lifecycle and change control (data curation, retraining criteria, validation) in line with regulatory guidance for AI/ML medical devices.
      U.S. Food and Drug Administration

      Privacy & security: encrypt data at rest and transit; follow HIPAA/GDPR as applicable; perform threat modelling for adversarial inputs.

      Human-in-the-loop: model outputs are advisory—present probability, heatmap, and a clear statement that the model is a decision-support tool.

      Safety & regulatory notes

      This is a diagnostic-adjacent, high-risk task. Do not deploy for autonomous decision making. Perform prospective clinical validation and seek regulatory approvals (FDA/CE) as required. The FDA has evolving guidance for AI/ML SaMD and lifecycle management; incorporate those recommendations early.

In [None]:
# resnet_transfer_xray.py
# Transfer learning (ResNet50) for 3-class X-ray classification + Grad-CAM
# NOTE: This is a template. Replace dataset paths and tune hyperparams.

import os
import time
import copy
import numpy as np
from PIL import Image

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader
import torch.nn.functional as F

# ---------- config ----------
DATA_DIR = "xray_dataset"  # expected: train/val/test subfolders with class subdirs
BATCH_SIZE = 32
NUM_EPOCHS_HEAD = 5
NUM_EPOCHS_FINE = 5
NUM_CLASSES = 3
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MODEL_SAVE = "resnet50_xray.pt"
# ----------------------------

# ---------- transforms ----------
train_tf = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.RandomResizedCrop(224, scale=(0.85,1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.1, contrast=0.1),
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229]) if False else transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])
val_tf = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])
# ------------------------------

# ---------- datasets & loaders ----------
train_ds = datasets.ImageFolder(os.path.join(DATA_DIR, "train"), transform=train_tf)
val_ds   = datasets.ImageFolder(os.path.join(DATA_DIR, "val"), transform=val_tf)
test_ds  = datasets.ImageFolder(os.path.join(DATA_DIR, "test"), transform=val_tf)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
val_loader   = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)
test_loader  = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)

class_names = train_ds.classes
print("Classes:", class_names)

# ---------- model setup ----------
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# replace final FC
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, NUM_CLASSES)
model = model.to(DEVICE)

# freeze backbone initially
for param in model.conv1.parameters():
    pass  # conv1 kept trainable; below we'll freeze whole layers
for name, param in model.named_parameters():
    if "fc" not in name:
        param.requires_grad = False

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
# scheduler optional
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

# ---------- training helper ----------
def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    for imgs, labels in loader:
        imgs = imgs.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        outputs = model(imgs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * imgs.size(0)
        preds = outputs.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    return running_loss/total, correct/total

def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0
    all_probs = []
    all_labels = []
    with torch.no_grad():
        for imgs, labels in loader:
            imgs = imgs.to(device)
            labels = labels.to(device)
            outputs = model(imgs)
            probs = F.softmax(outputs, dim=1)
            loss = criterion(outputs, labels)
            running_loss += loss.item() * imgs.size(0)
            preds = outputs.argmax(dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)
            all_probs.append(probs.cpu().numpy())
            all_labels.append(labels.cpu().numpy())
    return running_loss/total, correct/total, np.concatenate(all_probs), np.concatenate(all_labels)

# ---------- Stage 1: train head ----------
print("Stage 1: training head only")
best_val_acc = 0.0
history = {"train_loss":[], "train_acc":[], "val_loss":[], "val_acc":[]}
for epoch in range(NUM_EPOCHS_HEAD):
    t0 = time.time()
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, DEVICE)
    val_loss, val_acc, _, _ = evaluate(model, val_loader, criterion, DEVICE)
    scheduler.step()
    history["train_loss"].append(train_loss); history["train_acc"].append(train_acc)
    history["val_loss"].append(val_loss); history["val_acc"].append(val_acc)
    print(f"Epoch {epoch+1}/{NUM_EPOCHS_HEAD}  train_loss={train_loss:.4f} train_acc={train_acc:.4f}  val_loss={val_loss:.4f} val_acc={val_acc:.4f}  time={time.time()-t0:.1f}s")
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_model_wts = copy.deepcopy(model.state_dict())

# ---------- Stage 2: fine-tune last layers ----------
print("Stage 2: fine-tuning last layers")
# unfreeze last conv block (layer4) and fc
for name, param in model.named_parameters():
    if "layer4" in name or "fc" in name:
        param.requires_grad = True

optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-5)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

for epoch in range(NUM_EPOCHS_FINE):
    t0 = time.time()
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, DEVICE)
    val_loss, val_acc, _, _ = evaluate(model, val_loader, criterion, DEVICE)
    scheduler.step()
    history["train_loss"].append(train_loss); history["train_acc"].append(train_acc)
    history["val_loss"].append(val_loss); history["val_acc"].append(val_acc)
    print(f"Fine Epoch {epoch+1}/{NUM_EPOCHS_FINE}  train_loss={train_loss:.4f} train_acc={train_acc:.4f}  val_loss={val_loss:.4f} val_acc={val_acc:.4f}  time={time.time()-t0:.1f}s")
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_model_wts = copy.deepcopy(model.state_dict())

# save best model
model.load_state_dict(best_model_wts)
torch.save(model.state_dict(), MODEL_SAVE)
print("Saved best model with val_acc=", best_val_acc)

# ---------- Evaluate on test set ----------
test_loss, test_acc, test_probs, test_labels = evaluate(model, test_loader, criterion, DEVICE)
print(f"Test Acc: {test_acc:.4f}  Test Loss: {test_loss:.4f}")

# ---------- Grad-CAM utility (simple) ----------
# This is a minimal Grad-CAM implementation for ResNet last conv layer 'layer4'
class GradCAM:
    def __init__(self, model, target_layer):
        self.model = model
        self.model.eval()
        self.gradients = None
        self.activations = None
        self.target_layer = target_layer
        # register hooks
        def forward_hook(module, inp, out):
            self.activations = out.detach()
        def backward_hook(module, grad_in, grad_out):
            self.gradients = grad_out[0].detach()
        target_layer.register_forward_hook(forward_hook)
        target_layer.register_backward_hook(backward_hook)

    def __call__(self, input_tensor, class_idx=None):
        input_tensor = input_tensor.to(next(self.model.parameters()).device)
        output = self.model(input_tensor)
        if class_idx is None:
            class_idx = output.argmax(dim=1).item()
        loss = output[0, class_idx]
        self.model.zero_grad()
        loss.backward(retain_graph=True)
        grads = self.gradients[0]            # C x H x W
        acts  = self.activations[0]          # C x H x W
        weights = grads.mean(dim=(1,2))      # C
        cam = (weights.view(-1,1,1) * acts).sum(dim=0)
        cam = F.relu(cam)
        cam = cam - cam.min()
        if cam.max() > 0:
            cam = cam / cam.max()
        cam_np = cam.cpu().numpy()
        return cam_np

# Example usage of GradCAM on one test image
import matplotlib.pyplot as plt
model.eval()
sample_img, sample_label = test_ds[0]  # PIL->transform applied in dataset; here dataset returns tensors
input_tensor = sample_img.unsqueeze(0)
gcam = GradCAM(model, model.layer4)
cam_map = gcam(input_tensor)
# show overlay (requires original image before normalization; for demo we reuse tensor)
img_np = sample_img.permute(1,2,0).numpy()
img_np = (img_np - img_np.min())/(img_np.max()-img_np.min())
plt.figure(figsize=(8,4))
plt.subplot(1,2,1); plt.title("Input"); plt.imshow(img_np); plt.axis('off')
plt.subplot(1,2,2); plt.title("Grad-CAM"); plt.imshow(img_np); plt.imshow(cam_map, cmap='jet', alpha=0.45); plt.axis('off')
plt.show()
