# Guide Dog Classifier: Fine-Tuning Without Optimization  

**Objectives**  
- Load the pre-trained model with frozen-head weights  
- Unfreeze Convolutional layers and fine-tune on our dataset with low training weight
- Encounter and analyze OOM errors on 4 GB GPU  
- Profile VRAM usage with NVIDIA-SMI  
- reset cache/kernel  

Let’s start!

# Importing necessary libraries (again) , set up DataLoader with transfroms etc..

In [1]:
# Core Python and system utilities
import os
import time

# Numerical and data handling£
import numpy as np

# PyTorch and related libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Metrics and evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support

# Progress tracking
from tqdm import tqdm


# reproducibility
torch.manual_seed(42)
np.random.seed(42)


In [2]:
# CUDA check and setup
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    device = torch.device("cuda")
    torch.cuda.manual_seed_all(42)
    torch.cuda.empty_cache()  # Clear any residual memory
    torch.backends.cudnn.benchmark = False  # Disable for deterministic results
    torch.backends.cudnn.enabled = True
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Initial VRAM Allocated: {torch.cuda.memory_allocated(0)/1024**2:.2f} MB")
    print(f"Initial VRAM Reserved: {torch.cuda.memory_reserved(0)/1024**2:.2f} MB")
else:
    device = torch.device("cpu")
    print("Using CPU. Performance may be slow.")

# Define dataset paths
dataset_root = "./dataset"
train_path = os.path.join(dataset_root, "train")
val_path = os.path.join(dataset_root, "val")
test_path = os.path.join(dataset_root, "test")

CUDA Available: True
GPU: NVIDIA GeForce RTX 3050 Ti Laptop GPU
Initial VRAM Allocated: 0.00 MB
Initial VRAM Reserved: 0.00 MB


In [3]:
# Get the transforms from the model weights 
transform = models.EfficientNet_B3_Weights.IMAGENET1K_V1.transforms()
# Define custom training transforms
train_transform = transforms.Compose([
    transforms.Resize((300, 300)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomResizedCrop(300, scale=(0.8, 1.0)),
    transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.IMAGENET), #Applies learned augmentation policies for robustness.
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])


In [4]:
# Load datasets using ImageFolder 
train_dataset = datasets.ImageFolder(train_path)
val_dataset = datasets.ImageFolder(val_path)
test_dataset = datasets.ImageFolder(test_path)

# Create DataLoaders
batch_size = 64
num_workers = 1  # number of subprocesses
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers, 
    pin_memory=True,
)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=True
)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=True
)

train_dataset.transform = train_transform
val_dataset.transform = transform
test_dataset.transform = transform

# Verify dataset sizes and classes
try:
    assert len(train_dataset.classes) == 2, "Expected 2 classes: guide_dogs, non_guide_dogs"
    print(f"Train samples: {len(train_dataset)}")
    print(f"Validation samples: {len(val_dataset)}")
    print(f"Test samples: {len(test_dataset)}")
    print(f"Classes: {train_dataset.classes}")
except AssertionError as e:
    print(f"Dataset Error: {e}")
    raise
except Exception as e:
    print(f"Error loading dataset: {e}")
    raise

Train samples: 2016
Validation samples: 224
Test samples: 572
Classes: ['guide_dogs', 'non_guide_dogs']


## Configuring EfficientNet-B3 for Fine-Tuning  

#### Model Loading

In [5]:
model = models.efficientnet_b3(weights=None)
# replace classifier
model.classifier = nn.Sequential(
    nn.Linear(1536, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 1)
)
# load frozen-head weights
model.load_state_dict(torch.load("./model/frozen_model.pth",weights_only=True))


<All keys matched successfully>

### actually how does model loading works? I'll explain 
- first, we define the skeleton of the model (aka the structure) 
- the weights are imported from the dict as they are saved previously in last notebook using the `torch.save(model.state_dict(), "./model/frozen_model.pth")`
- for security reasons we'll load only the weights from that state dict from last time </br>
**Note** : as you can see we didn't load the pre-trained weights from the original model as we did in first time `model = models.efficientnet_b0(weights='IMAGENET1K_V1')` this time we ignored the parameter `weights='IMAGENET1K_V1'` and replaced it with `weights=None` to ensure no reloading of pretrained model.
  <br> **But why?** <br>
 the `torch.save` that we called saved the model weights efficiently `model.load_state_dict(torch.load("./model/frozen_model.pth",weights_only=True))` will load all the weights (including the one that were freezed so no need to load them twice.
by default , the `weight` parameter is `None` but I set it anyways. 

actually the process of saving and loading the model falls under a broader term called *serialization/deserialization* .A topic you can explore yourself.

#### Let's review which layers we'll unfreeze to fine tune this model 


In [6]:
# print(model)

In [7]:
for p in model.parameters():
    p.requires_grad = True

In [8]:
for idx, param in enumerate(model.parameters()):
    print(f"({idx}, {param.requires_grad})", end=' ')

(0, True) (1, True) (2, True) (3, True) (4, True) (5, True) (6, True) (7, True) (8, True) (9, True) (10, True) (11, True) (12, True) (13, True) (14, True) (15, True) (16, True) (17, True) (18, True) (19, True) (20, True) (21, True) (22, True) (23, True) (24, True) (25, True) (26, True) (27, True) (28, True) (29, True) (30, True) (31, True) (32, True) (33, True) (34, True) (35, True) (36, True) (37, True) (38, True) (39, True) (40, True) (41, True) (42, True) (43, True) (44, True) (45, True) (46, True) (47, True) (48, True) (49, True) (50, True) (51, True) (52, True) (53, True) (54, True) (55, True) (56, True) (57, True) (58, True) (59, True) (60, True) (61, True) (62, True) (63, True) (64, True) (65, True) (66, True) (67, True) (68, True) (69, True) (70, True) (71, True) (72, True) (73, True) (74, True) (75, True) (76, True) (77, True) (78, True) (79, True) (80, True) (81, True) (82, True) (83, True) (84, True) (85, True) (86, True) (87, True) (88, True) (89, True) (90, True) (91, True

In [9]:
# Send to Gpu
model = model.to(device)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable params: {trainable}")

Trainable params: 11090473


In [10]:
!nvidia-smi

Tue Jun  3 17:16:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   46C    P0             11W /   65W |     157MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Fine-Tuning and OOM Demonstration

We fine-tune the model with:
- **Loss**: Binary Cross-Entropy (`BCELoss`).
- **Optimizer**: Adam with learning rate 0.0001 (lower for fine-tuning).
- **Batch Size**: 32 to increase VRAM demand.
- **Epochs**: 5 (if successful).

With unfrozen layers, VRAM usage may be intense, causing OOM on our 4GB GPU.

**Note**: If training succeeds (unlikely), accuracy may reach ~85–90%.

## memory profiling function  
a small helper to profile memory usage during training

In [11]:

def memorytracking() :
    used_mem = torch.cuda.memory_allocated()
    reserved_mem = torch.cuda.memory_reserved()
    peak_mem = torch.cuda.max_memory_allocated()
    print(f"  Allocated Memory    : {used_mem / (1024 ** 2):.2f} MB" , end=' ') 
    print(f"  Reserved Memory      : {reserved_mem / (1024 ** 2):.2f} MB" , end=' ')
    print(f"  Peak Allocated Memory: {peak_mem / (1024 ** 2):.2f} MB")
memorytracking()


  Allocated Memory    : 42.78 MB   Reserved Memory      : 58.00 MB   Peak Allocated Memory: 42.78 MB


# The Training Loop 
here we set the learning late low so can it'll adjust to the data without drastically changing weights

In [12]:
# Loss and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# Training function
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=5):
    history = {"train_loss": [], "train_acc": [], "val_loss": [], "val_acc": []}
    
    for epoch in range(num_epochs):
        model.train()
        train_loss = 0.0
        train_preds, train_targets = [], []
      
        
        for data, target in train_loader:
            data, target = data.to(device), target.float().to(device).view(-1, 1)  # Float targets, shape (batch_size, 1)
            optimizer.zero_grad()
            output = model(data)  # Output: (batch_size, 1) logits
            loss = criterion(output, target)  # BCEWithLogitsLoss
            loss.backward()
            optimizer.step()
            train_loss += loss.item() * data.size(0)

            # Apply sigmoid for predictions
            probs = torch.sigmoid(output)
            preds = (probs > 0.5).float()  # Threshold at 0.5
            train_preds.extend(preds.cpu().numpy().flatten())
            train_targets.extend(target.cpu().numpy().flatten())

        # Validation
        model.eval()
        val_loss = 0.0
        val_preds, val_targets = [], []
        with torch.no_grad():
            for data, target in val_loader:
                data, target = data.to(device), target.float().to(device).view(-1, 1)
                output = model(data)  # (batch_size, 1) logits
                loss = criterion(output, target)
                val_loss += loss.item() * data.size(0)

                probs = torch.sigmoid(output)
                preds = (probs > 0.5).float()
                val_preds.extend(preds.cpu().numpy().flatten())
                val_targets.extend(target.cpu().numpy().flatten())

        # Metrics
        train_loss /= len(train_loader.dataset)
        val_loss /= len(val_loader.dataset)
        train_acc = accuracy_score(train_targets, train_preds)
        val_acc = accuracy_score(val_targets, val_preds)

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)

        torch.cuda.empty_cache()

        print(f"Epoch {epoch+1}/{num_epochs}: Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
        memorytracking()
    
    return history

In [13]:
# Attempt to train
history = train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=10)


OutOfMemoryError: CUDA out of memory. Tried to allocate 792.00 MiB. GPU 0 has a total capacity of 3.69 GiB of which 154.94 MiB is free. Including non-PyTorch memory, this process has 3.52 GiB memory in use. Of the allocated memory 3.41 GiB is allocated by PyTorch, and 15.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# OOM in a nutshell 
here is the output of my error <br>
![OOM](./images/OOM.png)

I guess it is explanatory enough , we basically exceeded the allowed range of 4GB.


## Memory Profiling with NVIDIA-SMI

We use NVIDIA-SMI to check VRAM usage during fine-tuning. Unfreezing the last convolutional block and using a batch size of 32 likely pushed usage to ~7GB, exceeding our 4GB RTX 3050 Ti’s limit, causing the OOM error.

Run `nvidia-smi` to inspect GPU memory.

In [14]:
torch.cuda.empty_cache()
!nvidia-smi

Tue Jun  3 17:17:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0             11W /   65W |    3627MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Successful Run on a Powerful GPU

On a more powerful GPU (e.g., 8GB RTX 3080 or cloud instance like AWS V100), fine-tuning with unfrozen layers and batch size 32 typically succeeds without OOM errors. <br>
Since our 4GB GPU hit OOM, we’ll address this in Notebook 4 with optimization techniques.

## Evaluation

If fine-tuning completed any epochs before OOM, we evaluate the model on the test set (560 images). If OOM occurred early, evaluation will be deferred to Notebook 4 after optimization.

Metrics:
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrix

In [None]:
import matplotlib.pyplot as plt
# Evaluation function
def evaluate_model(model, loader):
    model.eval()
    preds, targets = [], []
    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.float().to(device).view(-1, 1)
            output = model(data)
            preds.extend((output > 0.5).float().cpu().numpy().flatten())
            targets.extend(target.cpu().numpy().flatten())
    
    acc = accuracy_score(targets, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(targets, preds, average="binary")
    return acc, precision, recall, f1, preds, targets

# Evaluate
test_acc, test_precision, test_recall, test_f1, test_preds, test_targets = evaluate_model(model, test_loader)
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")
print(f"Test F1-Score: {test_f1:.4f}")

## Training Curves

If fine-tuning ran for any epochs, we plot training and validation loss/accuracy to assess progress. If OOM stopped training early, this will be addressed in Notebook 4.


In [None]:
# Plot loss and accuracy
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history["train_loss"], label="Train Loss")
plt.plot(history["val_loss"], label="Validation Loss")
plt.title("Training and Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history["train_acc"], label="Training Accuracy")
plt.plot(history["val_acc"], label="Validation Accuracy")
plt.title("Training and Validation Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()

plt.tight_layout()
plt.show()

## Reset Cache and Shut Down Kernel

To ensure a clean state, we clear GPU memory and shut down the kernel. After running the cell below, the kernel will stop. Open `3_Fine_Tuning_Memory_Optimization.ipynb` in a new Jupyter session to apply memory optimization.

**Note**: Restart Jupyter or open the next notebook manually after shutdown.

In [None]:
from IPython import get_ipython
import torch

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("GPU memory cleared.")

# Shut down the kernel
get_ipython().kernel.do_shutdown(restart=True)