### Section 1: Introduction

In this homework assignment, we will investigate the effects of hyperparameters such as initial learning rate, learning rate schedule, weight decay, and data augmentation on deep neural networks. One of the most important issues in deep learning is optimization versus regularization. 

Optimization is controlled by the initial learning rate and the learning rate schedule. Regularization is controlled by, among other things, weight decay and data augmentation. As a result, the values of these hyperparameters are critical for the performance of deep neural networks.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
from torch.utils.data import DataLoader, Subset
import random
from tqdm.notebook import tqdm
import os, json, time
from torch.optim.lr_scheduler import CosineAnnealingLR

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

CIFAR-10 with Augmentation.
- random crop, flip, normalization
- Batch size: 128

In [None]:
# Reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


# Transforms
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914,0.4822,0.4465), (0.247,0.243,0.261))
])
transform_eval = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914,0.4822,0.4465), (0.247,0.243,0.261))
])

# Load full train for splitting, and test
full_train = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
val_view   = datasets.CIFAR10(root='./data', train=True, download=False, transform=transform_eval)
test_set   = datasets.CIFAR10(root='./data', train=False, download=False, transform=transform_eval)

# Create a hold-out validation split (e.g., 40k train / 10k val)
num_train = len(full_train)  # 50,000
val_size = 10000
indices = np.arange(num_train)
np.random.shuffle(indices)
val_indices = indices[:val_size] # 10,000
train_indices = indices[val_size:] # 40,000

train_set = Subset(full_train, train_indices)
val_set   = Subset(val_view,   val_indices)

train_loader = DataLoader(train_set, batch_size=128, shuffle=True,  num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_set,   batch_size=128, shuffle=False, num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_set,  batch_size=128, shuffle=False, num_workers=2, pin_memory=True)

100%|██████████| 170M/170M [00:14<00:00, 11.9MB/s]


### Section 2: Resnet
Please describe the unique architectural characteristics of ResNet-18, possibly by comparing to neural networks before this work (e.g., VGG-18)
<div style="font-size:85%">

| Aspect | ResNet-18 | VGG-18 |
|--------|-----------|--------|
| Philosophy | Enable depth via residual learning so optimization remains tractable. | Improve representation by stacking many small $3 \times 3$ convolutions sequentially. |
| Core mapping | $H(x) = F(x) + x$ with identity skip (or projection) shortcut. | $H(x)$ learned directly with no shortcut path. |
| Residual function | $F(x) = W_2\,\sigma(\mathrm{BN}(W_1 x))$ (BasicBlock: Conv $3\times3$–BN–ReLU–Conv $3\times3$–BN). | Typical two-layer block: $H(x) = W_2\,\sigma(W_1 x)$ (Conv $3\times3$–ReLU–Conv $3\times3$–ReLU). |
| Gradient flow | $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial H(x)}\left(1 + \frac{\partial F}{\partial x}\right)$ preserves signal via the identity path. | $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial H(x)} \cdot \frac{\partial H}{\partial x}$ must traverse all layers; more prone to vanishing. |
| Depth handling | Scales to $50/101/152+$ layers due to stable optimization from skips. | Deeper variants degrade without special tricks; training gets harder with depth. |
| Normalization | BatchNorm after each convolution; ReLU after first conv in block. | Original VGG used ReLU without BatchNorm (often added in modern reproductions). |
| Downsampling | Stride-2 conv at stage starts; identity or $1\times1$ projection for skip when shape changes. | Max pooling between stacks; downsampling via pooling only. |
| Classifier head | Global average pooling $\rightarrow$ single fully connected layer $\rightarrow$ softmax. | Large fully connected layers (e.g., $4096 \rightarrow 4096$) $\rightarrow$ softmax. |
| Parameters | Fewer for similar accuracy due to GAP and residual efficiency. | More due to large FC layers and no parameter-sharing shortcuts. |
| Regularization effects | Skip paths act like implicit ensemble/shortcut regularization; BN stabilizes. | Relies more on explicit regularization and careful initialization. |

</div>



In [None]:
# defining resnet models

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(
            in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, in_planes, planes, stride=1):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion *
                               planes, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out


class ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=10):
        super(ResNet, self).__init__()
        self.in_planes = 64

        # This is the "stem"
        # For CIFAR (32x32 images), it does not perform downsampling
        # It should downsample for ImageNet
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        # four stages with three downsampling
        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
        self.linear = nn.Linear(512*block.expansion, num_classes)

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = F.avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out


def ResNet18():
    return ResNet(BasicBlock, [2, 2, 2, 2])


def ResNet34():
    return ResNet(BasicBlock, [3, 4, 6, 3])


def ResNet50():
    return ResNet(Bottleneck, [3, 4, 6, 3])


def ResNet101():
    return ResNet(Bottleneck, [3, 4, 23, 3])


def ResNet152():
    return ResNet(Bottleneck, [3, 8, 36, 3])


def test_resnet18():
    net = ResNet18()
    y = net(torch.randn(1, 3, 32, 32))
    print(y.size())

In [4]:
model = ResNet18().to(device)

Functions to train one epoch, evaluate, and plot curves.

In [5]:
criterion = nn.CrossEntropyLoss()

def train_one_epoch(model, loader, optimizer, criterion):
    model.train()
    running_loss, correct, total = 0.0, 0, 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        out = model(x)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * x.size(0)
        _, pred = out.max(1)
        correct += pred.eq(y).sum().item()
        total += x.size(0)
    return running_loss / total, correct / total

@torch.no_grad()
def evaluate(model, loader, criterion):
    model.eval()
    running_loss, correct, total = 0.0, 0, 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        out = model(x)
        loss = criterion(out, y)
        running_loss += loss.item() * x.size(0)
        _, pred = out.max(1)
        correct += pred.eq(y).sum().item()
        total += x.size(0)
    return running_loss / total, correct / total

In [6]:
class ExperimentLogger:
    def __init__(self):
        self.results = []
        self.best_state = None
        self.best_label = None
        self.best_val_acc = -1.0
    def add(self, label, history, final_train, final_val, state_dict=None, metadata=None):
        meta = dict(metadata or {})
        meta.setdefault('added_at', time.strftime("%Y-%m-%d %H:%M:%S"))
        safe_history = {}
        for k, v in history.items():
            arr = np.asarray(v)
            safe_history[k] = arr.tolist()
        entry = {
            'label': label,
            'history': safe_history,
            'final_train': (float(final_train[0]), float(final_train[1])),
            'final_val':   (float(final_val[0]),   float(final_val[1])),
            'metadata': meta
        }
        self.results.append(entry)
        if final_val[1] > self.best_val_acc and state_dict is not None:
            self.best_val_acc = float(final_val[1])
            self.best_state = {k: v.cpu().clone() for k, v in state_dict.items()}
            self.best_label = label
    def summary(self):
        for r in self.results:
            tt = r['final_train']; vv = r['final_val']
            print(f"{r['label']}: Train(L={tt[0]:.4f}, A={tt[1]:.4f}) | Val(L={vv[0]:.4f}, A={vv[1]:.4f})")
    def save_all(self, history_dir="history", models_dir="models", overwrite=False):
        os.makedirs(history_dir, exist_ok=True)
        os.makedirs(models_dir, exist_ok=True)
        for r in self.results:
            label = r['label']
            fname = os.path.join(history_dir, f"{label}.json")
            if os.path.exists(fname) and not overwrite:
                fname = os.path.join(history_dir, f"{label}_{int(time.time())}.json")
            with open(fname, "w") as f:
                json.dump(r, f, indent=4)
        if self.best_state is not None and self.best_label is not None:
            torch.save(self.best_state, os.path.join(models_dir, f"{self.best_label}_best.pt"))
    def load_results(self, history_dir="history"):
        self.results = []
        for fname in sorted(os.listdir(history_dir)):
            if fname.endswith(".json"):
                with open(os.path.join(history_dir, fname), "r") as f:
                    self.results.append(json.load(f))
    def select(self, label_prefix=None):
        out = self.results
        if label_prefix is not None:
            out = [r for r in out if str(r['label']).startswith(label_prefix)]
        return out
    
    
logger = ExperimentLogger()

### Section 3: Initial Learning Rate Study (15 Epochs)

Three experiments with lr ∈ {0.1, 0.01, 0.001}, no weight decay or schedule.

In [7]:
lrs = [0.1, 0.01, 0.001]
best_lr, best_acc, best_epoch = 0,0,0
print(f"<--- LR Sweep --->\n=== Best accuracy on validation set so far ===")
for lr in lrs:
    model = ResNet18().to(device)
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=0.0)
    history = {'train_loss':[], 'train_acc':[], 'val_loss':[], 'val_acc':[], 'lr':[]}
    for epoch in range(1, 16):
        tl, ta = train_one_epoch(model, train_loader, optimizer, criterion)
        vl, va = evaluate(model, val_loader, criterion)
        history['train_loss'].append(tl)
        history['train_acc'].append(ta)
        history['val_loss'].append(vl)
        history['val_acc'].append(va)
        history['lr'].append(optimizer.param_groups[0]['lr'])

        if best_acc < va:
            best_acc = va
            best_lr = lr
            best_epoch = epoch
            print(f"LR={lr} | val acc={va} | epoch={epoch}")
            
    
    cpu_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
    logger.add(label=f"sec3_lr={lr}", history=history, final_train=(tl, ta), final_val=(vl, va), state_dict=cpu_state, metadata={'base_lr': lr, 'section': 3})

<--- LR Sweep --->
=== Best accuracy on validation set so far ===
LR=0.1 | val acc=0.3898 | epoch=1
LR=0.1 | val acc=0.4905 | epoch=2
LR=0.1 | val acc=0.5356 | epoch=3
LR=0.1 | val acc=0.6257 | epoch=4
LR=0.1 | val acc=0.691 | epoch=5
LR=0.1 | val acc=0.7009 | epoch=6
LR=0.1 | val acc=0.745 | epoch=7
LR=0.1 | val acc=0.7708 | epoch=8
LR=0.1 | val acc=0.7798 | epoch=9
LR=0.1 | val acc=0.8077 | epoch=10
LR=0.1 | val acc=0.8156 | epoch=12
LR=0.1 | val acc=0.8348 | epoch=13
LR=0.1 | val acc=0.8431 | epoch=15
LR=0.01 | val acc=0.8506 | epoch=9
LR=0.01 | val acc=0.856 | epoch=11
LR=0.01 | val acc=0.8671 | epoch=13


<i>LR **0.01** gave the higest accuracy of **86.71%** on the validation set</i>

### Section 4: Learning Rate Schedule (300 Epochs)

Use best initial lr from Section 3. Compare:
* Constant lr for 300 epochs
* Cosine annealing such that lr → 0 over 300 epochs

In [8]:
use_cosine, best_acc = False, 0
num_epochs = 300
print(f"<--- Annealing Check --->\n=== Best accuracy on validation set so far ===")
for use_schedule in [False, True]:
    model = ResNet18().to(device)
    optimizer = optim.SGD(model.parameters(), lr=best_lr, momentum=0.9, weight_decay=0.0)
    scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs) if use_schedule else None
    history = {'train_loss':[], 'train_acc':[], 'val_loss':[], 'val_acc':[], 'lr':[]}
    for epoch in range(1, num_epochs + 1):
        tl, ta = train_one_epoch(model, train_loader, optimizer, criterion)
        vl, va = evaluate(model, val_loader, criterion)
        history['train_loss'].append(tl)
        history['train_acc'].append(ta)
        history['val_loss'].append(vl)
        history['val_acc'].append(va)
        history['lr'].append(optimizer.param_groups[0]['lr'])
        if scheduler:
            scheduler.step()
        if best_acc < va:
            best_acc = va
            use_cosine = use_schedule
            print(f"AnnealingLR={use_schedule}| val acc={va} | epoch={epoch}")
            
    label = f"sec4_{'cosine' if use_schedule else 'constant'}_lr={best_lr}"
    cpu_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
    logger.add(label=label, history=history, final_train=(tl, ta), final_val=(vl, va), state_dict=cpu_state, metadata={'schedule': bool(use_schedule), 'section': 4})

<--- Annealing Check --->
=== Best accuracy on validation set so far ===
AnnealingLR=False| val acc=0.5346 | epoch=1
AnnealingLR=False| val acc=0.6327 | epoch=2
AnnealingLR=False| val acc=0.7045 | epoch=3
AnnealingLR=False| val acc=0.7799 | epoch=4
AnnealingLR=False| val acc=0.8053 | epoch=6
AnnealingLR=False| val acc=0.8266 | epoch=8
AnnealingLR=False| val acc=0.845 | epoch=9
AnnealingLR=False| val acc=0.8694 | epoch=13
AnnealingLR=False| val acc=0.8853 | epoch=17
AnnealingLR=False| val acc=0.8882 | epoch=20
AnnealingLR=False| val acc=0.8888 | epoch=29
AnnealingLR=False| val acc=0.8971 | epoch=30
AnnealingLR=False| val acc=0.8992 | epoch=31
AnnealingLR=False| val acc=0.9083 | epoch=41
AnnealingLR=False| val acc=0.9095 | epoch=45
AnnealingLR=False| val acc=0.9101 | epoch=50
AnnealingLR=False| val acc=0.9117 | epoch=51
AnnealingLR=False| val acc=0.9123 | epoch=56
AnnealingLR=False| val acc=0.916 | epoch=62
AnnealingLR=False| val acc=0.9189 | epoch=81
AnnealingLR=False| val acc=0.9221 | 

<i>Using annealing improves the accuracy to **93.08%**</i>

### Section 5: Weight Decay Study (300 Epochs)

Keep best lr & cosine schedule. Compare weight decay λ ∈ {5e-4, 1e-2}.

In [9]:
wds = [5e-4, 1e-2]
best_wd, best_acc = 0, 0
print(f"<--- WD Sweep --->\n=== Best accuracy on validation set so far ===")
for wd in wds:
    model = ResNet18().to(device)
    optimizer = optim.SGD(model.parameters(), lr=best_lr, momentum=0.9, weight_decay=wd)
    scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs) if use_cosine else None
    history = {'train_loss':[], 'train_acc':[], 'val_loss':[], 'val_acc':[], 'lr':[]}
    for epoch in range(1, num_epochs + 1):
        tl, ta = train_one_epoch(model, train_loader, optimizer, criterion)
        vl, va = evaluate(model, val_loader, criterion)
        history['train_loss'].append(tl)
        history['train_acc'].append(ta)
        history['val_loss'].append(vl)
        history['val_acc'].append(va)
        history['lr'].append(optimizer.param_groups[0]['lr'])
        if scheduler:
            scheduler.step()
            
        if best_acc < va:
            best_wd = wd
            best_acc = va
            print(f"Best Weight Decay={best_wd} | val acc={va} | epoch={epoch}")
            
    label = f"sec5_wd={wd:g}"
    cpu_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
    logger.add(label=label, history=history, final_train=(tl, ta), final_val=(vl, va), state_dict=cpu_state, metadata={'weight_decay': wd, 'section': 5})

<--- WD Sweep --->
=== Best accuracy on validation set so far ===
Best Weight Decay=0.0005 | val acc=0.556 | epoch=1
Best Weight Decay=0.0005 | val acc=0.6185 | epoch=2
Best Weight Decay=0.0005 | val acc=0.7131 | epoch=3
Best Weight Decay=0.0005 | val acc=0.783 | epoch=4
Best Weight Decay=0.0005 | val acc=0.7986 | epoch=7
Best Weight Decay=0.0005 | val acc=0.8268 | epoch=8
Best Weight Decay=0.0005 | val acc=0.8499 | epoch=10
Best Weight Decay=0.0005 | val acc=0.8577 | epoch=12
Best Weight Decay=0.0005 | val acc=0.8627 | epoch=15
Best Weight Decay=0.0005 | val acc=0.8702 | epoch=17
Best Weight Decay=0.0005 | val acc=0.8806 | epoch=18
Best Weight Decay=0.0005 | val acc=0.8809 | epoch=21
Best Weight Decay=0.0005 | val acc=0.8872 | epoch=23
Best Weight Decay=0.0005 | val acc=0.8931 | epoch=28
Best Weight Decay=0.0005 | val acc=0.8937 | epoch=32
Best Weight Decay=0.0005 | val acc=0.8986 | epoch=34
Best Weight Decay=0.0005 | val acc=0.8991 | epoch=37
Best Weight Decay=0.0005 | val acc=0.9029

<i>For weight decay rate = **0.01**, accuracy improved to **95.03%**</i>

### Section 6: Custom Batch Normalization

Implement BN where mean/var are detached in backprop, then replace all nn.BatchNorm2d layers.

In [None]:
class MyBatchNorm2d(nn.Module):
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(num_features))
        self.beta  = nn.Parameter(torch.zeros(num_features))
        self.eps = eps
        self.momentum = momentum
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var',  torch.ones(num_features))

    def forward(self, x):
        if self.training:
            # compute batch stats
            mean = x.mean(dim=(0,2,3), keepdim=True)
            var  = x.var(dim=(0,2,3), keepdim=True, unbiased=False)

            # update running stats without tracking grads (imp: reduces gpu usage)
            with torch.no_grad():
                self.running_mean.mul_(1 - self.momentum) \
                                 .add_(self.momentum * mean.view(-1))
                self.running_var .mul_(1 - self.momentum) \
                                 .add_(self.momentum * var.view(-1))

            # detach mean/var
            m, v = mean.detach(), var.detach()
        else:
            # inference: use running stats
            m = self.running_mean.view(1, -1, 1, 1)
            v = self.running_var .view(1, -1, 1, 1)

        # normalize and apply scale+shift
        x_norm = (x - m) / torch.sqrt(v + self.eps)
        return self.gamma.view(1, -1, 1, 1) * x_norm \
             + self.beta .view(1, -1, 1, 1)

In [None]:
# toy input to check the layers
x = torch.randn(4, 16, 8, 8, device=device, requires_grad=True)
bn = MyBatchNorm2d(16).to(device)
out = bn(x).sum()
out.backward()

print(bn.running_mean.grad, bn.running_var.grad) 
print(x.grad.shape, bn.gamma.grad.shape, bn.beta.grad.shape)


None None
torch.Size([4, 16, 8, 8]) torch.Size([16]) torch.Size([16])


In [12]:
# Monkey-patch and build model
OriginalBatchNorm2d = nn.BatchNorm2d
nn.BatchNorm2d = MyBatchNorm2d
model_bn = ResNet18().to(device)
nn.BatchNorm2d = OriginalBatchNorm2d # restore for safety

In [13]:
# Verify Model
for name, module in model_bn.named_modules():
    if isinstance(module, (MyBatchNorm2d, OriginalBatchNorm2d)):
        print(f"{name:30s} -> {module.__class__.__name__}")

bn1                            -> MyBatchNorm2d
layer1.0.bn1                   -> MyBatchNorm2d
layer1.0.bn2                   -> MyBatchNorm2d
layer1.1.bn1                   -> MyBatchNorm2d
layer1.1.bn2                   -> MyBatchNorm2d
layer2.0.bn1                   -> MyBatchNorm2d
layer2.0.bn2                   -> MyBatchNorm2d
layer2.0.shortcut.1            -> MyBatchNorm2d
layer2.1.bn1                   -> MyBatchNorm2d
layer2.1.bn2                   -> MyBatchNorm2d
layer3.0.bn1                   -> MyBatchNorm2d
layer3.0.bn2                   -> MyBatchNorm2d
layer3.0.shortcut.1            -> MyBatchNorm2d
layer3.1.bn1                   -> MyBatchNorm2d
layer3.1.bn2                   -> MyBatchNorm2d
layer4.0.bn1                   -> MyBatchNorm2d
layer4.0.bn2                   -> MyBatchNorm2d
layer4.0.shortcut.1            -> MyBatchNorm2d
layer4.1.bn1                   -> MyBatchNorm2d
layer4.1.bn2                   -> MyBatchNorm2d


In [14]:
best_per_section = {}

for run in logger.results:
    sec = run['metadata'].get('section', None)
    if sec is None:
        continue
    
    if (sec not in best_per_section or
        run['final_val'][1] > best_per_section[sec]['final_val'][1]):
        best_per_section[sec] = run

# Print out the winners
for sec, winner in sorted(best_per_section.items()):
    acc = winner['final_val'][1]
    print(f"Section {sec} best: {winner['label']} | val_acc = {acc:.4f}")

Section 3 best: sec3_lr=0.01 | val_acc = 0.8457
Section 4 best: sec4_cosine_lr=0.01 | val_acc = 0.9303
Section 5 best: sec5_wd=0.01 | val_acc = 0.9499


In [None]:
# Confirm if correct variables are stored
print(f"Best LR: {best_lr}")
print(f"Use Annealing: {use_cosine}")
print(f"Best Weight Decay Rate: {best_wd}")

Best LR: 0.01
Use Annealing: True
Best Weight Decay Rate: 0.01


In [16]:
# Train with best setting discovered (e.g., best_lr, cosine, best weight decay)
optimizer = optim.SGD(model_bn.parameters(), lr=best_lr, momentum=0.9, weight_decay=best_wd)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs) if use_cosine else None

history = {'train_loss':[], 'train_acc':[], 'val_loss':[], 'val_acc':[], 'lr':[]}
for epoch in tqdm(range(1, num_epochs + 1),desc="Epochs",unit="epoch"):
    tl, ta = train_one_epoch(model_bn, train_loader, optimizer, criterion)
    vl, va = evaluate(model_bn, val_loader, criterion)
    history['train_loss'].append(tl)
    history['train_acc'].append(ta)
    history['val_loss'].append(vl)
    history['val_acc'].append(va)
    history['lr'].append(optimizer.param_groups[0]['lr'])
    if scheduler:
        scheduler.step()
cpu_state = {k: v.cpu().clone() for k, v in model_bn.state_dict().items()}
logger.add(label="sec6_bn", history=history, final_train=(tl, ta), final_val=(vl, va), state_dict=cpu_state, metadata={'section': 6})

Epochs:   0%|          | 0/300 [00:00<?, ?epoch/s]

#### Final Test Accuracy
Load the best checkpoint, evaluate on test_loader, and report accuracy.

In [17]:
print("\n=== Experiment Summary ===")
logger.summary()
print(f"\nSelected best by Val Acc: {logger.best_label} (Val Acc={logger.best_val_acc:.4f})")


=== Experiment Summary ===
sec3_lr=0.1: Train(L=0.3652, A=0.8728) | Val(L=0.4697, A=0.8431)
sec3_lr=0.01: Train(L=0.2497, A=0.9126) | Val(L=0.5292, A=0.8457)
sec3_lr=0.001: Train(L=0.3998, A=0.8605) | Val(L=0.5184, A=0.8288)
sec4_constant_lr=0.01: Train(L=0.0003, A=1.0000) | Val(L=0.5699, A=0.9278)
sec4_cosine_lr=0.01: Train(L=0.0002, A=1.0000) | Val(L=0.5040, A=0.9303)
sec5_wd=0.0005: Train(L=0.0012, A=1.0000) | Val(L=0.2401, A=0.9384)
sec5_wd=0.01: Train(L=0.0233, A=1.0000) | Val(L=0.1867, A=0.9499)
sec6_bn: Train(L=1.9471, A=0.2491) | Val(L=1.9234, A=0.2477)

Selected best by Val Acc: sec5_wd=0.01 (Val Acc=0.9499)


In [18]:
# Loading best model
best_model = ResNet18().to(device)
best_model.load_state_dict(logger.best_state)

# Testing best model
test_loss, test_acc = evaluate(best_model, test_loader, criterion)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}")

Test Loss: 0.1876, Test Accuracy: 0.9494


In [19]:
logger.save_all(history_dir="history", models_dir="models")