# Tuning

In this homework, you'll tune some hyperparameters to try to improve the performance of a network. This will involve setting up a convolutional network for the CIFAR-10 dataset, setting up TensorBoard for logging, and then experimenting. I am _not_ going to be setting specific accuracy targets for different grades because there is too much randomness in the training process. Moreover, achieving high accuracy is easier if you have access to a lot of computational resources, so there are some equity issues with just grading by final model performance.

Instead, I'm going to ask you to explain your tuning _process_. That is, for each experiment you run (each set of hyperparameters you try), explain why you ran that experiment and what happened. Based on your observations, what changes did you make for the next run? As long as you have explained your reasoning and it corresponds to the principles we've talked about in class, you'll do fine. Be sure to set up your logging in a way that indicates the hyperparameter values used for each run.

**IMPORTANT: Please zip up and submit your TensorBoard log files with your homework. That will help me to see what you were looking at as you went through your tuning process.**

I'm also deliberately giving you no starter code for this homework. I understand that a lot of it will just be copy-paste from past classes/labs/homeworks but I still think there is some value in going from a blank document to a complete program.

## My Program
Well, any good machine learning Jupyter notebook starts with a gajillion imports

In [15]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
torchvision.disable_beta_transforms_warning()
import torchvision.transforms.v2 as transforms
import torch.utils.tensorboard as tb
import datetime
import os

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineRenderer.figure_format = 'retina'

if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'
print(device)

log_dir = 'homework5_logs'
data_dir = '../scratch/data/torch/cifar'

cuda


## Data
The next thing we want to do is important and sanitize our data as well as set up our transforms

In [16]:
transform = transforms.Compose([
    transforms.ToImage(),
    transforms.ConvertImageDtype(),
])

cifar = torchvision.datasets.CIFAR10(data_dir, download=True, transform=transform)
train_size = int(0.8 * len(cifar))
train_data, valid_data = torch.utils.data.random_split(cifar, [train_size, len(cifar) - train_size])

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

print(len(cifar))

Files already downloaded and verified
50000


In [17]:
mean = []
for x, _ in cifar:
    mean.append(torch.mean(x, dim=(1, 2)))
mean = torch.stack(mean, dim=0).mean(dim=0)
std = []
for x, _ in cifar:
    std.append(((x - mean[:,np.newaxis,np.newaxis]) ** 2).mean(dim=(1, 2)))
std = torch.stack(std, dim=0).mean(dim=0).sqrt()
print(mean, std)

tensor([0.4914, 0.4822, 0.4465]) tensor([0.2470, 0.2435, 0.2616])


In [18]:
cifar_mean = (0.4914, 0.4822, 0.4465)
cifar_std = (0.2470, 0.2435, 0.2616)

normalize = transforms.Normalize(cifar_mean, cifar_std)

## Setting up our Classes
Now it's time to set up our CNN classes. This will look very similar to what I did in homework4, although we'll now be tuning some hyperparameters to get better performance.

In [19]:
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(3, 16, 3),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3),
            nn.AvgPool2d(2),
            nn.Flatten(),
            nn.Linear(256, 10)
        )
    
    def forward(self, x):
        return self.model(x)


## The Training Loop
We also want to define a training loop.

In [20]:
def train(model_name="", model_class=CNN, lr=1e-3, epochs=10, batch_size=64, momentum=0.9, weight_decay=0):

    data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
    valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=batch_size, shuffle=False)

    print(device)

    network = model_class().to(device)
    loss = nn.CrossEntropyLoss()
    opt = optim.SGD(network.parameters(), lr=lr, momentum=momentum, weight_decay=weight_decay)

    name = model_name + '-' + "homework5-cnn"
    name += '-lr-' + str(lr) + '-bs-' + str(batch_size) + '-mom-' + str(momentum) + '-wght-' + str(weight_decay)
    logger = tb.SummaryWriter(os.path.join(log_dir, name))
    global_step = 0

    for i in range(epochs):

        network.train()
        for batch_xs, batch_ys in data_loader:

            batch_xs = batch_xs.to(device)
            batch_ys = batch_ys.to(device)
            batch_xs = normalize(batch_xs)

            preds = network(batch_xs)
            loss_val = loss(preds, batch_ys)
            opt.zero_grad()
            loss_val.backward()
            opt.step()

            logger.add_scalar('loss', loss_val, global_step=global_step)
            logger.add_scalar('training accuracy', (preds.argmax(dim=1) == batch_ys).float().mean(), global_step=global_step)

            global_step += 1
        
        network.eval()
        accs = []
        for batch_xs, batch_ys in valid_loader:

            batch_xs = batch_xs.to(device)
            batch_ys = batch_ys.to(device)
            batch_xs = normalize(batch_xs)

            preds = network(batch_xs)
            accs.append((preds.argmax(dim=1) == batch_ys).float().mean())

        logger.add_scalar('validation accuracy', torch.tensor(accs).mean(), global_step=global_step)

    return network

In [21]:
# FIRST BATCH

# -------------------------------------------------

cnn_model1 = train(model_name="cnn_model1")
cnn_model2 = train(model_name="cnn_model2", lr=5e-3)
cnn_model3 = train(model_name="cnn_model3", lr=1e-2)
cnn_model4 = train(model_name="cnn_model4", momentum=0.8)
cnn_model5 = train(model_name="cnn_model5", lr=5e-3, momentum=0.8)
cnn_model6 = train(model_name="cnn_model6", lr=1e-2, momentum=0.8)
cnn_model7 = train(model_name="cnn_model7", weight_decay=5e-3)
cnn_model8 = train(model_name="cnn_model8", lr=5e-3, weight_decay=5e-3)
cnn_model9 = train(model_name="cnn_model9",lr=1e-2, weight_decay=5e-3)

torch.save(cnn_model1.state_dict, "homework5_models/cnn_model1.pt")
torch.save(cnn_model2.state_dict, "homework5_models/cnn_model2.pt")
torch.save(cnn_model3.state_dict, "homework5_models/cnn_model3.pt")
torch.save(cnn_model4.state_dict, "homework5_models/cnn_model4.pt")
torch.save(cnn_model5.state_dict, "homework5_models/cnn_model5.pt")
torch.save(cnn_model6.state_dict, "homework5_models/cnn_model6.pt")
torch.save(cnn_model7.state_dict, "homework5_models/cnn_model7.pt")
torch.save(cnn_model8.state_dict, "homework5_models/cnn_model8.pt")
torch.save(cnn_model9.state_dict, "homework5_models/cnn_model9.pt")

# SECOND BATCH

# --------------------------------------------------

cnn_model10 = train(model_name="cnn_model10",lr=5e-3, momentum=0.85)
cnn_model11 = train(model_name="cnn_model11",lr=1e-2, momentum=0.85)
cnn_model12 = train(model_name="cnn_model12",lr=3e-2, momentum=0.85)
cnn_model13 = train(model_name="cnn_model13",lr=5e-3, momentum=0.75)
cnn_model14 = train(model_name="cnn_model14",lr=1e-2, momentum=0.75)
cnn_model15 = train(model_name="cnn_model15",lr=3e-2, momentum=0.75)
cnn_model16 = train(model_name="cnn_model16",lr=5e-3, momentum=0.85, weight_decay=5e-4)
cnn_model17 = train(model_name="cnn_model17",lr=1e-2, momentum=0.85, weight_decay=5e-4)
cnn_model18 = train(model_name="cnn_model18",lr=3e-2, momentum=0.85, weight_decay=5e-4)

torch.save(cnn_model10.state_dict, "homework5_models/cnn_model10.pt")
torch.save(cnn_model11.state_dict, "homework5_models/cnn_model11.pt")
torch.save(cnn_model12.state_dict, "homework5_models/cnn_model12.pt")
torch.save(cnn_model13.state_dict, "homework5_models/cnn_model13.pt")
torch.save(cnn_model14.state_dict, "homework5_models/cnn_model14.pt")
torch.save(cnn_model15.state_dict, "homework5_models/cnn_model15.pt")
torch.save(cnn_model16.state_dict, "homework5_models/cnn_model16.pt")
torch.save(cnn_model17.state_dict, "homework5_models/cnn_model17.pt")
torch.save(cnn_model18.state_dict, "homework5_models/cnn_model18.pt")

# THIRD BATCH

# -------------------------------------------------- 

cnn_model19 = train(model_name="cnn_model19", lr=1e-2, momentum=0.92)
cnn_model20 = train(model_name="cnn_model20", lr=9e-3, momentum=0.92)
cnn_model21 = train(model_name="cnn_model21", lr=1.1e-2, momentum=0.92)
cnn_model22 = train(model_name="cnn_model22", lr=1e-2, weight_decay=5e-5)
cnn_model23 = train(model_name="cnn_model23", lr=5e-3, epochs=15, weight_decay=5e-5)
cnn_model24 = train(model_name="cnn_model24", lr=1e-2, epochs=15)

torch.save(cnn_model19.state_dict, "homework5_models/cnn_model19.pt")
torch.save(cnn_model20.state_dict, "homework5_models/cnn_model20.pt")
torch.save(cnn_model21.state_dict, "homework5_models/cnn_model21.pt")
torch.save(cnn_model22.state_dict, "homework5_models/cnn_model22.pt")
torch.save(cnn_model23.state_dict, "homework5_models/cnn_model23.pt")
torch.save(cnn_model24.state_dict, "homework5_models/cnn_model24.pt")

# FOURTH BATCH

# -------------------------------------------------- 

cnn_model25 = train(model_name="cnn_model25", lr=5e-3, epochs=20, weight_decay=5e-5)
cnn_model26 = train(model_name="cnn_model26", lr=7e-3, epochs=20, weight_decay=5e-5)
cnn_model27 = train(model_name="cnn_model27", lr=9e-3, epochs=20, weight_decay=5e-5)

torch.save(cnn_model25.state_dict, "homework5_models/cnn_model25.pt")
torch.save(cnn_model26.state_dict, "homework5_models/cnn_model26.pt")
torch.save(cnn_model27.state_dict, "homework5_models/cnn_model27.pt")


cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda
cuda


In [27]:
%reload_ext tensorboard
%tensorboard --logdir={log_dir} --port 20005

ERROR: Failed to launch TensorBoard (exited with 255).
Contents of stderr:
TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

E0303 12:47:20.139552 140571579766592 program.py:300] TensorBoard could not bind to port 20005, it was already in use
ERROR: TensorBoard could not bind to port 20005, it was already in use