# Bystrov Mikhail. Homework 3

In [233]:
!wget https://raw.githubusercontent.com/yandexdataschool/Practical_DL/refs/heads/fall25/week03_convnets/cifar.py

--2025-09-25 14:33:27--  https://raw.githubusercontent.com/yandexdataschool/Practical_DL/refs/heads/fall25/week03_convnets/cifar.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2396 (2.3K) [text/plain]
Saving to: ‘cifar.py.15’


2025-09-25 14:33:27 (48.2 MB/s) - ‘cifar.py.15’ saved [2396/2396]



In [234]:
import numpy as np
from cifar import load_cifar10
X_train, y_train, X_val, y_val, X_test, y_test = load_cifar10("cifar_data")

class_names = np.array(['airplane', 'automobile', 'bird', 'cat', 'deer',
                        'dog', 'frog', 'horse', 'ship', 'truck'])

In [235]:
import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
import matplotlib.pyplot as plt
%matplotlib inline

## Task I: small convolution net
### First step

Let's create a mini-convolutional network with roughly such architecture:
* Input layer
* 3x3 convolution with 10 filters and _ReLU_ activation
* 2x2 pooling (or set previous convolution stride to 3)
* Flatten
* Dense layer with 100 neurons and _ReLU_ activation
* 10% dropout
* Output dense layer.


__Convolutional layers__ in torch are just like all other layers, but with a specific set of parameters:

__`...`__

__`model.add_module('conv1', nn.Conv2d(in_channels=3, out_channels=10, kernel_size=3)) # convolution`__

__`model.add_module('pool1', nn.MaxPool2d(2)) # max pooling 2x2`__

__`...`__


Once you're done (and compute_loss no longer raises errors), train it with __Adam__ optimizer with default params (feel free to modify the code above).

If everything is right, you should get at least __50%__ validation accuracy.

In [236]:
model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=(3,3)),
    nn.ReLU(),
    nn.MaxPool2d((2, 2)),
    nn.Flatten(),
    nn.Linear(32 * 15 * 15, 100),
    nn.ReLU(),
    nn.Dropout(0.1),
    nn.Linear(in_features=100, out_features=10)
).to(device)

In [237]:
# model = nn.Sequential(
#     nn.Conv2d(3, 32, kernel_size=(3,3)),
#     nn.MaxPool2d((2, 2)),
#     nn.ReLU(),
#     nn.Conv2d(32, 64, kernel_size=(3,3)),
#     nn.MaxPool2d((2, 2)),
#     nn.ReLU(),
#     nn.Flatten(),
#     nn.Linear(64 * 6 * 6, 128),
#     nn.ReLU(),
#     nn.Linear(128, 10)
# )

In [238]:
opt = torch.optim.Adam(model.parameters())

train_loss = []
val_accuracy = []

# An auxilary function that returns mini-batches for neural network training
def iterate_minibatches(X, y, batchsize):
    indices = np.random.permutation(np.arange(len(X)))
    for start in range(0, len(indices), batchsize):
        ix = indices[start: start + batchsize]
        yield X[ix], y[ix]

In [239]:
def compute_loss(X_batch, y_batch):
    X_batch = torch.as_tensor(X_batch, dtype=torch.float32, device=device)
    y_batch = torch.as_tensor(y_batch, dtype=torch.int64, device=device)
    logits = model(X_batch)
    return F.cross_entropy(logits, y_batch).mean()

In [240]:
import time
num_epochs = 0 # total amount of full passes over training data
batch_size = 50  # number of samples processed in one SGD iteration

for epoch in range(num_epochs):
    # In each epoch, we do a full pass over the training data:
    start_time = time.time()
    model.train(True) # enable dropout / batch_norm training behavior
    for X_batch, y_batch in iterate_minibatches(X_train, y_train, batch_size):
        # train on batch
        loss = compute_loss(X_batch, y_batch)
        loss.backward()
        opt.step()
        opt.zero_grad()
        train_loss.append(loss.item())  # .item() = convert 1-value Tensor to float

    # And a full pass over the validation data:
    model.train(False)     # disable dropout / use averages for batch_norm
    with torch.no_grad():  # do not store intermediate activations
        for X_batch, y_batch in iterate_minibatches(X_val, y_val, batch_size):
            logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
            y_pred = logits.argmax(-1).detach().to("cpu").numpy()
            val_accuracy.append(np.mean(y_batch == y_pred))

    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss (in-iteration): \t{:.6f}".format(
        np.mean(train_loss[-len(X_train) // batch_size :])))
    print("  validation accuracy: \t\t\t{:.2f} %".format(
        np.mean(val_accuracy[-len(X_val) // batch_size :]) * 100))

In [241]:
model.train(False) # disable dropout / use averages for batch_norm
test_batch_acc = []
for X_batch, y_batch in iterate_minibatches(X_test, y_test, 500):
    logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
    y_pred = logits.max(1)[1].detach().to("cpu").numpy()
    test_batch_acc.append(np.mean(y_batch == y_pred))

test_accuracy = np.mean(test_batch_acc)

print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_accuracy * 100))

Final results:
  test accuracy:		10.00 %


__Hint:__ If you don't want to compute shapes by hand, just plug in any shape (e.g. 1 unit) and run compute_loss. You will see something like this:

__`RuntimeError: size mismatch, m1: [5 x 1960], m2: [1 x 64] at /some/long/path/to/torch/operation`__

See the __1960__ there? That's your actual input shape.

## Task 2: adding normalization

* Add batch norm (with default params) between convolution and ReLU
  * nn.BatchNorm*d (1d for dense, 2d for conv)
  * usually better to put them after linear/conv but before nonlinearity
* Re-train the network with the same optimizer, it should get at least 60% validation accuracy at peak.



In [242]:
model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=(3,3)),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d((2, 2)),
    nn.Flatten(),
    nn.Linear(32 * 15 * 15, 100),
    nn.ReLU(),
    nn.Dropout(0.1),
    nn.Linear(in_features=100, out_features=10)
).to(device)

In [243]:
model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=(3,3), padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=(3,3), padding=1),
    nn.BatchNorm2d(64),
    nn.MaxPool2d((2, 2)),
    nn.LeakyReLU(0.1),
    nn.Conv2d(64, 128, kernel_size=(3,3), padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.Conv2d(128, 256, kernel_size=(3,3), padding=1),
    nn.BatchNorm2d(256),
    nn.MaxPool2d((2, 2)),
    nn.LeakyReLU(0.1),
    nn.Conv2d(256, 512, kernel_size=(3,3), padding=1),
    nn.BatchNorm2d(512),
    nn.MaxPool2d((2, 2)),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(512 * 4 * 4, 256),
    nn.LeakyReLU(0.1),
    nn.Dropout(0.25),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(0.25),
    nn.Linear(128, 10)
).to(device)

In [244]:
import os

opt = torch.optim.Adam(model.parameters())

train_loss = []
val_accuracy = []

num_epochs = 100 # total amount of full passes over training data
batch_size = 50  # number of samples processed in one SGD iteration

best_accuracy = -1.0
best_epoch = -1

for epoch in range(num_epochs):
    # In each epoch, we do a full pass over the training data:
    start_time = time.time()
    model.train(True) # enable dropout / batch_norm training behavior
    for X_batch, y_batch in iterate_minibatches(X_train, y_train, batch_size):
        # train on batch
        loss = compute_loss(X_batch, y_batch)
        loss.backward()
        opt.step()
        opt.zero_grad()
        train_loss.append(loss.item())  # .item() = convert 1-value Tensor to float

    # And a full pass over the validation data:
    model.train(False)     # disable dropout / use averages for batch_norm
    with torch.no_grad():  # do not store intermediate activations
        epoch_val_accuracy = [] # Calculate accuracy for the current epoch's validation pass
        for X_batch, y_batch in iterate_minibatches(X_val, y_val, batch_size):
            logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
            y_pred = logits.argmax(-1).detach().to("cpu").numpy()
            epoch_val_accuracy.append(np.mean(y_batch == y_pred))

        current_accuracy = np.mean(epoch_val_accuracy) # Mean accuracy for the current epoch
        val_accuracy.append(current_accuracy) # Append epoch accuracy to the list


    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss (in-iteration): \t{:.6f}".format(
        np.mean(train_loss[-len(X_train) // batch_size :])))
    print("  validation accuracy: \t\t\t{:.2f} %".format(
        current_accuracy * 100)) # Use current_accuracy here

    # Early stopping logic
    if current_accuracy > best_accuracy:
      best_accuracy = current_accuracy
      best_epoch = epoch
      torch.save(model.state_dict(), "best_state.pt")
    elif epoch - best_epoch > 10:
      print(f"  Validation accuracy has not improved for 10 epochs. Stopping early at epoch {epoch + 1}.")
      break # early stopping


# Load the best model state
if os.path.exists("best_state.pt"):
    model.load_state_dict(torch.load("best_state.pt"))
    print("Loaded best model state.")


model.train(False) # disable dropout / use averages for batch_norm
test_batch_acc = []
with torch.no_grad():
    for X_batch, y_batch in iterate_minibatches(X_test, y_test, 500):
        logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
        y_pred = logits.max(1)[1].detach().cpu().numpy()
        test_batch_acc.append(np.mean(y_batch == y_pred))

test_accuracy = np.mean(test_batch_acc)

print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_accuracy * 100))

Epoch 1 of 100 took 12.634s
  training loss (in-iteration): 	1.556915
  validation accuracy: 			56.37 %
Epoch 2 of 100 took 12.612s
  training loss (in-iteration): 	1.103406
  validation accuracy: 			64.16 %
Epoch 3 of 100 took 12.681s
  training loss (in-iteration): 	0.878950
  validation accuracy: 			72.51 %
Epoch 4 of 100 took 12.725s
  training loss (in-iteration): 	0.730524
  validation accuracy: 			75.87 %
Epoch 5 of 100 took 12.389s
  training loss (in-iteration): 	0.622831
  validation accuracy: 			76.32 %
Epoch 6 of 100 took 12.318s
  training loss (in-iteration): 	0.519368
  validation accuracy: 			79.01 %
Epoch 7 of 100 took 12.319s
  training loss (in-iteration): 	0.435477
  validation accuracy: 			81.66 %
Epoch 8 of 100 took 12.359s
  training loss (in-iteration): 	0.357722
  validation accuracy: 			79.79 %
Epoch 9 of 100 took 12.417s
  training loss (in-iteration): 	0.290713
  validation accuracy: 			81.06 %
Epoch 10 of 100 took 12.504s
  training loss (in-iteration): 	0.

KeyboardInterrupt: 

## Task 3: Data Augmentation

There's a powerful torch tool for image preprocessing useful to do data preprocessing and augmentation.

Here's how it works: we define a pipeline that
* makes random crops of data (augmentation)
* randomly flips image horizontally (augmentation)
* then normalizes it (preprocessing)

In [None]:
from torchvision import transforms
means = np.array((0.4914, 0.4822, 0.4465))  # statistics from dataset documentation
stds = np.array((0.2023, 0.1994, 0.2010))

transform_augment = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomRotation([-30, 30]),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])

In [None]:
from torchvision.datasets import CIFAR10
train_loader = CIFAR10("./cifar_data/", train=True, transform=transform_augment)

train_dataloader = torch.utils.data.DataLoader(
    train_loader,  batch_size=32, shuffle=True, num_workers=1)

In [None]:
# for (x_batch, y_batch) in train_dataloader:

#     print('X:', type(x_batch), x_batch.shape)
#     print('y:', type(y_batch), y_batch.shape)

#     for i, img in enumerate(x_batch.numpy()[:8]):
#         plt.subplot(2, 4, i+1)
#         plt.imshow(img.transpose([1,2,0]) * stds + means )


#     raise NotImplementedError("Plese use this code in your training loop")
    # TODO use this in your training loop

When testing, we don't need random crops, just normalize with same statistics.

In [None]:
transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])

test_loader = CIFAR10("./cifar_data/", train=False, transform=transform_test)
test_dataloader = torch.utils.data.DataLoader(
    test_loader,  batch_size=32, shuffle=False, num_workers=1)



In [None]:
opt = torch.optim.Adam(model.parameters())

train_loss = []
val_accuracy = []

num_epochs = 0

for epoch in range(num_epochs):
    # In each epoch, we do a full pass over the training data:
    start_time = time.time()
    model.train(True) # enable dropout / batch_norm training behavior
    for X_batch, y_batch in train_dataloader: # Use train_dataloader
        # move data to device
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        # train on batch
        logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
        loss = F.cross_entropy(logits, y_batch)
        loss.backward()
        opt.step()
        opt.zero_grad()
        train_loss.append(loss.item())  # .item() = convert 1-value Tensor to float

    # And a full pass over the validation data (using test_dataloader as validation here):
    model.train(False)     # disable dropout / use averages for batch_norm
    with torch.no_grad():  # do not store intermediate activations
        val_batch_acc = []
        for X_batch, y_batch in test_dataloader: # Use test_dataloader for validation
            # move data to device
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
            y_pred = logits.argmax(-1).detach().to("cpu").numpy()
            val_batch_acc.append(np.mean(y_batch.to("cpu").numpy() == y_pred)) # Move y_batch to cpu for comparison

        val_accuracy.append(np.mean(val_batch_acc))


    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss (in-iteration): \t{:.6f}".format(
        np.mean(train_loss[-len(train_dataloader) :]))) # Adjusted to use len(train_dataloader)
    print("  validation accuracy: \t\t\t{:.2f} %".format(
        np.mean(val_accuracy[-1]) * 100))


# Evaluate on the test set after training
model.train(False) # disable dropout / use averages for batch_norm
test_batch_acc = []
with torch.no_grad():
    for X_batch, y_batch in test_dataloader: # Use test_dataloader for final evaluation
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
        y_pred = logits.max(1)[1].detach().to("cpu").numpy()
        test_batch_acc.append(np.mean(y_batch.to("cpu").numpy() == y_pred)) # Move y_batch to cpu for comparison


test_accuracy = np.mean(test_batch_acc)

print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_accuracy * 100))

# Homework 2.2: The Quest For A Better Network

In this assignment you will build a monster network to solve CIFAR10 image classification.

This notebook is intended as a sequel to seminar 3, please give it a try if you haven't done so yet.

(please read it at least diagonally)

* The ultimate quest is to create a network that has as high __accuracy__ as you can push it.
* There is a __mini-report__ at the end that you will have to fill in. We recommend reading it first and filling it while you iterate.

## Grading
* starting at zero points
* +20% for describing your iteration path in a report below.
* +20% for building a network that gets above 20% accuracy
* +10% for beating each of these milestones on __TEST__ dataset:
    * 50% (50% points)
    * 60% (60% points)
    * 65% (70% points)
    * 70% (80% points)
    * 75% (90% points)
    * 80% (full points)
    
## Restrictions
* Please do NOT use pre-trained networks for this assignment until you reach 80%.
 * In other words, base milestones must be beaten without pre-trained nets (and such net must be present in the e-mail). After that, you can use whatever you want.
* you __can__ use validation data for training, but you __can't'__ do anything with test data apart from running the evaluation procedure.

## Tips on what can be done:


 * __Network size__
   * MOAR neurons,
   * MOAR layers, ([torch.nn docs](http://pytorch.org/docs/master/nn.html))

   * Nonlinearities in the hidden layers
     * tanh, relu, leaky relu, etc
   * Larger networks may take more epochs to train, so don't discard your net just because it could didn't beat the baseline in 5 epochs.

   * Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn!


### The main rule of prototyping: one change at a time
   * By now you probably have several ideas on what to change. By all means, try them out! But there's a catch: __never test several new things at once__.


### Optimization
   * Training for 100 epochs regardless of anything is probably a bad idea.
   * Some networks converge over 5 epochs, others - over 500.
   * Way to go: stop when validation score is 10 iterations past maximum
   * You should certainly use adaptive optimizers
     * rmsprop, nesterov_momentum, adam, adagrad and so on.
     * Converge faster and sometimes reach better optima
     * It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
   * __BatchNormalization__ (nn.BatchNorm2d) for the win!
     * Sometimes more batch normalization is better.
   * __Regularize__ to prevent overfitting
     * Add some L2 weight norm to the loss function, PyTorch will do the rest
       * Can be done manually or with weight_decay parameter of a optimizer ([for example SGD's doc](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD)).
     * Dropout (`nn.Dropout`) - to prevent overfitting
       * Don't overdo it. Check if it actually makes your network better
   
### Convolution architectures
   * This task __can__ be solved by a sequence of convolutions and poolings with batch_norm and ReLU seasoning, but you shouldn't necessarily stop there.
   * [Inception family](https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/), [ResNet family](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035?gi=9018057983ca), [Densely-connected convolutions (exotic)](https://arxiv.org/abs/1608.06993), [Capsule networks (exotic)](https://arxiv.org/abs/1710.09829)
   * Please do try a few simple architectures before you go for resnet-152.
   * Warning! Training convolutional networks can take long without GPU. That's okay.
     * If you are CPU-only, we still recomment that you try a simple convolutional architecture
     * a perfect option is if you can set it up to run at nighttime and check it up at the morning.
     * Make reasonable layer size estimates. A 128-neuron first convolution is likely an overkill.
     * __To reduce computation__ time by a factor in exchange for some accuracy drop, try using __stride__ parameter. A stride=2 convolution should take roughly 1/4 of the default (stride=1) one.

   
### Data augmemntation
   * getting 5x as large dataset for free is a great
     * Zoom-in+slice = move
     * Rotate+zoom(to remove black stripes)
     * Add Noize (gaussian or bernoulli)
   * Simple way to do that (if you have PIL/Image):
     * ```from scipy.misc import imrotate,imresize```
     * and a few slicing
     * Other cool libraries: cv2, skimake, PIL/Pillow
   * A more advanced way is to use torchvision transforms:
    ```
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    trainset = torchvision.datasets.CIFAR10(root=path_to_cifar_like_in_seminar, train=True, download=True, transform=transform_train)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

    ```
   * Or use this tool from Keras (requires theano/tensorflow): [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html), [docs](https://keras.io/preprocessing/image/)
   * Stay realistic. There's usually no point in flipping dogs upside down as that is not the way you usually see them.
   
```

```

```

```

```

```

```

```
