# Fashion MNIST - Andrew Wang

1. Implement a [SqueezeNet convolutional neural network](https://arxiv.org/pdf/1602.07360.pdf) architecture.
2. Modify the architecture to use [depthwise seperable convolutions](https://arxiv.org/pdf/1704.04861.pdf) for better on-device performance. 
3. Train model to classify images of fashion items supplied in the [FashionMNIST dataset](https://github.com/zalandoresearch/fashion-mnist).

# Final Accuracy Results: 

Note that the following results can only be run after defining train_model. Copied at the aforementioned place for convenience.

For additional convenience, these two models were saved and sent along with this notebook in a zip file. They can be loaded using code at the bottom of this notebook, above the Summary section.

**Demonstration of >80% Accuracy on Test Set for Base SqueezeNet:**

In [0]:
# Ran for 15 Epochs
train_model(False, conv_layer_params, fire_layer_params, [0.0003], [300], root)

Learning Rate: 0.0003; Batch Size: 300
Cuda Available
Epoch: 1, Batch_idx:    20 Loss: 2.305
Epoch: 1, Batch_idx:    40 Loss: 2.303
Epoch: 1, Batch_idx:    60 Loss: 2.304
Epoch: 1, Batch_idx:    80 Loss: 2.304
Epoch: 1, Batch_idx:   100 Loss: 2.304
Epoch: 1, Batch_idx:   120 Loss: 2.303
Epoch: 1, Batch_idx:   140 Loss: 2.304
Epoch: 1, Batch_idx:   160 Loss: 2.303
Epoch: 1, Batch_idx:   180 Loss: 2.303
Epoch: 1, Batch_idx:   200 Loss: 2.303
Epoch: 2, Batch_idx:    20 Loss: 2.303
Epoch: 2, Batch_idx:    40 Loss: 2.303
Epoch: 2, Batch_idx:    60 Loss: 2.304
Epoch: 2, Batch_idx:    80 Loss: 2.303
Epoch: 2, Batch_idx:   100 Loss: 2.302
Epoch: 2, Batch_idx:   120 Loss: 2.304
Epoch: 2, Batch_idx:   140 Loss: 2.303
Epoch: 2, Batch_idx:   160 Loss: 2.273
Epoch: 2, Batch_idx:   180 Loss: 2.081
Epoch: 2, Batch_idx:   200 Loss: 1.688
Epoch: 3, Batch_idx:    20 Loss: 1.240
Epoch: 3, Batch_idx:    40 Loss: 1.103
Epoch: 3, Batch_idx:    60 Loss: 1.021
Epoch: 3, Batch_idx:    80 Loss: 0.982
Epoch: 3, 

**Demonstration of >80% Accuracy on Test Set for SqueezeNet with DWS:**

In [0]:
# Ran for 3 epochs:
train_model(True, conv_layer_params, fire_layer_params, [0.01], [100], root)

Learning Rate: 0.01; Batch Size: 100
Cuda Available
Epoch: 1, Batch_idx:    60 Loss: 1.444
Epoch: 1, Batch_idx:   120 Loss: 0.970
Epoch: 1, Batch_idx:   180 Loss: 0.863
Epoch: 1, Batch_idx:   240 Loss: 0.765
Epoch: 1, Batch_idx:   300 Loss: 0.744
Epoch: 1, Batch_idx:   360 Loss: 0.694
Epoch: 1, Batch_idx:   420 Loss: 0.650
Epoch: 1, Batch_idx:   480 Loss: 0.634
Epoch: 1, Batch_idx:   540 Loss: 0.599
Epoch: 1, Batch_idx:   600 Loss: 0.588
Epoch: 2, Batch_idx:    60 Loss: 0.578
Epoch: 2, Batch_idx:   120 Loss: 0.545
Epoch: 2, Batch_idx:   180 Loss: 0.531
Epoch: 2, Batch_idx:   240 Loss: 0.523
Epoch: 2, Batch_idx:   300 Loss: 0.492
Epoch: 2, Batch_idx:   360 Loss: 0.517
Epoch: 2, Batch_idx:   420 Loss: 0.500
Epoch: 2, Batch_idx:   480 Loss: 0.484
Epoch: 2, Batch_idx:   540 Loss: 0.508
Epoch: 2, Batch_idx:   600 Loss: 0.448
Epoch: 3, Batch_idx:    60 Loss: 0.448
Epoch: 3, Batch_idx:   120 Loss: 0.432
Epoch: 3, Batch_idx:   180 Loss: 0.441
Epoch: 3, Batch_idx:   240 Loss: 0.428
Epoch: 3, Ba

## Setup and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

root = '/content/drive/My Drive/Colab Notebooks/'

In [2]:
# Import the packages we'll need
import keras
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.optim import Adam

import torchvision.datasets as datasets
import torchvision.transforms as transforms

Using TensorFlow backend.


In [3]:
# Load the Fashion MNIST dataset through torchvision.datasets

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize([0.5], [0.5])])

batch_size = 100
num_batches = 60000/batch_size

train_data = datasets.FashionMNIST('',   train=True,  transform=transform, target_transform=None, download=True)
test_data  = datasets.FashionMNIST(root, train=False, transform=transform, target_transform=None, download=True)
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True,  num_workers=2)
test_data_loader  = torch.utils.data.DataLoader(test_data,  batch_size=batch_size, shuffle=False, num_workers=2)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to FashionMNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting FashionMNIST/raw/train-images-idx3-ubyte.gz to FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to FashionMNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting FashionMNIST/raw/train-labels-idx1-ubyte.gz to FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to FashionMNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting FashionMNIST/raw/t10k-images-idx3-ubyte.gz to FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Extracting FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to FashionMNIST/raw
Processing...
Done!


In [0]:
# # Plot a few examples. Note that the images are black and white and
# # that the labels are provided as integer values 0-9.
# num_examples = 9
# indexes = np.random.randint(x_train.shape[0], size=(num_examples))

# dim = int(np.sqrt(num_examples))

# fig = plt.figure()
# k = 1
# for idx in indexes:
#     ax = fig.add_subplot(dim, dim, k)
#     ax.imshow(x_train[idx][:, :, 0])
#     ax.set_title('Class: %d' % y_train[idx])
#     k += 1
# fig.tight_layout()

## Implement and modify SqueezeNet
SqueezeNet models are neural 50x smaller than the original AlexNet model, but achieve the same performance on ImageNet classification. They're size and runtime performance make them a good choice for computer vision models meant to be used on-device. You can consult the [original paper here](https://arxiv.org/pdf/1602.07360.pdf) as well as open source implementations in various frameworks ([Keras](https://github.com/rcmalli/keras-squeezenet/blob/master/keras_squeezenet/squeezenet.py), [TensorFlow](https://github.com/tensorflow/tpu/blob/master/models/official/squeezenet/squeezenet_model.py#L61), [PyTorch](https://github.com/pytorch/vision/blob/master/torchvision/models/squeezenet.py)). 

1. Implement the most basic configuration of SqueezeNet. The model should take a Fashion MNIST image as input and output a vector of probabilities corresponding to the model's confidence that the image belongs to each class.

2. Modify the SqueezeNet architecture to use Depthwise Seperable convolutions instead of regular convolutions for better runtime performance. 

In [0]:
# Classes describing layers and models

class DWSLayer(nn.Module):
  def __init__(self, input_channels, output_channels, kernel_size, padding=0):
    super(DWSLayer, self).__init__()
    self.input_channels = input_channels
    self.output_channels = output_channels
    self.kernel_size = kernel_size
    self.padding = padding

    self.depthwise = nn.Conv2d(1, 1, self.kernel_size, padding=self.padding)
    self.pointwise = nn.Conv2d(input_channels, output_channels, 1)
    self.batchnorm = nn.BatchNorm2d(output_channels)


  def forward(self, x):
    # Depthwise convolution:
    out = self.depthwise(x[:,0].unsqueeze(1))
    for in_channel in range(1, len(x[0])):
      out = torch.cat((out, self.depthwise(x[:,in_channel].unsqueeze(1))), 1)

    # Pointwise convolution:
    out = self.pointwise(out)
    out = self.batchnorm(out)
    return out


class FireLayer(nn.Module):
  def __init__(self, input_channels, num_s11, num_e11, num_e33, DWS=False):
    super(FireLayer, self).__init__()
    self.input_channels = input_channels
    self.num_s11 = num_s11
    self.num_e11 = num_e11
    self.num_e33 = num_e33

    self.s11 = nn.Conv2d(input_channels, num_s11, 1)
    self.activ_f1 = nn.ReLU(inplace=True)
    self.e11 = nn.Conv2d(num_s11, num_e11, 1)
    self.e33 = nn.Conv2d(num_s11, num_e33, 3, padding=1)
    self.activ_f2 = nn.ReLU(inplace=True)

    if DWS:
      self.s11 = DWSLayer(input_channels, num_s11, 1)
      self.e11 = DWSLayer(num_s11, num_e11, 1)
      self.e33 = DWSLayer(num_s11, num_e33, 3, padding=1)


  def forward(self, x):
    # Squeeze
    x = self.s11(x)
    x = self.activ_f1(x)

    # Expand
    output_e11 = self.e11(x)
    output_e33 = self.e33(x)
    output = torch.cat((output_e11, output_e33), 1)
    output = self.activ_f2(output)
    return output


class SqueezeNet(nn.Module):
  def __init__(self, input_shape, n_classes, conv_layer_params, fire_layer_params, DWS=False):
    '''
    Args
    ----
    input_shape : HxWxD
    n_classes : Number of classes to predict from
    conv_layer_params : 2x2 array where clp[i] = [num_filters_i, kernel_size_i]
    fire_layer_params : 8x3 array where fls[i] = [s11_i, e11_i, e33_i]
    '''
    super(SqueezeNet, self).__init__()
    self.input_shape = input_shape
    self.n_classes = n_classes

    # For readability:
    clp = conv_layer_params.tolist()
    flp = fire_layer_params.tolist()
    flp2 = flp[0]
    flp3 = flp[1]
    flp4 = flp[2]
    flp5 = flp[3]
    flp6 = flp[4]
    flp7 = flp[5]
    flp8 = flp[6]
    flp9 = flp[7]

    self.conv1       = nn.Conv2d(input_shape[-1], clp[0][0], clp[0][1], padding=(int)((clp[0][1] - 1)/2))
    self.relu1       = nn.ReLU(inplace=True)
    self.fire_layers = nn.Sequential(
      FireLayer(        clp[0][0],  flp2[0], flp2[1], flp2[2], DWS),
      FireLayer(flp2[1] + flp2[2],  flp3[0], flp3[1], flp3[2], DWS),
      FireLayer(flp3[1] + flp3[2],  flp4[0], flp4[1], flp4[2], DWS),
      FireLayer(flp4[1] + flp4[2],  flp5[0], flp5[1], flp5[2], DWS),
      FireLayer(flp5[1] + flp5[2],  flp6[0], flp6[1], flp6[2], DWS),
      FireLayer(flp6[1] + flp6[2],  flp7[0], flp7[1], flp7[2], DWS),
      FireLayer(flp7[1] + flp7[2],  flp8[0], flp8[1], flp8[2], DWS),
      FireLayer(flp8[1] + flp8[2],  flp9[0], flp9[1], flp9[2], DWS)
    )
    self.conv10      = nn.Conv2d(flp9[1] + flp9[2], clp[1][0], clp[1][1], padding=(int)((clp[1][1] - 1)/2))
    self.relu2       = nn.ReLU(inplace=True)
    # Global Avg Pool
    self.avgpool2d   = nn.AvgPool2d(input_shape[0])
    # Softmax computed as part of CrossEntropy 
    self.linear      = nn.Linear(clp[1][0], self.n_classes)
    
    if DWS:
      self.conv1  = DWSLayer(  input_shape[-1], clp[0][0], clp[0][1], padding=(int)((clp[0][1] - 1)/2))
      self.conv10 = DWSLayer(flp9[1] + flp9[2], clp[1][0], clp[1][1], padding=(int)((clp[1][1] - 1)/2))

    
  def forward(self, x):
    x = self.conv1(x).to(device) # Convert to torch.cuda.FloatTensor
    x = self.fire_layers(x)
    x = self.conv10(x)
    x = self.avgpool2d(x)
    x = torch.flatten(x, start_dim=1, end_dim=-1)
    x = self.linear(x)
    return x


def build_squeezenet(conv_layer_params, fire_layer_params, input_shape=(28, 28, 1), n_classes=10):
  model = SqueezeNet(input_shape, n_classes, conv_layer_params, fire_layer_params)
  return model

def build_squeezenet_depthwise(conv_layer_params, fire_layer_params, input_shape=(28, 28, 1), n_classes=10):
  model = SqueezeNet(input_shape, n_classes, conv_layer_params, fire_layer_params, DWS=True)
  return model

In [5]:
# Default layer architectures from Figure2 of the SqueezeNet paper:
conv_layer_params = np.asarray([[  96, 7],
                                [1000, 1]])
fire_layer_params = np.asarray([[16,  64,  64],
                                [16,  64,  64],
                                [32, 128, 128],
                                [32, 128, 128],
                                [48, 192, 192],
                                [48, 192, 192],
                                [64, 256, 256],
                                [64, 256, 256]])

# Demonstrating model size comparison:
model = build_squeezenet(conv_layer_params, fire_layer_params)
pytorch_total_params = sum(p.numel() for p in model.parameters())
print(pytorch_total_params)
print(model)

model_dw = build_squeezenet_depthwise(conv_layer_params, fire_layer_params)
pytorch_total_params = sum(p.numel() for p in model_dw.parameters())
print(pytorch_total_params)
print(model_dw)

1249026
SqueezeNet(
  (conv1): Conv2d(1, 96, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (relu1): ReLU(inplace=True)
  (fire_layers): Sequential(
    (0): FireLayer(
      (s11): Conv2d(96, 16, kernel_size=(1, 1), stride=(1, 1))
      (activ_f1): ReLU(inplace=True)
      (e11): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      (e33): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (activ_f2): ReLU(inplace=True)
    )
    (1): FireLayer(
      (s11): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
      (activ_f1): ReLU(inplace=True)
      (e11): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      (e33): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (activ_f2): ReLU(inplace=True)
    )
    (2): FireLayer(
      (s11): Conv2d(128, 32, kernel_size=(1, 1), stride=(1, 1))
      (activ_f1): ReLU(inplace=True)
      (e11): Conv2d(32, 128, kernel_size=(1, 1), stride=(1, 1))
      (e33): Conv2d(32, 128, kernel_size=(3, 3), 

## Training Models

In [6]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

if torch.cuda.is_available():
  print("Cuda Available")
  model.to(device)
  model_dw.to(device)

Cuda Available



**Training:**

In [0]:
def train_model(DWS, conv_layer_params, fire_layer_params, learning_rates, batch_sizes, root):
  # Initialization before loops:
  train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=2, shuffle=True,  num_workers=2)
  test_data_loader  = torch.utils.data.DataLoader(test_data,  batch_size=2, shuffle=False, num_workers=2)
  model = build_squeezenet(conv_layer_params, fire_layer_params)
  device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 

  tr_loss_history = []
  tr_acc_history = []

  for batch_size in batch_sizes:
    num_batches = int(60000/batch_size)
    num_checks = 10
    batches_per_check = int(num_batches/num_checks)
    train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True,  num_workers=2)
    test_data_loader  = torch.utils.data.DataLoader(test_data,  batch_size=batch_size, shuffle=False, num_workers=2)

    early_stop_thresh = 0.001
    patience = int(num_checks / 3)

    for lr in learning_rates:
      stop_flag_idx = -1
      print('Learning Rate: ' + str(lr) + '; Batch Size: ' + str(batch_size))
      
      model = build_squeezenet(conv_layer_params, fire_layer_params)
      if DWS:
        model = build_squeezenet_depthwise(conv_layer_params, fire_layer_params)

      if torch.cuda.is_available():
        print("Cuda Available")
        model.to(device)

      optimizer = Adam(model.parameters(), lr = lr)
      criterion = nn.CrossEntropyLoss()
      curr_loss_history = []    

      for epoch in range(3):
        running_loss = 0.0
        for mini_batch, d in enumerate(train_data_loader, 0):
          # print(mini_batch)
          inputs, class_labels = d
          if torch.cuda.is_available():
            inputs, class_labels = d[0].to(device), d[1].to(device)

          optimizer.zero_grad()
          train_output = model(inputs)
          batch_loss = criterion(train_output.squeeze(), class_labels)
          batch_loss.backward()
          optimizer.step()

          # Print loss
          running_loss += batch_loss.item()
          if mini_batch % batches_per_check == batches_per_check - 1:
              print('Epoch: %d, Batch_idx: %5d Loss: %.3f' %
                    (epoch + 1, mini_batch + 1, running_loss / batches_per_check))
              curr_loss_history.append(running_loss)

              # Early stopping:
              if len(curr_loss_history) > 3:
                if stop_flag_idx == -1:
                  if abs(running_loss - curr_loss_history[-2]) < early_stop_thresh:
                    stop_flag_idx = len(curr_loss_history) - 1
                  # Else continue
                else:
                  if abs(running_loss - curr_loss_history[stop_flag_idx]) < early_stop_thresh:
                    if (len(curr_loss_history) - 1 - stop_flag_idx) > patience:
                      print("Early stopping: Curr Batch Loss = " + str(running_loss) + 
                                          ", Prev Batch Loss = " + str(curr_loss_history[-2]) + 
                                                ", Patience = " + str(patience))
                      print("Ended on batch: " + str(len(curr_loss_history)))
                      stop_flag_idx = -1
                      break
              running_loss = 0.0

      print('Finished Training')
      tr_loss_history.append(curr_loss_history)
      
      # Final accuracy after training:
      correct = 0.0
      total = 0.0
      with torch.no_grad():
          for data in test_data_loader:
              images, labels = data
              if torch.cuda.is_available():
                images, class_labels = data[0].to(device), data[1].to(device)
              outputs = model(images)
              _, predicted = torch.max(outputs.squeeze().data, 1)
              total += labels.size(0)
              labels = labels.to(device)
              correct += (predicted == labels).sum().item()

      print('Accuracy of the network on the 10000 test images: %d %%' % (
          100 * correct / total))
      tr_acc_history.append(float(correct / total))

      # Save model parameters:
      PATH =   root +     'model_lr' + str(lr) + '_batchsize_' + str(batch_size) + '_acc_' + str(100.0*correct/total)
      if DWS:
        PATH = root + 'DWS_model_lr' + str(lr) + '_batchsize_' + str(batch_size) + '_acc_' + str(100.0*correct/total)
      torch.save({
              'epoch': epoch,
              'model_state_dict': model.state_dict(),
              'optimizer_state_dict': optimizer.state_dict(),
              'loss': batch_loss
              }, PATH)

**Demonstration of >80% Accuracy on Test Set for Base SqueezeNet:**

In [0]:
# Ran for 15 Epochs
train_model(False, conv_layer_params, fire_layer_params, [0.0003], [300], root)

**Demonstration of >80% Accuracy on Test Set for SqueezeNet with DWS:**

In [0]:
# Ran for 3 epochs:
train_model(True, conv_layer_params, fire_layer_params, [0.01], [100], root)

**Loading Saved Training Models:**

In [0]:
# Base SqueezeNet: lr_0.0003; batch_size_300
base_SN_PATH = root + 'model_lr0.0003_batchsize_300_acc_85.54'

# DWS SqueezeNet: lr_0.01; batch_size_100
DWS_SN_PATH = root + 'DWS_model_lr0.01_batchsize_100_acc_84.62'

In [17]:
device = torch.device("cuda")
model = build_squeezenet(conv_layer_params, fire_layer_params)
model.load_state_dict(torch.load(base_SN_PATH)['model_state_dict'])
model.to(device)

model_dw = build_squeezenet_depthwise(conv_layer_params, fire_layer_params)
model_dw.load_state_dict(torch.load(DWS_SN_PATH)['model_state_dict'])
model_dw.to(device)

SqueezeNet(
  (conv1): DWSLayer(
    (depthwise): Conv2d(1, 1, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (pointwise): Conv2d(1, 96, kernel_size=(1, 1), stride=(1, 1))
    (batchnorm): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (relu1): ReLU(inplace=True)
  (fire_layers): Sequential(
    (0): FireLayer(
      (s11): DWSLayer(
        (depthwise): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))
        (pointwise): Conv2d(96, 16, kernel_size=(1, 1), stride=(1, 1))
        (batchnorm): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (activ_f1): ReLU(inplace=True)
      (e11): DWSLayer(
        (depthwise): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))
        (pointwise): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
        (batchnorm): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (e33): DWSLayer(
        (depthwise): Conv2d(1, 1, ke

## Results Summary

   This notebook demonstrates that both a base SqueezeNet model - as described by "SQUEEZENET: ALEXNET-LEVEL ACCURACY WITH
50X FEWER PARAMETERS AND <0.5MB MODEL SIZE" - as well as the SqueezeNet model augmented with depthwise-separable layers - as described by "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" - able to achieve >80% accuracy on the 10,000 test images from the Fashion MNIST dataset. Specifically, the base SqueezeNet achieved an accuracy of 85% after 15 epochs with a learning rate of 0.0003 and a batch size of 300, and the SqueezeNet using DWS layers achieved an accuracy of 84% after 3 epochs with a learning rate of 0.01 and a batch size of 100. The key takeaways from these experiments are that: 1. Base SqueezeNet architecture and Fire Layers, are able to achieve decent performance in a multiclass image classification setting and showcase potential for improved performance based on increasing training time alone; and 2. Implementing DWS layers within the SqueezeNet's architecture and Fire Layers is able to drastically decrease model size (from 1249026 parameters to 761014 - more than 39% under this particular architecture) while achieving comparable performance with similar indications of potential improvement based on trends of model loss (both models' printed loss per 1/10 batch shows loss has not converged at the end of training). Therefore, for further increase in performance for both models, models should be trained to convergence instead of simple a set number of epochs. Hyperparameter tuning can additionally be done on the learning rate, batch_size, and macroarchitecture defined by conv_layer_params and fire_layer_params. In terms of macroarchitecture, the models can be further pared down by adding in maxpool layers as shown in Figure 2 of the SqueezeNet paper (ommitted for this exercise) and introducing network pruning; implementing bypass connections between multiple layers to improve accuracy (also suggested by the SqueezeNet paper); and otherwise reformatted in terms of adding or dropping layers based on external resource constraints. As a final footnote: it was observed that the basse SqueezeNet model trained on input data faster than the DWS SqueezeNet despite having more parameters. It was not determined what the exact cause of this was, but it is likely due to the overhead of creating multiple new tensors,and the additional steps of computation (including a loop) in the place of a single nn.Conv2d layer, which should slow down computation both in forward and back-propagation during training. 