# Tutorial: Compressing AlexNet on Cifar10 dataset achieving `97.83x` compression.

Compressing the AlexNet neural network on the CIFAR-10 dataset using Condensa. We will target two different objectives: reducing total model memory footprint, and reducing the inference latency of the compressed model.


In [51]:
from google.colab import drive
drive.mount('/content/drive')
import matplotlib


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
%cd "/content/drive/My Drive/notebooks"
!pip install kora -q
from kora import drive
drive.link_nbs()

/content/drive/My Drive/notebooks
[K     |████████████████████████████████| 57 kB 2.6 MB/s 
[K     |████████████████████████████████| 56 kB 3.7 MB/s 
[?25h

Defining the AlexNet network architecture in PyTorch as shown below:

In [4]:
import torch
import torch.nn as nn

class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=5),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.classifier = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

We instantiate this class into model:




In [35]:
model = AlexNet()

## Load Pre-Trained Weights

 Loading a pre-trained set of weights into the model from the `AlexNet.pth` file included with this notebook.

In [36]:
model.load_state_dict(torch.load('AlexNet.pth',map_location=torch.device('cpu')))

<All keys matched successfully>

## Preparing for Compression

Let's make sure CUDA is enabled in PyTorch.

In [37]:
assert torch.cuda.is_available()

In [38]:
!pip install condensa



We now create PyTorch data loaders for the training, test, and validation datasets. To save space, we wrap the data loading code into two utility functions: `cifar_train_val_loader` and `cifar_test_loader` (please refer to `util.py` in the current `notebooks` folder for the full code).

In [39]:
import util
import torchvision.datasets as datasets

In [40]:
dataset = datasets.CIFAR10

trainloader,valloader = util.cifar_train_val_loader(dataset, train_batch_size=128, val_batch_size=128)
testloader = util.cifar_test_loader(dataset, batch_size=128)

Files already downloaded and verified
Files already downloaded and verified


  cpuset_checked))
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)


Files already downloaded and verified


  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)
  tensor[i] += torch.from_numpy(nump_array)


The utilities above split the original training set into training and validation sets (using a 9:1 split) and perform data normalization for all datasets. They also utilize Condensa's `GPUDataLoader` to enable fast data prefetching and collation.

We now define our loss criterion:

In [55]:
criterion = nn.CrossEntropyLoss().cuda()

Finally, we set our logging level to `INFO` so that Condensa prints out intermediate updates.

In [56]:
import logging
logging.basicConfig(level=logging.INFO, format='%(message)s')

## Two Different Compression Strategies

In this tutorial, we will explore two different ways of compressing the AlexNet network: one targeted at reducing the total model memory footprint (named `MEM`) and the other at reducing inference runtime latency (named `FLOP`).

### MEM Scheme

The `MEM` scheme aims to reduce the total model memory footprint (number of bytes required to store the non-zero elements of the compressed model). To this end, we perform a combination of _pruning_ (clipping model parameters to zero) and _quantization_ (using 16-bit floating point representation to store model weights instead of 32-bit). Expressing this scheme in Condensa is fairly straightforward using the built-in [`Compose`](https://nvlabs.github.io/condensa/modules/schemes.html#composition) scheme as shown below:

In [57]:
import condensa
from condensa.schemes import Compose, Prune, Quantize

MEM = Compose([Prune(0.02), Quantize(condensa.float16)])

Here, the operator successively applies pruning followed by quantization to the model. The pruning density, or the ratio of non-zero parameters in the compressed model to the original one, is specified as 0.02 (2%). Condensa includes a number of other common schemes, including structured and block pruning, among others. ) in the API documentation. 

### FLOP Scheme

While the `MEM` scheme is effective at reducing the number of non-zero elements in a model, this may not directly translate into improvements in actual inference runtime. Most modern CPUs and GPUs are unable to detect individual zero elements and bypass computations on them in hardware. Instead, to realize speedups on such architectures, we perform filter pruning, which removes entire filters (3D blocks) at once from convolutional layers. This enables the weight tensors to be physically reshaped in the compressed model. We call this the `FLOP` scheme in this tutorial, and use the [`FilterPrune`](https://nvlabs.github.io/condensa/modules/schemes.html#filter-pruning) scheme in Condensa to define it.

In [58]:
from condensa.schemes import FilterPrune
FLOP = condensa.schemes.FilterPrune(0.5)

## Setting up the Optimizer

To recover any accuracy lost due to compression, Condensa comes with a set of _optimizers_. Each optimizer takes a pre-trained model, applies the compression scheme, and tries to recover the original accuracy either directly or iteratively. In this tutorial, we'll be using Condensa's L-C optimizer. We instantiate it as follows:

In [59]:
lc = condensa.opt.LC(steps=35,                             # L-C iterations
                     l_optimizer=condensa.opt.lc.SGD,      # L-step sub-optimizer
                     l_optimizer_params={'momentum':0.95}, # L-step sub-optimizer parameters
                     lr=0.01,                              # Initial learning rate
                     lr_end=1e-4,                          # Final learning rate
                     mb_iterations_per_l=3000,             # Mini-batch iterations per L-step
                     mb_iterations_first_l=30000,          # Mini-batch iterations for first L-step
                     mu_init=1e-3,                         # Initial value of `mu`
                     mu_multiplier=1.1,                    # Multiplier for `mu`
                     mu_cap=10000,                         # Maximum value of `mu`
                     debugging_flags={'custom_model_statistics':
                                      condensa.util.cnn_statistics})


[Condensa] LC ENGINE CONFIG [steps=35, l_optimizer=<class 'condensa.opt.lc.sgd.SGD'>, l_optimizer_params={'momentum': 0.95}, lr=0.01, lr_end=0.0001, lr_decay=None, lr_schedule=None, lr_multiplier=None, mb_iterations_per_l=3000, mb_iterations_first_l=30000, mu_init=0.001, mu_multiplier=1.1, mu_cap=10000, distributed=False, debugging_flags={'custom_model_statistics': <function cnn_statistics at 0x7f44c9603e60>}]


Each optimizer in Condensa has its own set of hyper-parameters which must be specified manually by the user. A full description of hyper-parameter tuning is beyond the scope of this tutorial, but for additional information on what each hyper-parameter represents and tips on finding its optimal value, we refer you to the Condensa paper. In this notebook, we run the L-C algorithm for 35 iterations using the hyper-parameter values shown above. 

## Compressing the Model

Once the optimizer is instantiated, we can go ahead and perform the actual compression using the [`Compressor`](https://nvlabs.github.io/condensa/modules/compressor.html#model-compressor) class and its [`run`](https://nvlabs.github.io/condensa/modules/compressor.html#condensa.compressor.Compressor.run) method. **Note:** the next two lines may take a while to execute!

In [None]:
compressor_MEM  = condensa.Compressor(lc,
                                      MEM,
                                      model,
                                      trainloader,
                                      testloader,
                                      valloader,
                                      criterion)
w_MEM  = compressor_MEM.run()

In [None]:
compressor_FLOP = condensa.Compressor(lc,
                                      FLOP,
                                      model,
                                      trainloader,
                                      testloader,
                                      valloader,
                                      criterion)

w_FLOP = compressor_FLOP.run()

We specify the optimizer, scheme, input model, training, test, and validation sets, and the loss criterion to create an instance of the Compressor class. Since the optimizer is specified as a parameter, we are able to easily experiment with alternative optimizers in Condensa.

In the above snippets, `w_MEM` and `w_FLOP` contain the models compressed using the `MEM` and `FLOP` schemes, respectively. We can now save these to disk:

In [None]:
torch.save(w_MEM.state_dict(), 'AlexNet_MEM.pth')
torch.save(w_FLOP.state_dict(), 'AlexNet_FLOP.pth')

Condensa also records various statistics about the compression process. These can be retrieved using the `statistics` member of the compressor object as follows:

In [None]:
for k,v in compressor_MEM.statistics.items():
    print('{}: {}'.format(k, v))

In [None]:
for k,v in compressor_FLOP.statistics.items():
    print('{}: {}'.format(k, v))

## Results

We notice that Condensa achieves top-1 test accuracies of **77.49%** and **76.81%** for the MEM and FLOP schemes, respectively (compared to the baseline accuracy of **77.07%** for AlexNet). For more complex models, it is possible to further improve accuracies via model fine-tuning

### Compression and Runtime Reductions

Using the MEM scheme, we reduce the model memory footprint by compressing the neural network by **97.83x**. Additionally, we achieve a **55.6%** reduction in FLOPs using the FLOP scheme.