# Convert Distiller Post-Train Quantization Models to "Native" PyTorch

## Background

As of version 1.3 PyTorch comes with built-in quantization functionality. Details are available [here](https://pytorch.org/docs/stable/quantization.html). Distiller's and PyTorch's implementations are completely unrelated. An advantage of PyTorch built-in quantization is that it offers optimized 8-bit execution on CPU and export to GLOW. PyTorch doesn't offer optimized 8-bit execution on GPU (as of version 1.4).

At the moment we are still keeping Distiller's separate API and implementation, but we've added the capability to convert a **post-training quantization** model created in Distiller to a "Distiller-free" model, comprised entirely of PyTorch built-in quantized modules.

Distiller's quantized layers are actually simulated in FP32. Hence, comparing a Distiller model running on CPU to a PyTorch built-in model, the latter will be significantly faster on CPU. However, a Distiller model on a GPU is still likely to be faster compared to a PyTorch model on CPU. So experimenting with Distiller and converting to PyTorch in the end could be useful. Milage may vary of course, depending on the actual HW setup.

Let's see how the conversion works.

In [None]:
import torch
import matplotlib.pyplot as plt
import os
import math
import torchnet as tnt
from ipywidgets import widgets, interact
from copy import deepcopy
from collections import OrderedDict

import distiller
from distiller.models import create_model
import distiller.quantization as quant

# Load some common code and configure logging
# We do this so we can see the logging output coming from
# Distiller function calls
%run './distiller_jupyter_helpers.ipynb'
msglogger = config_notebooks_logger()

## Create Model

In [None]:
# By default, the model is moved to the GPU and parallelized (wrapped with torch.nn.DataParallel)
# If no GPU is available, a non-parallel model is created on the CPU
model = create_model(pretrained=True, dataset='imagenet', arch='resnet18', parallel=True)

## Create Data Loaders

We create separate data loaders for GPU and CPU. Set `batch_size` and `num_workers` to optimal values that match your HW setup.

(Note we reset the seed before creating each data loader, to make sure both loaders consist of the same subset of the test set)

In [None]:
# We use Distiller's built-in data loading functionality for ImageNet

distiller.set_seed(0)

subset_size = 1.0 # To save time, can set to value < 1.0
dataset = 'imagenet'
dataset_path = os.path.expanduser('/data2/datasets/imagenet')

batch_size_gpu = 256
num_workers_gpu = 10
_, _, test_loader_gpu, _ = distiller.apputils.load_data(
    dataset, dataset_path, batch_size_gpu, num_workers_gpu,
    effective_test_size=subset_size, fixed_subset=True, test_only=True)

In [None]:
distiller.set_seed(0)
batch_size_cpu = 44
num_workers_cpu = 10
_, _, test_loader_cpu, _ = distiller.apputils.load_data(
    dataset, dataset_path, batch_size_cpu, num_workers_cpu,
    effective_test_size=subset_size, fixed_subset=True, test_only=True)

## Define Evaluation Function

In [None]:
def eval_model(data_loader, model, device, print_freq=10):
    print('Evaluating model')
    criterion = torch.nn.CrossEntropyLoss().to(device)
    
    loss = tnt.meter.AverageValueMeter()
    classerr = tnt.meter.ClassErrorMeter(accuracy=True, topk=(1, 5))

    total_samples = len(data_loader.sampler)
    batch_size = data_loader.batch_size
    total_steps = math.ceil(total_samples / batch_size)
    print('{0} samples ({1} per mini-batch)'.format(total_samples, batch_size))

    # Switch to evaluation mode
    model.eval()

    for step, (inputs, target) in enumerate(data_loader):
        with torch.no_grad():
            inputs, target = inputs.to(device), target.to(device)
            # compute output from model
            output = model(inputs)

            # compute loss and measure accuracy
            loss.add(criterion(output, target).item())
            classerr.add(output.data, target)
            
            if (step + 1) % print_freq == 0:
                print('[{:3d}/{:3d}] Top1: {:.3f}  Top5: {:.3f}  Loss: {:.3f}'.format(
                      step + 1, total_steps, classerr.value(1), classerr.value(5), loss.mean), flush=True)
    print('----------')
    print('Overall ==> Top1: {:.3f}  Top5: {:.3f}  Loss: {:.3f}'.format(
        classerr.value(1), classerr.value(5), loss.mean), flush=True)

## Post-Train Quantize with Distiller

In [None]:
quant_mode = {'activations': 'ASYMMETRIC_UNSIGNED', 'weights': 'SYMMETRIC'}
stats_file = "../examples/quantization/post_train_quant/stats/resnet18_quant_stats.yaml"
dummy_input = distiller.get_dummy_input(input_shape=model.input_shape)

quantizer = quant.PostTrainLinearQuantizer(
    deepcopy(model), bits_activations=8, bits_parameters=8, mode=quant_mode,
    model_activation_stats=stats_file, overrides=None
)
quantizer.prepare_model(dummy_input)

## Convert to PyTorch Built-In

In [None]:
# Here we trigger the conversion via the Quantizer instance. Later on we show another way which does not
# require the quantizer
pyt_model = quantizer.convert_to_pytorch(dummy_input)

# Note that the converted model is automatically moved to the CPU, regardless
# of the device of the Distiller model
print('Distiller model device:', distiller.model_device(quantizer.model))
print('PyTorch model device:', distiller.model_device(pyt_model))

## Run Evaluation
### Distiller Model on GPU (if available)

In [None]:
if torch.cuda.is_available():
    %time eval_model(test_loader_gpu, quantizer.model, 'cuda')

### Distiller Model on CPU

In [None]:
if torch.cuda.is_available():
    print('Creating CPU copy of Distiller model')
    cpu_model = distiller.make_non_parallel_copy(quantizer.model).cpu()
else:
    cpu_model = quantizer.model
%time eval_model(test_loader_cpu, cpu_model, 'cpu', print_freq=60)

### PyTorch model in CPU

We expect the PyTorch model on CPU to be much faster than the Distiller model on CPU

In [None]:
%time eval_model(test_loader_cpu, pyt_model, 'cpu', print_freq=60)

## For the Extra-Curious: Comparing the Models

1. Distiller takes care of quantizing the inputs within the quantized modules PyTorch quantized modules assume the input is already quantized. Hence, for cases where a module's input is not quantized, we explicitly add a quantization operation for the input. The first layer in the model, `conv1` in ResNet18, is such a case
2. Both Distiller and native PyTorch support fused ReLU. In Distiller, this is somewhat obscurely indicated by the `clip_half_range` attribute inside `output_quant_settings`. In PyTorch, the module type is explicitly `QuantizedConvReLU2d`.

In [None]:
print('conv1\n')
print('DISTILLER:\n{}\n'.format(quantizer.model.module.conv1))
print('PyTorch:\n{}\n'.format(pyt_model.conv1))

Example of internal layers which don't require explicit input quantization:

In [None]:
print('layer1.0.conv1')
print(pyt_model.layer1[0].conv1)
print('\nlayer1.0.add')
print(pyt_model.layer1[0].add)

### Automatic de-quantization <--> quantization in the model

For each quantized module in the Distiller implementation, we quantize the input and de-quantize the output.
So, if the user explicitly sets "internal" modules to run in FP32, this is transparent to the other quantized modules (at the cost of redundant quant-dequant operations).

When converting to PyTorch we remove these redundant operations, and keep just the required ones in case the user explicitly decided to run some modules in FP32.

For an example, consider a ResNet "basic block" with a residual connection that contains a downsampling convolution. Let's see how such a block looks in our fully-quantized, converted model:

In [None]:
print(pyt_model.layer2[0])

We can see all layers are either built-in quantized PyTorch modules, or identity operations representing fused operations. The entire block is quantized, so we don't see any quant-dequnt operations in the middle.

Now let's create a new quantized model, and this time leave the 'downsample' module in FP32:

In [None]:
overrides = OrderedDict(
    [('layer2.0.downsample.0', OrderedDict([('bits_activations', None), ('bits_weights', None)]))]
)
new_quantizer = quant.PostTrainLinearQuantizer(
    deepcopy(model), bits_activations=8, bits_parameters=8, mode=quant_mode,
    model_activation_stats=stats_file, overrides=overrides
)
new_quantizer.prepare_model(dummy_input)

new_pyt_model = new_quantizer.convert_to_pytorch(dummy_input)

print(new_pyt_model.layer2[0])

We can see a few differences:
1. The `downsample` module now contains a de-quantize op before the actual convolution
2. The `add` module now contains a quantize op before the actual add. Note that the add operation accepts 2 inputs. In this case the first input (index 0) comes from the `conv2` module, which is quantized. The second input (index 1) comes from the `downsample` module, which we kept in FP32. So, we only need to quantized the input at index 1. We can see this is indeed what is happening, by looking at the `ModuleDict` inside the `quant` module, and noticing it has only a single key for index "1".

Let's see how the `add` module would look if we also kept the `conv2` module in FP32:

In [None]:
overrides = OrderedDict(
    [('layer2.0.downsample.0', OrderedDict([('bits_activations', None), ('bits_weights', None)])),
     ('layer2.0.conv2', OrderedDict([('bits_activations', None), ('bits_weights', None)]))]
)
new_quantizer = quant.PostTrainLinearQuantizer(
    deepcopy(model), bits_activations=8, bits_parameters=8, mode=quant_mode,
    model_activation_stats=stats_file, overrides=overrides
)
new_quantizer.prepare_model(dummy_input)

new_pyt_model = new_quantizer.convert_to_pytorch(dummy_input)

print(new_pyt_model.layer2[0].add)

We can see that now both inputs to the add module are being quantized.

## Another API for Conversion

In some cases we don't have the actual quantizer. For example - if the Distiller quantized module was loaded from a checkpoint. In those cases we can call a `distiller.quantization` module-level function (In fact, the Quantizer method we used earlier is a wrapper around this function).

### Save Distiller model to checkpoint

In [None]:
# Save Distiller model to checkpoint and load it
distiller.apputils.save_checkpoint(0, 'resnet18', quantizer.model)

### Load Checkpoint

The model is quantized when the checkpoint is loaded

In [None]:
loaded_model = create_model(False, dataset='imagenet', arch='resnet18', parallel=True)
loaded_model = distiller.apputils.load_lean_checkpoint(loaded_model, 'checkpoint.pth.tar')

### Convert and Evaluate

In [None]:
# Convert
loaded_pyt_model = distiller.quantization.convert_distiller_ptq_model_to_pytorch(loaded_model, dummy_input)

# Run evaluation
%time eval_model(test_loader_cpu, loaded_pyt_model, 'cpu', print_freq=60)

# Cleanup
os.remove('checkpoint.pth.tar')