## Weight Quantization

Neural network models can take up a lot of space on disk, with the original AlexNet being over 200 MB in float format for example. Almost all of that size is taken up with the weights for the neural connections, since there are often many millions of these in a single model. Because they're all slightly different floating point numbers, simple compression formats like zip don't compress them well.

Training neural networks is done by applying many tiny nudges to the weights, and these small increments typically need floating point precision to work. Taking a pre-trained model and running inference is very different. If you think about recognizing an object in a photo you've just taken, the network has to ignore all the noise, lighting changes, and other non-essential differences between it and the training examples it's seen before, and focus on the important similarities instead. This ability means that they seem to treat low-precision calculations as just another source of noise, and still produce accurate results even with numerical formats that hold less information.

Once again we do our regular imports.



In [4]:
import numpy as np
np.random.seed(1337)  # for reproducibility
from sklearn.cluster import KMeans
import torch 
import torchvision
import torch.nn as nn
import torchvision.datasets as dsets
import torchvision.transforms as transforms
from torch.autograd import Variable
%matplotlib inline
import matplotlib.pyplot as plt

### Hyperparameters

In [5]:
num_epochs = 5
batch_size = 100
learning_rate = 0.001
use_reg = True

### Downloading the MNIST dataset

In [6]:
train_dataset = dsets.MNIST(root='../../data/lab6',
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='../../data/lab6',
                           train=False, 
                           transform=transforms.ToTensor())

Files already downloaded


### Dataloader

In [7]:
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)

### Define the network

In [8]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU())
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 16, kernel_size=3, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2))
        self.layer3 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2))
        self.fc1 = nn.Linear(7*7*32, 300)
        self.fc2 = nn.Linear(300, 10)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = out.view(out.size(0), -1)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

<b>The below function is called to reinitialize the weights of the network and define the required loss criterion and the optimizer.</b> 

In [9]:
def reset_model():
    net = Net()
    net = net.cuda()

    # Loss and Optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
    return net,criterion,optimizer

### Initializing the model

In [10]:
net, criterion, optimizer = reset_model()

### Defining a L1 Regularizer

In [11]:
def l1_regularizer(net, loss, beta):
    l1_crit = nn.L1Loss(size_average=False)
    reg_loss = 0
    for param in net.parameters():
        target = Variable((torch.FloatTensor(param.size()).zero_()).cuda())
        reg_loss += l1_crit(param, target)
        
    loss += beta * reg_loss
    return loss

### Training function

In [12]:
# Train the Model

def training(net, reset = True):
    if reset == True:
        net, criterion, optimizer = reset_model()
    else:
        criterion = nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
    
    net.train()
    for epoch in range(num_epochs):
        total_loss = 0
        accuracy = []
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda()
            labels = labels.cuda()
            temp_labels = labels
            images = Variable(images)
            labels = Variable(labels)

            # Forward + Backward + Optimize
            optimizer.zero_grad()
            outputs = net(images)
            loss = criterion(outputs, labels)

            if use_reg == True :
                loss = l1_regularizer(net,loss,beta=0.001)

            loss.backward()
            optimizer.step()

            total_loss += loss.data[0]
            _, predicted = torch.max(outputs.data, 1)
            correct = (predicted == temp_labels).sum()
            accuracy.append(correct/float(batch_size))

        print('Epoch: %d, Loss: %.4f, Accuracy: %.4f' %(epoch+1,total_loss, (sum(accuracy)/float(len(accuracy)))))
    
    return net

### Testing function

In [13]:
# Test the Model
def testing(net):
    net.eval() 
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.cuda()
        labels = labels.cuda()
        images = Variable(images)
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum()

    print('Test Accuracy of the network on the 10000 test images: %.2f %%' % (100.0 * correct / total))

### Training and testing the network

In [14]:
reset = True
net = training(net, reset)
testing(net)

Epoch: 1, Loss: 592.2687, Accuracy: 0.9522
Epoch: 2, Loss: 231.6676, Accuracy: 0.9769
Epoch: 3, Loss: 191.3403, Accuracy: 0.9789
Epoch: 4, Loss: 173.4457, Accuracy: 0.9804
Epoch: 5, Loss: 164.5340, Accuracy: 0.9814
Test Accuracy of the network on the 10000 test images: 98.12 %


### Uniform Quantization

The simplest motivation for quantization is to shrink file sizes by storing the min and max for each layer, and then compressing each float value to an eight-bit integer representing the closest real number in a linear set of 256 within the range.

In the function below we send 8 bits as input which ressembles that the weights of the network should be represented with only 8 bits while storing to disk. In other words we use only 2^8 or 256 clusters. Hence each weight is represented as a 8-bit integer between 0-255.

Thus before using the weights during test time they need to be projected into the original weight space by using the following equation:

$$
W_{i} = min + \dfrac{max-min}{255}*W_{index}
$$

In [15]:
def uniform_quantize(weight, bits):
    print('-------------------------LAYER---------------------------')
    print("Number of unique parameters before quantization: " + str(len(np.unique(weight))))
    n_clusters = 2**bits
    
    maxim = np.amax(weight)
    minim = np.amin(weight)
    step= (maxim-minim)/(n_clusters - 1)

    clusters=[]

    for i in range(0,n_clusters):
        clusters.append(minim)
        minim+=step

    for i in range(0,len(weight)):
        dist= (clusters-weight[i])**2     
        weight[i]=clusters[np.argmin(dist)]
        
    print("Number of unique parameters after quantization: " + str(len(np.unique(weight))))
    
    return weight  

### Uniform Quantization

Different number of bits can be used for representing the weights and biases. The exact number of bits to use is a design choice and may depend on the complexity of the task at hand since using too less number of bits can result in poor performance. Here, we use 8 bits for quantizing the weights and 1 bit for the biases.

In [16]:
for m in net.modules():
    if isinstance(m,nn.Conv2d) or isinstance(m,nn.BatchNorm2d) or isinstance(m,nn.Linear):
        temp_weight = m.weight.data.cpu().numpy()
        dims = temp_weight.shape
        temp_weight = temp_weight.flatten()
        temp_weight = uniform_quantize(temp_weight, 8)
        temp_weight=np.reshape(temp_weight,dims)
        m.weight.data = (torch.FloatTensor(temp_weight).cuda())
        
        temp_bias = m.bias.data.cpu().numpy()
        dims = temp_bias.shape
        temp_bias = temp_bias.flatten()
        temp_bias = uniform_quantize(temp_bias, 1)
        temp_bias = np.reshape(temp_bias,dims)
        m.bias.data = (torch.FloatTensor(temp_bias).cuda())

-------------------------LAYER---------------------------
Number of unique parameters before quantization: 400
Number of unique parameters after quantization: 122
-------------------------LAYER---------------------------
Number of unique parameters before quantization: 16
Number of unique parameters after quantization: 2
-------------------------LAYER---------------------------
Number of unique parameters before quantization: 16
Number of unique parameters after quantization: 11
-------------------------LAYER---------------------------
Number of unique parameters before quantization: 16
Number of unique parameters after quantization: 2
-------------------------LAYER---------------------------
Number of unique parameters before quantization: 2304
Number of unique parameters after quantization: 136
-------------------------LAYER---------------------------
Number of unique parameters before quantization: 16
Number of unique parameters after quantization: 2
-------------------------LAYER--

Now that we have replaced the weight matrix with the approximated weight of the nearest cluster, we can test the network with the modified weights.

In [17]:
testing(net)

Test Accuracy of the network on the 10000 test images: 98.30 %


## Non-uniform quantization

We have seen in the previous method that we divide the weight space into equally partitioned cluster heads. However, instead of forcing the cluster heads to be equally spaced it would make more sense to learn them. A common and obvious practice is to learn the weight space as a distribution of cluseter centers using k-means clustering. Here, we define a function to perform k-means to the weight values.

$$
min\sum_{i}^{mn}\sum_{j}^{k}||w_{i}-c_{j}||_{2}^{2}
$$

In [18]:
num_clusters = 8
kmeans = KMeans(n_clusters=num_clusters, random_state=0,  max_iter=500, precompute_distances='auto', verbose=0)

In [19]:
def non_uniform_quantize(weights):
    print("---------------------------Layer--------------------------------")
    print("Number of unique parameters before quantization: " + str(len(np.unique(weights))))
    weights = np.reshape(weights,[weights.shape[0],1])
    print(weights.shape)
    kmeans_fit = kmeans.fit(weights)
    clusters = kmeans_fit.cluster_centers_
    
    for i in range(0,len(weights)):
        dist= (clusters-weights[i])**2     
        weights[i]=clusters[np.argmin(dist)]
        
    print("Number of unique parameters after quantization: " + str(len(np.unique(weights))))
    
    return weights  

We reset the model and train the network since we had earlier done uniform quantization on the weight already.

In [20]:
reset = True
net = training(net, reset)
testing(net)

Epoch: 1, Loss: 549.0652, Accuracy: 0.9510
Epoch: 2, Loss: 243.9295, Accuracy: 0.9738
Epoch: 3, Loss: 201.0722, Accuracy: 0.9773
Epoch: 4, Loss: 179.7706, Accuracy: 0.9792
Epoch: 5, Loss: 163.9932, Accuracy: 0.9803
Test Accuracy of the network on the 10000 test images: 97.17 %


Uniform quantization on the weights and biases

In [21]:
for m in net.modules():
    if isinstance(m,nn.Conv2d) or isinstance(m,nn.BatchNorm2d) or isinstance(m,nn.Linear):
        temp_weight = m.weight.data.cpu().numpy()
        dims = temp_weight.shape
        temp_weight = temp_weight.flatten()
        temp_weight = non_uniform_quantize(temp_weight)
        temp_weight=np.reshape(temp_weight,dims)
        m.weight.data = (torch.FloatTensor(temp_weight).cuda())
        
        temp_bias = m.bias.data.cpu().numpy()
        dims = temp_bias.shape
        temp_bias = temp_bias.flatten()
        temp_bias = non_uniform_quantize(temp_bias)
        temp_bias = np.reshape(temp_bias,dims)
        m.bias.data = (torch.FloatTensor(temp_bias).cuda())

---------------------------Layer--------------------------------
Number of unique parameters before quantization: 400
(400, 1)
Number of unique parameters after quantization: 8
---------------------------Layer--------------------------------
Number of unique parameters before quantization: 16
(16, 1)
Number of unique parameters after quantization: 8
---------------------------Layer--------------------------------
Number of unique parameters before quantization: 16
(16, 1)
Number of unique parameters after quantization: 8
---------------------------Layer--------------------------------
Number of unique parameters before quantization: 16
(16, 1)
Number of unique parameters after quantization: 8
---------------------------Layer--------------------------------
Number of unique parameters before quantization: 2304
(2304, 1)
Number of unique parameters after quantization: 8
---------------------------Layer--------------------------------
Number of unique parameters before quantization: 16
(1

In [22]:
testing(net)

Test Accuracy of the network on the 10000 test images: 97.61 %


### Retraining the network

Here we see that 8 clusters are too less in order to maintain the network at the same accuracy since we see almost a 3% drop in performance. One of the solutions is to retrain the network. This helps the other weights to compensate for those weights which on being rounded off to the nearest cluster center have resulted in a drop in performance. Accuracy can be recovered significantly on retraining the network and then non-uniformly quantizing the weights again.

#### Excercise

In [23]:
# reset = False
# net = training(net, reset)
# perform non-uniform quantization
# test(net)

### References

1. https://arxiv.org/pdf/1412.6115.pdf