# SparseProp Usage Guide

This notebook serves as a guide on how to effectively utilize SparseProp. You'll find detailed steps on how to take advantage of SparseProp for both individual layers, as well as the entire network to accelerate the backpropagation process.

As an introduction, SparseProp provides a low-level CPU implementation of backpropagation, where the weights of a layer are unstructured sparse. More specifically, if we have a sparse fully connected or convolution layer, SparseProp is capable of speeding up the backpropagation process on CPU. We further integrate SparseProp with the PyTorch framework, providing the *SparseLinear* and *SparseConv2d* modules as drop-in replacements for PyTorch's *Linear* and *Conv2d* modules, respectively. Further details of our algorithms can be found in [our paper](https://arxiv.org/abs/2302.04852).

If you haven't already installed *SparseProp*, make sure you have PyTorch installed, and then simply run the following cell:

In [None]:
%pip install sparseprop

Now, let's get started! Here we only consider the case where only a single thread is being used for doing the computations. Run the following cell to limit both *PyTorch* and *SparseProp* to a single thread.

In [1]:
import torch
import sparseprop

torch.set_num_threads(1)
sparseprop.set_num_threads(1)

Also, let's set the random seeds to get consistent results.

In [2]:
import random
import numpy as np

seed = 10
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

f"All seeds were set to {seed}."

'All seeds were set to 10.'

## Individual Layer

Let's say we have a *Linear* module, which is 98% sparse. For the sake of argument, let's actually create such module. We assume that the input and output dimensions are 768 and 3072, respectively, but any other dimensions work just as fine.

In [3]:
from torch.nn import Linear

linear = Linear(768, 3072) # input size of 768, and output size of 3072

# prune the module randomly to 98% unstructred sparsity
with torch.no_grad():

    # generate a random mask with roughly 98% sparsity
    mask = torch.rand_like(linear.weight) > 0.98

    # apply the mask to the module
    linear.weight.mul_(mask.float())

f"Our module's spasity is now {(linear.weight == 0).float().mean().item():.2f}."

"Our module's spasity is now 0.98."

So now we actually have a 98% sparse module, called `linear`. Let's see how long forward and backward steps take on this module. Assuming the batch size is 2048, we generate a synthetic batch of data.

In [4]:
X = torch.randn(2048, 768) # batch_size x input_dimension

# the following two lines tell PyTorch to keep the gradients for the input tensor
X.requires_grad_()
X.retain_grad()

y = torch.randn(2048, 3072) # batch_size x output_dimension

Now we  measure the time required for the forward and backward steps of the `linear` module.

In [5]:
import time

# time the forward step
start = time.time()
O = linear(X)
pytorch_forward_time = time.time() - start
f"The forward pass took {pytorch_forward_time:.3f} seconds."

'The forward pass took 0.083 seconds.'

In [6]:
# calculate the mse loss
L = torch.mean((y - O) ** 2)

# time the backward step
start = time.time()
L.backward()
pytorch_backward_time = time.time() - start
f"The backward pass took {pytorch_backward_time:.3f} seconds."

'The backward pass took 0.196 seconds.'

Notice we haven't exploited *SparseProp*'s implementations yet. Let's see how much speedup we can get if we utilize SparseProp.

To do so, we only need one line of code:

In [7]:
from sparseprop.modules import SparseLinear

# this line will convert your pytorch module to a sparseprop module
sparse_linear = SparseLinear.from_dense(linear)

print(sparse_linear)

SparseLinear([3072, 768], sp=0.98, nnz=46972)


Now that we have a *SparseProp* module, let's again compute the forward and backward times:

In [8]:
# time the forward step
start = time.time()
O = sparse_linear(X)
sparseprop_forward_time = time.time() - start
f"The forward pass took {sparseprop_forward_time:.3f} seconds."

'The forward pass took 0.038 seconds.'

In [9]:
# calculate the mse loss
L = torch.mean((y - O) ** 2)

# time the backward step
start = time.time()
L.backward()
sparseprop_backward_time = time.time() - start
f"The backward pass took {sparseprop_backward_time:.3f} seconds."

'The backward pass took 0.084 seconds.'

The numbers you get will highly depend on your CPU architecture, but you should generally be able to see a non-trivial speedup with *SparseProp* with respect to PyTorch's implementations. Run the following cell to compare the two:

In [10]:
print(f"Forward speedup: {pytorch_forward_time / sparseprop_forward_time:.2f}x")
print(f"Backward speedup: {pytorch_backward_time / sparseprop_backward_time:.2f}x")

Forward speedup: 2.22x
Backward speedup: 2.35x


If you have a `Conv2d` module instead of a `Linear` one, you can again use *SparseProp* to gain speedups. The interface is exactly the same with only one differnce. If your module is called `conv`, you can do:

```
from sparseprop.modules import SparseConv2d

sparse_conv = SparseConv2d.from_dense(conv, vectorizing_over_on=False)
```

The only difference with the *Linear* case is that there is an additional boolean argument `vectorizing_over_on`. As described in [the paper](https://arxiv.org/abs/2302.04852), we have two implementations for the convolution case, one performing the vectorization over the bactch size, and the other over the output dimension. Using this argument you can specify which one of the two implementations to use. A quick rule of thumb is that if the input width and height are small (e.g., less than 32) then `vectorizing_over_on=False` is faster.

Alternatively, the `sparsify_conv2d_auto` method can automatically determine the correct value of `vectorizing_over_on`.

```
from sparseprop.modules import sparsify_conv2d_auto

sparse_conv = sparsify_conv2d_auto(conv, input_shape, verbose=True)
```

Notice that you will need to feed the `input_shape` to this method, which should look something like (`batch_size`, `input_channels`, `input_height`, `input_width`). This method will create two sparse modules, one with `vectorizing_over_on=False` and the other one with `vectorizing_over_on=True`, run a randomly generated batch through both, and return the faster module based on forward+backward time.

## Full Network

Now assume you have a sparse network instead of just one layer. *SparseProp* offers tools that can seemlessly process your network object, and replace its layers with their corresponding sparse counterparts with only one or two lines of extra code. Let's go through an example of this case.

Consider the scenario where you have a sparse model pre-trained on a large dataset (e.g., ImageNet). Let's say you want to fine-tune this sparse model on a smaller dataset (e.g., ImageNette), while keeping the sparsity mask fixed . This process is called *sparse transfer learning*.

We have provided the checkpoint for a 95% uniform sparse ResNet18 model pre-trained on ImageNet at `models/resnet18_ac_dc_500_epochs_sp=0.95_uniform.pt`. Let's go ahead and load this model!

We start by fixing the random seed so we get consistent results.

In [10]:
# set the seed everywhere
seed = 11
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

f"All seeds were set to {seed}."

'All seeds were set to 11.'

Now let's create a directory to store the log file and initialize a `Logger`. 

In [11]:
import os
from utils import Logger

outdir = "results-finetune-resnet18-imagenette/"
os.makedirs(outdir, exist_ok=False)
logger = Logger(outdir)

It's time to load the checkpoint:

In [None]:
from torchvision.models import resnet18

model = resnet18() # initialize the network
ckpt = torch.load('models/resnet18_ac_dc_500_epochs_sp=0.95_uniform.pt', map_location='cpu') # read the checkpoint from the file
model.load_state_dict(ckpt) # load the checkpoint into the network

Run the following block of code to print the sparsity level of each layer. Since the network is 95% uniformly sparse, we expect all the layers (except the first and last ones) to have a sparsity of exactly 95%.

You will notice we have used the function `apply_to_all_modules_with_types(model, types, fn)`. This function iterates through the layers of `model`, and if the type of the layer is in the `types` list, it will apply the `fn` function on it and return the results.  

In [13]:
from pprint import pformat # just to print a dictionary nicely
from utils import apply_to_all_modules_with_types
from sparseprop.utils import sparsity

logger.log("Sparsity per layer:")
logger.log(pformat(apply_to_all_modules_with_types(
    model,
    [torch.nn.Linear, torch.nn.Conv2d], # we only want the sparsity of linear and conv2d modules
    lambda name, module: f'{sparsity(module):.3f}') # calculate the sparsity for each module
, indent=4))

Sparsity per layer:
OrderedDict([   ('conv1', '0.000'),
                ('layer1.0.conv1', '0.950'),
                ('layer1.0.conv2', '0.950'),
                ('layer1.1.conv1', '0.950'),
                ('layer1.1.conv2', '0.950'),
                ('layer2.0.conv1', '0.950'),
                ('layer2.0.conv2', '0.950'),
                ('layer2.0.downsample.0', '0.950'),
                ('layer2.1.conv1', '0.950'),
                ('layer2.1.conv2', '0.950'),
                ('layer3.0.conv1', '0.950'),
                ('layer3.0.conv2', '0.950'),
                ('layer3.0.downsample.0', '0.950'),
                ('layer3.1.conv1', '0.950'),
                ('layer3.1.conv2', '0.950'),
                ('layer4.0.conv1', '0.950'),
                ('layer4.0.conv2', '0.950'),
                ('layer4.0.downsample.0', '0.950'),
                ('layer4.1.conv1', '0.950'),
                ('layer4.1.conv2', '0.950'),
                ('fc', '0.000')])


This model is pre-trained on the ImageNet dataset, which consists of 1000 classes. However, for fine-tuning, we will be using the ImageNette dataset, which only has 10 classes. As a result, we will need to replace the classifier layer in order to adapt the model to this specific task.

In [14]:
model.fc = torch.nn.Linear(
    model.fc.in_features, # number of input features
    10, # number of classes in imagenette
    bias=model.fc.bias is not None # keep the bias if exists
)

Now let's get our dataset and dataloaders ready. We directly load the ImageNette dataset from the *SparseML* library. You can run the following commnad to install it on your environment.

In [None]:
%pip install sparseml

Now that the library installed, we can load the dataset.

In [16]:
from torch.utils.data import DataLoader
from sparseml.pytorch.datasets import ImagenetteDataset, ImagenetteSize

# load the datasets
train_dataset, test_dataset = [ImagenetteDataset(
    root='/dev/shm/', # store the dataset in /dev/shm/ to map the dataset to memory and avoid data loading overheads
    train=train,
    dataset_size=ImagenetteSize.s320,
    image_size=224
) for train in [True, False]]

# prepare the dataloaders
train_loader, test_loader = [DataLoader(
    dataset,
    batch_size=256,
    num_workers=4,
    shuffle=True
) for dataset in [train_dataset, test_dataset]]

logger.log(f'Total number of training batches: {len(train_loader)}')



already downloaded imagenette ImagenetteSize.s320
already downloaded imagenette ImagenetteSize.s320
Total number of training batches: 51


Now here's where *SparseProp* comes into play. As explained in the paper, we replace each Linear or Conv2d layer in a network with a sparse one, if the following conditions are met:

- It's at least 80% sparse.
- The sparse module is faster than the original dense one (in terms of forward+backward time).

This behavior is implemented in the `swap_modules_with_sparse` method in `sparseprop.utils`. Let's do this!

In [17]:
from sparseprop.utils import swap_modules_with_sparse

input_shape = next(iter(train_loader))[0].shape # we need the shape of our data

# here's where the magic happens
model = swap_modules_with_sparse(model, input_shape, inplace=True, verbose=True)

------------------------------
keeping the module conv1 dense...
------------------------------
module Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) took 0.6537895202636719 fwd and 1.5996980667114258 bwd
module SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=False) took 1.3576796054840088 fwd and 1.9765965938568115 bwd
module SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=True) took 0.24248600006103516 fwd and 0.5414636135101318 bwd
going with SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=True) with full time of 0.783949613571167
module layer1.0.conv1 replaced with SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=True)
------------------------------
module Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) took 0.6515367031097412 fwd and 1.5962450504302979 bwd
module SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=False) took 1.3252687454223633 fwd and 1.96

Let's see how our model looks like:

In [18]:
model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=True)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=True)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): SparseConv2d([64, 64, 3, 3], sp=0.95, nnz=1843, s=1, p=1, voo=True)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): SparseConv2d([64,

Now that we have our model and dataloaders ready, let's create our loss criterion and optimizer objects.

In [19]:
# loss and optim
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=5e-3, momentum=0.9, weight_decay=1e-4)

Finally, let's create our `Finetuner` object to perform one epoch of training. This object handles training and validaton, as well as performing extensive timings on the network. 

In [20]:
from utils import Finetuner

# initialize the finetuner
finetuner = Finetuner(
    model,
    optimizer,
    schedular=None, # we could pass an lr schedular here. no need for this example.
    loss_fn=loss_fn,
    log_freq=1, # how often to log (in batches). 1 means that it will log after processing every batch.
    save_freq=1, # how often to save the checkpoint (in epochs). 1 means that it will save a checkpoint after each epoch.
    logger=logger
).finetune(train_loader, test_loader, epochs=1)

[Train] Epoch 1, Step 1: loss=2.7053, acc=0.0820
Timings: avg_end_to_end_forward=11.1022, avg_end_to_end_backward=11.5243, avg_end_to_end_minibatch=22.6428, avg_module_forward_sum=6.9712, avg_module_backward_sum=7.5827
[Train] Epoch 1, Step 2: loss=2.6545, acc=0.0898
Timings: avg_end_to_end_forward=11.0292, avg_end_to_end_backward=11.3527, avg_end_to_end_minibatch=22.3985, avg_module_forward_sum=6.9279, avg_module_backward_sum=7.5711
[Train] Epoch 1, Step 3: loss=2.5691, acc=0.0924
Timings: avg_end_to_end_forward=11.0056, avg_end_to_end_backward=11.2854, avg_end_to_end_minibatch=22.3066, avg_module_forward_sum=6.9138, avg_module_backward_sum=7.5587
[Train] Epoch 1, Step 4: loss=2.4842, acc=0.1113
Timings: avg_end_to_end_forward=10.9948, avg_end_to_end_backward=11.2466, avg_end_to_end_minibatch=22.2568, avg_module_forward_sum=6.9054, avg_module_backward_sum=7.5530
[Train] Epoch 1, Step 5: loss=2.3990, acc=0.1437
Timings: avg_end_to_end_forward=10.9984, avg_end_to_end_backward=11.2293, a

Notice that in addition to the loss and accuracy metrics, this script also reports the time spent in each part of the process. The timings include:

- `avg_end_to_end_forward`: the average time spent in the forward pass, i.e., the model(inputs) line.
- `avg_end_to_end_backward`: the average time spent in the backward pass, i.e., the loss.backward() line.
- `avg_end_to_end_minibatch`: the average time spent processing a minibatch. This includes forward pass, backward pass, loss calculation, optimization step, etc. Note that loading the data into memory is not included.
- `avg_module_forward_sum`: the average time spent in the forward function of the modules torch.nn.Linear, torch.nn.Conv2d, SparseLinear, and SparseConv2d.
- `avg_module_backward_sum`: the average time spent in the backward function of the modules torch.nn.Linear, torch.nn.Conv2d, SparseLinear, and SparseConv2d.


## Conclusion

In this notebook we provided step-by-step examples of how to benefit from *SparseProp*'s sparse implementation for speeding-up a single layer, as well as an entire network. For the latter, we took a 95% uniform sparse ResNet18 model (pretrained on ImageNet), and fine-tuned it on the ImageNette dataset.