# SparseProp Usage Guide

This notebook serves as a guide on how to effectively utilize SparseProp. You'll find detailed steps on how to take advantage of SparseProp for both individual layers, as well as the entire network to accelerate the backpropagation process.

As an introduction, SparseProp provides a low-level CPU implementation of backpropagation, where the weights of a layer are unstructured sparse. More specifically, if we have a sparse fully connected or convolution layer, SparseProp is capable of speeding up the backpropagation process on CPU. We further integrate SparseProp with the PyTorch framework, providing the *SparseLinear* and *SparseConv2d* modules as drop-in replacements for PyTorch's *Linear* and *Conv2d* modules, respectively. Further details of our algorithms can be found [the our paper](https://arxiv.org/abs/2302.04852).

Now, let's get started! We first set the random seeds to get consistent results.

In [1]:
import random
import numpy as np
import torch

seed = 10
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

print(f"All seeds were set to {seed}.")

All seeds were set to 10.


## Individual Layer

Let's say we have a *Linear* module, which is 95% sparse. For the sake of argument, let's actually create such module. We assume that the input and output dimensions are both 512, but any other dimensions work just as fine.

In [2]:
from torch.nn import Linear

linear = Linear(512, 512) # 512 input channels, and 512 output channels

# prune the module randomly to 95% unstructred sparsity
with torch.no_grad():

    # generate a random mask with roughly 95% sparsity
    mask = torch.rand_like(linear.weight) > 0.95

    # apply the mask to the module
    linear.weight.mul_(mask.float())

print(f"Our module's spasity is now {(linear.weight == 0).float().mean().item():.2f}.")

Our module's spasity is now 0.95.


So now we actually have a 95% sparse module, called `linear`. Let's see how long forward and backward steps take on this module. Assuming the batch size is 256, we generate a synthetic batch of data.

In [3]:
X = torch.randn(256, 512) # batch_size x input_dimension

# the following two lines tell PyTorch to store the gradients for the input tensor
X.requires_grad_()
X.retain_grad()

y = torch.randn(256, 512) # batch_size x output_dimension

Now we  measure the time required for the forward and backward steps of the `linear` module.

In [4]:
import time

# time the forward step
start = time.time()
O = linear(X)
pytorch_forward_time = time.time() - start
f"The forward pass took {pytorch_forward_time:.3f} seconds."

'The forward pass took 0.012 seconds.'

In [5]:
# calculate the mse loss
L = torch.mean((y - O) ** 2)

# time the backward step
start = time.time()
L.backward()
pytorch_backward_time = time.time() - start
f"The backward pass took {pytorch_backward_time:.3f} seconds."

'The backward pass took 0.027 seconds.'

So the forward and backward passes took 12 and 26 milliseconds, respectively. But notice we haven't exploited *SparseProp*'s implementations yet. Let's see how much speedup we can get if we utilize SparseProp.

To do so, we only need one line of code:

In [6]:
from sparseprop.modules import SparseLinear

# this line will convert your pytorch module to a sparseprop module
sparse_linear = SparseLinear.from_dense(linear)

sparse_linear

SparseLinear([512, 512], sp=0.95, nnz=13190)

Now that we have a *SparseProp* module, which will benefit from our high performance sparse implementations. Let's again compute the forward and backward times:

In [7]:
# time the forward step
start = time.time()
O = sparse_linear(X)
sparseprop_forward_time = time.time() - start
f"The forward pass took {sparseprop_forward_time:.3f} seconds."

'The forward pass took 0.003 seconds.'

In [8]:
# calculate the mse loss
L = torch.mean((y - O) ** 2)

# time the backward step
start = time.time()
L.backward()
sparseprop_backward_time = time.time() - start
f"The backward pass took {sparseprop_backward_time:.3f} seconds."

'The backward pass took 0.004 seconds.'

So you should vitness a significant speedup in *SparseProp* with respect to PyTorch's implementations. Run the following cell to compare the two:

In [11]:
print(f"Forward speedup: {pytorch_forward_time / sparseprop_forward_time:.2f}x")
print(f"Backward speedup: {pytorch_backward_time / sparseprop_backward_time:.2f}x")

Forward speedup: 4.67x
Backward speedup: 6.26x


If you have a `Conv2d` module instead of a `Linear` one, you can again use *SparseProp* to gain speedups. The interface is exactly the same with only one differnce. If your module is called `conv`, you can do:

```
from sparseprop.modules import SparseConv2d

sparse_conv = SparseConv2d.from_dense(conv, vectorizing_over_on=False)
```

The only difference with the *Linear* case is that there is an additional boolean argument `vectorizing_over_on`. As described in the paper, we have two implementations for the convolution case, one performing the vectorization over the bactch size, and the other over the output dimension. Using this argument you can specify which one of the two implementations to use. A quick rule of thumb is that if the input width and height are small (e.g., less than 32) then `vectorizing_over_on=False` is faster.

Alternatively, the `sparsify_conv2d_auto` method can automatically determine the correct value of `vectorizing_over_on`.

```
from sparseprop.modules import sparsify_conv2d_auto

sparse_conv = sparsify_conv2d_auto(conv, input_shape, verbose=True)
```

Notice that you will need to feed the `input_shape` to this method, which should look something like (`batch_size`, `input_channels`, `input_height`, `input_width`). This method will create two sparse modules, one with `vectorizing_over_on=False` and the other one with `vectorizing_over_on=True`, run a randomly generated batch through both, and return the faster module based on forward+backward time.