This notebook shows the differences of sparse vs. dense matrix operations and illustrates why we opted for pytorch_sparse.
The key operation here is the outer product of a masked (sparse) matrix with a dense vector. We make a big weight matrix to illustrate that the outer product of a sparse matrix with a dense vector works best with torch_sparse.spmm(). Before every operation, the memory is cleaned up and the temporary variable 'temo' is reassigned, just to be sure.

In [1]:
%%capture
pip install torch-scatter torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.1+cu111.html

In [2]:
import torch
import numpy as np
import torch_sparse
from torch_sparse import spmm
import gc # garbage collector to "flush" the memory

In [3]:
if not torch.cuda.is_available():
    raise SystemExit("you need a GPU with CUDA to run this notebook!")

In [4]:
def compute_mask(w, topk_percentage):
    """
    get the indices of the parameters to keep.
    """
    threshold = np.quantile(w.reshape(-1), topk_percentage)
    return np.where(w >= threshold)

In case the graphics card has not been used at all since powering up, we set up a tiny variable and move it to the GPU in order to be able to call torch.cuda.memory_snapshot()

In [30]:
if len(torch.cuda.memory_snapshot()) == 0:
    torch.tensor(8).cuda()
assert len(torch.cuda.memory_snapshot()) > 0

[{'device': 0,
  'address': 21533556736,
  'total_size': 2097152,
  'allocated_size': 512,
  'active_size': 512,
  'segment_type': 'small',
  'blocks': [{'size': 512, 'state': 'active_allocated'},
   {'size': 2096640, 'state': 'inactive'}]}]

In [32]:
available_memory=torch.cuda.memory_snapshot()[0]['total_size']
n = int(available_memory / 4) # dimension of square matrix
sparsity_1=0.99
# sparsity_2=0.7
nnz = int((1-sparsity_1)*n) # number of non-sparsified values
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
values = torch.randn(nnz)
X_sparse = torch.sparse_coo_tensor([rows,cols], values, size=(n,n)).cuda().requires_grad_(True)
Y_dense = torch.randn((n,200)).cuda().requires_grad_(True)

In [35]:
gc.collect()
torch.cuda.empty_cache()

temp = torch.Tensor(8)
initial_allocated_memory = torch.cuda.memory_allocated(device=None)/10**9
init_cached_memory = torch.cuda.memory_reserved(device=None)/10**9
print(f'GPU memory allocated: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU memory cached: {torch.cuda.memory_reserved(device=None)/10**9}')
temp = torch.sparse.mm(X_sparse, Y_dense)
print(f'Difference in allocated GPU memory: {torch.cuda.memory_allocated(device=None)/10**9 - initial_allocated_memory}')
print(f'Difference in cached GPU memory: {torch.cuda.memory_reserved(device=None)/10**9 - init_cached_memory}')

%timeit torch.sparse.mm(X_sparse, Y_dense)

del temp
gc.collect()
torch.cuda.empty_cache()

GPU memory allocated: 0.419535872
GPU memory cached: 0.421527552
Difference in allocated GPU memory: 0.4194304
Difference in cached GPU memory: 0.8598323199999999
101 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [36]:
gc.collect()
torch.cuda.empty_cache()
temp = torch.Tensor(8)
initial_allocated_memory = torch.cuda.memory_allocated(device=None)/10**9
init_cached_memory = torch.cuda.memory_reserved(device=None)/10**9
print(f'GPU memory allocated: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU memory cached: {torch.cuda.memory_reserved(device=None)/10**9}')
temp = spmm(torch.tensor([rows, cols], dtype=torch.int64).cuda(), values.cuda(), n, n, Y_dense)
print(f'Difference in allocated GPU memory: {torch.cuda.memory_allocated(device=None)/10**9 - initial_allocated_memory}')
print(f'Difference in cached GPU memory: {torch.cuda.memory_reserved(device=None)/10**9 - init_cached_memory}')

%timeit spmm(torch.tensor([rows, cols], dtype=torch.int64).cuda(), values.cuda(), n, n, Y_dense)

del temp
gc.collect()
torch.cuda.empty_cache()

GPU memory allocated: 0.419535872
GPU memory cached: 0.421527552
Difference in allocated GPU memory: 0.41953536
Difference in cached GPU memory: 0.44040192
4.39 ms ± 48.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [37]:
gc.collect()
torch.cuda.empty_cache()

temp = torch.Tensor(8)
initial_allocated_memory = torch.cuda.memory_allocated(device=None)/10**9
init_cached_memory = torch.cuda.memory_reserved(device=None)/10**9
print(f'GPU memory allocated: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU memory cached: {torch.cuda.memory_reserved(device=None)/10**9}')
temp = torch.matmul(X_sparse.to_dense().cuda(), Y_dense)
print(f'Difference in allocated GPU memory: {torch.cuda.memory_allocated(device=None)/10**9 - initial_allocated_memory}')
print(f'Difference in cached GPU memory: {torch.cuda.memory_reserved(device=None)/10**9 - init_cached_memory}')

%timeit torch.matmul(X_sparse.to_dense().cuda(), Y_dense)

del temp
gc.collect()
torch.cuda.empty_cache()

GPU memory allocated: 0.419535872
GPU memory cached: 0.421527552


RuntimeError: CUDA out of memory. Tried to allocate 1024.00 GiB (GPU 0; 6.00 GiB total capacity; 400.10 MiB already allocated; 4.29 GiB free; 402.00 MiB reserved in total by PyTorch)

Conclusion: torch.sparse.mm() is slower than torch_sparse.spmm() and requires more memory. torch.matmul() cannot even carry out the required operation, running out of memory.