This notebook shows the differences of sparse vs. dense matrix operations and illustrates why we opted for pytorch_sparse.
The key operation here is the outer product of a masked (sparse) matrix with a dense vector. We make a big weight matrix to illustrate that the outer product of a sparse matrix with a dense vector works best with torch_sparse.spmm(). Before every operation, the memory is cleaned up and the temporary variable 'temo' is reassigned, just to be sure.

In [45]:
%%capture
pip install torch-scatter torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.1+cu111.html

In [46]:
import torch
import numpy as np
import torch_sparse
from torch_sparse import spmm
import gc # garbage collector to "flush" the memory

In [47]:
if not torch.cuda.is_available():
    raise SystemExit("you need a GPU with CUDA to run this notebook!")

In [48]:
def compute_mask(w, topk_percentage):
    """
    get the indices of the parameters to keep.
    """
    threshold = np.quantile(w.reshape(-1), topk_percentage)
    return np.where(w >= threshold)

In [49]:
available_memory=torch.cuda.memory_snapshot()[0]['total_size']
n = int(available_memory / 4) # dimension of square matrix
sparsity_1=0.99
# sparsity_2=0.7
nnz = int((1-sparsity_1)*n) # number of non-sparsified values
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
values = torch.randn(nnz)
X_sparse = torch.sparse_coo_tensor([rows,cols], values, size=(n,n)).cuda().requires_grad_(True)
Y_dense = torch.randn((n,200)).cuda().requires_grad_(True)

In [50]:
gc.collect()
torch.cuda.empty_cache()

temp = torch.Tensor(8)
initial_allocated_memory = torch.cuda.memory_allocated(device=None)/10**9
init_cached_memory = torch.cuda.memory_reserved(device=None)/10**9
print(f'GPU memory allocated: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU memory cached: {torch.cuda.memory_reserved(device=None)/10**9}')
temp = torch.sparse.mm(X_sparse, Y_dense)
print(f'Difference in allocated GPU memory: {torch.cuda.memory_allocated(device=None)/10**9 - initial_allocated_memory}')
print(f'Difference in cached GPU memory: {torch.cuda.memory_reserved(device=None)/10**9 - init_cached_memory}')

%timeit torch.sparse.mm(X_sparse, Y_dense)
del temp
gc.collect()
torch.cuda.empty_cache()

GPU memory allocated: 0.41953536
GPU memory cached: 0.421527552
Difference in allocated GPU memory: 0.4194304
Difference in cached GPU memory: 0.8598323199999999
99.2 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [51]:
gc.collect()
trch.cuda.empty_cache()temp
 = torch.Tensor(8)
rinitial_allocated_memory = torch.cuda.memory_allocated(device=None)/10**9
init_cached_memory = torch.cuda.memory_reserved(device=None)/10**9int(f'
GPU memory allocated: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU memory cached: {torch.cuda.memory_reserved(device=None)/10**9}')
temp = spmm
(torch.tensor([rows, cols], dtype=torch.int64).cuda(), values.cuda(), n, n, Y_dense)
print(f'Diffe
rence in allocated GPU memory: {torch.cuda.memory_allocated(device=None)/10**9 - initial_allocated_memory}')
print(f'Difference in cached GPU memory: {torch.cuda.memory_reserved(device=None)/10**9 - init_cached_memory}')
del temp
%%tmp% spmm(torch.tensor([rows, cols], dtype=torch.int64).cuda(), values.cuda(), n, n, Y_dense)
del temp

gc.collect()
torch.cuda.empty_cache()

GPU memory allocated: 0.41953536
GPU memory cached: 0.421527552
Difference in allocated GPU memory: 0.41953536
Difference in cached GPU memory: 0.44040192
4.31 ms ± 8.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [52]:
gc.collect()
torch.cuda.empty_cache()

temp = torch.Tensor(8)
initial_allocated_memory = torch.cuda.memory_allocated(device=None)/10**9
init_cached_memory = torch.cuda.memory_reserved(device=None)/10**9
print(f'GPU memory allocated: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU memory cached: {torch.cuda.memory_reserved(device=None)/10**9}')
temp = torch.matmul(X_sparse.to_dense().cuda(), Y_dense)
print(f'Difference in allocated GPU memory: {torch.cuda.memory_allocated(device=None)/10**9 - initial_allocated_memory}')
print(f'Difference in cached GPU memory: {torch.cuda.memory_reserved(device=None)/10**9 - init_cached_memory}')

%timeit torch.matmul(X_sparse.to_dense().cuda(), Y_dense)

del temp
gc.collect()
torch.cuda.empty_cache()

GPU memory allocated: 0.41953536
GPU memory cached: 0.421527552


RuntimeError: CUDA out of memory. Tried to allocate 1024.00 GiB (GPU 0; 6.00 GiB total capacity; 400.10 MiB already allocated; 4.29 GiB free; 402.00 MiB reserved in total by PyTorch)

Conclusion: torch.sparse.mm() is slower than torch_sparse.spmm() and requires more memory. torch.matmul() cannot even carry out the required operation, running out of memory.