# SD212: Graph mining

# Lab 1: Sparse matrices

The objective of this lab is to understand the structure and main properties of [sparse matrices](https://en.wikipedia.org/wiki/Sparse_matrix).

You will learn to code your own sparse matrices to understand their underlying structure. <br>Note that in the other labs, we will only use sparse matrices of [SciPy](https://www.scipy.org/scipylib/index.html). 

## Import

In [1]:
import numpy as np

In [2]:
from scipy import sparse

## Coordinate format

In [3]:
# random matrix (dense format)
A_dense = np.random.randint(2, size = (5,10))

In [4]:
A_dense

array([[1, 0, 0, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 1, 1, 0],
       [0, 0, 1, 1, 1, 0, 1, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 1, 1]])

In [5]:
A_coo = sparse.coo_matrix(A_dense)

In [6]:
A_coo

<5x10 sparse matrix of type '<class 'numpy.int32'>'
	with 21 stored elements in COOrdinate format>

In [7]:
A_coo.shape

(5, 10)

In [8]:
A_coo.nnz

21

In [9]:
print(A_coo.row)
print(A_coo.col)
print(A_coo.data)

[0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 4 4 4 4]
[0 5 6 8 2 5 6 7 8 2 3 4 6 7 0 7 8 3 5 8 9]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [10]:
# there might be zeros in data!
row = A_coo.row
col = A_coo.col
data = np.random.randint(5, size=len(A_coo.data))
shape = A_coo.shape

In [11]:
B_coo = sparse.coo_matrix((data, (row, col)), shape)

In [12]:
B_coo

<5x10 sparse matrix of type '<class 'numpy.int32'>'
	with 21 stored elements in COOrdinate format>

In [13]:
B_coo.toarray()

array([[4, 0, 0, 0, 0, 2, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 2, 3, 2, 0, 0],
       [0, 0, 0, 1, 2, 0, 1, 0, 0, 0],
       [2, 0, 0, 0, 0, 0, 0, 4, 4, 0],
       [0, 0, 0, 3, 0, 3, 0, 0, 3, 4]])

In [14]:
B_coo.nnz

21

In [15]:
np.sum(B_coo.data > 0)

17

In [16]:
B_coo.eliminate_zeros()

In [17]:
B_coo

<5x10 sparse matrix of type '<class 'numpy.int32'>'
	with 17 stored elements in COOrdinate format>

## ~To do~

Complete the function below that converts a dense matrix into a sparse matrix in COO format. 

Needless to say...
* don't use `scipy`
* don't use any loop

**Hint:** Use `np.nonzero`

In [18]:
nnz = A_dense.nonzero()
A_dense[nnz]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [19]:
class SparseCOO():
    def __init__(self, data: np.ndarray, row: np.ndarray, col: np.ndarray, shape: tuple):
        self.data = data
        self.row = row
        self.col = col
        self.shape = shape

In [20]:
def dense_to_coo(A):
    '''Convert dense matrix to sparse in COO format.
    
    Parameters
    ----------
    A : np.ndarray
        Dense matrix
        
    Returns
    -------
    A_coo : SparseCOO
        Sparse matrix in COO format.
    '''
    # to be modified
    row, col = A.nonzero()
    data = A[row, col] # or data = A[A!=0]
    shape = A.shape
    return SparseCOO(data, row, col, shape)

In [21]:
def test_equality(A, B, attributes):
    return [np.all(getattr(A, a) == getattr(B, a)) for a in attributes]

In [22]:
# test
A_dense = np.random.randint(2, size = (5,10))
A_coo = sparse.coo_matrix(A_dense)
A_coo_ = dense_to_coo(A_dense)

In [23]:
test_equality(A_coo, A_coo_, ["data", "row", "col"])

[True, True, True]

## CSR format

The CSR (Compressed Sparse Row) format is the most efficient for arithmetic operations (see below).

In [24]:
A_dense

array([[1, 1, 1, 1, 0, 1, 0, 0, 1, 1],
       [1, 1, 1, 1, 1, 0, 1, 1, 1, 0],
       [1, 1, 1, 1, 0, 1, 1, 1, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 1, 1, 1],
       [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]])

In [25]:
A_csr = sparse.csr_matrix(A_dense)

In [26]:
A_csr

<5x10 sparse matrix of type '<class 'numpy.intc'>'
	with 34 stored elements in Compressed Sparse Row format>

In [27]:
A_csr.shape

(5, 10)

In [28]:
A_csr.nnz

34

In [29]:
print(A_csr.indices)
print(A_csr.indptr)
print(A_csr.data)

[0 1 2 3 5 8 9 0 1 2 3 4 6 7 8 0 1 2 3 5 6 7 9 1 5 7 8 9 1 4 5 6 8 9]
[ 0  7 15 23 28 34]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [30]:
A_csr[3, 4]

0

In [31]:
A_csr[3]

<1x10 sparse matrix of type '<class 'numpy.intc'>'
	with 5 stored elements in Compressed Sparse Row format>

In [32]:
A_csr[3].toarray()

array([[0, 1, 0, 0, 0, 1, 0, 1, 1, 1]], dtype=int32)

In [33]:
# data might have zeros!
indices = A_csr.indices
indptr = A_csr.indptr
data = np.random.randint(5, size=len(A_csr.data))
shape = A_csr.shape

In [34]:
B_csr = sparse.csr_matrix((data, indices, indptr), shape)

In [35]:
B_csr

<5x10 sparse matrix of type '<class 'numpy.int32'>'
	with 34 stored elements in Compressed Sparse Row format>

In [36]:
B_csr.eliminate_zeros()

In [37]:
B_csr

<5x10 sparse matrix of type '<class 'numpy.int32'>'
	with 30 stored elements in Compressed Sparse Row format>

In [38]:
# from COO format
row = [0, 0, 1, 2, 2]
col = [2, 3, 0, 1, 2]
data = np.ones(5)
A_csr = sparse.csr_matrix((data, (row, col)), shape = (3, 4))

In [39]:
A_csr.toarray()

array([[0., 0., 1., 1.],
       [1., 0., 0., 0.],
       [0., 1., 1., 0.]])

In [40]:
# equivalently
A_coo = sparse.coo_matrix((data, (row, col)), shape = (3, 4))
A_csr = sparse.csr_matrix(A_coo)

In [41]:
A_csr.toarray()

array([[0., 0., 1., 1.],
       [1., 0., 0., 0.],
       [0., 1., 1., 0.]])

## ~To do~

Complete the function below that converts a sparse matrix from COO format to CSR format.

Again...
* don't use `scipy`
* don't use any loop

**Hint:** Use ``np.unique`` and ``np.cumsum``.

In [42]:
A_coo = sparse.coo_matrix(A_dense)
A_csr = sparse.csr_matrix(A_dense)
A_coo.col, A_csr.indices # they are the same
A_coo.row, np.unique(A_coo.row, return_counts=True)

(array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
        2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], dtype=int32),
 (array([0, 1, 2, 3, 4], dtype=int32), array([7, 8, 8, 5, 6], dtype=int64)))

In [96]:
indptr = np.zeros(shape[0] + 1, dtype=int)
row_indices, counts = np.unique(A_coo.row, return_counts=True)
indptr[row_indices + 1] = counts
indptr
indptr = np.cumsum(indptr)

indptr, A_csr.indptr

(array([ 0,  7, 14, 21, 26, 32], dtype=int32),
 array([ 0,  7, 14, 21, 26, 32], dtype=int32))

In [43]:
class SparseCSR():
    def __init__(self, data: np.ndarray, indices: np.ndarray, indptr: np.ndarray, shape: tuple):
        self.data = data
        self.indices = indices
        self.indptr = indptr
        self.shape = shape

In [97]:
def coo_to_csr(A_coo):
    '''Convert a sparse matrix from COO to CSR format.
    
    Parameters
    ----------
    A_coo : SparseCSR
        Sparse matrix in COO format.
        
    Returns
    -------
    A_csr : SparseCSR
        Sparse matrix in CSR format.
    '''
    # to be modified
    data = A_coo.data
    
    indices = A_coo.col
    
    shape = A_coo.shape
    
    indptr = np.zeros(shape[0] + 1, dtype=int)
    row_indices, counts = np.unique(A_coo.row, return_counts=True)
    indptr[row_indices + 1] = counts
    indptr = np.cumsum(indptr)
    
    return SparseCSR(data, indices, indptr, shape)

In [98]:
def dense_to_csr(A):
    '''Convert dense matrix to sparse in CSR format.
    
    Parameters
    ----------
    A : np.ndarray
        Dense matrix
        
    Returns
    -------
    A_csr : SparseCSR
        Sparse matrix in CSR format.
    '''
    return coo_to_csr(sparse.coo_matrix(A))

In [99]:
# test
A_dense = np.random.randint(2, size = (5,10))
A_csr = sparse.csr_matrix(A_dense)
A_csr_ = dense_to_csr(A_dense)

In [100]:
test_equality(A_csr, A_csr_, ["data", "indices", "indptr"])

[True, True, True]

## Diagonal format

In [101]:
A_diag = sparse.diags(np.arange(5))

In [102]:
A_diag

<5x5 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements (1 diagonals) in DIAgonal format>

In [103]:
A_diag.toarray()

array([[0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 2., 0., 0.],
       [0., 0., 0., 3., 0.],
       [0., 0., 0., 0., 4.]])

In [104]:
A_diag.diagonal()

array([0., 1., 2., 3., 4.])

In [105]:
A = sparse.csr_matrix(A_diag)

In [106]:
A, A.data, A.indices, A.indptr

(<5x5 sparse matrix of type '<class 'numpy.float64'>'
 	with 4 stored elements in Compressed Sparse Row format>,
 array([1., 2., 3., 4.]),
 array([1, 2, 3, 4], dtype=int32),
 array([0, 0, 1, 2, 3, 4], dtype=int32))

## ~To do~

Complete the following function that returns a sparse CSR matrix with the pseudo-inverse vector on the diagonal.

**Example:** pseudo inverse of (0, 1, 2) -> (0, 1, 1/2)

**Hint:** Use the property of sparse matrices!

In [246]:
def get_pseudo_inverse(vector):
    '''Return a sparse matrix with pseudo-inverse on the diagonal.
    
    Parameters
    ----------
    vector : np.ndarray
        Input vector. 
        
    Returns
    -------
    A_csr : sparse.csr_matrix
        Sparse matrix in scipy CSR format.
    '''    
    # to be modified
    Dinv = sparse.csr_matrix(vector, dtype=float)
    Dinv.data = np.reciprocal(Dinv.data) #elelment-wise inverse of a vector
    return Dinv

In [247]:
# test
get_pseudo_inverse(np.arange(3)).toarray()

array([[0. , 1. , 0.5]])

## Operations

Usual arithmetic operations apply to sparse matrices. The only contraint is to have a sparse matrix on the **left-hand side** of the operator.

In [202]:
A_dense

array([[0, 0, 1, 1, 1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 0, 0, 1, 1, 1, 1, 1],
       [1, 0, 1, 0, 0, 1, 1, 0, 0, 1],
       [1, 1, 1, 0, 0, 1, 1, 1, 1, 1]])

In [203]:
A = sparse.csr_matrix(A_dense)

In [204]:
n_row, n_col = A.shape

In [205]:
A.dot(np.ones(n_col, dtype=int)) #degree of each node

array([5, 5, 8, 5, 8], dtype=int32)

In [206]:
A.T.dot(np.ones(n_row, dtype=int))

array([4, 2, 5, 1, 2, 4, 5, 2, 3, 3], dtype=int32)

In [207]:
# observe the format of the transpose
A.T

<10x5 sparse matrix of type '<class 'numpy.intc'>'
	with 31 stored elements in Compressed Sparse Column format>

In [208]:
A.T.dot(A)

<10x10 sparse matrix of type '<class 'numpy.intc'>'
	with 84 stored elements in Compressed Sparse Column format>

In [209]:
A.dot(A.T)

<5x5 sparse matrix of type '<class 'numpy.intc'>'
	with 25 stored elements in Compressed Sparse Row format>

In [210]:
A.data = np.random.choice((1,2,3,4), size = len(A.data))

In [217]:
A.data

array([3, 1, 3, 4, 1, 2, 4, 3, 3, 1, 3, 2, 4, 1, 3, 3, 3, 2, 1, 3, 4, 3,
       1, 4, 3, 2, 1, 3, 4, 3, 1])

In [218]:
B = A > 1

In [231]:
B

<5x10 sparse matrix of type '<class 'numpy.bool_'>'
	with 19 stored elements in Compressed Sparse Row format>

In [235]:
# Explain the following warning...
B = A < 1
# There is no element that is less than 1 in A, so the resulting answer will be an empty matrix, which is ineffeicient to store in a CSR matrix.
#Usually, sparse matrices hold values bigger than 0.
# All sparse data saatisfy the condition...

In [236]:
B

<5x10 sparse matrix of type '<class 'numpy.bool_'>'
	with 19 stored elements in Compressed Sparse Row format>

In [237]:
B_dense = np.random.randint(2, size = (5,10))
B = sparse.csr_matrix(B_dense)

In [238]:
2 * A + 5 * B

<5x10 sparse matrix of type '<class 'numpy.intc'>'
	with 41 stored elements in Compressed Sparse Row format>

## To do

Complete the following function that normalizes a sparse CSR matrix with non-negative entries so that each row sums to 1 (or to 0 if the whole row is zero). 

**Hint:** Use the above function ``get_pseudo_inverse``.

In [256]:
get_pseudo_inverse(A.data), A.data

(<1x31 sparse matrix of type '<class 'numpy.float64'>'
 	with 31 stored elements in Compressed Sparse Row format>,
 array([3, 1, 3, 4, 1, 2, 4, 3, 3, 1, 3, 2, 4, 1, 3, 3, 3, 2, 1, 3, 4, 3,
        1, 4, 3, 2, 1, 3, 4, 3, 1]))

In [341]:
row_weights = A.dot(np.ones(A.shape[1], dtype=int))

array([13, 14, 15, 21, 19, 19, 16, 18, 14, 20, 14, 15, 14, 14, 15, 14, 14,
       13, 12, 11], dtype=int32)

In [255]:
def normalize_rows(A):
    '''Normalize the rows of a CSR matrix so that all sum to 1 (or 0).
    
    Parameters
    ----------
    A : sparse.csr_matrix
        Input matrix (non-negative entries).
    
    Returns
    -------
    A_norm : sparse.csr_matrix
        Normalized matrix. 
    
    '''
    A.data = 
    return None

SyntaxError: invalid syntax (<ipython-input-255-1def34668bf3>, line 15)

## To do

Complete the following method that returns the dot product of a sparse CSR matrix with a vector.

* No loop allowed!

In [258]:
np.add.reduceat?

[1;31mDocstring:[0m
reduceat(a, indices, axis=0, dtype=None, out=None)

Performs a (local) reduce with specified slices over a single axis.

For i in ``range(len(indices))``, `reduceat` computes
``ufunc.reduce(a[indices[i]:indices[i+1]])``, which becomes the i-th
generalized "row" parallel to `axis` in the final result (i.e., in a
2-D array, for example, if `axis = 0`, it becomes the i-th row, but if
`axis = 1`, it becomes the i-th column).  There are three exceptions to this:

* when ``i = len(indices) - 1`` (so for the last index),
  ``indices[i+1] = a.shape[axis]``.
* if ``indices[i] >= indices[i + 1]``, the i-th generalized "row" is
  simply ``a[indices[i]]``.
* if ``indices[i] >= len(a)`` or ``indices[i] < 0``, an error is raised.

The shape of the output depends on the size of `indices`, and may be
larger than `a` (this happens if ``len(indices) > a.shape[axis]``).

Parameters
----------
a : array_like
    The array to act on.
indices : array_like
    Paired indices, comma sepa

In [None]:
class SparseCSR():
    def __init__(self, data: np.ndarray, indices: np.ndarray, indptr: np.ndarray, shape: tuple):
        self.data = data
        self.indices = indices
        self.indptr = indptr
        self.shape = shape
        
    def dot(self, x: np.ndarray) -> np.ndarray:
        '''Sparse-vector dot product.'''
        # to be modified
        return None

## Slicing

Sparse matrices can be sliced like numpy arrays. The CSR format is more efficient for row slicing (although column slicing is possible), while the CSC format is more efficient for column slicing.

In [260]:
A = sparse.csr_matrix(A_dense)
A

<5x10 sparse matrix of type '<class 'numpy.intc'>'
	with 31 stored elements in Compressed Sparse Row format>

In [261]:
A[:2]

<2x10 sparse matrix of type '<class 'numpy.intc'>'
	with 10 stored elements in Compressed Sparse Row format>

In [262]:
A[1:4,2:]

<3x8 sparse matrix of type '<class 'numpy.intc'>'
	with 14 stored elements in Compressed Sparse Row format>

In [264]:
A[np.array([0,2,4])] #cherry-picking rows

<3x10 sparse matrix of type '<class 'numpy.intc'>'
	with 21 stored elements in Compressed Sparse Row format>

## ~To do~

Consider the following matrix:

In [323]:
A = sparse.csr_matrix(np.random.randint(2, size = (20,30)))
A

<20x30 sparse matrix of type '<class 'numpy.intc'>'
	with 305 stored elements in Compressed Sparse Row format>

Extract the 10 rows of largest sums and build the corresponding matrix.

In [328]:
A.sum(axis=1).T

matrix([[13, 14, 15, 21, 19, 19, 16, 18, 14, 20, 14, 15, 14, 14, 15, 14,
         14, 13, 12, 11]])

In [339]:
A.dot(np.ones(A.shape[1], dtype=int))

array([13, 14, 15, 21, 19, 19, 16, 18, 14, 20, 14, 15, 14, 14, 15, 14, 14,
       13, 12, 11], dtype=int32)

In [329]:
np.argsort(A.dot(np.ones(A.shape[1], dtype=int)))[:10]

array([19, 18, 17,  0,  8, 10, 12, 13, 15, 16], dtype=int64)

#### Solution

In [340]:
A[np.argsort(-A.dot(np.ones(A.shape[1], dtype=int)))[:10]]

<10x30 sparse matrix of type '<class 'numpy.intc'>'
	with 172 stored elements in Compressed Sparse Row format>

## Bonus

Complete all methods of the following CSR class.

In [None]:
class SparseCSR():
    def __init__(self, data: np.ndarray, indices: np.ndarray, indptr: np.ndarray, shape: tuple):
        self.data = data
        self.indices = indices
        self.indptr = indptr
        self.shape = shape
        
    def dot(self, x: np.ndarray) -> np.ndarray:
        '''Sparse-vector dot product.'''
        # to be modified
        prod = np.zeros(shape[0])
        x[indices]
        return None

    def dot_array(self, X: np.ndarray) -> np.ndarray:
        '''Sparse-array dot product.'''
        # to be modified
        return None
    
    def dot_sparse(self, X: SparseCSR) -> SparseCSR:
        '''Sparse-sparse dot product.'''
        # to be modified
        return None
    
    def add_sparse(self, X: SparseCSR) -> SparseCSR:
        '''Add a sparse matrix.'''
        # to be modified
        return None
    
    def slice_row(self, index: np.ndarray) -> SparseCSR:
        '''Slice rows of a sparse matrix.'''
        # to be modified
        return None
    
    def slice_col(self, index: np.ndarray) -> SparseCSR:
        '''Slice columns of a sparse matrix.'''
        # to be modified
        return None
    
    def eliminate_zeros(self) -> SparseCSR:
        '''Eliminate zeros of a sparse matrix.'''
        # to be modified
        return None