## Sparse Matrices.

Once a well-trained machine learning model has been deployed, the data ingestion pipeline for that model will also be deployed. That pipeline will consist of a collection of tools and systems used to fetch, transform, and feed data to the machine learning system in production.

However, that pipeline cannot be finalized during the development of the machine learning model it feeds. Finalizing the process of data ingestion before models have been run and your hypotheses about the business use case have been tested often leads to lots of re-work. Early experiments almost always fail and you should be careful about investing large amounts of time in building a data ingestion pipeline until there is enough accumulated evidence that a deployed model will help the business.

Instead of building a complete data ingestion pipeline, data scientists will often use sparse matrices during the development and testing of a machine learning model. Sparse matrices are used to represent complex sets of data (e.g., word counts) in a way that reduces the use of computer memory and processing time.

In [1]:
import numpy as np
from scipy import sparse

A sparse matrix is one in which most of the values are zero. If the number of zero-valued elements divided by the size of the matrix is greater than 0.5 then it is consider sparse.

In [2]:
A = np.random.randint(0,2,100000).reshape(100,1000)
sparcity = 1.0 - (np.count_nonzero(A) / A.size)
print(round(sparcity,4))

0.5015


**!! Many of the common functions like np.dot do not work on sparse matrices.**

The most commonly used sparse matrices include a `coo_matrix` which is a sparse matrix built from the `COO`rdinates and values of the non-zero entries.  
When there are repeated entries in the rows or cols, we can remove the redundancy by indicating the location of the first occurrence of a value and its increment instead of the full coordinates. When the repeats occur in colums we use a `CSC` format.  
Like the `CSC` format there is a `CSR` format to account for data that repeat along the rows.

In [3]:
# Generate a 10 by 100 matrix with random values from a Poisson distribution
A = np.random.poisson(0.3, (10,100))
B = sparse.coo_matrix(A)
C = B.todense()

print("A",type(A),A.shape,"\n"
      "B",type(B),B.shape,"\n"
      "C",type(C),C.shape,"\n")

A <class 'numpy.ndarray'> (10, 100) 
B <class 'scipy.sparse.coo.coo_matrix'> (10, 100) 
C <class 'numpy.matrix'> (10, 100) 



In [4]:
print(A[:2])

[[0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 2 0 0 2 2 0 1 3 0 0 0
  0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 1 2 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1
  0 1 0 0 0 2 2 2 0 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
  2 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0]]


Because the coordinate format is easier to create, it is common to create it first then cast to another more efficient format. Let us first show how to create a matrix from coordinates:

In [5]:
rows = [0,1,2,8]
cols = [1,0,4,8]
vals = [1,2,1,4]

A = sparse.coo_matrix((vals, (rows, cols)))

Then to cast it to a CSR matrix:

In [6]:
B = A.tocsr()

In [7]:
print(B)

  (0, 1)	1
  (1, 0)	2
  (2, 4)	1
  (8, 8)	4


Because this introduction to sparse matrices is applied to data ingestion we would need to be able to:

- concatenate matrices (e.g., add a new user to a recommender matrix)
- read and write the matrices to and from disk

In [8]:
## matrix merge example
C = sparse.csr_matrix(np.array([0,1,0,0,2,0,0,0,1]).reshape(1,9))
D = sparse.vstack([B,C])
print(D.todense())

[[0 1 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 4]
 [0 1 0 0 2 0 0 0 1]]


In [9]:
print("Shape of original matrix:", B.shape)
print("Shape of CSR matrix:", C.shape)
print("Shape of stacked matrix:", D.shape)

Shape of original matrix: (9, 9)
Shape of CSR matrix: (1, 9)
Shape of stacked matrix: (10, 9)
