# SciPy - Library of scientific algorithms for Python

Parts of this notebook have been taken from:

[http://github.com/jrjohansson/scientific-python-lectures](http://github.com/jrjohansson/scientific-python-lectures).

The other notebooks in this lecture series are indexed at [http://jrjohansson.github.io](http://jrjohansson.github.io).

## Introduction

The SciPy framework builds on top of the low-level NumPy framework for multidimensional arrays, and provides a large number of higher-level scientific algorithms. Some of the topics that SciPy covers are:


* Special functions ([scipy.special](http://docs.scipy.org/doc/scipy/reference/special.html))
* Integration ([scipy.integrate](http://docs.scipy.org/doc/scipy/reference/integrate.html))
* Optimization ([scipy.optimize](http://docs.scipy.org/doc/scipy/reference/optimize.html))
* Interpolation ([scipy.interpolate](http://docs.scipy.org/doc/scipy/reference/interpolate.html))
* Fourier Transforms ([scipy.fftpack](http://docs.scipy.org/doc/scipy/reference/fftpack.html))
* Signal Processing ([scipy.signal](http://docs.scipy.org/doc/scipy/reference/signal.html))



* Linear Algebra ([scipy.linalg](http://docs.scipy.org/doc/scipy/reference/linalg.html))
* Sparse Eigenvalue Problems ([scipy.sparse](http://docs.scipy.org/doc/scipy/reference/sparse.html))
* Statistics ([scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html))
* Multi-dimensional image processing ([scipy.ndimage](http://docs.scipy.org/doc/scipy/reference/ndimage.html))
* File IO ([scipy.io](http://docs.scipy.org/doc/scipy/reference/io.html))


Each of these submodules provides a number of functions and classes that can be used to solve problems in their respective topics.

In this lecture we will look at how to use some of these subpackages.

To access the SciPy package in a Python program, we start by importing everything from the `scipy` module.

In [1]:
from scipy import *

If we only need to use part of the SciPy framework we can selectively include only those modules we are interested in. For example, to include the linear algebra package under the name `la`, we can do:

In [2]:
import scipy.linalg as la

## Sparse matrices

Sparse matrices are often useful in numerical simulations dealing with large systems, if the problem can be described in matrix form where the matrices or vectors mostly contains zeros. 

`SciPy` has a good support for sparse matrices, with basic linear algebra operations (such as equation solving, eigenvalue calculations, etc).

There are many possible strategies for storing sparse matrices in an efficient way. Some of the most common are the so-called **coordinate form (COO)**, **list of list (LIL) form**, and **compressed-sparse column CSC (and row, CSR)**. 

*Advantages*
- CSR or CSC 
 - Most computations can be efficiently implemented
- COO or LIL format 
 - Easy initialization
 - Adding elements is efficient

*Disadvantages*
- CSR or CSC 
 - They're not intuitive
 - Not easy to initialize
- COO or LIL format 
 - Operations and computations are not efficient.

For more information about sparse matrices, see e.g. [Wikipedia](http://en.wikipedia.org/wiki/Sparse_matrix)


## Good ol' examples

Let's inspect the following matrix:

|   |   |   |   |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 0 | 3 | 0 | 0 |
| 0 | 1 | 1 | 0 |
| 1 | 0 | 0 | 1 |


In [3]:
from scipy.sparse import *

In [4]:
# dense matrix
M = array([[1,0,0,0], [0,3,0,0], [0,1,1,0], [1,0,0,1]])
M

array([[1, 0, 0, 0],
       [0, 3, 0, 0],
       [0, 1, 1, 0],
       [1, 0, 0, 1]])

When we create a sparse matrix we have to choose which format it should be stored in. For example, 

In [5]:
# convert from dense to sparse
A = csr_matrix(M)
A

<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

From sparse to dense

In [6]:
# convert from sparse to dense
A.todense()

matrix([[1, 0, 0, 0],
        [0, 3, 0, 0],
        [0, 1, 1, 0],
        [1, 0, 0, 1]], dtype=int64)

More efficient way to create sparse matrices: create an empty matrix and populate with using matrix indexing (avoids creating a potentially large dense matrix)

In [7]:
A = lil_matrix((4,4)) # empty 4x4 sparse matrix
A[0,0] = 1
A[1,1] = 3
A[2,2] = A[2,1] = 1
A[3,3] = A[3,0] = 1
A

<4x4 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in LInked List format>

In [8]:
A.todense()

matrix([[1., 0., 0., 0.],
        [0., 3., 0., 0.],
        [0., 1., 1., 0.],
        [1., 0., 0., 1.]])

## Operations

We can compute with sparse matrices like with dense matrices:

In [9]:
(A * A).todense()

matrix([[1., 0., 0., 0.],
        [0., 9., 0., 0.],
        [0., 4., 1., 0.],
        [2., 0., 0., 1.]])

In [10]:
A.dot(A).todense()

matrix([[1., 0., 0., 0.],
        [0., 9., 0., 0.],
        [0., 4., 1., 0.],
        [2., 0., 0., 1.]])

In [11]:
A.multiply(A).todense()

matrix([[1., 0., 0., 0.],
        [0., 9., 0., 0.],
        [0., 1., 1., 0.],
        [1., 0., 0., 1.]])

## CSR, CSC and COO

There are a few different formats in which your sparse matrix might be encoded


Remember our original matrix

In [12]:
M

array([[1, 0, 0, 0],
       [0, 3, 0, 0],
       [0, 1, 1, 0],
       [1, 0, 0, 1]])

## COO - Coordinate format

In [13]:
A_coo = coo_matrix(A)
A_coo

<4x4 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in COOrdinate format>

We have access to the inner data structure

In [14]:
A_coo.row

array([0, 1, 2, 2, 3, 3], dtype=int32)

In [15]:
A_coo.col

array([0, 1, 1, 2, 0, 3], dtype=int32)

In [16]:
A_coo.data

array([1., 3., 1., 1., 1., 1.])

The problem of this data structure is that we cannot efficiently index elements. How can we get the elements in row 3? We would have to loop the row field until we find the first 3, which is in position 5.


Too slow for matrices of even few hundred rows.

What if we had a pointer to the exact starting position of every row?

## CSR - Compressed Sparse Row format

In [17]:
A_csr = csr_matrix(A)
A_csr.todense()

matrix([[1., 0., 0., 0.],
        [0., 3., 0., 0.],
        [0., 1., 1., 0.],
        [1., 0., 0., 1.]])

In [18]:
A_csr.indices

array([0, 1, 1, 2, 0, 3], dtype=int32)

In [19]:
A_csr.indptr

array([0, 1, 2, 4, 6], dtype=int32)

In [20]:
A_csr.data

array([1., 3., 1., 1., 1., 1.])

In this case:

- `indices`: Refers to the column index, same as `.col` in COO format
- `data`: Refers to the value contained in the cell
- `indptr`: It is a row-pointer. `indptr[i]` refers to the first cell in indices and data which contains elements of row `i`

Let's extract row 3

In [21]:
target_row = 3

row_start = A_csr.indptr[target_row]
row_end = A_csr.indptr[target_row+1]

In [22]:
row_columns = A_csr.indices[row_start:row_end]
row_columns

array([0, 3], dtype=int32)

In [23]:
row_data = A_csr.data[row_start:row_end]
row_data

array([1., 1.])

Double check with the original matrix

In [24]:
M[target_row,:]

array([1, 0, 0, 1])

## WARNING! 

CSR is made for fast **row** access, CSC for fast **column** access. Do not mix them

Let's compare the speeds

In [25]:
big_matrix_csr = random(10000, 5000, density=0.05, format='csr')
big_matrix_csr

<10000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 2500000 stored elements in Compressed Sparse Row format>

In [26]:
big_matrix_csc = big_matrix_csr.tocsc()
big_matrix_csc

<10000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 2500000 stored elements in Compressed Sparse Column format>

In [27]:
%%timeit
for row_index in range(big_matrix_csr.shape[0]):
    # Do something
    useless_row = big_matrix_csr[row_index]

610 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
%%timeit
for row_index in range(big_matrix_csc.shape[0]):
    # Do something
    useless_row = big_matrix_csc[row_index]

43.5 s ± 1.46 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Ouch! That is roughly a 68x difference

It's very easy to waste a lot of time by using data structures in an incorrect way

### Another example for speed critical code. 

We want the item indices seen by a user. 


The simple implementation would be the following. Users are rows and the items are columns

In [29]:
URM = random(100000, 1500, density=0.01, format='csr')
URM

<100000x1500 sparse matrix of type '<class 'numpy.float64'>'
	with 1500000 stored elements in Compressed Sparse Row format>

In [30]:
%%timeit
for user_id in range(URM.shape[0]):
    # Do something
    user_seen_items = URM[user_id].indices

6.33 s ± 796 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [31]:
%%timeit
for user_id in range(URM.shape[0]): 
    # Do something
    user_seen_items = URM.indices[URM.indptr[user_id]:URM.indptr[user_id+1]]

62.1 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Even for this simple operation there is a difference of 2 orders of magnitude



Why?

Let's see what the first version of the code does...

#### Step 1 - slicing row

In [32]:
user_id = 5
user_row = URM[user_id]
user_row

<1x1500 sparse matrix of type '<class 'numpy.float64'>'
	with 15 stored elements in Compressed Sparse Row format>

#### Step 2 - get column indices

In [33]:
user_seen_items = user_row.indices
user_seen_items

array([  11,   23,   53,  405,  414,  576,  668,  702, 1168, 1214, 1216,
       1236, 1304, 1366, 1472], dtype=int32)

Now, let's go into the second version of the code

#### Step 1 - slicing row

In [34]:
user_id = 5
user_indices_start = URM.indptr[user_id]
user_indices_end = URM.indptr[user_id + 1]

user_indices_start, user_indices_end

(75, 90)

#### Step 2 - get column indices

In [35]:
user_seen_items = URM.indices[user_indices_start:user_indices_end]
user_seen_items

array([  11,   23,   53,  405,  414,  576,  668,  702, 1168, 1214, 1216,
       1236, 1304, 1366, 1472], dtype=int32)

### Did you see the difference?

The reason for the big performance loss is that in the first way we are building a new sparse matrix for each user. We don't need the matrix itself, only one of its attributes

## Further reading

* http://www.scipy.org - The official web page for the SciPy project.
* http://docs.scipy.org/doc/scipy/reference/tutorial/index.html - A tutorial on how to get started using SciPy. 
* https://github.com/scipy/scipy/ - The SciPy source code. 