## Comprehesion iterable 

To iitialize an iterable in Python, it can be convinient to use what is called a `comprehesion.` It is a way to build an iterable from another iterale, using a for loop.

In [7]:
[x for x in range(0,10)]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

If statements can also be used in these comprehesion lists:

In [8]:
our_list = [x for x in range(0,10) if x%2 == 1]
print(our_list)

[1, 3, 5, 7, 9]


We can write similar statements for sets and for dictionaries:

In [9]:
{x - 1 for x in our_list}

{0, 2, 4, 6, 8}

### Using Python’s enumerate()
You can use enumerate() in a loop in almost the same way that you use the original iterable object. Instead of putting the iterable directly after in in the for loop, you put it inside the parentheses of enumerate(). You also have to change the loop variable a little bit, as shown in this example:

In [10]:
values = ["a", "b", "c"]

When you use `enumerate()`, the function gives you back two loop variables: 

1. The `count` of the current iteration
2. The `value` of the item at the current iteration

In [11]:
for (count, value) in enumerate(values):
    print(count, value)

0 a
1 b
2 c


In [12]:
{x:y for (x,y) in enumerate(our_list)}

{0: 1, 1: 3, 2: 5, 3: 7, 4: 9}

## Map function

The `map()` function is one of Python´s built functions. IT can be used to apply a function each element of an iterable. Ir returns an iterator, it can be converted into the desired iterable.

In [13]:
map(lambda a : a**2, our_list)

<map at 0x1c758f44e80>

In [14]:
our_list

[1, 3, 5, 7, 9]

In [15]:
list(map(lambda a : a**2, our_list))

[1, 9, 25, 49, 81]

## Zip function

The `zip()` function is also one of Python´s built-in functions. It is used to create an iterator of tuples fron iterables.

In [16]:
x = [1,2,3]
y = [4,5,6]

list(zip(x,y))

[(1, 4), (2, 5), (3, 6)]

# Sparese data with Spcipy

In the rest of the lecture notes, the word “matrix” will be used a lot. If you are not comfortable with it, you can just think of it as a two-dimensional array. When such an object contains a very small amount of information compared to its size (a matrix with mostly zeros or a dataframe with a lot of NaN values), the data is called sparse.

## Why sparse data?

Sparse data is present in a lot of domains:


- Search engines: a graph of the web is sparse, so using ranking algorithms like PageRank to rank websites requires calculations involving sparse data.

- Natural Language Processing: working on a set of text documents and a vocabulary requires building sparse graphs. The computations required use highly sparse vectors and matrices.

- Digital image processing: when the processed images have a lot of black pixels.

- Studying communication networks: communication networks or social networks are often sparse (lots of nodes but few links). Using community detection algorithms or other tools requires computations with sparse matrices (since a network/graph can be represented by its adjacency matrix).


First, sparse data should be stored in an efficient way. Indeed, we do not want to use a large amount of memory for storing a lot of zeros.

Second, computations with sparse data can be a problem if we use regular dense two-dimensional arrays in Python. Indeed, multiplying two sparse matrices will involve multiplying and adding a lot of zeros, which are useless operations that take time.

## How is stored and hendled?

The library SciPy provides a sparse matrices module. It provides several types that allow to store spare data in a memory efficient way, but also implements basic arithmetic operations so that computations with sparse matrices focus on “useful” operations.

We will talk about the three most important sparse matrix formats provided by SciPy. To choose the right format for the right application, it will be important to consider what kind of operations we want to do with the matrix: * accessing data often * entering new data often * vector matrix multiplication * …

### THe coordinate format: COO

A COO matrix is a sparse matrix in the coordinate format. The coordinates of non null values and the values are stored. It takes the shape of three arrays:

    - an array i of row indices

    - an array j of column indices

    - an array data of data

The value at position (i[k], j[k]) in our matrix is data[k].

Advantages:

    - easy to construct

    - facilitates fast conversion among sparse formats

    - very fast conversion to and from CSR/CSC formats

Disadvantages:

    - does not support direct arithmetic operations

    - no efficient slicing

Usage suggestion: Use it to construct a sparse matrix and directly convert it to one of the two formats we will present next. It will then allow fast operations.

When initializing a COO matrix, one can specify the type of the values that will be stored in the matrix:

In [17]:
import numpy as np
from scipy.sparse import coo_matrix

In [18]:
coo_matrix((3,4), dtype = np.int64).toarray()

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int64)

In [19]:
row  = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
print(coo_matrix((data, (row, col)), shape=(4, 4)))

  (0, 0)	4
  (3, 3)	5
  (1, 1)	7
  (0, 2)	9


In [20]:
print(coo_matrix((data, (row, col)), shape=(4, 4)).toarray())

[[4 0 9 0]
 [0 7 0 0]
 [0 0 0 0]
 [0 0 0 5]]


## The compressed sparse column format: CSC

A CSC matrix is a compressed sparse column matrix. The row indices for each column are stored in an array (indices) and the values are stored in a second array (data). A third array (indptr) is used to specify from which index to which index of the two previous arrays each row is. More precisely, the indices in indices[indptr[i]:indptr[i+1]] give the row indices where the data in data[indptr[i]:indptr[i+1]] is. So for each column i, we are given the row indices and the values indices. Notice that the values of a single column are stored consecutively in data but not at all elements of the same row. That is why this format is preferred if calculations impose column accesses and if we want to access slices of columns quickly.

Advantages:

    - efficient arithmetic operations CSC + CSC, CSC * CSC, etc.

    - efficient column slicing

    - fast matrix vector products (CSR may be faster)

Disadvantages:

    - slow row slicing operations (consider CSR)

Such a sparce matrix can be initialized from a dense matrix (numpy two-dimensional array or numpy matrix) or from another sparse matrix, whatever its format. It can also be initialized in the same way as a COO matrix, or by directly giving the three needed arrays

In [21]:
from scipy.sparse import csc_matrix

In [22]:
csc_matrix((3,4), dtype = np.int8).toarray()

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

In [23]:
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])

In [24]:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0,2,2,0,1,2])
data = np.array([1,2,3,4,5,6])

In [25]:
csc_matrix((data, (row, col)), shape=(3, 3)).toarray()

array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]], dtype=int32)

In [26]:
dense_matrix = np.array([[1, 0, 4],[0, 0, 5],[2, 3, 6]])
print(csc_matrix(dense_matrix))

  (0, 0)	1
  (2, 0)	2
  (2, 1)	3
  (0, 2)	4
  (1, 2)	5
  (2, 2)	6


## The compressed spars row format: CSR

A CSR matrix is a compressed sparse row matrix. The format is the same as the CSC format but here the indices array stores columns indices and the indptr array gives ranges of indices and data arrays that correspond to a specific row (instead of a specific column).

This format is used exactly in the same way as CSC but the computation against vectors is faster.

# SParse data with Pandas

The Pandas library also provides handling of sparse data structures. It is mainly used to optimize the memory usage of a sparse dataframe.

In [27]:
import pandas as pd
df = pd.DataFrame(dense_matrix)
sdf = df.astype(pd.SparseDtype("float", 0.))

The functionalities of such a dataframe are the same but some functions might take longer, especially the printing functions, since the table needs to be reconstructed from the sparse information to be able to print.

Note that to build a dataframe from a sparse matrix a specific function has to be used: from_spmatrix(). Notice that it is a function of Pandas’ sparse module.

In [29]:
import scipy.sparse
mat = scipy.sparse.eye(3)
pd.DataFrame.sparse.from_spmatrix(mat)

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0


In [34]:
x = [1, 2, 3]
y = [4, 5, 6]
z = ["a", "b", "c"]

[(x[i], y[i], z[i]) for i in range(0, 3)]

[(1, 4, 'a'), (2, 5, 'b'), (3, 6, 'c')]