# BLU10 - Learning Notebook - Part 2 of 3 - Rating Matrix

In [1]:
import numpy as np
import pandas as pd
import scipy as sp

from scipy.sparse import random, coo_matrix, lil_matrix, dok_matrix, csr_matrix, csc_matrix

# 1 Creating a ratings matrix

## 1.1 Community matrix

As we know, the community matrix represents our entire community (customers, users, whatever you wanna call them!) in a single matrix. The community is the matrix that should be the single source of truth when developing a recommender system.

<br>

If you pause for a moment and think, there are multiple ways for a user to show interest in a specific product or service. It really depends on the recommendations that we are trying to do. You can't, for instance, rate every item that you buy on the supermarket and normally you show interest in that item by buying it (and this could be reinforced by the number of times you buy that item).

**Let's see an example of a community matrix:**

Take $U = \{Ana, Miguel, Beatriz\}$, and $I = \{Bananas, Water, Milk\}$. 

We represent $U \times I$, aka the community matrix, as:

$$\begin{bmatrix}(Ana, Bananas) & (Ana, Water) & (Ana, Milk)\\ (Miguel, Bananas) & (Miguel, Water) & (Miguel, Milk)\\ (Beatriz, Bananas) & (Beatriz, Water) & (Beatriz, Milk)\end{bmatrix}$$

However, as we already know, the community matrix is not a thing *per se*, as these combinations should convey more information - hence let's jump to user opinions!

## 1.2 Types of data

Users manifest their opinion about an item in different ways.

### Explicit and implicit feedback

Feedback is said to be explicit when provided by the user and implicit if inferred based on user actions (e.g., clicks).

Implicit feedback usually takes the form of unary data (buys/does not buy),

### Rating scale

We write $S$ the set of possible ratings. For example, in 1-5 stars rating system $r_{u, i} \in S = \{1, 2, 3, 4, 5\}$.

| Type of data    | Description                          | Rating scale (examples) | Explicit/Implicit |  
|-----------------|--------------------------------------|-------------------------|-------------------|
| Numeric         | Continuous ratings                   | $S = [1, 5]$            | Explicit          |
| Ordinal         | Ordered categories                   | $S = \{1, 2, 3, 4, 5\}$ | Explicit          |
| Binary          | Good or bad  (e.g., Upvote/Downvote) | $S = \{-1, 1\}$         | Explicit          |
| Unary           | User action  (e.g., Click, Purchase) | $S = \{1\}$             | Implicit          |

*Table 1: Different types of data and rating scales*

## 1.3 Ratings matrix

Consider the following ratings matrix $R$, with $S = \{1, 2, 3, 4, 5\}$ where each row is a user and each column is a product (consider the values the number of times a user bought an item):

$$\begin{bmatrix}1 &  & 2\\ 1 & 5 & \\  & 2 & 1\end{bmatrix}$$

## 1.4 Representing vectors

Let's go bit by bit, starting with the first row of the matrix, corresponding to:

$$\begin{bmatrix}(Ana, Bananas) & (Ana, Water) & (Ana, Milk)\end{bmatrix}$$

To clarify, $I_{Ana} = \{Bananas, Milk\}$ and $(Ana, Water) \notin R$. Right? - (in plain Portuguese, Ana bought Bananas and Milk but did not buy Water).

At the core of Numpy is the homogeneous (i.e., all elements of the same type) n-dimensional array.

Corresponding to the NumPy array (this is the Ana array):

```
┌───┬───┬───┐
│ 1 │   │ 2 │
└───┴───┴───┘
```

We can create a numpy array using `numpy.array` with an array-like object, a standard Python list in this case.

In [2]:
ana = np.array([1, np.NaN, 2])

ana

array([ 1., nan,  2.])

## 1.5 Representing matrices

And you may be thinking: *If each user is an array, multiple arrays are multiple users, right?* 

**YES!**

The following is an example of the community matrix that we have above translated into a ratings matrix $R$ - it's also commonly called a Customers X Products matrix - intuitively the cross between customers and products:
```
┌───┬───┬───┐
│ 1 │   │ 2 │
├───┼───┼───┤
│ 1 │ 5 │   │
├───┼───┼───┤
│   │ 2 │ 1 │
└───┴───┴───┘
```

In [3]:
R = [[1, np.NaN, 2], [1, 5, np.NaN], [np.NaN, 2, 1]]
R = np.array(R)
R

array([[ 1., nan,  2.],
       [ 1.,  5., nan],
       [nan,  2.,  1.]])

Let's select Beatriz: 

In [4]:
R[2]

array([nan,  2.,  1.])

## 1.6 Matrix attributes

Some important attributes of any `ndarray`, to keep in mind.

In [5]:
ndims = R.ndim
nrows = R.shape[0]
ncols = R.shape[1] 
dtype = R.dtype

print("R is a {}-dimensional, {} by {} matrix, of {} elements.".format(ndims, nrows, ncols, dtype))

R is a 2-dimensional, 3 by 3 matrix, of float64 elements.


Cool, so $R$ has two dimensions (customers and products), 3 customers (rows) and 3 products (columns).

## 1.7 Saving the matrix

We can save the matrix to a binary file in NumPy `.npy` format.

Note that `save` is a stand-alone function and not an array method.

In [6]:
np.save('data/interim/ratings_matrix', R);

Alternatively, we can dump the matrix into a `.csv` file, as we would typically do.

In [7]:
np.savetxt("data/interim/ratings_matrix.csv", R, delimiter=",")

# 2 Sparse Matrices

These matrixes can get quite big as you add users or products!

Huge matrices require much memory, and some large matrices are very sparse, as recorded ratings are relatively rare. Another example, Think of Netflix: you, as a user, have not provided ratings for the vast majority (if any) of the movies and TV shows. This means that most of the recorded ratings matrix is full of zeros or missing values and only a few entries are filled with information.

<img src="https://i.imgflip.com/4hg29a.jpg" />

This allocation is a waste of resources, as missing values and data cost the same space, but only the later hold any information.

In practice, this leads to matrices that don't fit in memory, despite having a manageable amount of data.

And as your company grows and you get more users or more products, this problem gets even worse!

The premise of sparse data structures is that we *store only non-zero values*, and assume the rest of them are zeros.

**Sparse matrices** allow us to mitigate these problems:
* They are less memory-intensive, as they squeeze out the zeros and store only relevant values;
* Operations ignore zero values, i.e., the majority of the cells.

## 2.1 Sparse Matrices in SciPy

The `scipy.sparse` module implements sparse matrices based in regular NumPy arrays.

For the sake of objectivity, let's compare the sizes of a sparse versus a regular matrix.

We use `sp.sparse.random` to generate a sparse matrix of a given shape and density (don't worry about this concept, we will explore it better in the next unit - we are just creating a random sparse matrix here), with randomly distributed values.

In [8]:
def sparse_matrix_nbytes(M):
    return M.data.nbytes + M.indptr.nbytes + M.indices.nbytes


A = random(10 ** 3, 10 ** 5, density=.01, format='csr')
sparse_matrix_nbytes(A) / A.toarray().nbytes

0.015005005

So, there's that - the sparse matrix only takes up 1.5% of the space of the original matrix - huge savings!

Let's explore how sparse matrices work and exemplify some implementations (more can be seen in the appendix).

### 2.1.1 Dictionary of Keys (DOK)

The most straightforward implementation of a sparse matrix is as a dictionary of keys, in which the keys are tuples that represent indices.

```
┌───┬───┬───┐          
│ 2 │ 0 │ 0 │          {  
├───┼───┼───┤            (0, 0): 2,
│ 0 │ 5 │ 0 │ → DoK →    (1, 1): 5,
├───┼───┼───┤            (2, 1): 3,
│ 0 │ 3 │ 0 │          }
└───┴───┴───┘ 
```

In [9]:
B = random(5, 5, density=.04, format='dok', random_state=42)

B.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.30424224, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])

In [10]:
dict(B)

{(3, 1): 0.3042422429595377}

### 2.1.2 Compressed Sparse (CS)

Although the DOK implementation is quite easy to understand, the most used format is the **Compressed Sparse (CS)** and this is the one we are going to use going forward. It has a Row and a Column variants.

The **Compressed Sparse Row (CSR)**, uses three arrays:
* `data`, the value vector containing all non-zero values in [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order)
* `indptr`, the index pointer indicates at which element of the value vector the row starts
* `indices`, contains the column indices (which column each of the values come from).

```
┌───┬───┬───┐                   
│ 1 │ 0 │ 1 │          Matrix data:    [1, 1, 1] 
├───┼───┼───┤          
│ 0 │ 0 │ 0 │ → CSR →  Matrix indptr:  [0, 2, 2, 3]
├───┼───┼───┤          
│ 0 │ 0 │ 1 │          Matrix indices: [0, 2, 2]
└───┴───┴───┘ 
```

In fact, the index pointers tell us the starting and stopping indices `data[i, j]` for each row, above:
* The first row is given by `data[0:2]`
* The second row is given by `data[2:2]`
* The third row is given by `data[2:3]`.

For a better visualization, check the CSR representation as displayed on a more advanced paper on ["Dynamic-CSR"](https://www.semanticscholar.org/paper/Dynamic-CSR-%3A-A-Format-for-Dynamic-Sparse-Matrix-King-Gilray/cee342df5f4e93747d5d2ff9804b8129f818768c#citing-papers) *[citation: King, James et al. “Dynamic-CSR : A Format for Dynamic Sparse-Matrix Updates.” (2016).]*.

![Compressed Sparse Row Representation](./media/csr.jpg)


The **Compressed Sparse Column (CSC)** format is similar, but the pointers refer to columns and the indices to the rows.

When comparing the two types of Compressed Sparse matrices:
* `CSR` provides efficient row slicing but slow column slicing, i.e., accessing and operating on row vectors
* `CSC` provides efficient column slicing but slow row slicing, i.e., accessing and operating on column vectors.



In [11]:
E = random(5, 5, density=.2, format='csr', random_state=65)

E.toarray()

array([[0.22027153, 0.        , 0.        , 0.16514066, 0.        ],
       [0.        , 0.        , 0.73870729, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.92758757, 0.        , 0.        , 0.        , 0.        ],
       [0.10069371, 0.        , 0.        , 0.        , 0.        ]])

In [12]:
E.data

array([0.22027153, 0.16514066, 0.73870729, 0.92758757, 0.10069371])

In [13]:
E.indptr

array([0, 2, 3, 3, 4, 5], dtype=int32)

In [14]:
E.indices

array([0, 3, 2, 0, 0], dtype=int32)

## 2.2 Creating Sparse Matrices

Back to our rating matrix $R$ from the previous section, as:

```
    ┌───┬───┬───┐                   
    │ 1 │   │ 2 │
    ├───┼───┼───┤          
R = │ 1 │ 5 │   │
    ├───┼───┼───┤          
    │   │ 2 │ 1 │
    └───┴───┴───┘ 
```

In this section, we build sparse representations of $R$.

We start from our standard array.

In [15]:
data = np.array([1, 0, 2, 1, 5, 0, 0, 2, 1]).reshape(3, 3)
data

array([[1, 0, 2],
       [1, 5, 0],
       [0, 2, 1]])

### 2.2.1 DOK

The use-case for `DOK` is incremental construction.

In [16]:
F = dok_matrix((3, 3))

for i in range(3):
    for j in range(3):
        F[i, j] = data[i, j]

F.toarray()

array([[1., 0., 2.],
       [1., 5., 0.],
       [0., 2., 1.]])

### 2.2.2 Compressed Sparse

Numpy matrices can easily be converted to the `CSR` format, so that we can efficiently operate on them.

In [17]:
H_ = csr_matrix(data)
H_

<3x3 sparse matrix of type '<class 'numpy.int32'>'
	with 6 stored elements in Compressed Sparse Row format>

In [18]:
H_.data

array([1, 2, 1, 5, 2, 1], dtype=int32)

In [19]:
H_.indptr

array([0, 2, 4, 6], dtype=int32)

In [20]:
H_.indices

array([0, 2, 0, 1, 1, 2], dtype=int32)

The process is exactly the same to convert to `CSC`.

In [21]:
H_ = csc_matrix(data)
H_

<3x3 sparse matrix of type '<class 'numpy.int32'>'
	with 6 stored elements in Compressed Sparse Column format>

### 2.2.3 From pandas DataFrame to scipy sparse

If you have a pandas DataFrame (containing only numerical values, of course), you don't need to create a numpy array from it and then convert to scipy sparse: you can do it directly!
This allows you to use Pandas to do cool feature engineering, plot some things and pretend you actually understand what the data is telling you.

In [22]:
df = pd.DataFrame({
    'Bananas': [1,1,0],
    'Water': [0,5,2],
    'Milk': [2,0,1]
    },
    index=['Ana', 'Miguel', 'Beatriz']
)
df

Unnamed: 0,Bananas,Water,Milk
Ana,1,0,2
Miguel,1,5,0
Beatriz,0,2,1


In [23]:
H_ = csc_matrix(df.values)
H_

<3x3 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Column format>

In [24]:
H_git .toarray()

array([[1, 0, 2],
       [1, 5, 0],
       [0, 2, 1]], dtype=int64)