# Fun Time! Application on Recommender Systems

We've seen basic operations for loading, filtering, and processing datasets using NumPy, Pandas, and Scipy.

Now we've arrived to the fun part. We will be implementing recommender systems using these libraries.

We will take into account recommendations at length $n$, i.e., each algorithm will return a list of $n$ items.


In [5]:
import numpy as np
from scipy import sparse

## Dataset Loading

In [22]:
urm_csr = sparse.load_npz("data/urm_csr.npz",)
urm_csc = sparse.load_npz("data/urm_csc.npz",)

## Constants

In [20]:
recommendation_length = 10
(num_users, num_items), num_interactions = urm_csr.shape, urm_csr.nnz

rng = np.random.default_rng()

print(recommendation_length, num_users, num_items)

10 944 1683


## Recommender: Random

![Random Recommender](images/random.jpg)

The name it's more or less self explanatory. But just to make it clear, it recommends $n$ random items of the catalog.


In [71]:
def random_item_recommender():
    return rng.permutation(np.arange(num_items))[:recommendation_length]

## Recommender: Top Popular

This is one of the most basic recommender out there. It just recommend the most popular items to all users.

The interesting part is to count the number of times each item has been interacted.

When we have a CSC Sparse matrix, we can get the number of elements stored in each column using the `indptr` attribute [Reference](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) and the `np.ediff1d` function.

More specifically, for a matrix like this:

```python
[[1, 0, 4],
 [0, 0, 5],
 [2, 3, 6]]
```
We have that `indptr = np.array([0, 2, 3, 6])` (will always have one more column than the original number of columns), `indices = np.array([0, 2, 2, 0, 1, 2])`, and `data = np.array([1, 2, 3, 4, 5, 6])`. If we want to know where in the matrix we have a value, we do the following:




In [80]:
example_matrix = sparse.csc_matrix(np.array([[1, 0, 4], [0, 0, 5], [2, 3, 6]]))

for column_idx in range(3):
    row_ranges = example_matrix.indices[example_matrix.indptr[column_idx]:example_matrix.indptr[column_idx+1]]
    values = example_matrix.data[example_matrix.indptr[column_idx]:example_matrix.indptr[column_idx+1]]
    print(f"{column_idx = } - {row_ranges = } - {values = }")

column_idx = 0 - row_ranges = array([0, 2], dtype=int32) - values = array([1, 2])
column_idx = 1 - row_ranges = array([2], dtype=int32) - values = array([3])
column_idx = 2 - row_ranges = array([0, 1, 2], dtype=int32) - values = array([4, 5, 6])


Thanks to the way indptr is constructed, we can deduct that the number of non zero elements on each column $i$ is solely the difference between `indptr[i+1] - indptr[i]`. Moreover, after we've calculated the number of nnz elements in each column, we sort the indices with `np.argsort` and get the latest $n$ indices (argsort sorts in ascending order)

In [81]:
def top_popular_item_recommender():
    return np.argsort(np.ediff1d(urm_csc.indptr))[num_items - 1: num_items - 1 - recommendation_length:-1]

### NumPy

**DISCLAIMER**: I'm not the author of this cheat sheet. All credits go to their respective authors.

![Nice NumPy Cheat Sheet](images/numpy_cheat_sheet.pdf)

### Pandas

**DISCLAIMER**: I'm not the author of this cheat sheet. All credits go to their respective authors.

![Nice Pandas Cheat Sheet](images/pandas_cheat_sheet.pdf)

### SciPy

**DISCLAIMER**: I'm not the author of this cheat sheet. All credits go to their respective authors.

![Nice Pandas Cheat Sheet](images/scipy_cheat_sheet.pdf)

## Some exercises for you

### Numpy

1. What is the difference between `np.loadtxt` and `np.genfromtxt`
2. What does the `np.vectorize` function do?

### Pandas

1. What is the main benefit of using pandas readers for files, such as read_csv, read_excel, instead of np.loadtxt? What are some limitations that we find in Numpy?
2. Can we create a new column using the attribute notation? i.e. can we do this? `data_df.new_col = <series>` If not, why?

### Scipy

1. What do you think could be the use of a LiL matrix? What are the differences between a Python Dictionary and a sparse matrix? Can they be equivalent?

## NumPy

NumPy is **THE** foundation of all the data science stack in Python. Most of the libraries are built on top of NumPy data structures. Make sure you take your time to understand them.

The basic data structure of NumPy is an `n-array`, i.e., an array of `n` dimensions with **LOTS** of methods to work with these arrays.

NumPy is really important in the field because it is **FAST**, most of the routines are implemented in C and you can get orders of magnitude faster than with their Python counterparts.

In [4]:
import numpy as np # Convention, just memorize that np.<something> means a numpy method

### You have to think in vectors

Most of the operations are highly optimized to be done in a vectorized way (e.g. no `for` loops). 

Let's see an example of this. First we will prepare four arrays.

Two arrays will be implemented as Python `lists` (`huge_array_1` & `huge_array_2`) and two arrays will be NumPy `arrays` (`numpy_huge_array_1` & `numpy_huge_array_2`). We will see the difference in speed by measuring two things: 

1. Data structure efficiency
2. Algorithm "efficiency"

We're not going to do something uterly complicated here. We are just going to sum both arrays 😅


In [5]:
huge_array_1 = [x for x in range(10000000)]
huge_array_2 = [x*x for x in range(10000000)]

numpy_huge_array_1 = np.array(huge_array_1)
numpy_huge_array_2 = np.array(huge_array_2)

In [6]:
def sum_naive_python_loop():
    a = []
    for arr1, arr2 in zip(huge_array_1, huge_array_2):
        a.append(arr1 + arr2)
        
def sum_naive_python_list_comprehension():
    a = [arr1 + arr2 for arr1, arr2 in zip(huge_array_1, huge_array_2)]
    
def sum_naive_numpy_loop():
    a = []
    for arr1, arr2 in zip(numpy_huge_array_1, numpy_huge_array_2):
        a.append(arr1 + arr2)
        
def sum_naive_numpy_list_comprehension():
    a = [arr1 + arr2 for arr1, arr2 in zip(numpy_huge_array_1, numpy_huge_array_2)]
    
def sum_numpy_approved():
    a = numpy_huge_array_1 + numpy_huge_array_2

Super naive Python for loop with Python lists.

In [8]:
%timeit sum_naive_python_loop()
%timeit sum_naive_python_list_comprehension()
%timeit sum_naive_numpy_loop()
%timeit sum_naive_numpy_list_comprehension()
%timeit sum_numpy_approved()

1.29 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.02 s ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.55 s ± 196 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.18 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
26.6 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Pandas

In [None]:
import pandas as pd

## Cheat Sheets for ya

Some people like to have everything condensed into a single PDF that you can check whenever you've a doubt. So, here you have them 👌

### Matplotlib

**DISCLAIMER**: I'm not the author of this cheat sheet. All credits go to their respective authors.

![Nice Pandas Cheat Sheet](images/matplotlib_cheat_sheet.pdf)