# BLU10 - Learning Notebook - Part 3 of 3 - Non-personalized Recommendations

In [1]:
import os
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import numpy as np
import pandas as pd

It is finally time to implement our first basic Recommender System!

# 1 Non-personalized RS

The core functions of any RS is to identify useful items for the user.

Going back to our framework, non-personalized RS typically include the base model (users, items, and ratings).

We consider users, however, as providers of ratings, ignoring user preferences at recommendation time.

![Recommender Sytems Framework](./media/recommender_systems_framework_base.png)

*Fig.1 - RS framework with a community and the basic model.*

**The rationale is that a generic user also likes something that is liked by many users.**

If we are unable to predict the utility of an item for a particular user, then we recommend an item with high utility for many users. This approach is particularly relevant in the absence of information about the user preferences.

Non-personalized algorithms are useful to get a sense of building an RS and train our NumPy skills.

# 2 Loading the data

First, we read the data into Python and NumPy.

In [2]:
def read_data():
    
    path = os.path.join('data', 'ml-latest-small', 'ratings.csv')
    data = np.genfromtxt(path, delimiter=',', skip_header=1, usecols=[0, 1, 2])
    return data


data = read_data()
pd.DataFrame(data).head(5)

Unnamed: 0,0,1,2
0,1.0,31.0,2.5
1,1.0,1029.0,3.0
2,1.0,1061.0,3.0
3,1.0,1129.0,2.0
4,1.0,1172.0,4.0


What we have here is:
* User identification (ID) in the first column
* Item identification (ID) in the second column
* Rating in the third column.



We have about 100004 pairs of user X ratings - combination of products that the users have ranked: 

In [3]:
data.shape[0]

100004

**We'll work with this data in array form throughout the lecture.**

For more information, explore the `../data/ml-latest-small/` folder.

# 3 Building the ratings matrix

The second step then is to transform this representation into a ratings matrix, with:
* Users as rows
* Items as columns
* Ratings as values.

We use the unique values for users and items, storing the indices that can be used to reconstruct the original array.

Then, we create a matrix, all filled with zeros, the size we want:
* The number of unique users is the number of rows
* The number of unique items is the number of columns.

Finally, we fill in the rating values, using the stored indexes, in a vectorized way.

## 3.1 - Building it with Numpy

In [4]:
def make_ratings(data):
    
    users, user_pos = np.unique(data[:, 0], return_inverse=True)
    items, item_pos = np.unique(data[:, 1], return_inverse=True)
    
    R = np.zeros((len(users), len(items)))
    R[user_pos, item_pos] = data[:, 2]
    
    return R


R = make_ratings(data)
R

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.]])

Take your time to read through and experiment with the code as you go.

**How many values do you think will be non zero in this matrix?**

In [5]:
len(R[R>0])

100004

Exactly the number of rows we had in our tabular format!

## 3.2 Building it with Pandas 

Another beautiful constructor that we might use and we haven't spoke about yet is the Pandas pivot function. It's normal that we want to retain indexes for our products and users in a data frame instead of having a numpy array so Pandas rescues us on that.

You have to use the pivot method on a dataframe. It takes as arguments: 
- index: The row index (normally the User)
- columns: The column indexes (normally the Product)
- values: The values of the matrix (normally the Ratings)

In [6]:
pd.DataFrame(data).pivot(index=0, columns=1, values=2).head(5)

1,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,161084.0,161155.0,161594.0,161830.0,161918.0,161944.0,162376.0,162542.0,162672.0,163949.0
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,,
2.0,,,,,,,,,,4.0,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,4.0,...,,,,,,,,,,
5.0,,,4.0,,,,,,,,...,,,,,,,,,,


Nice! We can power this up with fillna()!

In [7]:
pd.DataFrame(data).pivot(index=0, columns=1, values=2).fillna(0).head(5)

1,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,161084.0,161155.0,161594.0,161830.0,161918.0,161944.0,162376.0,162542.0,162672.0,163949.0
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


A thing of beauty. One-line of code! But remember that with great power comes great responsability - this matrix is really hungry and eats a lot of space!

# 4 Density and Sparsity score

Now, we compute the Density score and the Sparsity of the ratings matrix.

## 4.1 Density 

We will use the array method `nonzero` to return a mask of the elements that are non-zero - this is another way to get the elements that are not zero in a matrix - if you had negative elements it would be more efficient to use this method instead of the R[R>0] one.

In [8]:
R.nonzero()

(array([  0,   0,   0, ..., 670, 670, 670]),
 array([  30,  833,  859, ..., 4597, 4610, 4696]))

We can compute the density score, as: 

$$Density = \frac{|R'|}{|R|}$$

Where $|R'|$ is equal to the elements that are not zero in the matrix.

In [9]:
R[R.nonzero()].size / R.size

0.016439141608663475

Holy moly, at least now we know what we are up against! - Only 1.6% of the matrix has values that are not zero - Density refers to the number of elements in a matrix that are not zero over the total elements of a matrix. 

## 4.2 Sparsity

Sparsity is the opposite - the number of elements that are zero in a matrix over the total elements in the matrix! Simply put we can also consider:  

$$Sparsity = 1- \frac{|R'|}{|R|}$$

In [10]:
1 - R[R.nonzero()].size / R.size

0.9835608583913366

They complement each other and they are attributes that are important when speaking of rating matrixes, in this case: 
- **This matrix is ~2% dense and ~98% sparse!**

# 5 Aggregated opinion

Again, the most important idea about non-personalization is that we predict the utility for the entire community.

Perhaps the oldest RS is aggregated opinion, i.e., most popular/hated (Think Billboard or [IMDb Bottom 100](https://www.imdb.com/chart/bottom)).

## 5.1 Most-rated

According to popular opinion, the most popular items are the ones with most ratings. (see what we did here?) 

We start by checking which elements are greater than zero.

In [11]:
def is_rating(R):
    return np.greater(R, 0)

is_rating(R)

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False]])

Recall that each row corresponds to user and each column to an item, we can sum the results in each column to know how many real ratings exist for that item.

In [23]:
def count_ratings(R):
    R_ = is_rating(R)
    return R_.sum(axis=0)

count_ratings(R)

9066

Now, we can have a function that retrieves the top-$N$ most-rated items.

In [13]:
def most_rated(R, n):
    R_ = count_ratings(R)
    return np.negative(R_).argsort()[:n]


most_rated(R, 3)

array([321, 266, 284])

## Yeah, but what if most ratings are negative?

We can extend the function above to mimic another popular algorithm, "Highest % of Top Ratings".

Let's say a positive rating is anything above the value of 3 (e.g., 3 stars).

In [14]:
def count_positive_ratings(R, threshold):
    R_ = is_above_threshold(R, threshold)
    return R_.sum(axis=0)


def is_above_threshold(R, threshold):
    return np.greater(R, threshold)


count_positive_ratings(R, 3)

array([182,  51,  24, ...,   1,   0,   1])

Now, we just need to count the number of positive ratings and sort the resulting array.

In [15]:
def most_positive_ratings(R, threshold, n):
    R_ = count_positive_ratings(R, threshold)
    return np.negative(R_).argsort()[:n]


most_positive_ratings(R, 3, 3)

array([284, 321, 266])

# 6. Powering up with Summary Statistics

Until now we have only used counts to do stuff: Count the number of ratings and count the number of positive ratings. But we can rely on good old statistics to help us out here.

Probably the most popular non-personalized algorithm is the average rating.

Popularized at first by Amazon and Ebay and then IMDB, Netflix, among others, this is a basic yet widely used algorithm.

The first step is to remove the zeros, so that they don't affect our average.

In [16]:
def remove_zeros(R):
    R_ = R.copy()
    R_[R_ == 0] = np.NaN
    
    return R_


remove_zeros(R)

array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [ 4., nan, nan, ..., nan, nan, nan],
       [ 5., nan, nan, ..., nan, nan, nan]])

And now, we can safely compute the average rating per item and sort the array.

In [22]:
R.shape

(671, 9066)

In [21]:
def mean_ratings(R):
    R_ = remove_zeros(R)
    return np.nanmean(R_, axis=0)


mean_ratings(R)

9066

In [18]:
def best_mean_rating(R, n):
    R_ = mean_ratings(R)
    return np.negative(R_).argsort()[:n]


best_mean_rating(R, 3)

array([9065, 8119, 8125])

There are alternatives, such as computing the "mean rating for users that liked this item", that we don't explore.

It's increasingly popular also to show an histogram alongside mean ratings, to give a sense of the distribution of ratings. Or to normalize the mean by the number of ratings so that items with a low number of ratings do not get an advantage.

# 7 Association rules

Perhaps one of the most interesting (and also very popular) non-personalized algorithms is "people that buy X, also buy Y".

These are called association rules. Here, and for the sake of conciseness, we use `mlxtend` to implement some of them. 

(Yes, we are cheating. We should be implementing it with NumPy but for simplicity sake we will use mlxtend!)

## 7.1 Apriori

Apriori is used to identify common item pairs, i.e., stuff that usually goes together:
* We identify individual items that satisfy a minimum occurrence threshold
* Then, we extend the item sets, adding one item at a time 
* Every time we check if the resulting item set satisfies the specified threshold
* The algorithm stops when there are no more items to add that meet the threshold. 

The `mlxtend` expects a one-hot input, i.e., 0/1 or True/False.

(Unfortunately, `mlxtend` only supports dataframes at this point.)

In [19]:
def get_apriori_itemsets(R, min_support=0.3):
    R_ = pd.DataFrame(R > 0)
    return apriori(R_, min_support)


get_apriori_itemsets(R)

Unnamed: 0,support,itemsets
0,0.368107,(0)
1,0.339791,(100)
2,0.433681,(232)
3,0.482861,(266)
4,0.463487,(284)
5,0.508197,(321)
6,0.317437,(406)
7,0.408346,(427)
8,0.363636,(472)
9,0.320417,(521)


## 7.2 Support

Support is the percentage of users that contains the item set, so:

$$Support\{i, j\} = \frac{|U_{i, j}|}{|U|} = \frac{|U_{i, j}|}{m}$$

## 7.3 Confidence

Given two sets, the confidence refers to how frequently the item $j$ is purchased, given that item $i$ was purchased, as:

$$Confidence\{i \to j \} = \frac{Support\{i, j\}}{Support\{i\}}$$

Or, in a more familiar way, confidence is the conditional probability of $j$ given $i$:

$$P(j|i) = \frac{P(i \cap j)}{P(i)}$$

However, do $i$, and $j$ occur for the same users for a reason, or is it random? What if $j$ is a trendy item?

## 7.4 Lift

Meet the bananas trap: just because people buy bananas most times, it doesn't mean bananas go well with soap.

Fortunately, there is a better way. 

The lift algorithm, which takes into consideration the popularity of the items.

$$Lift\{i, j\} = \frac{Support\{i, j\}}{Support\{i\} * Support\{j\}}$$

The denominator is the likelihood that $i$ and $j$ appear together by chance, so lift questions whether $i$ makes $j$ more probable or not. Think of lift as a metric that is able to take into account the actual popularity of the item.

In [20]:
def get_rules(R, min_support=.3, min_threshold=1.2):
    itemsets = get_apriori_itemsets(R, min_support=0.3)
    return association_rules(itemsets, metric="lift", min_threshold=min_threshold)


get_rules(R)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(232),(953),0.433681,0.348733,0.302534,0.697595,2.000367,0.151295,2.153621
1,(953),(232),0.348733,0.433681,0.302534,0.867521,2.000367,0.151295,4.274794
2,(266),(284),0.482861,0.463487,0.326379,0.675926,1.458348,0.102578,1.655525
3,(284),(266),0.463487,0.482861,0.326379,0.70418,1.458348,0.102578,1.748153
4,(321),(266),0.508197,0.482861,0.344262,0.677419,1.402927,0.098874,1.60313
5,(266),(321),0.482861,0.508197,0.344262,0.712963,1.402927,0.098874,1.713379
6,(266),(525),0.482861,0.453055,0.33383,0.691358,1.525991,0.115067,1.772101
7,(525),(266),0.453055,0.482861,0.33383,0.736842,1.525991,0.115067,1.965127
8,(321),(284),0.508197,0.463487,0.321908,0.633431,1.366663,0.086365,1.463607
9,(284),(321),0.463487,0.508197,0.321908,0.694534,1.366663,0.086365,1.610009


Wrapping up: 
- Non personalized recommenders do not take into account specific users preferences or characteristics.
- Non personalized recommenders approaches are the simpler way to do recommendation engines.
- It's really important to know how to handle matrix sparsity as it will impact your workflow until the end.

Now, we have the foundations to tackle more complex recommendation approaches.

Time to practice!