# BLU10 - Learning Notebook - Part 3 of 3 - Non-personalized Recommendations

In [None]:
import os
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import numpy as np
import pandas as pd

# 1 Non-personalized RS

The core functions of any RS is to identify useful items for the user.

Going back to our framework, non-personalized RS typically include the base model (users, items, and ratings).

We consider users, however, as providers of ratings, ignoring user preferences at recommendation time.

![Recommender Sytems Framework](./media/recommender_systems_framework.png)

*Fig.1 - RS framework with a community, the basic and extended models.*

The rationale is that a generic user also likes something that is liked by many users.

If we are unable to predict the utility of an item for a particular user, then we recommend an item with high utility for many users.

This approach is particularly relevant in the absence of information about the user preferences.

Non-personalized algorithms are useful to get a sense of building an RS and train our NumPy skills.

# 2 Loading the data

First, we read the data into Python and NumPy.

In [None]:
def read_data():
    
    path = os.path.join('data', 'ml-latest-small', 'ratings.csv')
    data = np.genfromtxt(path, delimiter=',', skip_header=1, usecols=[0, 1, 2])
    return data


data = read_data()
data

Although the presentation when printing is not anything refined, like in Pandas, what we have here is:
* User identification in the first column
* Item identification in the second column
* Rating in the third column.

This array is, in short, our set of recorded ratings $R'$, presented in long-form.

For more information, explore the `../data/ml-latest-small/` folder.

# 3 Building the ratings matrix

The second step then is to transform this representation into a ratings matrix, with:
* Users as rows
* Items as columns
* Ratings as values.

We use the unique values for users and items, storing the indices that can be used to reconstruct the original array.

Then, we create a matrix, all filled with zeros, the size we want:
* The number of unique users is the number of rows
* The number of unique items is the number of columns.

Finally, we fill in the rating values, using the stored indexes, in a vectorized way.

In [None]:
def make_ratings(data):
    
    users, user_pos = np.unique(data[:, 0], return_inverse=True)
    items, item_pos = np.unique(data[:, 1], return_inverse=True)
    
    R = np.zeros((len(users), len(items)))
    R[user_pos, item_pos] = data[:, 2]
    
    return R


R = make_ratings(data)
R

Take your time to read through and experiment with the code as you go.

# 4 Sparsity score

Now, we compute the sparsity score of the ratings matrix.

We will use the array method `nonzero` to return a mask of the element that are non-zero.

In [None]:
R.nonzero()

As we've seen, we compute the sparsity score, as: 

$$Sparcity = \frac{|R'|}{|R|}$$

In [None]:
R[R.nonzero()].size / R.size

Holy moly, at least now we know what we are up against!

# 5 Aggregated opinion

Again, the most important idea about non-personalization is that we predict the utility for the entire community.

Perhaps the oldest RS is aggregated opinion, i.e., most popular/hated (Think Billboard or [IMDb Bottom 100](https://www.imdb.com/chart/bottom)).

## 5.1 Most-rated

We can think of the most popular as the most rated.

We start by checking which elements are greater than zero.

In [None]:
def is_rating(R):
    return np.greater(R, 0)


is_rating(R)

Recalling that each row corresponds to user and each column to an item, we can sum the results in each column to know how many ratings exist for that item.

In [None]:
def count_ratings(R):
    R_ = is_rating(R)
    return R_.sum(axis=0)


count_ratings(R)

Now, we can have a function that retrieves the top-$N$ most-rated items.

In [None]:
def most_rated(R, n):
    R_ = count_ratings(R)
    return np.negative(R_).argsort()[:n]


most_rated(R, 3)

## 5.2 % > X

We can extend the function above to mimic another popular algorithm, "% of people that like this item".

Let's say a positive rating is anything above the value of 3 (e.g., 3 stars).

In [None]:
def count_positive_ratings(R, threshold):
    R_ = is_above_threshold(R, threshold)
    return R_.sum(axis=0)


def is_above_threshold(R, threshold):
    return np.greater(R, threshold)


count_positive_ratings(R, 3)

Now, we just need to count the number of positive ratings and sort the resulting array.

In [None]:
def most_positive_ratings(R, threshold, n):
    R_ = count_positive_ratings(R, threshold)
    return np.negative(R_).argsort()[:n]


most_positive_ratings(R, 3, 3)

# 6 Summary statistics

Probaly the most popular non-personalized algorithm is the average rating.

Popularized at first by Amazon and Ebay and then IMDB, Netflix, among others, this is a basic yet widely used algorithm.

The first step is to remove the zeros, so that they don't affect out average.

In [None]:
def remove_zeros(R):
    R_ = R.copy()
    R_[R_ == 0] = np.NaN
    
    return R_


remove_zeros(R)

A now, we can safely compute the average rating per item and sort the array.

In [None]:
def mean_ratings(R):
    R_ = remove_zeros(R)
    return np.nanmean(R_, axis=0)


mean_ratings(R)

In [None]:
def best_mean_rating(R, n):
    R_ = mean_ratings(R)
    return np.negative(R_).argsort()[:n]


best_mean_rating(R, 3)

There are alternatives, such as computing the "mean rating for users that liked this item", that we don't explore.

It's increasingly popular also to show an histogram alongside mean ratings, to give a sense of the distribution of ratings.

# 7 Association rules

Perhaps one of the most interesting (and also very popular) non-personalized algorithms is "people that buy X, also buy Y".

These are called association rules. Here, and for the sake of conciseness, we use `mlxtend` to implement some of them. 

(Yes, we are cheating. We should be implementing it with NumPy!)

## 7.1 Apriori

Apriori is used to identify common item pairs, i.e., stuff that usually goes together:
* We identify individual items that satisfy a minimum occurrence threshold
* Then, we extend the item sets, adding one item at a time 
* Every time we check if the resulting item set satisfies the specified threshold
* The algorithm stops when there are no more items to add that meet the threshold. 

The `mlxtend` expects a one-hot input, i.e., 0/1 or True/False.

(Unfortunately, `mlxtend` only supports dataframes at this point. We still cheating.)

In [None]:
def get_apriori_itemsets(R, min_support=0.3):
    R_ = pd.DataFrame(R > 0)
    return apriori(R_, min_support)


get_apriori_itemsets(R)

## 7.2 Support

Support is the percentage of users that contains the item set, so:

$$Support\{i, j\} = \frac{|U_{i, j}|}{|U|} = \frac{|U_{i, j}|}{m}$$

## 7.3 Confidence

Given two sets, the confidence is the how frequently the item $j$ is purchased, given that item $i$ was purchased, as:

$$Confidence\{i \to j \} = \frac{Support\{i, j\}}{Support\{i\}}$$

Or, in a more familiar way, confidence is the conditional probability of $j$ given $i$:

$$P(j|i) = \frac{P(i \cap j)}{P(i)}$$

However, do $i$, and $j$ occur for the same users for a reason, or is it random? What if $j$ is a trendy item?

## 7.4 Lift

Meet the bananas trap: just because people buy bananas most times, it doesn't mean bananas go well with soap.

Fortunately, there is a better way. 

The lift algorithm, which takes into consideration the popularity of the items.

$$Lift\{i, j\} = \frac{Support\{i, j\}}{Support\{i\} * Support\{j\}}$$

The denominator is the likelihood that $i$ and $j$ appear together by chance, so lift questions whether $i$ makes $j$ more probable or not.

In [None]:
def get_rules(R, min_support=.3, min_threshold=1.2):
    itemsets = get_apriori_itemsets(R, min_support=0.3)
    return association_rules(itemsets, metric="lift", min_threshold=min_threshold)


get_rules(R)

Now, we have the foundations to tackle more complext recommendation approaches.

Time to practice!