# Recommendations Using the Mean

## Imports

In [1]:
import numpy as np
import pandas as pd

## Load Data

Only loading a subset of the original data set for proof of concept reasons.

In [2]:
# 80/20 split earlier
df_train = pd.read_csv('../Data/training_data_subset.csv')
df_test = pd.read_csv('../Data/testing_data_subset.csv')

In [3]:
df_train.head(2)

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,price,asin,overall,verified,reviewerID,vote,style,for_testing
0,"['Grocery & Gourmet Food', 'Candy & Chocolate'...","YumEarth Organic Gummy Bears, 10 Count","['B008CC8UXC', 'B00C25LO8S', 'B073RWDCMD', 'B0...",YumEarth,"129,438 in Grocery & Gourmet Food (","['B008CC8UXC', 'B00C25LNWA', 'B008CC8ULY', 'B0...",Grocery,,B008B7JNRA,3.0,True,A35KP4ROS9KWPO,,"{'Size:': ' 10 Count', 'Style:': ' Natural Gum...",False
1,"['Grocery & Gourmet Food', 'Jams, Jellies & Sw...",Bell Plantation Powdered PB2 Bundle: 1 Peanut ...,"['B06W9N8X9H', 'B06X15V3DC', 'B01ENYJX3S', 'B0...",PB2,"1,214 in Grocery & Gourmet Food (",,Grocery,18.49,B00H9H56QA,5.0,True,AVAMZWS7AAI1S,,{'Size:': ' Pack of 2 (1 each flavor)'},False


In [4]:
df_test.head(2)

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,price,asin,overall,verified,reviewerID,vote,style,for_testing
0,"['Grocery & Gourmet Food', 'Snack Foods', 'Bar...","Grocery &amp; Gourmet Food"" />","['B01MT0QDPO', 'B00NL17FE4', 'B01NBM9OJN', 'B0...",Nature Valley,"16,921 in Grocery & Gourmet Food (",,Grocery,18.04,B001E6GFR6,5.0,True,A2IUE299OONA73,,,True
1,"['Grocery & Gourmet Food', 'Snack Foods', 'Chi...",Gourmet Basics Smart Fries 4-Flavor Variety Pa...,"['B0763SHX4W', 'B0040FIHS8', 'B00FYR5HS4', 'B0...",Gourmet Basics,"53,167 in Grocery & Gourmet Food (",,Grocery,23.99,B003AZ2ECY,4.0,True,A38NO7J1TK4R1W,,,True


### RMSE

In [5]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

### Evaluation method

In [6]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    
    ids_to_estimate = zip(df_test.reviewerID, df_test.asin)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = df_test.overall.values
    return compute_rmse(estimated, real)

## Well-known Solutions to the Recommendation Problem

### Content-based filtering

*Recommend based on the user's rating history.* 

Generic expression (notice how this is kind of a 'row-based' approach):

$$\newcommand{\aggr}{\mathop{\rm aggr}\nolimits}r_{u,i} = \aggr_{i' \in I(u)} [r_{u,i'}]$$

A simple example using the mean as an aggregation function:

$$ r_{u,i} = \bar r_u = \frac{\sum_{i' \in I(u)} r_{u,i'}}{|I(u)|} $$

In [7]:
def content_mean(product_id, user_id):
    """ Simple content-filtering based on mean ratings. """
    
    user_condition = df_train.reviewerID != user_id
    movie_condition = df_train.asin == product_id
    ratings_by_others = df_train.loc[user_condition & movie_condition]
    if ratings_by_others.empty:
        return 4.0
    else:
        return df_train.loc[user_condition, 'overall'].mean()

In [8]:
# Test model
print('RMSE for content mean: %s' % evaluate(content_mean))

RMSE for content mean: 1.1813128290169375


### Collaborative filtering

*Recommend based on other user's rating histories.* 

Generic expression (notice how this is kind of a 'col-based' approach):

$$\newcommand{\aggr}{\mathop{\rm aggr}\nolimits}r_{u,i} = \aggr_{u' \in U(i)} [r_{u',i}]$$

A simple example using the mean as an aggregation function:

$$ r_{u,i} = \bar r_i = \frac{\sum_{u' \in U(i)} r_{u',i}}{|U(i)|} $$

In [9]:
def collaborative_mean(user_id, product_id):
    """ Simple collaborative filter based on mean ratings. """
    
    user_condition = df_train.reviewerID != user_id
    movie_condition = df_train.asin == product_id
    ratings_by_others = df_train.loc[user_condition & movie_condition]
    if ratings_by_others.empty:
        return 4.0
    else:
        return ratings_by_others.overall.mean()

In [10]:
# Test model
print(f'RMSE for collaborative mean is: {evaluate(collaborative_mean)}.')

RMSE for collaborative mean is: 1.2344329617156664.


The rating for the collaborative mean is worse (higher error) than the rating using the content mean above.

### Generalizations of the aggregation function for content-based filtering: incorporating similarities

Possibly incorporating metadata about items, which makes the term 'content' make more sense now.

$$ r_{u,i} = k \sum_{i' \in I(u)} sim(i, i') \; r_{u,i'} $$

$$ r_{u,i} = \bar r_u + k \sum_{i' \in I(u)} sim(i, i') \; (r_{u,i'} - \bar r_u) $$

Here $k$ is a normalizing factor,

$$ k = \frac{1}{\sum_{i' \in I(u)} |sim(i,i')|} $$

and $\bar r_u$ is the average rating of user u:

$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$

### Generalizations of the aggregation function for collaborative filtering: incorporating similarities

Possibly incorporating metadata about users.

$$ r_{u,i} = k \sum_{u' \in U(i)} sim(u, u') \; r_{u',i} $$

$$ r_{u,i} = \bar r_u + k \sum_{u' \in U(i)} sim(u, u') \; (r_{u',i} - \bar r_u) $$

Here $k$ is a normalizing factor,

$$ k = \frac{1}{\sum_{u' \in U(i)} |sim(u,u')|} $$

and $\bar r_u$ is the average rating of user u:

$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$

## Summary
- The content mean (simple average of the product) is better (has a lower error) than the collaborative mean (simple average of the users).
- However, this RMSE of 1.1813 is still worse than the best baseline function in the earlier version.
- Therefore, will use custom similarity functions in the next version of the notebooks for improvements.

## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  