# Recommendations Using the Mean

## Imports

In [None]:
import numpy as np
import pandas as pd

## Load Data

Only loading a subset of the original data set for proof of concept reasons.

In [None]:
# 80/20 split earlier
df_train = pd.read_csv('../Data/training_data_subset.csv')
df_test = pd.read_csv('../Data/testing_data_subset.csv')

In [None]:
df_train.head(2)

In [None]:
df_test.head(2)

### RMSE

In [None]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

### Evaluation method

In [None]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    
    ids_to_estimate = zip(df_test.reviewerID, df_test.asin)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = df_test.overall.values
    return compute_rmse(estimated, real)

## Well-known Solutions to the Recommendation Problem

### Content-based filtering

*Recommend based on the user's rating history.* 

Generic expression (notice how this is kind of a 'row-based' approach):

$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits}r_{u,i} = \aggr_{i' \in I(u)} [r_{u,i'}]$$

A simple example using the mean as an aggregation function:

$$ r_{u,i} = \bar r_u = \frac{\sum_{i' \in I(u)} r_{u,i'}}{|I(u)|} $$

In [None]:
def content_mean(product_id, movie_id):
    """ Simple content-filtering based on mean ratings. """
    
    user_condition = df_train.reviewerID == product_id
    return df_train.loc[user_condition, 'overall'].mean()

In [None]:
# Specific example
content_mean('ACOICLIJQYECU', '4639725043')

In [None]:
# Longer running process (several hours). Uncomment to run.
# print('RMSE for content mean: %s' % evaluate(content_mean))

TODO: Research ways to fix.
In case the output is cleared, the RMSE using content mean is nan.

### Collaborative filtering

*Recommend based on other user's rating histories.* 

Generic expression (notice how this is kind of a 'col-based' approach):

$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits}r_{u,i} = \aggr_{u' \in U(i)} [r_{u',i}] $$

A simple example using the mean as an aggregation function:

$$ r_{u,i} = \bar r_i = \frac{\sum_{u' \in U(i)} r_{u',i}}{|U(i)|} $$

In [None]:
def collaborative_mean(user_id, product_id):
    """ Simple collaborative filter based on mean ratings. """
    
    user_condition = df_train.reviewerID != user_id
    movie_condition = df_train.asin == product_id
    ratings_by_others = df_train.loc[user_condition & movie_condition]
    if ratings_by_others.empty:
        return 4.0
    else:
        return ratings_by_others.overall.mean()
    

In [None]:
# Specific example
collaborative_mean('ACOICLIJQYECU', '4639725043')

The review rating for the collaborative mean is higher than the rating using the content mean above.

In [None]:
# Longer running process (several hours). Uncomment to run.
# print(f'RMSE for collaborative mean is: {evaluate(collaborative_mean)}.')

In case the output is cleared, the RMSE for collaborative mean is 1.0631020146075487.
This is the best results so far.

### Generalizations of the aggregation function for content-based filtering: incorporating similarities

Possibly incorporating metadata about items, which makes the term 'content' make more sense now.

$$ r_{u,i} = k \sum_{i' \in I(u)} sim(i, i') \; r_{u,i'} $$

$$ r_{u,i} = \bar r_u + k \sum_{i' \in I(u)} sim(i, i') \; (r_{u,i'} - \bar r_u) $$

Here $k$ is a normalizing factor,

$$ k = \frac{1}{\sum_{i' \in I(u)} |sim(i,i')|} $$

and $\bar r_u$ is the average rating of user u:

$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$


### Generalizations of the aggregation function for collaborative filtering: incorporating similarities

Possibly incorporating metadata about users.

$$ r_{u,i} = k \sum_{u' \in U(i)} sim(u, u') \; r_{u',i} $$

$$ r_{u,i} = \bar r_u + k \sum_{u' \in U(i)} sim(u, u') \; (r_{u',i} - \bar r_u) $$

Here $k$ is a normalizing factor,

$$ k = \frac{1}{\sum_{u' \in U(i)} |sim(u,u')|} $$

and $\bar r_u$ is the average rating of user u:

$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$

## Summary
- TODO

## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  