# Modeling <sup>[1]</sup>

### Evaluation: performance criterion

Performance evaluation of recommendation systems include:

- RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$
- Precision / Recall / F-scores
- ROC curves
- Cost curves

## Imports

In [43]:
import numpy as np
import pandas as pd

# set some print options
np.set_printoptions(precision=4)
np.set_printoptions(threshold=5)
np.set_printoptions(suppress=True)
pd.set_option('precision', 3, 'notebook_repr_html', True, )

# init random gen
np.random.seed(2)

## Load Data

In [44]:
# 80/20 split earlier
df_train = pd.read_csv('../Data/training_data.csv')
df_test = pd.read_csv('../Data/testing_data.csv')

In [45]:
df_train.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,main_cat,asin,details,overall,verified,reviewerID,reviewText,summary,for_testing
0,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,5.0,True,A1J205ZK25TZ6W,I make the best brewed iced tea with this yell...,Best for brewed iced tea.,False
1,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,3.0,True,ACOICLIJQYECU,I have recently started drinking hot tea again...,Not Bad for iced Tea,False


In [46]:
df_test.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,main_cat,asin,details,overall,verified,reviewerID,reviewText,summary,for_testing
0,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,5.0,True,A29RCQA5G0B1BA,I like pretty much all of Lipton's tea... I ju...,A Great Cuppa...!,False
1,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,5.0,True,A3ALL8JW7604P7,My second favorite tea! Drink lots it'll make...,Good Stuff for those who like a strong tea,False


TODO: May have an issue with the for_testing column that was created previously. Or I'm misreading what it does.

### RMSE

In [47]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

### Evaluation method

In [48]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    
    ids_to_estimate = zip(df_test.reviewerID, df_test.asin)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = df_test.overall.values
    return compute_rmse(estimated, real)

### Baseline function

In [49]:
# This is a baseline that just gives an average rating to everything
def baseline_function(user_id, product_id):
    return 3

In [50]:
print('RMSE for baseline function: %s' % evaluate(baseline_function))

RMSE for baseline function: 1.792272271858807


Want to improve on 1.7923 for all future analysis.  
A value of 0 means there is no error, and the recommendation is perfect.
A value of 4 is the maximum amount it could be off (5-1).

In [51]:
def hard_coded_5_function(user_id, product_id):
    return 5

In [52]:
print('RMSE for hard coded most common rating: %s' % evaluate(hard_coded_5_function))

RMSE for hard coded most common rating: 1.2060183302315464


This is lower than the baseline and makes sense because the majority of the reviews are 5's.#%%

In [53]:
def hard_coded_4_function(user_id, product_id):
    return 4

In [54]:
print('RMSE for hard coded 4: %s' % evaluate(hard_coded_4_function))

RMSE for hard coded 4: 1.1547121089969605


A hard coded 4 is even better and makes sense because there are a full range of values for the reviews. 
Despite 5 being the most common rating, it doesn't mean it is closest to the average rating.

## Well-known Solutions to the Recommendation Problem

### Content-based filtering

*Recommend based on the user's rating history.* 

Generic expression (notice how this is kind of a 'row-based' approach):

$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits}r_{u,i} = \aggr_{i' \in I(u)} [r_{u,i'}]$$

A simple example using the mean as an aggregation function:

$$ r_{u,i} = \bar r_u = \frac{\sum_{i' \in I(u)} r_{u,i'}}{|I(u)|} $$

In [55]:
def content_mean(product_id, movie_id):
    """ Simple content-filtering based on mean ratings. """
    
    user_condition = df_train.reviewerID == product_id
    return df_train.loc[user_condition, 'overall'].mean()

In [56]:
# Specific example
content_mean('ACOICLIJQYECU', '4639725043')

3.5

In [57]:
# Longer running process
print('RMSE for content mean: %s' % evaluate(content_mean))

RMSE for content mean: nan


TODO: Research ways to fix.
In case the output is cleared, the RMSE using content mean is nan.

### Collaborative filtering

*Recommend based on other user's rating histories.* 

Generic expression (notice how this is kind of a 'col-based' approach):

$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits}r_{u,i} = \aggr_{u' \in U(i)} [r_{u',i}] $$

A simple example using the mean as an aggregation function:

$$ r_{u,i} = \bar r_i = \frac{\sum_{u' \in U(i)} r_{u',i}}{|U(i)|} $$

In [58]:
def collaborative_mean(user_id, product_id):
    """ Simple collaborative filter based on mean ratings. """
    
    user_condition = df_train.reviewerID != user_id
    movie_condition = df_train.asin == product_id
    ratings_by_others = df_train.loc[user_condition & movie_condition]
    if ratings_by_others.empty:
        return 4.0
    else:
        return ratings_by_others.overall.mean()
    

In [59]:
# Specific example
collaborative_mean('ACOICLIJQYECU', '4639725043')

4.238095238095238

The review rating for the collaborative mean is higher than the rating using the content mean above.

In [60]:
# Longer running process (several hours)
print(f'RMSE for collaborative mean is: {evaluate(collaborative_mean)}.')

RMSE for collaborative mean is: 1.0631020146075487.


In case the output is cleared, the RMSE for collaborative mean is 1.0631020146075487.

### Hybrid solutions

The literature has lots of examples of systems that try to combine the strengths
of the two main approaches. This can be done in a number of ways:

- Combine the predictions of a content-based system and a collaborative system.
- Incorporate content-based techniques into a collaborative approach.
- Incorporate collaborative techniques into a content-based approach.
- Unifying model.

### Generalizations of the aggregation function for content-based filtering: incorporating similarities

Possibly incorporating metadata about items, which makes the term 'content' make more sense now.

$$ r_{u,i} = k \sum_{i' \in I(u)} sim(i, i') \; r_{u,i'} $$

$$ r_{u,i} = \bar r_u + k \sum_{i' \in I(u)} sim(i, i') \; (r_{u,i'} - \bar r_u) $$

Here $k$ is a normalizing factor,

$$ k = \frac{1}{\sum_{i' \in I(u)} |sim(i,i')|} $$

and $\bar r_u$ is the average rating of user u:

$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$


### Generalizations of the aggregation function for collaborative filtering: incorporating similarities

Possibly incorporating metadata about users.

$$ r_{u,i} = k \sum_{u' \in U(i)} sim(u, u') \; r_{u',i} $$

$$ r_{u,i} = \bar r_u + k \sum_{u' \in U(i)} sim(u, u') \; (r_{u',i} - \bar r_u) $$

Here $k$ is a normalizing factor,

$$ k = \frac{1}{\sum_{u' \in U(i)} |sim(u,u')|} $$

and $\bar r_u$ is the average rating of user u:

$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$

### Create a user index
To retrieve information given a specific user_id in a more convenient way.

In [61]:
user_info = df_test.set_index('reviewerID')
user_info.head(3) 

Unnamed: 0_level_0,category,description,title,also_buy,brand,rank,main_cat,asin,details,overall,verified,reviewText,summary,for_testing
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
A29RCQA5G0B1BA,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,5.0,True,I like pretty much all of Lipton's tea... I ju...,A Great Cuppa...!,False
A3ALL8JW7604P7,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,5.0,True,My second favorite tea! Drink lots it'll make...,Good Stuff for those who like a strong tea,False
A19XLYBG7REBLS,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,5.0,True,excellent,Five Stars,False


## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  