# Recommendations Using Hybrid solutions <sup>1</sup>

Lots of examples of systems that try to combine the strengths of the two main approaches.
This can be done in a number of ways:

- Combine the predictions of a content-based system and a collaborative system.
- Incorporate content-based techniques into a collaborative approach.
- Incorporate collaborative techniques into a content-based approach.
- Unifying model.

## Imports

In [1]:
import numpy as np
import pandas as pd
from scipy.special import logsumexp

## Load Data

Only loading a subset of the original data set for proof of concept reasons.

In [2]:
# 80/20 split earlier
df_train = pd.read_csv('../Data/training_data_subset.csv')
df_test = pd.read_csv('../Data/testing_data_subset.csv')

In [3]:
# Also, load the entire original dataset to make a user index.
# Note: If using the full train/test dataset several of the methods below can takes 2-6 hours to run.
df_original = pd.read_csv('../Data/eda_data.csv')

In [4]:
df_train.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,price,asin,overall,verified,reviewerID,vote,style,for_testing
0,"['Grocery & Gourmet Food', 'Candy & Chocolate'...","YumEarth Organic Gummy Bears, 10 Count","['B008CC8UXC', 'B00C25LO8S', 'B073RWDCMD', 'B0...",YumEarth,"129,438 in Grocery & Gourmet Food (","['B008CC8UXC', 'B00C25LNWA', 'B008CC8ULY', 'B0...",Grocery,,B008B7JNRA,3.0,True,A35KP4ROS9KWPO,,"{'Size:': ' 10 Count', 'Style:': ' Natural Gum...",False
1,"['Grocery & Gourmet Food', 'Jams, Jellies & Sw...",Bell Plantation Powdered PB2 Bundle: 1 Peanut ...,"['B06W9N8X9H', 'B06X15V3DC', 'B01ENYJX3S', 'B0...",PB2,"1,214 in Grocery & Gourmet Food (",,Grocery,18.49,B00H9H56QA,5.0,True,AVAMZWS7AAI1S,,{'Size:': ' Pack of 2 (1 each flavor)'},False
2,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",Cruz De Malta 1/2 Kilo Yerba Mate,"['B01M1N9U23', 'B07H4Y4HS3', 'B07C1GJB5Q', 'B0...",Cruz de Malta,"4,168 in Grocery & Gourmet Food (",,Grocery,8.96,B001UO90BA,3.0,True,A3TBH6LM7PSZOM,,{'Size:': ' 500 Grams (1/2 Kilo)'},False
3,"['Grocery & Gourmet Food', 'Candy & Chocolate'...","Sheila G Brownie Brittle, Traditional Walnut, ...",,Sheila G,"814,161 in Grocery & Gourmet Food (",,Grocery,,B00AKSDGL2,5.0,True,A1CFMGYN17AJTT,,{'Flavor:': ' Toffee Crunch'},False
4,"['Grocery & Gourmet Food', 'Canned, Jarred & P...","Thai Red Curry Meal Kit by Marion's Kitchen, 5...","['B07HQY62FP', 'B07HQW7J3Y', 'B01LEC66KA', 'B0...",Marion's Kitchen,"50,529 in Grocery & Gourmet Food (","['B071GZNTM1', 'B07HQY62FP', 'B07HQW7J3Y', 'B0...",Grocery,34.95,B0141KOO5Q,5.0,True,A1V4KHN1SPG1PS,,{'Flavor:': ' Thai Green Curry'},False


In [5]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083170 entries, 0 to 1083169
Data columns (total 14 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   category    1083170 non-null  object 
 1   title       1083170 non-null  object 
 2   also_buy    926546 non-null   object 
 3   brand       1075197 non-null  object 
 4   rank        1039163 non-null  object 
 5   also_view   577060 non-null   object 
 6   main_cat    1081896 non-null  object 
 7   price       750231 non-null   float64
 8   asin        1083170 non-null  object 
 9   overall     1083170 non-null  float64
 10  verified    1083170 non-null  bool   
 11  reviewerID  1083170 non-null  object 
 12  vote        149247 non-null   float64
 13  style       559212 non-null   object 
dtypes: bool(1), float64(3), object(10)
memory usage: 108.5+ MB


### Create a user index
To retrieve information given a specific user_id in a more convenient way.

In [15]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [16]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    
    ids_to_estimate = zip(df_test.reviewerID, df_test.asin)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = df_test.overall.values
    return compute_rmse(estimated, real)

## Similarity functions

### Euclidean 'similarity'

$$ sim(x,y) = \frac{1}{1 + \sqrt{\sum (x - y)^2}}$$

In [20]:
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))

### Cosine similarity

$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$

In [21]:
def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))

### Pearson correlation

$$ sim(x,y) = \frac{(x - \bar x).(y - \bar y)}{\sqrt{(x - \bar x).(x - \bar x) * (y - \bar y)(y - \bar y)}} $$

In [22]:
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(logsumexp(s1_c ** 2)) * np.sum(logsumexp(s2_c ** 2)))

### Jaccard similarity

$$ sim(x,y) = \frac{(x . y)}{(x . x) + (y . y) - (x . y)} $$

In [23]:
def jaccard(s1, s2):
    dotp = np.sum(s1 * s2)
    return dotp / (np.sum(s1 ** 2) + np.sum(s2 ** 2) - dotp)

In [28]:
class Recommender(object):
    def __init__(self, similarity=pearson):
        self.overall_mean = df_test['overall'].mean()
        self.all_user_profiles = df_test.pivot_table('overall', index='asin', columns='reviewerID')
        self._similarity = similarity
        
    @property
    def similarity(self):
        return self._similarity
    
    @similarity.setter
    def similarity(self, value):
        self._similarity = value
    
    def estimate_product(self, user_id, product_id):
        all_ratings = df_test.loc[df_test.asin == product_id]
        if all_ratings.empty:
            return self.overall_mean
        all_ratings.set_index('reviewerID', inplace=True)
        their_ids = all_ratings.index
        their_ratings = all_ratings.overall
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        sims = their_profiles.apply(lambda profile: self.similarity(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'overall': their_ratings})
        ratings_sims = ratings_sims[ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.overall, weights=ratings_sims.sim)

    def estimate_user(self):
        return self.overall_mean

    def estimate(self, user_id, movie_id):
        return 0.5 * self.estimate_user() + 0.5 * self.estimate_product(user_id, movie_id)

In [29]:
rec = Recommender(euclidean)
print('RMSE for recommender estimate class using euclidean: %s' % evaluate(rec.estimate))

RMSE for recommender estimate class using euclidean: 0.5895197895124215


In [30]:
rec = Recommender(cosine)
print('RMSE for recommender estimate class using cosine: %s' % evaluate(rec.estimate))

RMSE for recommender estimate class using cosine: 0.6383217858949748


In [31]:
rec = Recommender(pearson)
print('RMSE for recommender estimate class using pearson: %s' % evaluate(rec.estimate))

RMSE for recommender estimate class using pearson: 0.6392390495737882


In [32]:
rec = Recommender(jaccard)
print('RMSE for recommender estimate class using jaccard: %s' % evaluate(rec.estimate))

RMSE for recommender estimate class using jaccard: 0.6052358162303417


## Summary
- Evaluated estimated recommendations using several hybrid methods.
- In all cases, the estimates were:
    - fairly close to one another.
    - significantly better than all previous notebook versions.
- The best performing function was euclidean with a RMSE of 0.5895.
- The worse performing function was pearson with a RMSE of 0.6392.

## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  