## Recommender systems

Nowadays, recommender systems are used to personalize your experience on the web, telling you what to buy, where to eat or even who you should be friends with. People’s tastes vary, but generally follow patterns. People tend to like things that are similar to other things they like, and they tend to have similar taste as other people they are close with. Recommender systems try to capture these patterns to help predict what else you might like.

### Types
- Content-Based (Similarity between items)
- Collaborative Filtering (Similarity between user's behaviers)
    - Model-Based Collaborative filtering (SVD)
    - Memory-Based Collaborative Filtering (cosine similarity)
        - user-item filtering
        - item-item filtering
 
### Data
- [MovieLens 100K Dataset](https://grouplens.org/datasets/movielens/100k/)
- 100k movie ratings
- 943 users
- 1682 movies

In [1]:
import numpy as np
import pandas as pd
import tools as t
from sklearn.metrics.pairwise import pairwise_distances

In [2]:
#reading
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=header)

In [3]:
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)  

Number of users = 943 | Number of movies = 1682


In [59]:
from sklearn import cross_validation as cv
train_data, test_data = cv.train_test_split(df, test_size=0.25)

In [56]:
train_data.describe()
train_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp
78949,927,763,4,879181749
14895,59,71,3,888205574
39149,144,500,4,888105419
35031,389,181,4,879915806
5843,194,511,4,879520991


### Create a user-item rating matrix

<img src="user-item.png">

In [57]:
def user_item_rating(data):
    rowIDs = df['user_id']
    colIDs = df['item_id']
    A = np.zeros((rowIDs.max(),colIDs.max()))
    A[rowIDs-1,colIDs-1] = df['rating']
    return A

In [60]:
train_data_matrix = user_item_rating(train_data)
test_data_matrix = user_item_rating(test_data)

print train_data_matrix.shape, test_data_matrix.shape
print "Train Matrix ", train_data_matrix[:10]
print
print "Test Matrix ", test_data_matrix[:10]

(943, 1682) (943, 1682)
Train Matrix  [[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]]

Test Matrix  [[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]]


### Calculate Cosine Similarity
<img src="user_sim.gif">
<img src="item_sim.gif">

#### Hint: look for pairwise_distances

In [89]:
def cosine_similarity_user(data):
    return cosine_similarity(data)

In [90]:
def cosine_similarity_item(data):
    return cosine_similarity(data)

In [217]:
from sklearn.metrics.pairwise import cosine_similarity
user_similarity = cosine_similarity_user(train_data_matrix)
item_similarity = cosine_similarity_item(train_data_matrix.T)

In [97]:
print user_similarity.shape, item_similarity.shape
print user_similarity[0][1]
print item_similarity[0][1]

(943, 943) (1682, 1682)
0.166930983869
0.4023821783


### Predictions
- user-item filtering
- item-item filtering

<img src="user_predict.gif">
<img src="item_predict.gif">

In [226]:
def predict(ratings, similarity, type='item'):
    if type == 'item':
#         result = np.dot(ratings,similarity)/ np.array([np.abs(similarity).sum(axis=1)])
        print np.shape(ratings)
        print np.shape(similarity)
        result = np.dot(ratings,similarity)
    if type == 'user':
        rating_mean = np.mean(ratings,axis=1)
        rating_diff = (ratings - rating_mean[:,np.newaxis])
        result = rating_mean[:,np.newaxis] + np.dot(similarity,rating_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    return result

In [227]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
# user_prediction = predict(train_data_matrix, user_similarity, type='user')
# print train_data_matrix[1,1] == 1
# print np.shape(train_data_matrix)
# print np.shape(train_data_matrix[:,1681])
# print np.shape(user_similarity[0,:])
# print np.mean(train_data_matrix[2,:])
item_prediction

(943, 1682)
(1682, 1682)


array([[ 2513.62295555,  2436.79369503,  2314.54464929, ...,
         2618.84922329,  2618.68126852,  2618.23078906],
       [ 1078.94141117,  1034.33102824,   985.94514636, ...,
         1125.46379592,  1125.48930533,  1125.3408924 ],
       [  854.74437862,   821.39808256,   782.37893219, ...,
          895.51459334,   895.55562691,   895.43551464],
       ..., 
       [ 4756.54211025,  4597.12612512,  4369.72513806, ...,
         5103.34579456,  5103.12166657,  5102.37909277],
       [ 5331.82083275,  5155.30396766,  4898.1087534 , ...,
         5712.67882335,  5712.41250438,  5711.59255174],
       [ 5678.98220959,  5495.3368623 ,  5221.06791887, ...,
         6081.46298093,  6081.1612561 ,  6080.25628339]])

In [213]:
print item_prediction[0]
print user_prediction[0]

[ 2.12156153  2.03402604  2.05774686 ...,  1.35864862  2.08447116
  2.34978973]
[ 3.08952651  1.29865335  0.89410707 ...,  2.77339047  2.77594018
  2.77632291]


### Evatuate

In [214]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
#     print prediction
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
#     print ground_truth
    return sqrt(mean_squared_error(prediction, ground_truth))

In [216]:
print 'Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix))
print 'User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))


Item-based CF RMSE: 1.76137179504
User-based CF RMSE: 1.67722221628
