## Recommender systems

Nowadays, recommender systems are used to personalize your experience on the web, telling you what to buy, where to eat or even who you should be friends with. People’s tastes vary, but generally follow patterns. People tend to like things that are similar to other things they like, and they tend to have similar taste as other people they are close with. Recommender systems try to capture these patterns to help predict what else you might like.

### Types
- Content-Based (Similarity between items)
- Collaborative Filtering (Similarity between user's behaviers)
    - Model-Based Collaborative filtering (SVD)
    - Memory-Based Collaborative Filtering (cosine similarity)
        - user-item filtering
        - item-item filtering
 
### Data
- [MovieLens 100K Dataset](https://grouplens.org/datasets/movielens/100k/)
- 100k movie ratings
- 943 users
- 1682 movies

In [1]:
import numpy as np
import pandas as pd
import tools as t
from sklearn.metrics.pairwise import pairwise_distances

In [2]:
#reading
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=header)

In [3]:
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)  

Number of users = 943 | Number of movies = 1682


In [4]:
from sklearn import cross_validation as cv
train_data, test_data = cv.train_test_split(df, test_size=0.25)

In [5]:
train_data.describe()

Unnamed: 0,user_id,item_id,rating,timestamp
count,75000.0,75000.0,75000.0,75000.0
mean,462.69028,426.1074,3.5262,883549700.0
std,266.818143,331.115172,1.127261,5350006.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448900.0
50%,448.0,322.0,4.0,882843700.0
75%,682.0,631.0,4.0,888267500.0
max,943.0,1682.0,5.0,893286600.0


### Create a user-item rating matrix

<img src="user-item.png">

In [6]:
def user_item_rating(data):
    matrix = np.zeros((n_users, n_items))
    # you code here
    return matrix

In [12]:
train_data_matrix = user_item_rating(train_data)
test_data_matrix = user_item_rating(test_data)

print train_data_matrix.shape, test_data_matrix.shape
print "Train Matrix ", train_data_matrix[:10]
print
print "Test Matrix ", test_data_matrix[:10]

(943L, 1682L) (943L, 1682L)
Train Matrix  [[ 5.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]]

Test Matrix  [[ 5.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]]


### Calculate Cosine Similarity
<img src="user_sim.gif">
<img src="item_sim.gif">

#### Hint: look for pairwise_distances

In [7]:
def cosine_similarity(data):
    # you code here
    return similarity

In [14]:
user_similarity = cosine_similarity(train_data_matrix)
item_similarity = cosine_similarity(train_data_matrix.T)

In [15]:
print user_similarity.shape, item_similarity.shape
print user_similarity[0][1]
print item_similarity[0][1]

(943L, 943L) (1682L, 1682L)
0.884271334295
0.716230386274


### Predictions
- user-item filtering
- item-item filtering

<img src="user_sim.gif">
<img src="item_sim.gif">

In [8]:
def predict(ratings, similarity, type='item'):
    if type == 'item':
        pass
    return 

In [None]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

In [None]:
print item_prediction[0]
print user_prediction[0]

### Evatuate

In [20]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [21]:
print 'Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix))
#print 'User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))


Item-based CF RMSE: 3.45542821619
