#### Types of Recommendation Engine
1. <b>Collaborative Filtering:</b> wisdom of crowd, commonly used and generally gives better result<br>
    a. Memory Based CF uses SVD<br>
    b. Model Based CF used cosine similarity
2. <b>Content Based:</b> similarity between items

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white')
%matplotlib inline

### Data

In [2]:
columnNames = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=columnNames)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [3]:
movieTitles = pd.read_csv('Movie_Id_Titles')
movieTitles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [4]:
df = pd.merge(df, movieTitles, on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


In [11]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()
print(n_users, n_items)

944 1682


### Train Test Split

In [8]:
from sklearn.model_selection import train_test_split
trainData, testData = train_test_split(df, test_size=0.25)

## Memory Based CF
- Easy to implement
- Hard to scale
1. <b>Item-Item CF:</b> users who like this item will also like
2. <b>User-Item CF:</b> users who are similar to you also liked

In [14]:
trainDataMatrix = np.zeros((n_users, n_items))
for line in trainData.itertuples():
    trainDataMatrix[line[1]-1, line[2]-1] = line[3]

testDataMatrix = np.zeros((n_users, n_items))
for line in testData.itertuples():
    testDataMatrix[line[1]-1, line[2]-1] = line[3]

In [16]:
from sklearn.metrics.pairwise import pairwise_distances
userSimilarity = pairwise_distances(trainDataMatrix, metric='cosine')
itemSimilarity = pairwise_distances(trainDataMatrix.T, metric='cosine')

In [22]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        meanUserRating = ratings.mean(axis=1)
        ratingsDiff = (ratings-meanUserRating[:, np.newaxis])
        pred = meanUserRating[:,np.newaxis] + similarity.dot(ratingsDiff) / np.array([np.abs(similarity).sum(axis=1)]).T
    if type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [23]:
itemPrediction = predict(trainDataMatrix, itemSimilarity, type='item')
userPrediction = predict(trainDataMatrix, userSimilarity, type='user')

### RMSE Error

In [26]:
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(preds, gts):
    preds = preds[gts.nonzero()].flatten()
    gts = gts[gts.nonzero()].flatten()
    return sqrt(mean_squared_error(preds, gts))

In [28]:
rmse(userPrediction, testDataMatrix), rmse(itemPrediction, testDataMatrix)

(3.1197914788713734, 3.4473383907846094)

## Model Based CF
- based on Matrix Factorization
- can deal better with scalability
- generally better than Memory Based

In [32]:
sparcity = round(1.0 - len(df)/float(n_users*n_items), 3) 
print('Sparcity', sparcity*100, '%')

Sparcity 93.7 %


### Singular Value Decomposition (SVD)

In [33]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

In [37]:
u, s, vt = svds(trainDataMatrix, k=20)
sDiagMatrix = np.diag(s)
xPred = np.dot(np.dot(u, sDiagMatrix), vt)
print('User Based CF RMSE:', rmse(xPred, testDataMatrix))

User Based CF RMSE: 2.7097071273455335
