# Item-Item Collaborative Filtering Experiment

Created on Aug 18, 2017 by Tina Bu

## Requirements

- find a python package that implements item-item based collaborative filtering
- Find a function library that takes ( USER X ITEM ) matrix ( either view based or purchase based) and then creates item-item matrix
- Then using that item-item similarity matrix and given user's past view history, find top 6 items to recommend
- the matrix will be just binary (viewed or not viewed or purchased or not purchased. Not ratings as talked about in the paper 1

## Libraries


In [38]:
from surprise import SVD, Dataset, evaluate, print_perf 
from surprise import prediction_algorithms as pa 
from collections import defaultdict


## 0. Download MovieLens Data

At https://grouplens.org/datasets/movielens/.

Or built-in in Surprise.

100,000 ratings from 1000 users on 1700 movies.

In [5]:
# Load the movielens-100k dataset (download it if needed)
data = Dataset.load_builtin('ml-100k')
# format:'user item rating timestamp'

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/Tina/.surprise_data/ml-100k


In [24]:
# and split it into 3 folds for cross-validation.
data.split(n_folds = 3)

## 1. Item-Item Similarity Matrix


Based on paper http://files.grouplens.org/papers/www10_sarwar.pdf cosine similarities outperform correlation and adjusted cosine, thus is chosen. 

In [30]:
# Define the similarity options

sim_options = {'name': 'cosine',
               'user_based': False  # compute similarities between items
               }

# Define the algorithm used
algo = pa.knns.KNNBasic(sim_options=sim_options)

In [31]:
# Evaluate model performance
perf = evaluate(algo, data, measures = ['MAE'], verbose = 1)

Evaluating MAE of algorithm KNNBasic.

------------
Fold 1
Computing the cosine similarity matrix...
Done computing similarity matrix.
MAE:  0.8190
------------
Fold 2
Computing the cosine similarity matrix...
Done computing similarity matrix.
MAE:  0.8233
------------
Fold 3
Computing the cosine similarity matrix...
Done computing similarity matrix.
MAE:  0.8253
------------
------------
Mean MAE : 0.8225
------------
------------


In [32]:
print_perf(perf)

        Fold 1  Fold 2  Fold 3  Mean    
MAE     0.8190  0.8233  0.8253  0.8225  


## 2. Most Similar Item

In [33]:
# Retrieve the whole training set
trainset = data.build_full_trainset()
algo.train(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [36]:
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
# This will take several minutes
'''
The ratings are all the ratings that are not in the trainset, i.e. all the ratings 
ruirui where the user uu is known, the item ii is known, but the rating ruirui is not 
in the trainset. As ruirui is unknown, it is assumed to be equal to the mean of all 
ratings global_mean.
'''
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

## 3. Top 6 Recommendation

In [34]:
def get_top_n(predictions, n = 6):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. 
                Default is 6.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [39]:
top_n = get_top_n(predictions, n=6)

In [40]:
# Print the top 6 recommended items for each user in the test set
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

938 ['1571', '1568', '1569', '1577', '1581', '1580']
302 ['1536', '1309', '1122', '1310', '1542', '1543']
904 ['1606', '1309', '1310', '1673', '711', '1354']
916 ['1156', '1593', '651', '1201', '603', '127']
601 ['1653', '32', '199', '432', '89', '1593']
346 ['1653', '87', '71', '239', '1654', '28']
877 ['1369', '435', '482', '190', '1256', '1267']
61 ['1571', '1568', '1569', '1308', '1573', '1577']
910 ['1571', '1568', '1569', '1577', '1581', '1580']
527 ['1536', '693', '478', '436', '481', '602']
836 ['1619', '1618', '711', '1027', '1484', '1674']
854 ['1122', '116', '1571', '1568', '1569', '1573']
621 ['64', '47', '144', '288', '234', '12']
619 ['1599', '864', '732', '164', '1606', '1309']
682 ['1614', '1156', '1122', '1593', '1674', '1618']
241 ['1309', '1653', '1122', '1310', '1599', '1614']
901 ['742', '215', '77', '186', '385', '684']
707 ['1653', '1671', '1026', '1678', '1679', '1680']
292 ['23', '1012', '47', '1377', '93', '269']
850 ['1362', '1667', '1670', '1610', '1414', '1

## Crab

http://muricoca.github.io/crab/install.html

https://github.com/muricoca/crab/tree/master/doc

little documentation, seems project is not alive anymore

In [1]:
import pandas as pd

# pass in column names for each CSV and read them using pandas. 
# Column names available in the readme file

#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
 encoding='latin-1')
#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
 encoding='latin-1')

#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols,
 encoding='latin-1')

In [2]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [10]:
import crab

In [7]:
# Tried different installation methods but still cannot import modules
# No 

from crab.models import MatrixPreferenceDataModel
#Build the model
model = MatrixPreferenceDataModel(movies.data)

from crab.metrics import pearson_correlation
from crab.similarities import ItemSimilarity
#Build the similarity
similarity = ItemSimilarity(model, pearson_correlation)

from crab.recommenders.knn import ItemBasedRecommender

#Build the Item based recommender
recommender = ItemBasedRecommender(model, similarity, with_preference=True)

ImportError: No module named 'crab.models'

In [63]:
#Recommend items for the all users 
for user_id in range(1000):
    recommender.recommend(user_id)

NameError: name 'recommender' is not defined

## Graphlab

https://turi.com/products/create/docs/graphlab.toolkits.recommender.html

In [53]:
# Load the pre-divided data wherein the test data has 10 ratings for each user, 
## i.e. 9430 rows in total. 

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_base = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')
ratings_base.shape, ratings_test.shape

((90570, 4), (9430, 4))

In [55]:
import graphlab
train_data = graphlab.SFrame(ratings_base)
test_data = graphlab.SFrame(ratings_test)

ImportError: No module named 'graphlab'

In [57]:
from graphlab.recommender import item_similarity_recommender
m = item_similarity_recommender.create(train_data, target = "rating", similarity_type='cosine', only_top_k=6)

ImportError: No module named 'graphlab'

In [None]:
# Return a score prediction for the user ids and item ids in the provided data set
m.predict(train_data)

In [None]:
# Recommend the k highest scored items for each user.

m.recommend(k = 6)

## Spark's MLlib

good for big data
http://spark.apache.org/docs/1.0.0/mllib-collaborative-filtering.html

Haven't experimented

## References

- Surprise Documentation http://surprise.readthedocs.io/en/stable/

- Surprise Developer Slides https://www.slideshare.net/PoleSystematicParisRegion/collaborative-filtering-for-recommendation-systems-in-python-nicolas-hug

- Blog on Surprise https://medium.com/@m_n_malaeb/the-easy-guide-for-building-python-collaborative-filtering-recommendation-system-in-2017-d2736d2e92a8

- Graphlab blog: https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/

- Spark MLlib: https://spark.apache.org/docs/1.4.0/api/python/pyspark.mllib.html#module-pyspark.mllib.recommendation

## CF From Data Science Homework

In [None]:
import math

def process(ratings, movies, P):
    movieIds = sorted(ratings["movieId"].unique())
    movieNames = [movies[movies["movieId"] == mid].iloc[0]["title"] for mid in movieIds]
    d = { mid : i for i, mid in enumerate(movieIds)}
    Rs = np.asarray(ratings["ratings"])[P]
    uids = np.asarray(ratings["userId"])[P] - 1
    mids = np.asarray(ratings["movieId"].apply(lambda x: d[x]))[P]
    shaper = uids.max()+1, mids.max()+1
    split = int(math.floor(float(len(ratings)) * 9 / 10))
    X_tr = sp.coo_matrix((Rs[:split], (uids[:split], mids[:split])), shape = shape).toarray()
    X_te = sp.coo_matrix((Rs[split:], (uids[split:], mids[split:])), shape = shape).toarray()
    return X_tr, X_te, movieNames
        
def recommend(X, U, V, movieNames):
    names = []
    for i, Xi in enumerate(X):
        D = (Xi == 0).astype(int)
        name = movieNames[U[i, :].dot(sp.diags(D).dot(V).T).argmax()]
        names.append(name)
    return names

def error (X, U, V):
    return(((U.dot(V.T) - X)[X > 0])**2).mean()

def train(X, X_te, k, U, V, ntiers = 51, lam = 10, verbose = False):
    m, n = X.shape
    D = lam * sp.eye(k)/ 2.
    W = (X > 0).astype(int)
    W_te = (X_te > 0)/astype(int)
    print("{0:5}|{1:10}|{2:10}".format("Iter", "Train Err", "Test Err"))
    for i in xrange(niters):
        for j, Wj in enumerate(W.T):
            V[j, :] = la.solve(U.T.dot(sp.diags(Wj).dot(U)) + D, U.T.dot(X[:, j]))
        for j, Wj in enumerate(W):
            U[j, :] = la.solve(V.T.dot(sp.diags(Wj).dot(V)) + D, V.T.dot(X[j, :]))
        if verbose and i % 5 == 0:
            print "{0:5d}|{1:10.4f}|{2:10.4f}".format(i, error(X, U, V), error(X_te, U, V))
    return U, V