# Lab 8: Recommender System

In this assignment, we will study how to do user-based collaborative filtering and item-based collaborative filtering. 

## 1. Dataset

In this assignment, we will use MovieLens-100K dataset. It includes about 100,000 ratings from 1000 users on 1700 movies.  

In [1]:
from math import sqrt
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors


# 1. load data
user_ratings_train = pd.read_csv('./ml-100k/u1.base',
                            sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])

user_ratings_test = pd.read_csv('./ml-100k/u1.test',
                            sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])

movie_info =  pd.read_csv('./ml-100k/u.item', 
                          sep='|', names=['movie_id','title'], usecols=[0,1],
                          encoding="ISO-8859-1")

user_ratings_train = pd.merge(movie_info, user_ratings_train)
user_ratings_test = pd.merge(movie_info, user_ratings_test)

# 2. get the rating matrix. Each row is a user, and each column is a movie.
user_ratings_train = user_ratings_train.pivot_table(index=['user_id'],
                                        columns=['title'],
                                        values='rating')

user_ratings_test = user_ratings_test.pivot_table(index=['user_id'],
                                        columns=['title'],
                                        values='rating')




user_ratings_train = user_ratings_train.reindex(
                            index=user_ratings_train.index.union(user_ratings_test.index), 
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

user_ratings_test = user_ratings_test.reindex(
                            index=user_ratings_train.index.union(user_ratings_test.index), 
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

print(user_ratings_train.shape)
print(user_ratings_test.shape)

(943, 1664)
(943, 1664)


## Task 1. User-based CF

* Use pearson correlation to get the similarity between different users.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [2]:
# need to replace nan values since NN doesnt support NaN values
mean_ratings_by_user = user_ratings_train.mean(axis=1)
user_ratings_train_nan_filled = user_ratings_train.T.fillna(mean_ratings_by_user).T # fill magic i found on SOF

# calc pearson value for user
network = user_ratings_train_nan_filled.T.corr(method='pearson').values

# fit Nearest Neighbors with network data
NearestNeighborsModel = NearestNeighbors(n_neighbors=5).fit(network)

# run NN on the self dataset signified by X=None
neighbors_distance, neighbors_ind = NearestNeighborsModel.kneighbors(X=None)            

In [3]:
from sklearn.metrics import mean_absolute_error

# preprare train and test matrices
user_data_train = user_ratings_train_nan_filled.values
user_data_test  = user_ratings_test.values

# input for mean_absolute_error
truth, pred  = [], []

# loop over each value of the test set
for user_id, user_ratings in enumerate(user_data_train):
    for video_id, video_rating in enumerate(user_ratings):
        # ignore null test ratings
        if np.isnan(user_data_test[user_id, video_id]): continue
        
        # get the neighbors of current user to predict
        neighbors = neighbors_ind[user_id]
        
        # get the ratings given by the neighbors via train
        neighbor_ratings = user_data_train[neighbors]
        
        # get rating for the video
        video_ratings = neighbor_ratings[:, video_id]
        
        # get biases for each user
        biases    = mean_ratings_by_user.values[neighbors]
        self_bias = mean_ratings_by_user.values[user_id]
        
        # get simarity for each user
        sim_scores = network[user_id][neighbors]
        
        # compute full score
        score = self_bias + (np.sum((np.multiply(sim_scores, video_ratings - biases))) / np.sum(sim_scores))
        
        # save to compute error later
        truth.append(user_data_test[user_id, video_id])
        pred.append(score)
        

MAE = mean_absolute_error(truth, pred)
print(f'MAE for User-based CF is {MAE}')  

MAE for User-based CF=0.81393842325719


## Task 2. Item-based CF
* Use cosine similarity to get the similarity between different items.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [4]:
user_ratings_test

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,,,,,,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,,,,...,,,,,,,,,4.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,,,...,,,,,,,,,,
940,,,,,,,,,,,...,,,,,,,,,,
941,,,,,,,,,,,...,,,,,,,,,,
942,,,,,,,,,,,...,,,,,,,,,,
