# Collaborative Filter Recommender

In this case we are going to generate the recommendation for each user with the Surprise library, which is specifically designed to build recommendation systems in Python, and which gives us the advantage of offering implementations of the most popular collaborative filter algorithms.

We have filtered 1000 users so you can run it with the RAM that colab offers.

- Starting from the file **ratings_filtrado.csv**.

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
datos = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/eoi_recomendaciones/datos/ratings_filtratos.csv')

In [None]:
datos.head()

Unnamed: 0.1,Unnamed: 0,user_id,imdb_id,rating,time
0,0,116,tt0061418,8,1564732986
1,1,116,tt0083658,7,1569771038
2,2,116,tt0095327,9,1547623046
3,3,116,tt0095765,10,1563411771
4,4,116,tt0096283,9,1565151708


In [None]:
peliculas = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/eoi_recomendaciones/datos/medias.csv')

In [None]:
peliculas.head()

Unnamed: 0.1,Unnamed: 0,imdb_id,conteo,media_rating,imdb_title,genre
0,0,tt0000008,1,5.0,Edison Kinetoscopic Record of a Sneeze,Documentary|Short
1,1,tt0419773,1,8.0,Gespenster,Drama
2,2,tt0419766,1,6.0,Garçon stupide,Comedy|Drama|Romance
3,3,tt0419765,1,8.0,We Have the Following News,Comedy
4,4,tt4048050,1,10.0,My Golden Days,Drama


In [None]:
peliculas = peliculas[['imdb_id', 'imdb_title']]

In [None]:
peliculas.head()

Unnamed: 0,imdb_id,imdb_title
0,tt0000008,Edison Kinetoscopic Record of a Sneeze
1,tt0419773,Gespenster
2,tt0419766,Garçon stupide
3,tt0419765,We Have the Following News
4,tt4048050,My Golden Days


In [None]:
# datos = datos.merge(peliculas, on='imdb_id')

In [None]:
datos.head()

Unnamed: 0.1,Unnamed: 0,user_id,imdb_id,rating,time
0,0,116,tt0061418,8,1564732986
1,1,116,tt0083658,7,1569771038
2,2,116,tt0095327,9,1547623046
3,3,116,tt0095765,10,1563411771
4,4,116,tt0096283,9,1565151708


In [None]:
! pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 8.7 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633964 sha256=c09c4bb9be1fd492910451762120a0f5caffbae474e8602e21053ad4ea708216
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [None]:
from surprise import SVD, accuracy, Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

In [None]:
reader = Reader()

In [None]:
data = Dataset.load_from_df(datos[['user_id', 'imdb_id', 'rating']], reader)

In [None]:
svd = SVD()

In [None]:
trainset = data.build_full_trainset() 

In [None]:
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f2c9513fb90>

In [None]:
testset = trainset.build_anti_testset()
predictions = svd.test(testset)

In [None]:
accuracy.rmse(predictions)

RMSE: 2.2311


2.231081285495108

In [None]:
from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n



In [None]:
top_n = get_top_n(predictions, n=10)


# We can see the recommendations for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

116 ['tt0169858', 'tt1121794', 'tt12361974', 'tt1950186', 'tt8332922', 'tt8367814', 'tt8772262', 'tt9426210', 'tt9541602', 'tt0037515']
170 ['tt0061418', 'tt0083658', 'tt0095765', 'tt0096283', 'tt0104652', 'tt0119698', 'tt0156887', 'tt0167261', 'tt0253474', 'tt0265666']
254 ['tt0061418', 'tt0083658', 'tt0095327', 'tt0095765', 'tt0096283', 'tt0104652', 'tt0119698', 'tt0156887', 'tt0167261', 'tt0253474']
268 ['tt0061418', 'tt0083658', 'tt0095327', 'tt0095765', 'tt0096283', 'tt0104652', 'tt0119698', 'tt0156887', 'tt0167261', 'tt0253474']
407 ['tt0061418', 'tt0083658', 'tt0095327', 'tt0095765', 'tt0096283', 'tt0104652', 'tt0119698', 'tt0156887', 'tt0167261', 'tt0253474']
416 ['tt0061418', 'tt0083658', 'tt0095327', 'tt0095765', 'tt0096283', 'tt0104652', 'tt0119698', 'tt0156887', 'tt0167261', 'tt0253474']
689 ['tt0061418', 'tt0083658', 'tt0095327', 'tt0095765', 'tt0096283', 'tt0104652', 'tt0119698', 'tt0156887', 'tt0167261', 'tt0253474']
699 ['tt0061418', 'tt0083658', 'tt0095327', 'tt0095765

In [None]:
# create an empty dataframe
reco_final = pd.DataFrame()

In [None]:
# do the same as before but saving it in a dataframe
for uid, user_ratings in top_n.items():
  usuario= uid
  recousuarios = [iid for (iid, _) in user_ratings]
  data = {'user_id':usuario,
        'imdb_id':recousuarios}
  reco_final = reco_final.append(pd.DataFrame(data))

In [None]:
# reco_final['imdb_id'] = reco_final['imdb_id'].astype(int)
reco_final['user_id'] = reco_final['user_id'].astype(int)

In [None]:
reco_final

Unnamed: 0,user_id,imdb_id
0,116,tt0169858
1,116,tt1121794
2,116,tt12361974
3,116,tt1950186
4,116,tt8332922
...,...,...
5,71700,tt0104652
6,71700,tt0119698
7,71700,tt0156887
8,71700,tt0167261


In [None]:
reco_final = pd.merge(reco_final,peliculas, on = 'imdb_id')

In [None]:
reco_final

Unnamed: 0,user_id,imdb_id,imdb_title
0,116,tt0169858,Neon Genesis Evangelion: The End of Evangelion
1,116,tt1121794,Sword of the Stranger
2,116,tt12361974,Zack Snyder's Justice League
3,116,tt1950186,Ford v Ferrari
4,116,tt8332922,A Quiet Place Part II
...,...,...,...
9995,9662,tt0443706,Zodiac
9996,24375,tt0443706,Zodiac
9997,24375,tt0407887,The Departed
9998,31107,tt0407887,The Departed


In [None]:
# We benchmark the following algorithms, which we have already verified do not cause problems. SVD(), SlopeOne(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()

In [None]:
# reload the original data
data = Dataset.load_from_df(datos[['user_id', 'imdb_id', 'rating']], reader)

In [None]:
from surprise import *
from surprise.model_selection import cross_validate
benchmark = []
# We iterate all the algorithms
for algoritmo in [SVD(), SlopeOne(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # We perform cross validation
    results = cross_validate(algoritmo, data, measures=['RMSE'], verbose=False)
    
    # save the results
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algoritmo).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Don

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BaselineOnly,2.883014,0.238912,0.058474
SVD,2.883368,2.92545,0.102218
KNNBaseline,2.893026,0.339864,0.612445
KNNWithMeans,2.894137,0.127061,0.541732
KNNBasic,2.897619,0.101418,0.514346
KNNWithZScore,2.898696,0.187768,0.571041
CoClustering,2.901692,1.942315,0.056415
SlopeOne,2.906331,3.370088,0.950382
NormalPredictor,2.987208,0.083487,0.083188


In [None]:
# Generate recommendations to later show some.

In [None]:
usuario_ejemplo = int(datos['user_id'].sample(n=1, random_state=123456))

In [None]:
usuario_ejemplo

8703

In [None]:
reco_final[reco_final.user_id==usuario_ejemplo]

Unnamed: 0,user_id,imdb_id,imdb_title
112,8703,tt0061418,Bonnie and Clyde
1102,8703,tt0083658,Blade Runner
2088,8703,tt0095765,Cinema Paradiso
3076,8703,tt0096283,My Neighbor Totoro
4062,8703,tt0104652,Porco Rosso
5055,8703,tt0119698,Princess Mononoke
6042,8703,tt0156887,Perfect Blue
7036,8703,tt0167261,The Lord of the Rings: The Two Towers
8019,8703,tt0253474,The Pianist
9085,8703,tt0095327,Grave of the Fireflies
