---
## **Exercise: Collaborative Filtering**

**Model Based**

**Dengan menggunakan dataset rating.csv dan anime.csv, buatlah recommendation system dengan skema berikut:**

* Gabungkan kedua dataset (rating.csv dan anime.csv) untuk menampilkan kolom ['user_id', 'anime_id', 'rating', 'name']
* Bandingkan algoritma SVD dan ALS
* Tuning algoritma yang menurut kalian lebih baik

Setelah mendapatkan model terbaik, coba prediksi rating anime berikut:

* Hunter x Hunter (2011), anime_id 11061
* Detective Conan OVA 09, anime_id 6438
* Ranma ½, anime_id 1010
* Saint Seiya: Meiou Hades Juuni Kyuu-hen, anime_id 1257 

Oleh user:

* 50
* 200
* 400
* 800

Bagaimana urutan rekomendasi yang akan kalian berikan untuk masing-masing user?

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# import surprise

#utk baca dataset
from surprise import Reader, Dataset

#utk algoritma nya
from surprise import SVD, BaselineOnly

#utk modeling
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV

#utk evaluation metric
from surprise import accuracy

ModuleNotFoundError: No module named 'surprise'

In [None]:
df_rating = pd.read_csv('rating.csv').drop(columns='Unnamed: 0')
df_rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,8074,10.0
1,1,11617,10.0
2,1,11757,10.0
3,1,15451,10.0
4,2,11771,10.0


In [None]:
df_anime = pd.read_csv('anime.csv')
df_anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [None]:
df = pd.merge(left=df_rating,right=df_anime,on='anime_id')
df = df[['user_id','anime_id','rating_x','name',]]

In [None]:
df.rename(columns={'rating_x':'rating'}, inplace=True)

In [None]:
df

Unnamed: 0,user_id,anime_id,rating,name
0,1,8074,10.0,Highschool of the Dead
1,3,8074,6.0,Highschool of the Dead
2,5,8074,2.0,Highschool of the Dead
3,12,8074,6.0,Highschool of the Dead
4,14,8074,6.0,Highschool of the Dead
...,...,...,...,...
77863,963,27909,6.0,Otome Hime
77864,979,7549,8.0,Quiz Magic Academy: The Original Animation 2
77865,992,1044,4.0,Taiyou no Ouji: Horus no Daibouken
77866,995,2571,6.0,Mitsubachi Maya no Bouken


In [None]:
df.isna().sum()

user_id     0
anime_id    0
rating      0
name        0
dtype: int64

In [None]:
df['rating'].unique()

array([10.,  6.,  2.,  7.,  9.,  8.,  4.,  5.,  3.,  1.])

In [None]:
# Define reader
reader = Reader(rating_scale=(1,10))

# Load dataset
data = Dataset.load_from_df(df=df.loc[:,'user_id':'rating'], reader=reader)
# dataframe yg dimasukkin di parameter df, urutan kolomnya harus seperti ini: user -> item -> rating. Kalo salah urutan, bakal salah nanti.
data

<surprise.dataset.DatasetAutoFolds at 0x1ccf6849400>

In [None]:
data.df

Unnamed: 0,user_id,anime_id,rating
0,1,8074,10.0
1,3,8074,6.0
2,5,8074,2.0
3,12,8074,6.0
4,14,8074,6.0
...,...,...,...
77863,963,27909,6.0
77864,979,7549,8.0
77865,992,1044,4.0
77866,995,2571,6.0


## Data Splitting

In [None]:
train_set,test_set = train_test_split(data=data, test_size=0.2, random_state=0)

## Cross Validation

In [None]:

model_svd = SVD(random_state=10)


cv_svd = cross_validate(
    algo=model_svd,
    data=data,
    cv=5,
    n_jobs=-1,
    verbose=True, 
    measures=['mae', 'rmse'] 
)

cv_svd

Evaluating MAE, RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MAE (testset)     0.9109  0.9129  0.9127  0.9167  0.9129  0.9132  0.0019  
RMSE (testset)    1.2044  1.2045  1.2037  1.2086  1.2014  1.2045  0.0023  
Fit time          1.30    1.29    1.27    1.24    0.78    1.18    0.20    
Test time         0.26    0.26    0.27    0.31    0.15    0.25    0.05    


{'test_mae': array([0.9109001 , 0.91288457, 0.9127245 , 0.91673464, 0.91291983]),
 'test_rmse': array([1.20441786, 1.20449376, 1.20367772, 1.20858312, 1.20141587]),
 'fit_time': (1.3049209117889404,
  1.2926025390625,
  1.267455816268921,
  1.242506504058838,
  0.7756330966949463),
 'test_time': (0.2590975761413574,
  0.26036596298217773,
  0.2658874988555908,
  0.3085782527923584,
  0.14597368240356445)}

In [None]:
print(cv_svd['test_mae'].mean(), 'adalah rata-rata MAE model SVD')
print(cv_svd['test_rmse'].mean(), 'adalah rata-rata RMSE model SVD')

0.9132327260832233 adalah rata-rata MAE model SVD
1.2045176658148877 adalah rata-rata RMSE model SVD


# 2. ALS

In [None]:


bsl_options = {
    'method':'als',
    'n_epoch':5,     
    'reg_u':12,      
    'reg_i':5        
}
model_als = BaselineOnly(bsl_options=bsl_options)


cv_als = cross_validate(
    algo=model_als,
    data=data,
    cv=5,
    n_jobs=-1,
    verbose=True, 
    measures=['mae', 'rmse']) 

cv_als

Evaluating MAE, RMSE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MAE (testset)     0.9084  0.9205  0.9174  0.9146  0.9210  0.9164  0.0046  
RMSE (testset)    1.1936  1.2102  1.2075  1.1987  1.2105  1.2041  0.0068  
Fit time          0.16    0.14    0.20    0.14    0.12    0.15    0.03    
Test time         0.32    0.31    0.23    0.17    0.11    0.23    0.08    


{'test_mae': array([0.90838447, 0.92049731, 0.91738703, 0.91456679, 0.92104595]),
 'test_rmse': array([1.19357325, 1.21022634, 1.20750828, 1.19867742, 1.21051118]),
 'fit_time': (0.1574852466583252,
  0.13599133491516113,
  0.20294761657714844,
  0.13646173477172852,
  0.11662864685058594),
 'test_time': (0.3181018829345703,
  0.3141329288482666,
  0.22547650337219238,
  0.17408037185668945,
  0.10899019241333008)}

In [None]:
print(cv_als['test_mae'].mean(), 'adalah rata-rata MAE model ALS')
print(cv_als['test_rmse'].mean(), 'adalah rata-rata RMSE model ALS')

0.9163763095793737 adalah rata-rata MAE model ALS
1.204099292645402 adalah rata-rata RMSE model ALS


# Hyperparameter Tuning

1. SVD

In [None]:

hyperparam = {
    'n_epochs':[5,10,20],
    'lr_all':[0.002, 0.005], 
    'reg_all':[0.02, 0.04, 0.06] 
}

gridsearch_svd = GridSearchCV(
    algo_class=SVD, 
    param_grid=hyperparam,
    n_jobs=-1,
    cv=5,
    measures=['mae', 'rmse']
)


gridsearch_svd.fit(data)

In [None]:
display(gridsearch_svd.best_params, gridsearch_svd.best_score)

{'mae': {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.06},
 'rmse': {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.06}}

{'mae': 0.9066976236580405, 'rmse': 1.1933922904423628}

In [None]:
print(cv_svd['test_mae'].mean(), 'adalah rata-rata MAE model SVD sebelum tuning')
print(cv_svd['test_rmse'].mean(), 'adalah rata-rata RMSE model SVD sebelum tuning')
print()
print(gridsearch_svd.best_score['mae'], 'adalah rata-rata MAE model SVD setelah tuning')
print(gridsearch_svd.best_score['rmse'], 'adalah rata-rata RMSE model SVD setelah tuning')

0.9132327260832233 adalah rata-rata MAE model SVD sebelum tuning
1.2045176658148877 adalah rata-rata RMSE model SVD sebelum tuning

0.9066976236580405 adalah rata-rata MAE model SVD setelah tuning
1.1933922904423628 adalah rata-rata RMSE model SVD setelah tuning


## ALS

In [None]:
hyperparam = {
    'bsl_options':{
    'method':['als'],
    'n_epoch':[5,10,20],
    'reg_u':[12,20],
    'reg_i':[5,10]
    }
}

gridsearch_als = GridSearchCV(
    algo_class=BaselineOnly, 
    param_grid=hyperparam,
    n_jobs=-1,
    cv=5,
    measures=['mae', 'rmse']
)


gridsearch_als.fit(data)

In [None]:
display(gridsearch_als.best_params, gridsearch_als.best_score)

{'mae': {'bsl_options': {'method': 'als',
   'n_epoch': 5,
   'reg_u': 12,
   'reg_i': 5}},
 'rmse': {'bsl_options': {'method': 'als',
   'n_epoch': 5,
   'reg_u': 12,
   'reg_i': 5}}}

{'mae': 0.9175430905764352, 'rmse': 1.2051714078774112}

In [None]:
print(cv_als['test_mae'].mean(), 'adalah rata-rata MAE model ALS sebelum tuning')
print(cv_als['test_rmse'].mean(), 'adalah rata-rata RMSE model ALS sebelum tuning')
print()
print(gridsearch_als.best_score['mae'], 'adalah rata-rata MAE model ALS setelah tuning')
print(gridsearch_als.best_score['rmse'], 'adalah rata-rata RMSE model ALS setelah tuning')

0.9163763095793737 adalah rata-rata MAE model ALS sebelum tuning
1.204099292645402 adalah rata-rata RMSE model ALS sebelum tuning

0.9175430905764352 adalah rata-rata MAE model ALS setelah tuning
1.2051714078774112 adalah rata-rata RMSE model ALS setelah tuning


# Predict To Test Set

In [None]:
# Define best model
best_model = gridsearch_svd.best_estimator['mae']

# Fitting
best_model.fit(train_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1cc99d80dc0>

In [None]:
best_model.predict(uid=0, iid=1)

Prediction(uid=0, iid=1, r_ui=None, est=9.084471063008571, details={'was_impossible': False})

In [None]:
list_user = [50,200,400,800]
list_anime = [11061,6438,1010,1257]

In [None]:
df_result = pd.DataFrame(columns=['user_id', 'anime_id'])
df_result

Unnamed: 0,user_id,anime_id


In [None]:
for user in list_user:
    for item in list_anime:
        df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
df_result

  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item}, ignore_index=True)
  df_result = df_result.append({'user_id':user, 'anime_id':item},

Unnamed: 0,user_id,anime_id
0,50,11061
1,50,6438
2,50,1010
3,50,1257
4,200,11061
5,200,6438
6,200,1010
7,200,1257
8,400,11061
9,400,6438


In [None]:
list_rating_predict = []

for index, value in df_result.iterrows():
    rating_estimate = best_model.predict(uid=value['user_id'], iid=value['anime_id'])
    list_rating_predict.append(rating_estimate[3])

list_rating_predict

[9.723652188238214,
 7.265788412190737,
 7.336779024022364,
 7.849194640422156,
 10,
 8.944392036363155,
 8.757318595771554,
 9.580018119468624,
 8.44779515777702,
 6.462006346461142,
 6.017348830677101,
 6.959551030741963,
 9.428946338479149,
 7.933931088080923,
 7.889811765590983,
 8.367705846466073]

In [None]:
df_result['rating'] = list_rating_predict
df_result

Unnamed: 0,user_id,anime_id,rating
0,50,11061,9.723652
1,50,6438,7.265788
2,50,1010,7.336779
3,50,1257,7.849195
4,200,11061,10.0
5,200,6438,8.944392
6,200,1010,8.757319
7,200,1257,9.580018
8,400,11061,8.447795
9,400,6438,6.462006


In [None]:
df_result = pd.merge(left=df_result,right=df,on='anime_id')[['user_id_x','anime_id','rating_x','name']].drop_duplicates()

In [None]:
df_result = df_result.rename(columns={'user_id_x':'user_id','rating_x':'rating'})

In [None]:
def reccomend(user):
    return df_result[df_result['user_id']==user].sort_values('rating', ascending=False)

In [None]:
reccomend(50)

Unnamed: 0,user_id,anime_id,rating,name
0,50,11061,9.723652,Hunter x Hunter (2011)
464,50,1257,7.849195,Saint Seiya: Meiou Hades Juuni Kyuu-hen
456,50,1010,7.336779,Ranma ½: Chou Musabetsu Kessen! Ranma Team vs....
452,50,6438,7.265788,Detective Conan OVA 09: The Stranger in 10 Yea...
