## **Latihan Collaborative Filtering**

**Dengan menggunakan dataset anime & rating, buatlah recommendation system dengan skema berikut:**

* Gabungkan kedua data agar dapat memunculkan informasi-informasi yang ada pada dataset anime.
* Bandingkan algoritma SVD dan ALS
* Tuning algoritma yang menurut kalian lebih baik

Setelah mendapatkan model terbaik, coba prediksi rating anime berikut:

* Hunter x Hunter (2011), anime_id 11061
* Detective Conan OVA 09, anime_id 6438
* Ranma ½, anime_id 1010
* Saint Seiya: Meiou Hades Juuni Kyuu-hen, anime_id 1257 

Oleh user:

* 50
* 200
* 400
* 800

Bagaimana urutan rekomendasi yang akan kalian berikan untuk masing-masing user?

## **Import libraries**

In [34]:
import pandas as pd
import numpy as np
import seaborn as sns

# Dataset formatting
from surprise import Reader
from surprise import Dataset

from surprise import SVD            # SVD
from surprise import BaselineOnly   # ALS

from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split
from surprise.model_selection import GridSearchCV

## **Load dataset & preprocessing**

In [35]:
df_rating = pd.read_csv('rating.csv')
df_rating

Unnamed: 0.1,Unnamed: 0,user_id,anime_id,rating
0,47,1,8074,10.0
1,81,1,11617,10.0
2,83,1,11757,10.0
3,101,1,15451,10.0
4,153,2,11771,10.0
...,...,...,...,...
77863,96433,999,11757,6.0
77864,96434,999,16498,9.0
77865,96435,999,21881,5.0
77866,96436,999,22319,8.0


In [36]:
# Drop kolom yang tidak berguna
df_rating = df_rating.drop(columns='Unnamed: 0', axis=1)
df_rating.head(10)

Unnamed: 0,user_id,anime_id,rating
0,1,8074,10.0
1,1,11617,10.0
2,1,11757,10.0
3,1,15451,10.0
4,2,11771,10.0
5,3,20,8.0
6,3,154,6.0
7,3,170,9.0
8,3,199,10.0
9,3,225,9.0


In [37]:
df_anime = pd.read_csv('anime.csv')
df_anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [38]:
# Menggabungkan df_rating dan df_anime --> Left join pada kolom anime_id
df_merged = pd.merge(df_rating, df_anime, how='left', on=['anime_id'])
df_merged 

Unnamed: 0,user_id,anime_id,rating_x,name,genre,type,episodes,rating_y,members
0,1,8074,10.0,Highschool of the Dead,"Action, Ecchi, Horror, Supernatural",TV,12,7.46,535892
1,1,11617,10.0,High School DxD,"Comedy, Demons, Ecchi, Harem, Romance, School",TV,12,7.70,398660
2,1,11757,10.0,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance",TV,25,7.83,893100
3,1,15451,10.0,High School DxD New,"Action, Comedy, Demons, Ecchi, Harem, Romance,...",TV,12,7.87,266657
4,2,11771,10.0,Kuroko no Basket,"Comedy, School, Shounen, Sports",TV,25,8.46,338315
...,...,...,...,...,...,...,...,...,...
77863,999,11757,6.0,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance",TV,25,7.83,893100
77864,999,16498,9.0,Shingeki no Kyojin,"Action, Drama, Fantasy, Shounen, Super Power",TV,25,8.54,896229
77865,999,21881,5.0,Sword Art Online II,"Action, Adventure, Fantasy, Game, Romance",TV,24,7.35,537892
77866,999,22319,8.0,Tokyo Ghoul,"Action, Drama, Horror, Mystery, Psychological,...",TV,12,8.07,618056


In [39]:
# Drop kolom yang tidak digunakan
df_merged = df_merged.drop(columns=['type', 'episodes', 'rating_y', 'members'], axis=1)

# Ganti nama kolom 'rating_x' menjadi 'user_rating'
df_merged = df_merged.rename(columns={'rating_x':'user_rating'})
df_merged

Unnamed: 0,user_id,anime_id,user_rating,name,genre
0,1,8074,10.0,Highschool of the Dead,"Action, Ecchi, Horror, Supernatural"
1,1,11617,10.0,High School DxD,"Comedy, Demons, Ecchi, Harem, Romance, School"
2,1,11757,10.0,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance"
3,1,15451,10.0,High School DxD New,"Action, Comedy, Demons, Ecchi, Harem, Romance,..."
4,2,11771,10.0,Kuroko no Basket,"Comedy, School, Shounen, Sports"
...,...,...,...,...,...
77863,999,11757,6.0,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance"
77864,999,16498,9.0,Shingeki no Kyojin,"Action, Drama, Fantasy, Shounen, Super Power"
77865,999,21881,5.0,Sword Art Online II,"Action, Adventure, Fantasy, Game, Romance"
77866,999,22319,8.0,Tokyo Ghoul,"Action, Drama, Horror, Mystery, Psychological,..."


In [40]:
df_merged.describe()
# rating dari 1-10

Unnamed: 0,user_id,anime_id,user_rating
count,77868.0,77868.0,77868.0
mean,517.812786,10721.879116,7.855268
std,278.020509,9033.079184,1.53807
min,1.0,1.0,1.0
25%,288.0,2273.0,7.0
50%,529.0,9513.0,8.0
75%,753.0,16592.0,9.0
max,999.0,34240.0,10.0


In [41]:
# Pivot table menjadi sparse matrix
user_item_rating_matrix = df_merged.pivot_table(values='user_rating', index ='user_id', columns ='anime_id')
user_item_rating_matrix

anime_id,1,5,6,7,8,15,16,17,18,19,...,33338,33341,33372,33421,33524,33558,33569,33964,34103,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,8.0,,,6.0,,6.0,6.0,,...,,,,,,,,,,
7,,,,,,,,,,,...,,7.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,,,,,,,,,,9.0,...,,,,,,,,,,
996,,,,,,,,,,,...,,,,,,,,,,
997,9.0,,,,,,,,,,...,,,,,,,,,,
998,,,,,,,,,,,...,,,,,,,,,,


* Hunter x Hunter (2011), anime_id 11061
* Detective Conan OVA 09, anime_id 6438
* Ranma ½, anime_id 1010
* Saint Seiya: Meiou Hades Juuni Kyuu-hen, anime_id 1257 

In [42]:
user_item_rating_matrix.loc[[50,200,400,800], [11061,6438,1010,1257]]

anime_id,11061,6438,1010,1257
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
50,10.0,,,
200,,,,
400,9.0,,,
800,,,,


User-Item matrix with rating terdiri dari 940 user dan 4510 anime

## **Modeling**

In [43]:
reader = Reader(rating_scale=(1, 10))

data = Dataset.load_from_df(df_merged[['user_id', 'anime_id', 'user_rating']], reader)
data 

<surprise.dataset.DatasetAutoFolds at 0x27611f53b48>

## **Validation**

In [44]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=1) 

### **SVD**

In [45]:
algo_svd = SVD()

algo_svd.fit(trainset)
prediction_svd = algo_svd.test(testset)

In [46]:
accuracy.rmse(prediction_svd) 

RMSE: 1.2020


1.2019680632656489

### **ALS**

In [47]:
bsl_options = {'method': 'als',
               'n_epochs': 10,
               'reg_u': 15,
               'reg_i': 10
               }

algo_als = BaselineOnly(bsl_options=bsl_options)

algo_als.fit(trainset)
prediction_als = algo_als.test(testset)

Estimating biases using als...


In [48]:
accuracy.rmse(prediction_als)

RMSE: 1.2128


1.2127696615627046

SVD memiliki error lebih kecil, maka akan dilakukan hyperparameter tuning terhadap model SVD

## **Cross Validation**

### **SVD**

In [49]:
cv_svd = cross_validate(algo_svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2131  1.1989  1.1982  1.2097  1.2123  1.2064  0.0066  
MAE (testset)     0.9190  0.9085  0.9107  0.9148  0.9195  0.9145  0.0044  
Fit time          12.94   13.44   14.00   12.70   12.25   13.07   0.60    
Test time         0.51    0.53    0.69    0.44    0.45    0.52    0.09    


In [50]:
print('RMSE cv mean', cv_svd['test_rmse'].mean())

RMSE cv mean 1.206446046384325


### **ALS**

In [51]:
cv_als = cross_validate(algo_als, data, measures=['RMSE','MAE'], cv=5, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2122  1.2238  1.2092  1.2138  1.2062  1.2130  0.0060  
MAE (testset)     0.9250  0.9326  0.9177  0.9261  0.9223  0.9247  0.0049  
Fit time          0.51    0.56    0.63    0.56    0.56    0.56    0.04    
Test time         0.57    0.30    0.25    0.25    0.57    0.39    0.15    


In [52]:
print('RMSE cv mean', cv_als['test_rmse'].mean())

RMSE cv mean 1.2130479533952339


## **Hyperparameter tuning**

In [53]:
# Tuning SVD
hyperparam_space = {
    'n_epochs':[5, 10, 20, 30],     # jumlah iterasi
    'lr_all':[0.002, 0.005],        # learning rate
    'reg_all':[0.02, 0.4, 0.6]      # regularization
}

grid_search = GridSearchCV(SVD, hyperparam_space, measures=['rmse', 'mae'], cv=5)

grid_search.fit(data)

In [54]:
print('RMSE')
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

print('\nMAE')
print(grid_search.best_score['mae'])
print(grid_search.best_params['mae'])

RMSE
1.20597989714718
{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}

MAE
0.9139524051340322
{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}


In [55]:
# Contoh tuning metode ALS
# param_grid = {'bsl_options': {'method': ['als'],
#                               'n_epochs': [5,10,15], 
#                               'reg_u': [12, 18, 27], 
#                               'reg_i': [5,50,100]}
#               }

# gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3)

# gs.fit(data)

## **Model with Hyperparameter Tuning**

In [56]:
svd_tuned = SVD(n_epochs = 20, lr_all = 0.005, reg_all = 0.02)
cv_svd_tuned = cross_validate(svd_tuned, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1994  1.1931  1.2223  1.2147  1.2039  1.2067  0.0105  
MAE (testset)     0.9093  0.9073  0.9264  0.9184  0.9078  0.9138  0.0075  
Fit time          15.89   17.46   20.53   21.03   20.28   19.04   2.01    
Test time         0.65    0.68    0.77    0.98    0.84    0.79    0.12    


In [57]:
# Perbandingan RMSE sebelum dan sesudah tuning
print('RMSE cv mean before tuning:', cv_svd['test_rmse'].mean())
print('RMSE cv mean after tuning:', cv_svd_tuned['test_rmse'].mean())

RMSE cv mean before tuning: 1.206446046384325
RMSE cv mean after tuning: 1.2066772111720447


## **Prediction results**

* Hunter x Hunter (2011), anime_id 11061
* Detective Conan OVA 09, anime_id 6438
* Ranma ½, anime_id 1010
* Saint Seiya: Meiou Hades Juuni Kyuu-hen, anime_id 1257 

In [58]:
users = [50, 200, 400, 800]
anime_ids = [11061, 6438, 1010, 1257]
titles = ['Hunter x Hunter (2011)', 'Detective Conan OVA 09', 'Ranma ½', 'Saint Seiya: Meiou Hades Juuni Kyuu-hen']

# Dataframe kosong
df_test = pd.DataFrame(columns=['user_id', 'anime_id', 'title'], dtype='object')
df_test

# Mengisi dataframe dengan user_id dan anime_id beserta titlenya
for i in users:
    for j, k in zip(anime_ids, titles):
        df_test = df_test.append({'user_id':i, 'anime_id':j, 'title':k}, ignore_index=True)
        
df_test 

Unnamed: 0,user_id,anime_id,title
0,50,11061,Hunter x Hunter (2011)
1,50,6438,Detective Conan OVA 09
2,50,1010,Ranma ½
3,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
4,200,11061,Hunter x Hunter (2011)
5,200,6438,Detective Conan OVA 09
6,200,1010,Ranma ½
7,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
8,400,11061,Hunter x Hunter (2011)
9,400,6438,Detective Conan OVA 09


In [59]:
df_test.iloc[:, :-1]

Unnamed: 0,user_id,anime_id
0,50,11061
1,50,6438
2,50,1010
3,50,1257
4,200,11061
5,200,6438
6,200,1010
7,200,1257
8,400,11061
9,400,6438


In [60]:
df_merged.iloc[:, [1,3]]

Unnamed: 0,anime_id,name
0,8074,Highschool of the Dead
1,11617,High School DxD
2,11757,Sword Art Online
3,15451,High School DxD New
4,11771,Kuroko no Basket
...,...,...
77863,11757,Sword Art Online
77864,16498,Shingeki no Kyojin
77865,21881,Sword Art Online II
77866,22319,Tokyo Ghoul


In [61]:
df_hasil = pd.merge(df_test.iloc[:, :-1], df_merged.iloc[:, [1,3]], how='inner', on='anime_id')
df_hasil = df_hasil.drop_duplicates(ignore_index=True).sort_values('user_id')
df_hasil

Unnamed: 0,user_id,anime_id,name
0,50,11061,Hunter x Hunter (2011)
4,50,6438,Detective Conan OVA 09: The Stranger in 10 Yea...
8,50,1010,Ranma ½: Chou Musabetsu Kessen! Ranma Team vs....
12,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
1,200,11061,Hunter x Hunter (2011)
5,200,6438,Detective Conan OVA 09: The Stranger in 10 Yea...
9,200,1010,Ranma ½: Chou Musabetsu Kessen! Ranma Team vs....
13,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
2,400,11061,Hunter x Hunter (2011)
6,400,6438,Detective Conan OVA 09: The Stranger in 10 Yea...


In [62]:
# define model
svd_predict = SVD(n_epochs=20, lr_all=0.005, reg_all=0.02)

# fitting
svd_predict.fit(trainset)

# untuk menyimpan predicted score
y = []

# Melakukan prediksi pada tiap baris
for index, row in df_test.iterrows():
    est = svd_predict.predict(row['user_id'], row['anime_id'])
    y.append(est[3])
    
df_test['predicted_rating'] = y

df_test.sort_values(by=['user_id', 'predicted_rating'], ascending=[True, False], inplace=True)
df_test

Unnamed: 0,user_id,anime_id,title,predicted_rating
0,50,11061,Hunter x Hunter (2011),9.825665
2,50,1010,Ranma ½,8.01289
3,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,7.966056
1,50,6438,Detective Conan OVA 09,7.319418
4,200,11061,Hunter x Hunter (2011),10.0
7,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,9.211445
6,200,1010,Ranma ½,8.738796
5,200,6438,Detective Conan OVA 09,8.681264
8,400,11061,Hunter x Hunter (2011),8.469168
11,400,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,6.803702


In [63]:
est

Prediction(uid=800, iid=1257, r_ui=None, est=8.388786038168126, details={'was_impossible': False})

In [64]:
df_test[df_test['user_id'] == 50]

Unnamed: 0,user_id,anime_id,title,predicted_rating
0,50,11061,Hunter x Hunter (2011),9.825665
2,50,1010,Ranma ½,8.01289
3,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,7.966056
1,50,6438,Detective Conan OVA 09,7.319418


In [65]:
df_test[df_test['user_id'] == 200]

Unnamed: 0,user_id,anime_id,title,predicted_rating
4,200,11061,Hunter x Hunter (2011),10.0
7,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,9.211445
6,200,1010,Ranma ½,8.738796
5,200,6438,Detective Conan OVA 09,8.681264


In [66]:
df_test[df_test['user_id'] == 400]

Unnamed: 0,user_id,anime_id,title,predicted_rating
8,400,11061,Hunter x Hunter (2011),8.469168
11,400,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,6.803702
10,400,1010,Ranma ½,6.313438
9,400,6438,Detective Conan OVA 09,5.978582


In [67]:
df_test[df_test['user_id'] == 800]

Unnamed: 0,user_id,anime_id,title,predicted_rating
12,800,11061,Hunter x Hunter (2011),9.709039
15,800,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,8.388786
14,800,1010,Ranma ½,8.046597
13,800,6438,Detective Conan OVA 09,7.899487


## **Coba lihat rekomendasi anime untuk seorang user**

In [68]:
df_merged[df_merged['user_id']==1]

Unnamed: 0,user_id,anime_id,user_rating,name,genre
0,1,8074,10.0,Highschool of the Dead,"Action, Ecchi, Horror, Supernatural"
1,1,11617,10.0,High School DxD,"Comedy, Demons, Ecchi, Harem, Romance, School"
2,1,11757,10.0,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance"
3,1,15451,10.0,High School DxD New,"Action, Comedy, Demons, Ecchi, Harem, Romance,..."


In [69]:
df_merged['anime_id'].nunique()

4510

In [70]:
# cek score untuk masing-masing anime berdasarkan user
user_id = 1

# anime_id dan name yg tidak ada duplikat (unique)
anime = list(df_merged['anime_id'].unique())
name = list(df_merged['name'].unique())

In [71]:
svd_predict = SVD(n_epochs=20, lr_all=0.005, reg_all=0.02)
svd_predict.fit(trainset)

# prediksi score untuk seluruh anime berdasarkan user1
anime_score = [svd_predict.predict(user_id, anime_id).est for anime_id in anime]
anime_score

[9.383967304744894,
 9.332344227024613,
 9.514573714537217,
 9.393542346100094,
 9.640615413816704,
 8.597392936575813,
 8.158190296088877,
 9.415298127845567,
 9.900780635333211,
 8.019519238891728,
 8.20545860956802,
 8.629763324488389,
 8.199684395658196,
 8.668958527407806,
 9.398121758830518,
 8.20924150415762,
 7.951267976050533,
 8.30236344490301,
 7.676520996433003,
 7.8453961663317076,
 7.960918750772441,
 8.370562857180262,
 9.64900322663943,
 7.7452155179405295,
 8.671291331054338,
 8.393654756915224,
 8.97592328610027,
 7.953701112615388,
 7.82193193387221,
 8.561265312471871,
 8.759505091175269,
 8.278019297329253,
 9.937617026263103,
 8.193902659575219,
 8.203849910811629,
 9.321190423782008,
 8.52686326525258,
 8.06488475599561,
 7.93635444193846,
 8.71434368752996,
 8.694696777760837,
 8.073978519447722,
 8.863976977531763,
 8.978135445394615,
 8.862805360648661,
 8.301778179613873,
 9.29083160215705,
 7.7885712522823844,
 8.696735707716565,
 8.61653005504899,
 7.786026

In [72]:
# Rekomendasi untuk seorang user
recomToUser = pd.DataFrame({
                            'anime_id': anime, 
                            'title':name,
                            'score': anime_score
                            }).sort_values(by='score', ascending=False)

recomToUser.head(20)

Unnamed: 0,anime_id,title,score
586,11061,Hunter x Hunter (2011),10.0
32,5114,Fullmetal Alchemist: Brotherhood,9.937617
8,199,Sen to Chihiro no Kamikakushi,9.900781
289,9253,Steins;Gate,9.900472
297,9969,Gintama&#039;,9.898299
378,15417,Gintama&#039;: Enchousen,9.833903
895,6114,Rainbow: Nisha Rokubou no Shichinin,9.80748
557,4181,Clannad: After Story,9.799438
81,24415,Kuroko no Basket 3rd Season,9.795918
92,31043,Boku dake ga Inai Machi,9.786147
