## UAS ANALITIKA MEDIA SOSIAL

#### Alief Rachmandias Darmawan - A12.2020.06416

## Anime Recomendation dan User Rating Prediction Menggunakan Item-Based Collaborative Filtering

#### Latar Belakang

- Anime merupakan salah satu media entertaiment dalam bentuk animasi yang berasal dari Jepang. Pada saat pandemi anime menjadi salah satu opsi yang cukup banyak ditonton oleh beragam kalangan. Oleh karena itu perlu adanya suatu sistem rekomendasi anime bagi mereka yang baru mulai untuk menonton anime.

#### Kebutuhan dan Output

- Pada eksperimen kali ini saya akan mencoba untuk membuat suatu sistem rekomendasi anime dimana ini dapat membantu mereka yang baru mulai menonton anime. Selain itu saya juga akan membuat user rating prediction dimana ini bisa menentukan bagaimana seorang user akan merating salah satu anime.
- Output yang dihasilkan yakni nantinya adalah list rekomendasi anime serta list rekomendasi anime untuk seorang user berdasar pada similaritynya dengan user yang lain. Untuk implementasinya sendiri saya akan menggunakan sistem rekomendasi bertipe Item-Based Collaborative Filtering

#### Sumber data dan referensi
- https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database
- https://www.kaggle.com/code/benroshan/content-collaborative-anime-recommendation/notebook
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4121831
- https://jurnal.stmik-mi.ac.id/index.php/jcb/article/view/167/189

#### Untuk Dataset user_rating akan saya lampirkan pada gdrive dikarenaka ukuran file terlalu besar dan tidak bisa di upload pada github
- http://surl.li/eicif

#### Import Library yang Dibutuhkan

In [1]:
# Import libraries yang akan digunakan

import pandas as pd
import numpy as np
import scipy as sp
import re
from sklearn.metrics.pairwise import cosine_similarity
import operator
%matplotlib inline

- Pandas untuk pemrosesan dataframe
- numpy untuk perhitungan matrix
- scipy untuk membuat sparse matrix 
- re (Regular Expresion) untuk melakukan data cleaning
- Mengambil cosine_similarity dari modul sklearn
- Operator untuk eksport fungsi operator

## Data Preparation

Item-Based Collaborative Filtering merupakan suatu sistem rekomendasi berdasar pada kemiripan antar itemnya. Metode ini merupakan rekomendasi yang berdasar pada kesamaan antara pemberian rating terhadap suatu item dengan item yang pernah dirating user lain. Item yang telah dirating user akan menjadi patokan untuk mencari beberapa item lain yang yang memiliki kemiripan dengan item yang telah dirating user.

#### Data saya dapatkan dari Kaggle dimana data tersebut dicrawling dari website myanimelist.net

In [3]:
anime = pd.read_csv('anime.csv')
usr_rating = pd.read_csv('user_rating.csv')

In [3]:
## Info dari dataset anime
anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [4]:
## Info dari dataset user_rating
usr_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7813737 entries, 0 to 7813736
Data columns (total 3 columns):
 #   Column    Dtype
---  ------    -----
 0   user_id   int64
 1   anime_id  int64
 2   rating    int64
dtypes: int64(3)
memory usage: 178.8 MB


#### Melakukan pengecekan terhadap dataset anime

In [5]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


#### 1. anime
- anime_id : Merupakan unique id pada myanimelist.net untuk identifikasi anime
- name : Nama anime
- genre : Genre dari anime
- type : Tipe anime (Movie, TV, OVA, etc)
- episodes : Banyaknya episode pada anime
- rating : Rating dari anime
- Members : Jumlah anggota yang ada pada grup anime

#### Setelah melihat dataset anime saya akan melakukan perbersihan pada column judul dikarenakan terdapat beberapa simbol yang tidak diperlukan

In [4]:
## Pembersihan beberapa simbol yang ada pada column name

def text_cleaning(text):
    text = re.sub(r'&quot;', '', text)
    text = re.sub(r'&#039;', '', text)
    text = re.sub(r'A&#039;s', '', text)
    text = re.sub(r'I&#039;', 'I\'', text)
    text = re.sub(r'&amp;', 'and', text)
    
    return text

anime['name'] = anime['name'].apply(text_cleaning)

#### Melakukan pengecekan pada dataset user_rating

In [5]:
usr_rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


#### 2. user_rating
- user_id : Id dari user
- anime_id : Anime yang di rating oleh user
- rating : rating yang diberikan oleh user. Terdapat nilai -1 pada bagian rating. Pada dataset -1 menandakan bahwa user telah menonton namun user tidak memberikan rating. 

#### Disini nilai dari -1 akan saya ubah menjadi null dengan tujuan untuk membantu perhitungan pada saat mencari nilai rata - rata.

In [6]:
usr_rating.rating.replace({-1: np.nan}, regex=True, inplace = True)
usr_rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,
1,1,24,
2,1,79,
3,1,226,
4,1,241,


#### Pada analisis ini saya hanya akan menggunakan data yang memiliki tipe TV

In [7]:
# Pada analisis ini saya hanya menggunakan data yang memiliki tipe TV

anime_tv = anime[anime['type'] == 'TV']
anime_tv.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351


## Dari analisa dapat dilihat bahwa antar dataset memiliki relasi sehingga saya akan melakukan merged

In [8]:
# Penggabungan dataset pada coloumn anime_id

merged = usr_rating.merge(anime_tv, left_on = 'anime_id', right_on = 'anime_id', suffixes= ['_user', ''])
merged.rename(columns = {'rating_user':'user_rating'}, inplace = True)

In [9]:
# Saya akan melimit dataset menjadi 10.000 untuk memudahkan proses

merged=merged[['user_id', 'name', 'user_rating']]
merged_sub= merged[merged.user_id <= 10000]
merged_sub.head()

Unnamed: 0,user_id,name,user_rating
0,1,Naruto,
1,3,Naruto,8.0
2,5,Naruto,6.0
3,6,Naruto,
4,10,Naruto,


## Proses Modeling

Proses modeling ini saya akan menggunakan metode item base collaborative filtering dengan memanfaatkan algoritma cosine based similarity

#### Saya akan membuat pivot table menggunakan user_id sebagai index dan name sebagai index. Valuenya sendiri berasal dari user_rating. Pivot table akan membantu untuk mendefinisikan similarity antara user dengan nama anime berdasar pada user_rating

In [10]:
piv = merged_sub.pivot_table(index=['user_id'], columns=['name'], values='user_rating')

In [11]:
print(piv.shape)
piv.head(17)

(9387, 2704)


name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,2.0,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
11,,,,,,,,,,,...,,,,,,,,,,
12,,,,,,,,,,,...,,,,,,,,,,


- Diatas merupakan hasil dari pivot table dimana dapat kita lihat contoh bahwa user_id 17 telah melakukan rating terhadap anime 91 days yakni sebesar 8.0. Sementara untuk anime 11eyes user 17 belum melakukan rating

#### Kemudian kita akan melakukan normalisasi pada pivot table yang sudah terbentuk

In [12]:
# Normalisasi Dataset
piv_norm = piv.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)


# Drop semua column yang hanya berisi null value dimana itu mewakili pengguna yang tidak melakukan rate
piv_norm.fillna(0, inplace=True)
piv_norm = piv_norm.T
piv_norm = piv_norm.loc[:, (piv_norm != 0).any(axis=0)]

#### Perubahan pivot table menjadi bentuk sparse matrix agar nantinya pivot table dapat digunakan dalam perhitungan cosine similarity

In [13]:
# Merubah data ke dalam bentuk sparse matrix agar dapat digunakan dalam perhitungan cosine similarity

piv_sparse = sp.sparse.csr_matrix(piv_norm.values)

#### Kemudian matrix ini akan digunakan untuk melakukan perhitungan cosine similarity dimana kita bagi menjadi 2 bagian yakni matrix item_similarity (name) dan user_similarity (user_id). Untuk item menggunakan matrix normal sementara untuk user menggunakan matrix yang sudah di transpose

In [14]:
item_similarity = cosine_similarity(piv_sparse)
user_similarity = cosine_similarity(piv_sparse.T)

In [15]:
item_similarity

array([[ 1.        ,  0.17415138,  0.2523907 , ...,  0.02542878,
        -0.04674506, -0.04144116],
       [ 0.17415138,  1.        ,  0.15930535, ...,  0.02523458,
        -0.00898976, -0.03582197],
       [ 0.2523907 ,  0.15930535,  1.        , ...,  0.02015541,
        -0.0470111 , -0.02820349],
       ...,
       [ 0.02542878,  0.02523458,  0.02015541, ...,  1.        ,
         0.00828822, -0.00189062],
       [-0.04674506, -0.00898976, -0.0470111 , ...,  0.00828822,
         1.        ,  0.54829839],
       [-0.04144116, -0.03582197, -0.02820349, ..., -0.00189062,
         0.54829839,  1.        ]])

In [16]:
user_similarity

array([[ 1.        ,  0.06481888,  0.16768669, ...,  0.07505091,
         0.04112757,  0.0684037 ],
       [ 0.06481888,  1.        ,  0.07261421, ...,  0.07543479,
         0.02228038,  0.00893266],
       [ 0.16768669,  0.07261421,  1.        , ...,  0.12750498,
         0.05095768, -0.01356194],
       ...,
       [ 0.07505091,  0.07543479,  0.12750498, ...,  1.        ,
         0.17109752,  0.01595078],
       [ 0.04112757,  0.02228038,  0.05095768, ...,  0.17109752,
         1.        , -0.02431102],
       [ 0.0684037 ,  0.00893266, -0.01356194, ...,  0.01595078,
        -0.02431102,  1.        ]])

#### Kemudian matrix similarity kita masukan ke dalam dataframe 

In [17]:
# Memasukan matrix similarity ke dalam objek dataframe

item_sim_df = pd.DataFrame(item_similarity, index = piv_norm.index, columns = piv_norm.index)
user_sim_df = pd.DataFrame(user_similarity, index = piv_norm.columns, columns = piv_norm.columns)

In [18]:
item_sim_df.head()

name,.hack//Roots,.hack//Sign,.hack//Tasogare no Udewa Densetsu,009-1,07-Ghost,11eyes,12-sai.: Chicchana Mune no Tokimeki,3 Choume no Tama: Uchi no Tama Shirimasenka?,30-sai no Hoken Taiiku,91 Days,...,"Zone of the Enders: Dolores, I",Zukkoke Knight: Don De La Mancha,ef: A Tale of Melodies.,ef: A Tale of Memories.,gdgd Fairies,gdgd Fairies 2,iDOLM@STER Xenoglossia,s.CRY.ed,xxxHOLiC,xxxHOLiC Kei
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
.hack//Roots,1.0,0.174151,0.252391,-0.005796,0.049679,0.04527,0.007651,0.0,0.033394,-0.031335,...,-0.015857,0.0,-0.022133,-0.011535,0.0,0.0,0.027336,0.025429,-0.046745,-0.041441
.hack//Sign,0.174151,1.0,0.159305,0.020343,0.05195,0.028632,-0.002528,0.0,0.035474,-0.035404,...,-0.007762,0.0,-0.011227,-0.015026,-0.009463,-0.016353,-0.010918,0.025235,-0.00899,-0.035822
.hack//Tasogare no Udewa Densetsu,0.252391,0.159305,1.0,0.0468,0.02626,0.039341,-0.009715,0.0,0.001181,0.004111,...,0.009741,0.0,-0.021078,-0.024342,0.0,0.0,0.007474,0.020155,-0.047011,-0.028203
009-1,-0.005796,0.020343,0.0468,1.0,0.000487,0.011121,0.065982,0.0,0.0574,-0.07441,...,0.007242,0.0,0.003135,0.007257,0.0,0.0,0.0,-0.032589,0.007402,0.015183
07-Ghost,0.049679,0.05195,0.02626,0.000487,1.0,0.083046,0.009231,-0.009091,0.065631,-0.027509,...,0.007273,0.0,-0.03242,-0.033768,5.7e-05,4.2e-05,0.016826,0.017564,-0.032459,-0.035136


In [19]:
user_sim_df.head()

user_id,3,5,7,8,10,11,12,14,16,17,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,1.0,0.064819,0.167687,0.020365,0.125977,0.04523309,0.108629,0.005399,0.077388,0.248132,...,-0.109099,-0.145602,-0.017873,0.043,0.054435,0.027708,-0.015971,0.075051,0.041128,0.068404
5,0.064819,1.0,0.072614,0.016531,0.031589,0.005460964,0.094486,0.107463,0.007147,0.13126,...,0.004197,-0.054089,0.044714,0.012716,0.105504,0.169411,-0.014658,0.075435,0.02228,0.008933
7,0.167687,0.072614,1.0,-0.013531,0.020497,-0.01521354,0.061616,0.122953,0.048481,0.165547,...,0.0,0.0,0.000686,0.020499,0.069642,0.092922,0.019655,0.127505,0.050958,-0.013562
8,0.020365,0.016531,-0.013531,1.0,-0.052705,0.01892828,0.002786,-0.055035,0.0,0.001459,...,-0.228218,0.0,0.019007,0.043206,0.035985,-0.03675,0.0,-0.041096,-0.018445,0.106221
10,0.125977,0.031589,0.020497,-0.052705,1.0,-1.665335e-16,0.008811,0.044415,0.0,0.071511,...,0.0,0.0,0.0,0.208672,-0.015725,0.039774,0.0,0.259914,0.058329,-0.050271


#### Selanjutnya yakni membuat beberapa fungsi untuk menampilkan rekomendasi serta prediksi rating

In [20]:
# Fungsi ini akan memunculkan 10 besar rekomendasi anime berdasar pada nilai similarity tertinggi

def anime_rec(anime_name):
    count = 1
    print('Anime yang sama dengan {} adalah:\n'.format(anime_name))
    for item in item_sim_df.sort_values(by = anime_name, ascending = False).index[1:11]:
        print('No. {}: {}'.format(count, item))
        count +=1

In [21]:
# Fungsi ini akan memunculkan 5 besar user dengan nilai similarity tertinggi

def top_users(user):
    
    if user not in piv_norm.columns:
        return('Tidak ada data yang tersedia untuk user {}'.format(user))
    
    print('User yang memiliki kesamaan paling tinggi:\n')
    sim_values = user_sim_df.sort_values(by=user, ascending=False).loc[:,user].tolist()[1:11]
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:11]
    zipped = zip(sim_users, sim_values,)
    for user, sim in zipped:
        print('User #{0}, Similarity value: {1:.2f}'.format(user, sim)) 

In [22]:
# Fungsi ini akan membuat daftar list yang berisi anime dengan rating tertinggi per user yang sama dan mereturn anime serta frekuensi kemunculan pada daftar 

def similar_user_recs(user):
    
    if user not in piv_norm.columns:
        return('Tidak ada data user yang tersedia {}'.format(user))
    
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:11]
    best = []
    most_common = {}
    
    for i in sim_users:
        max_score = piv_norm.loc[:, i].max()
        best.append(piv_norm[piv_norm.loc[:, i]==max_score].index.tolist())
    for i in range(len(best)):
        for j in best[i]:
            if j in most_common:
                most_common[j] += 1
            else:
                most_common[j] = 1
    sorted_list = sorted(most_common.items(), key=operator.itemgetter(1), reverse=True)
    return sorted_list[:5]    

In [23]:
# Fungsi ini akan menghitung weighted average dari user yang sama untuk menentukan potensial rating untuk input user dan anime

def predicted_rating(anime_name, user):
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:1000]
    user_values = user_sim_df.sort_values(by=user, ascending=False).loc[:,user].tolist()[1:1000]
    rating_list = []
    weight_list = []
    for j, i in enumerate(sim_users):
        rating = piv.loc[i, anime_name]
        similarity = user_values[j]
        if np.isnan(rating):
            continue
        elif not np.isnan(rating):
            rating_list.append(rating*similarity)
            weight_list.append(similarity)
    return sum(rating_list)/sum(weight_list)    

In [36]:
anime_rec('Noragami')

Anime yang sama dengan Noragami adalah:

No. 1: Noragami Aragoto
No. 2: No Game No Life
No. 3: Haikyuu!!
No. 4: Barakamon
No. 5: Hataraku Maou-sama!
No. 6: Zankyou no Terror
No. 7: Boku dake ga Inai Machi
No. 8: Kuroko no Basket 2nd Season
No. 9: Shokugeki no Souma
No. 10: Gin no Saji 2nd Season


In [37]:
top_users(100)

User yang memiliki kesamaan paling tinggi:

User #8501, Similarity value: 0.84
User #4621, Similarity value: 0.83
User #8440, Similarity value: 0.82
User #569, Similarity value: 0.81
User #9906, Similarity value: 0.81
User #1930, Similarity value: 0.77
User #3236, Similarity value: 0.73
User #2813, Similarity value: 0.73
User #822, Similarity value: 0.68
User #2388, Similarity value: 0.68


In [38]:
df = merged_sub[merged_sub.user_id == 100].name
df.head(10)

355379     Highschool of the Dead
451707             Ao no Exorcist
541544           Sword Art Online
2056889                Durarara!!
4068919          Gakkou no Kaidan
Name: name, dtype: object

In [39]:
df = merged_sub[merged_sub.user_id == 8501].name
df.head(10)

3120                       Naruto
43576                  Elfen Lied
357827     Highschool of the Dead
544949           Sword Art Online
686832         Shingeki no Kyojin
1008528            Dragon Ball GT
1020902                   Pokemon
1030154         Digimon Adventure
1040594             Dragon Ball Z
1066458      Digimon Adventure 02
Name: name, dtype: object

In [40]:
similar_user_recs(100)

[('Dragon Ball', 2),
 ('Dragon Ball GT', 2),
 ('Shingeki no Kyojin', 2),
 ('Sword Art Online', 2),
 ('Mirai Nikki (TV)', 2)]

In [41]:
predicted_rating('Death Note', 100)

8.717063828469831