## Anime Recommendation System by Content Based Filtering
**Content Based Filtering** menggunakan fitur suatu item untuk merekomendasikan item lain yang mirip dengan apa yang disukai pengguna, berdasarkan perilaku sebelumnya atau feedback yang explisit [[1]](https://developers.google.com/machine-learning/recommendation/content-based/basics#:~:text=Content%2Dbased%20filtering%20uses%20item,previous%20actions%20or%20explicit%20feedback.). Cara kerja Content Based Filtering menggunakan konsep vektor dan *cosine similarity*. Setiap kata dalam dokumen akan memiliki skor TF-IDF sehingga setiap dokumen memiliki vektor. **TF–IDF** [[2]](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) singkatan dari *term frequency–inverse document frequency*, adalah statistik numerikal yang bertujuan untuk menunjukkan seberapa penting suatu kata dalam suatu dokumen dalam koleksi atau korpus (corpus). **Cosine Similarity** [[3]](https://en.wikipedia.org/wiki/Cosine_similarity) mengukur seberapa dekat vektor satu dengan yang lainnya. Metode perhitungan TF-IDF dan cosine similarity akan membandingkan kata antar dokumen sehingga ditemukan dokumen yang paling mirip secara statistik. Salah satu contoh penggunaan content based filtering adalah pada aplikasi ecommerce yang memberikan rekomendasi produk mirip dengan produk yang sedang dilihat user. Dalam project ini saya menggunakan data dari kaggle [[4]](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database) untuk membuat sistem rekomendasi anime sederhana menurut genre dan tipe penayangan.

In [1]:
# Import python package
import numpy as np
import pandas as pd

In [2]:
# Load file anime.csv and show top 5 data
df = pd.read_csv('anime.csv')
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


Data dalam file anime.csv menjadi dataframe df yang memiliki kolom:
- anime_id : nomor identifikasi anime
- name : judul anime
- genre : genre anime
- type : tipe penayangan anime
- episodes : jumlah episode
- rating : skor yang diberikan pengguna website myanimelist.com
- members : jumlah anggota yang memberikan skor pada anime tersebut

In [3]:
# show data dimension
df.shape

(12294, 7)

Dataframe memiliki 12294 baris data dengan 7 kolom

In [4]:
# check NaN or null values
df.isna().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [5]:
# check data description
dataDesc=[]
for i in df.columns:
    dataDesc.append([
        i,
        df[i].dtypes,
        df[i].isna().sum(),
        round((((df[i].isna().sum())/len(df))*100),2),
        df[i].nunique(),
        df[i].drop_duplicates().sample(2).values
    ])

pd.DataFrame(dataDesc, columns=[
    'Data Features',
    'Data Types',
    'Null',
    'Null Percentage',
    'Unique',
    'Unique Sample'
])

Unnamed: 0,Data Features,Data Types,Null,Null Percentage,Unique,Unique Sample
0,anime_id,int64,0,0.0,12294,"[14993, 9526]"
1,name,object,0,0.0,12292,"[Shin Seiki Inma Seiden, Channel 5.5 4th Season]"
2,genre,object,62,0.5,3264,"[Drama, Fantasy, Romance, School, Psychologica..."
3,type,object,25,0.2,6,"[Music, Special]"
4,episodes,object,0,0.0,187,"[330, 260]"
5,rating,float64,230,1.87,598,"[1.67, 8.37]"
6,members,int64,0,0.0,6706,"[22770, 23745]"


Dataframe ini memiliki nilai Null di beberapa kolom, namun persentasenya kecil (dibawah 5%) sehingga akan digantikan menjadi " " (blank space). Selanjutnya akan menggunakan fitur 'genre' dan 'type' untuk membuat vektor.

In [6]:
# define feature variable 
feature = ['genre', 'type']

In [7]:
# replace null value in `feature` to " " (blank space)
for i in feature:
    df[i] = df[i].fillna("")

In [8]:
# check null values
df.isna().sum()

anime_id      0
name          0
genre         0
type          0
episodes      0
rating      230
members       0
dtype: int64

In [9]:
# build function to combine 'genre' data and 'type' data
def combo(x):
    recom = x['genre'] + " " + x['type']
    return recom

In [10]:
# define column for combined data
df['combo_features'] = df.apply(combo, axis=1)

Sehingga didapatkan kolom baru yaitu combo_features yang isinya merupakan gabungan dari kolom 'genre' dan kolom 'type'.

In [11]:
# show top 5 data
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,combo_features
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,"Drama, Romance, School, Supernatural Movie"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262,"Action, Comedy, Historical, Parody, Samurai, S..."
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572,"Sci-Fi, Thriller TV"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266,"Action, Comedy, Historical, Parody, Samurai, S..."


In [12]:
# import package
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
CV = CountVectorizer()

In [14]:
# vectorize each data in `combo_features`
anime_matrix = CV.fit_transform(df['combo_features'])

In [15]:
# import package
from sklearn.metrics.pairwise import cosine_similarity

In [16]:
# count score from cosine_similarity from vectors in anime_matrix
cos_score = cosine_similarity(anime_matrix)

Model sistem rekomendasi sudah dibuat, namun belum bisa memunculkan hasil. Maka dibuat variabel watched yang berisikan judul anime yang telah ditonton, sebagai dokumen yang akan dibandingkan dengan dokumen lain.

In [17]:
# define watched anime as watched variable
watched = "Amagi Brilliant Park"

In [18]:
# check the index of watched anime
df[df['name'] == "Amagi Brilliant Park"].index.item()

1100

In [19]:
# check the name of anime from index
df[df.index == 1100]['name'].values[0]

'Amagi Brilliant Park'

Anime yang telah ditonton di variabel 'watched' adalah "Amagi Brilliant Park" yang memiliki indeks ke 1100. Selanjutnya buat fungsi untuk mendapatkan index dari nama anime di variabel watched. Dari index anime tersebut, model akan menghitung cosine similarity dari anime watched terhadap semua anime dalam dataframe.

In [20]:
# define function to get index from anime name
def index_by_anime(anime):
    return df[df['name'] == anime].index.item()

In [21]:
# define function to get anime name from index
def anime_by_index(index):
    return df[df.index == index]['name'].values[0]

In [22]:
# index of the watched anime
index_watched = index_by_anime(watched)

In [23]:
# show all cosine similarity score of all anime data
cos_score[index_watched]

array([0.        , 0.40824829, 0.38490018, ..., 0.        , 0.        ,
       0.        ])

In [24]:
# get every score by index
list(enumerate(cos_score[index_watched]))

[(0, 0.0),
 (1, 0.408248290463863),
 (2, 0.3849001794597505),
 (3, 0.2886751345948129),
 (4, 0.3849001794597505),
 (5, 0.4714045207910318),
 (6, 0.2357022603955159),
 (7, 0.0),
 (8, 0.19245008972987526),
 (9, 0.3849001794597505),
 (10, 0.2041241452319315),
 (11, 0.0),
 (12, 0.3849001794597505),
 (13, 0.19245008972987526),
 (14, 0.4714045207910318),
 (15, 0.0),
 (16, 0.2357022603955159),
 (17, 0.1825741858350554),
 (18, 0.0),
 (19, 0.19245008972987526),
 (20, 0.5163977794943223),
 (21, 0.0),
 (22, 0.408248290463863),
 (23, 0.3651483716701108),
 (24, 0.0),
 (25, 0.2041241452319315),
 (26, 0.4714045207910318),
 (27, 0.1825741858350554),
 (28, 0.1825741858350554),
 (29, 0.4364357804719848),
 (30, 0.408248290463863),
 (31, 0.2041241452319315),
 (32, 0.5163977794943223),
 (33, 0.0),
 (34, 0.2041241452319315),
 (35, 0.0),
 (36, 0.25819888974716115),
 (37, 0.0),
 (38, 0.2041241452319315),
 (39, 0.5163977794943223),
 (40, 0.2357022603955159),
 (41, 0.3333333333333334),
 (42, 0.5163977794943223)

In [25]:
# the list of the score are saved in a variable
similarity_anime = list(enumerate(cos_score[index_watched]))

Setelah mendapatkan semua skor cosine similarity, maka akan diurutkan dari terbesar hingga terkecil, pengecualian hasil terbesar pertama. Hasil terbesar pertama pasti memiliki skor 0.9999 yang mana adalah anime di variabel watched itu sendiri.

In [27]:
# sort the score from the highest to lowest, save as sorted_anime except the highest score
sorted_anime = sorted(similarity_anime, key=lambda x: x[1], reverse=True)[1:]

In [28]:
# show all score sorted from highest to lowest
sorted(similarity_anime, key=lambda x: x[1], reverse=True)[:]

[(1100, 1.0000000000000002),
 (2686, 1.0000000000000002),
 (4250, 1.0000000000000002),
 (5017, 1.0000000000000002),
 (5491, 1.0000000000000002),
 (5901, 1.0000000000000002),
 (6626, 1.0000000000000002),
 (7354, 1.0000000000000002),
 (7512, 1.0000000000000002),
 (10671, 1.0000000000000002),
 (10672, 1.0000000000000002),
 (1503, 0.8660254037844388),
 (1743, 0.8660254037844388),
 (1944, 0.8660254037844388),
 (2227, 0.8660254037844388),
 (2244, 0.8660254037844388),
 (2659, 0.8660254037844388),
 (2773, 0.8660254037844388),
 (2803, 0.8660254037844388),
 (2889, 0.8660254037844388),
 (2921, 0.8660254037844388),
 (3235, 0.8660254037844388),
 (4699, 0.8660254037844388),
 (5001, 0.8660254037844388),
 (5182, 0.8660254037844388),
 (5527, 0.8660254037844388),
 (5579, 0.8660254037844388),
 (5938, 0.8660254037844388),
 (6496, 0.8660254037844388),
 (8186, 0.8660254037844388),
 (8187, 0.8660254037844388),
 (8188, 0.8660254037844388),
 (8285, 0.8660254037844388),
 (11060, 0.8660254037844388),
 (1425, 0.8

In [30]:
anime_by_index(sorted_anime[1][0])

'Mahou no Yousei Persia'

In [31]:
# present recommendation
print("Top 10 Anime Recommendation:", watched)
print("=" *50)
count = 0
for i in sorted_anime:
    print(anime_by_index(i[0]))
    count += 1
    if count == 10:
        break

Top 10 Anime Recommendation: Amagi Brilliant Park
Ojamajo Doremi
Mahou no Yousei Persia
Yokuwakaru Gendaimahou
Magical Nyan Nyan Taruto
Mahou Shoujo Nante Mou Ii Desukara. 2nd Season
Mahou Shoujo Nante Mou Ii Desukara.
Maji de Otaku na English! Ribbon-chan: Eigo de Tatakau Mahou Shoujo - The TV
Maji de Otaku na English! Ribbon-chan: Eigo de Tatakau Mahou Shoujo
Unko-san: Tsuiteru Hito ni Shika Mienai Yousei
Unko-san: Tsuiteru Hito ni Shika Mienai Yousei Junjou Ha


Sistem rekomendasi sederhana ini menghasilkan 10 anime yang secara genre dan tipe penayangan mirip dengan anime "Amagi Brilliant Park". Model ini bekerja dengan baik, namun masih menghasilkan score yang terlalu tinggi menandakan feature yang dipakai kurang variatif. Sistem rekomendasi ini hanya memakai satu judul anime sebagai patokan, tapi tidak menutup kemungkinan memakai lebih dari satu judul sehingga rekomendasi yang diberikan bisa lebih baik.