# **Proyek Sistem Rekomendasi: Rekomendasi Film Tahun 1996 - 2018**

## **Import Library**

In [198]:
# Melakukan import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

2.19.0


**Insight:**

- Pandas: untuk manipulasi dan analisis data
- Numpy: untuk komputasi numerik
- Seaborn: untuk visualisasi data statistik
- Matplotlib: untuk visualisasi data

## **Loading Dataset**

Melakukan loading dataset menggunakan pandas

In [199]:
links_df = pd.read_csv("data/links.csv")
movies_df = pd.read_csv("data/movies.csv")
ratings_df = pd.read_csv("data/ratings.csv")
tags_df = pd.read_csv("data/tags.csv")

In [200]:
links_df.head(5)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [201]:
movies_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [202]:
ratings_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [203]:
tags_df.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


**Insight:**

- Pada proyek ini terdapat empat dataset yaitu links, movies, ratings dan tags
- Dataset ratings berisi baris data yang mewakili satu peringkat, satu movie oleh satu pengguna. Kolom pada dataset ini adalah userId, movieId, rating dan timestamp
- Dataset links berisi data movie dengan tautan ke sumber data film yaitu IMDB dan TMDB. Kolom pada dataset ini adalah movieId, imdbId dan tmdbId
- Dataset movies berisi informasi dari film yaitu judul dan genre. Kolom pada dataset ini adalah movieId, title dan genres
- Dataset tags berisi informasi tag yang diberikan oleh satu user pada satu film. Kolom pada dataset ini adalah userId, movieId, tag dan timestamp


## **Exploratory Data Analysis (EDA)**

### **Melihat Informasi Dataset**

In [204]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


**Insight:**

Berdasarkan output di atas, dataset movies memiliki 7742 entri. Terdapat tiga variabel pada dataset ini yaitu movieId, title atau judul film dan genres.


In [205]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


**Insight:**

Berdasarkan output di atas, dataset ratings memiliki 100835 entri. Terdapat empat variabel pada dataset ini yaitu userId, movieId, rating dan timestamp.

In [206]:
links_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


**Insight:**

Berdasarkan output di atas, dataset links memiliki 9741 entri. Terdapat tiga variabel pada dataset ini yaitu movieId, imdbId dan tmdbId.

In [207]:
tags_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


**Insight:**

Berdasarkan output di atasm dataset tags memiliki 3682 entri. Terdapat empat variabel pada dataset ini yaitu userId, movieId, tag dan timestamp.

In [208]:
print('Jumlah userId: ', len(ratings_df['userId'].unique()))
print('Jumlah data film: ', len(movies_df['movieId']))
print('Jumlah data rating: ', len(ratings_df['rating']))

Jumlah userId:  610
Jumlah data film:  9742
Jumlah data rating:  100836


**Insight:**

Terdapat 610 user yang memberikan rating pada 9742 film. Kemudian untuk akumulasi data rating yang diberikan user totalnya adalah 100836.

### **Mengecek Missing Value**

In [209]:
links_df.isna().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

In [210]:
ratings_df.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [211]:
movies_df.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [212]:
tags_df.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

**Insight:**

Dari keempat dataset di atas, setelah mengecek menggunakan isna() tidak ditemukan adanya missing value/Nan.

### **Melihat Deskripsi Statistik pada Dataset Rating**

In [213]:
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


**Insight:**

Dari output di atas, ditemukan bahwa nilai minimum user memberikan rating adalah 0.5 dan maksimum ratingnya adalah 5. Ini artinya, skala rating berkisar antara 0.5 hingga 5. Kemudian untuk rata-rata user memberikan rating 3.5.

### **Mengecek Nilai Duplikat pada Dataset**

In [214]:
print("Jumlah data duplikat pada links.csv: ", links_df.duplicated().sum())
print("Jumlah data duplikat pada movies.csv: ", movies_df.duplicated().sum())
print("Jumlah data duplikat pada ratings.csv: ", ratings_df.duplicated().sum())
print("Jumlah data duplikat pada tags.csv: ", tags_df.duplicated().sum())

Jumlah data duplikat pada links.csv:  0
Jumlah data duplikat pada movies.csv:  0
Jumlah data duplikat pada ratings.csv:  0
Jumlah data duplikat pada tags.csv:  0


**Insight:**

Dari keempat dataset di atas, setelah melakukan pengecekan dengan fungsi duplicated() tidak ditemukan adanya data yang sama/duplikat.

## **Data Preprocessing**

Pada tahapan ini akan menggabungkan beberapa dataset untuk menjadi dataset utama yang digunakan untuk modeling

In [215]:
# Menggabungkan dataset movies dengan ratings
main_movies_df = pd.merge(movies_df, ratings_df[['movieId', 'rating', 'userId']], on='movieId', how='left')

main_movies_df

Unnamed: 0,movieId,title,genres,rating,userId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0,1.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0,5.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5,7.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2.5,15.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5,17.0
...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,4.0,184.0
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,3.5,184.0
100851,193585,Flint (2017),Drama,3.5,184.0
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,3.5,184.0


In [216]:
main_movies_df.isna().sum()

movieId     0
title       0
genres      0
rating     18
userId     18
dtype: int64

In [217]:
print("Jumlah data duplikasi: ", main_movies_df.duplicated().sum())

Jumlah data duplikasi:  0


**Insight:**

- Dari tahapan ini dihasilkan dataframe dari hasil gabungan dataset movies dan ratings. main_movies_df akan digunakan untuk membuat sistem rekomendasi film.
- Setelah dilakukan merge movies dan ratings, kemudian melakukan pengecekan missing value dengan isna(). Ditemukan untuk variabel rating dan userId terdapat missing value dengan jumlah masing-masing adalah 18 data. Masalah ini akan ditangani pada tahapan Data Preparation.
- Setelah merging tidak ditemukan data duplikasi


## **Data Preparation**

### **Mengatasi Missing Value**

In [218]:
# Memberishkan missing value dengan fungsi dropna()
main_movies_clean_df = main_movies_df.dropna()
main_movies_clean_df

Unnamed: 0,movieId,title,genres,rating,userId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0,1.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0,5.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5,7.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2.5,15.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5,17.0
...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,4.0,184.0
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,3.5,184.0
100851,193585,Flint (2017),Drama,3.5,184.0
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,3.5,184.0


In [219]:
# Mengecek kembali missing value pada variabel main_movies_clean_df
main_movies_clean_df.isnull().sum()

movieId    0
title      0
genres     0
rating     0
userId     0
dtype: int64

**Insight:**
Langkah pertama pada tahapan Data Preparation adalah memersihkan missing value pada main_movies_df menggunakan fungsi dropna(). Kemudian dilakukan pengecekan kembali untuk memastikan apakah masalah missing value sudah ditangani. Terlihat pada output di atas sudah tidak ada missing value dan siap untuk lanjut ke proses selanjutnya.


### **Konversi Data Series menjadi List**

In [220]:
# Mengkonversi data series (movieId, title dan genres) menjadi dalam bentuk list
movie_id = main_movies_clean_df['movieId'].tolist()
movie_title = main_movies_clean_df['title'].tolist()
movie_genre = main_movies_clean_df['genres'].tolist()

print(len(movie_id))
print(len(movie_title))
print(len(movie_genre))

100836
100836
100836


In [221]:
# Membuat dictionary untuk data (movieId, title dan genres)
movie_new = pd.DataFrame({
    'id': movie_id,
    'title': movie_title,
    'genres': movie_genre,
}).drop_duplicates()

movie_new

Unnamed: 0,id,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
215,2,Jumanji (1995),Adventure|Children|Fantasy
325,3,Grumpier Old Men (1995),Comedy|Romance
377,4,Waiting to Exhale (1995),Comedy|Drama|Romance
384,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
100833,193585,Flint (2017),Drama
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


**Insight:**

Mengambil tiga variabel pada main_movies_clean_df yaitu movieId, title dan genre. Kemudian dari ketiga variabel ini diubah menjadi list. Setelah itu, membuat dictionary dengan pasangan key:value sesuai dengan variabel yang diambil. Kemudian dibuat dataframe yaitu movie_new dari dictionary yang sudah dibuat.

### **Melakukan Formatting pada Variabel Genres**

In [249]:
movies_formatted_df = movie_new.iloc[:]
movies_formatted_df['genres'] = movies_formatted_df['genres'].str.replace('-', '', regex=False)
movies_formatted_df

Unnamed: 0,id,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
215,2,Jumanji (1995),Adventure|Children|Fantasy
325,3,Grumpier Old Men (1995),Comedy|Romance
377,4,Waiting to Exhale (1995),Comedy|Drama|Romance
384,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
100833,193585,Flint (2017),Drama
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


**Insight:**

Terdapat nilai pada variabel genres yang dipisahkan dengan tanda "-". Untuk memudahkan proses vektorisasi nantinya, maka perlu mengghapus tanda "-" menjadi tanpa spasi.


### **Menghapus Data dengan Genre yang Tidak Jelas**

In [273]:
fix_movies_df = movies_formatted_df.iloc[:]

fix_movies_df['genres'].unique()

array(['Adventure|Animation|Children|Comedy|Fantasy',
       'Adventure|Children|Fantasy', 'Comedy|Romance',
       'Comedy|Drama|Romance', 'Comedy', 'Action|Crime|Thriller',
       'Adventure|Children', 'Action', 'Action|Adventure|Thriller',
       'Comedy|Horror', 'Adventure|Animation|Children', 'Drama',
       'Action|Adventure|Romance', 'Crime|Drama', 'Drama|Romance',
       'Action|Comedy|Crime|Drama|Thriller', 'Comedy|Crime|Thriller',
       'Crime|Drama|Horror|Mystery|Thriller', 'Drama|SciFi',
       'Children|Drama', 'Adventure|Drama|Fantasy|Mystery|SciFi',
       'Mystery|SciFi|Thriller', 'Children|Comedy', 'Drama|War',
       'Action|Crime|Drama', 'Action|Adventure|Fantasy',
       'Comedy|Drama|Thriller', 'Mystery|Thriller',
       'Animation|Children|Drama|Musical|Romance',
       'Crime|Mystery|Thriller', 'Adventure|Drama', 'Drama|Thriller',
       'Comedy|Crime', 'Action|SciFi|Thriller',
       'Action|Comedy|Horror|Thriller', 'Comedy|Drama', 'Documentary',
       'Action

In [274]:
fix_movies_df[fix_movies_df['genres'] == '(no genres listed)']

Unnamed: 0,id,title,genres
97487,114335,La cravate (1957),(no genres listed)
98182,122888,Ben-hur (2016),(no genres listed)
98216,122896,Pirates of the Caribbean: Dead Men Tell No Tal...,(no genres listed)
98608,129250,Superfast! (2015),(no genres listed)
98718,132084,Let It Be Me (1995),(no genres listed)
98932,134861,Trevor Noah: African American (2013),(no genres listed)
99331,141131,Guardians (2016),(no genres listed)
99357,141866,Green Room (2015),(no genres listed)
99391,142456,The Brand New Testament (2015),(no genres listed)
99459,143410,Hyena Road,(no genres listed)


In [285]:
fix_movies_df = fix_movies_df.replace('(no genres listed)', np.nan)
fix_movies_df = fix_movies_df.dropna()

fix_movies_df

Unnamed: 0,id,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
215,2,Jumanji (1995),Adventure|Children|Fantasy
325,3,Grumpier Old Men (1995),Comedy|Romance
377,4,Waiting to Exhale (1995),Comedy|Drama|Romance
384,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
100833,193585,Flint (2017),Drama
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


**Insight:**

Terdapat film dengan genre yang tidak jelas yaitu "(no genres listed)". Untuk menjaga kualitas data, maka perlu dilakukan penghapusan pada data tersebut.

## **Building Model**

### **Model Development dengan Content Based Filtering**

Mengembangkan sistem rekomendasi film dengan pendekatan content based filtering berdasarkan genre film

#### **Mempersiapkan data untuk modeling CBF**

In [277]:
data_cbf = fix_movies_df.iloc[:]
data_cbf.sample(5)

Unnamed: 0,id,title,genres
39628,2097,Something Wicked This Way Comes (1983),Children|Drama|Fantasy|Mystery|Thriller
5673,207,"Walk in the Clouds, A (1995)",Drama|Romance
92891,87028,"Roommate, The (2011)",Drama|Thriller
13037,475,In the Name of the Father (1993),Drama
30446,1367,101 Dalmatians (1996),Adventure|Children|Comedy


#### **Menemukan Representasi Fitur Penting**

In [278]:
# Inisialisasi TfidVectorizer
tfid = TfidfVectorizer()

# Melakukan perhitungan idf pada data genres
tfid.fit(data_cbf['genres'])

# Mapping array dari fitur index integer ke fitur nama
tfid.get_feature_names_out()

array(['action', 'adventure', 'animation', 'children', 'comedy', 'crime',
       'documentary', 'drama', 'fantasy', 'filmnoir', 'horror', 'imax',
       'musical', 'mystery', 'romance', 'scifi', 'thriller', 'war',
       'western'], dtype=object)

**Insight:**

Dari hasil vektorisasi menggunakan TfidfVectorizer didapatkan fitur yaitu
['action', 'adventure', 'animation', 'children', 'comedy', 'crime', 'documentary', 'drama', 'fantasy', 'filmnoir', 'horror', 'imax', 'musical', 'mystery', 'romance', 'scifi', 'thriller', 'war', 'western']


In [279]:
# Melakukukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tfid.transform(data_cbf['genres'])

# Melihat ukuran matrix tfidf
tfidf_matrix.shape

(9690, 19)

**Insight:**

Hasil matrix memiliki ukuran (9690, 19). Nilai 9690 merupakan ukuran data dan 19 merupakan matrix genre film.

In [280]:
# Mengubah vektor tfidf dalam bentuk matriks dengan fungsi todense()
tfidf_matrix.todense()

matrix([[0.        , 0.41677501, 0.51640289, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.5122897 , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.57821416, 0.        , 0.81588503, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]], shape=(9690, 19))

In [301]:
# Membuat dataframe untuk melihat tfidf matrix
# Kolom diisi dengan genre film
# Baris diisi dengan judul film

pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tfid.get_feature_names_out(),
    index=data_cbf['title'],
)

Unnamed: 0_level_0,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,filmnoir,horror,imax,musical,mystery,romance,scifi,thriller,war,western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Toy Story (1995),0.000000,0.416775,0.516403,0.504783,0.267318,0.0,0.0,0.000000,0.483075,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Jumanji (1995),0.000000,0.512290,0.000000,0.620467,0.000000,0.0,0.0,0.000000,0.593784,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Grumpier Old Men (1995),0.000000,0.000000,0.000000,0.000000,0.570321,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.821422,0.0,0.0,0.0,0.0
Waiting to Exhale (1995),0.000000,0.000000,0.000000,0.000000,0.504506,0.0,0.0,0.466351,0.000000,0.0,0.0,0.0,0.0,0.0,0.726630,0.0,0.0,0.0,0.0
Father of the Bride Part II (1995),0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Black Butler: Book of the Atlantic (2017),0.435735,0.000000,0.614840,0.000000,0.318275,0.0,0.0,0.000000,0.575159,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
No Game No Life: Zero (2017),0.000000,0.000000,0.683099,0.000000,0.353609,0.0,0.0,0.000000,0.639012,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Flint (2017),0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
Bungo Stray Dogs: Dead Apple (2018),0.578214,0.000000,0.815885,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


**Insight:**

Dari hasil moutput matriks tfid di atas menunjukkan film Toy Story (1995) memiliki genre adventure, animation, children, comedy dan fantasy. Hal ini terlihat dari nilai matriks 0.416775 pada genre adventure, 0.516403 pada genre animation, 0.504783 pada genre children, 0.267318 pada genre comedy dan 0.483075 pada genre fantasy.


### **Cosine Similarity**

In [295]:
# Menghitung cosine similarity pada matrix tfidf
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1.        , 0.81355337, 0.15245713, ..., 0.        , 0.42132539,
        0.26731811],
       [0.81355337, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.15245713, 0.        , 1.        , ..., 0.        , 0.        ,
        0.57032099],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.42132539, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.26731811, 0.        , 0.57032099, ..., 0.        , 0.        ,
        1.        ]], shape=(9690, 9690))

**Insight:**

Pada tahap ini dilakukan perhitungan cosine similarity pada dataframe tfidf_matrix. Menggunakan fungsi consine_similarity dari library sklearn. Terlihat pada output di atas adalah matriks kesamaan dalam bentuk array.

In [293]:
# Membuat dataframe dari variabel cosine_sim dengan baris dan kolo berupa nama film
cosine_sim_df = pd.DataFrame(cosine_sim, index=data_cbf['title'], columns=data_cbf['title'])
print("Shape: ", cosine_sim_df.shape)

# Melihat similarity matrix pada setiap film
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape:  (9690, 9690)


title,Snow White (1916),Mixed Nuts (1994),Cleanskin (2012),"Knight's Tale, A (2001)","Hello, Dolly! (1969)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Taken (2008),0.0,0.0,0.857362,0.316767,0.0
Smokey and the Bandit II (1980),0.0,0.589839,0.0,0.762115,0.207249
M (1931),0.0,0.0,0.528993,0.0,0.0
"Story of Film: An Odyssey, The (2011)",0.0,0.0,0.0,0.0,0.0
Clerks II (2006),0.0,1.0,0.0,0.449525,0.351365
Still Alice (2014),0.0,0.0,0.405348,0.0,0.0
48 Hrs. (1982),0.0,0.400121,0.591628,0.516986,0.140589
Garfield: A Tail of Two Kitties (2006),0.0,0.347155,0.0,0.156055,0.121978
Ben X (2007),0.0,0.0,0.405348,0.0,0.0
27 Dresses (2008),0.511964,0.570321,0.0,0.788197,0.616084


**Insight:**

Shape (9690, 9690) merupakan ukuran matriks similarity dari data film. Dari output di atas, sebagai contoh film Taken (2008) memiliki indikasi kesamaan dengan film Cleanskin (2012) dengan nilai similarity adalah 0.857362 dan Knight's Tale, A (2001) nilainya 0.316767.

### **Mendapatkan Rekomendasi Film**

In [310]:
def movie_recommendations(movie_title, similarity_data=cosine_sim_df, items=data_cbf[['title', 'genres']], k=5):
    # Mengambil data dengan menggunakan argpartition untuk melakukan partisi secara tidak langsung sepanjang sumbu yang diberikan
    # Dataframe diubah menjadi numpy
    # Range(start, stop, step)
    index = similarity_data.loc[:, movie_title].to_numpy().argpartition(range(-1, -k, -1))

    # Mengambil data dengan similarity terbesar dari index yang ada
    closest = similarity_data.columns[index[-1:-(k + 2): -1]]

    # Drop movie_title agar judul film yang dicari tidak muncul dalam daftar rekomendasi
    closest = closest.drop(movie_title, errors='ignore')

    return pd.DataFrame(closest).merge(items).head(k)

**Insight:**

Fungsi ini bertujuan untuk memberikan rekomendasi film berdasarkan judul film yang diberikan, dengan memanfaatkan data kemiripan (similarity) yang sudah dihitung sebelumnya.

Secara ringkas, alur kerjanya adalah sebagai berikut:

1. Mencari Indeks Film Termirip: Baris pertama menggunakan .argpartition() untuk cara yang sangat cepat dalam menemukan posisi (indeks) dari k film yang paling mirip dengan movie_title tanpa perlu mengurutkan seluruh data.
2. Mengambil Judul Film: Berdasarkan posisi tersebut, baris kedua mengambil judul-judul film yang paling mirip dan mengurutkannya dari yang paling tinggi kemiripannya.
3. Menghapus Film Asli: Baris ketiga memakai .drop() untuk menghapus movie_title dari daftar rekomendasi, agar film tersebut tidak merekomendasikan dirinya sendiri.
4. Menampilkan Hasil Akhir: Terakhir, fungsi ini menggabungkan daftar judul film yang sudah bersih dengan data items (yang berisi judul dan genre), lalu mengembalikan k rekomendasi teratas dalam bentuk tabel (DataFrame) yang rapi.

In [311]:
# Mendapatkan rekomendasi film yang mirip dengan Toy Story (1995)
movie_recommendations('Toy Story (1995)')

Unnamed: 0,title,genres
0,Moana (2016),Adventure|Animation|Children|Comedy|Fantasy
1,"Tale of Despereaux, The (2008)",Adventure|Animation|Children|Comedy|Fantasy
2,Shrek the Third (2007),Adventure|Animation|Children|Comedy|Fantasy
3,"Wild, The (2006)",Adventure|Animation|Children|Comedy|Fantasy
4,"Adventures of Rocky and Bullwinkle, The (2000)",Adventure|Animation|Children|Comedy|Fantasy


**Insight:**

Melakukan percobaan untuk mencari rekomendasi film yang sama dengan film Toy Story (1995). Didapatkan hasil film yang mirip berdasarkan genrenya adalah Moana (2016), Tale of Despereaux, The (2008), Shrek the Third (2007), Wild, The (2006) dan Adventures of Rocky and Bullwinkle, The (2000)

### **Model Development dengan Collaborative Filtering**

## **Evaluating Model**

## **Hyperparameter Tuning**

## **Evaluating Model after Tuning**