# Collecting Data

Install kaggle API agar bisa mengakses data langsung ke situs kaggle (https://www.kaggle.com/)

In [1]:
!pip install kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json



Download dataset langsung dari kaggle

In [2]:
!kaggle datasets download gargmanas/movierecommenderdataset

Downloading movierecommenderdataset.zip to /content
  0% 0.00/846k [00:00<?, ?B/s]
100% 846k/846k [00:00<00:00, 111MB/s]


extract .zip file

In [3]:
!unzip /content/movierecommenderdataset.zip

Archive:  /content/movierecommenderdataset.zip
  inflating: movies.csv              
  inflating: ratings.csv             


# Import Library

In [4]:
import pandas as pd
import numpy as np
import tensorflow as tf
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

# Data Preparation

## Import Data

File data dalam format .csv, menggunakan pandas untuk membaca file tersebut dalam notebook

Dalam proyek ini akan menggunakan _Content-Based Filter_. Sistem rekomendasi yang dibuat akan memberikan rekomendasi film kepada pengguna berdasarkan kemiripan fitur (dalam hal ini genre film) dari film yang disukai oleh pengguna sebelumnya.

Maka dari itu, file data movies saja sudah cukup untuk membuat sistem rekomendasi dengan _Content-Based Filter_.

In [5]:
movies = pd.read_csv('/content/movies.csv')

Menampilkan data movies

In [6]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


## Data Cleaning

### Duplicated Data

Cek jumlah movieId dan title

In [7]:
unique_movie_ids = movies['movieId'].nunique()
unique_titles = movies['title'].nunique()

if unique_movie_ids != unique_titles:
    print("Jumlah 'movieId' dan 'title' berbeda dalam dataset.")
    print(f"Jumlah 'movieId' unik: {unique_movie_ids}")
    print(f"Jumlah 'title' unik: {unique_titles}")
else:
    print("Jumlah 'movieId' dan 'title' sama dalam dataset.")

Jumlah 'movieId' dan 'title' berbeda dalam dataset.
Jumlah 'movieId' unik: 9742
Jumlah 'title' unik: 9737


Muncul masalah disini, movieId dan title harusnya memiliki jumlah yang sama

Cek pada kedua kolom apakah ada data yang terduplikasi atau tidak

In [8]:
duplicate_movie_ids = movies[movies.duplicated('movieId')]
duplicate_movie_ids

Unnamed: 0,movieId,title,genres


In [9]:
duplicate_titles = movies[movies['title'].duplicated(keep=False)]
duplicate_titles

Unnamed: 0,movieId,title,genres
650,838,Emma (1996),Comedy|Drama|Romance
2141,2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller
4169,6003,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Thriller
5601,26958,Emma (1996),Romance
5854,32600,Eros (2004),Drama
5931,34048,War of the Worlds (2005),Action|Adventure|Sci-Fi|Thriller
6932,64997,War of the Worlds (2005),Action|Sci-Fi
9106,144606,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Romance|Thriller
9135,147002,Eros (2004),Drama|Romance
9468,168358,Saturn 3 (1980),Sci-Fi|Thriller


Terdapat data yang terduplikasi pada title. Terlihat bahwa ada judul film yg sama namun memilki movieId yang berbeda. Indikasi lainnya adalah genres yang dideskripsikan pun berbeda. Maka akan dilakukan drop terhadap moviesId yang judulnya terduplikasi dengan jumlah genre yang dideskripsikan lebih sedikit.

drop pada row dengan index [4169, 5601, 5854, 6932, 9468]

In [10]:
drop_index = [4169, 5601, 5854, 6932, 9468]

movies = movies.drop(index=drop_index)

In [11]:
unique_movie_ids = movies['movieId'].nunique()
unique_titles = movies['title'].nunique()

if unique_movie_ids != unique_titles:
    print("Jumlah 'movieId' dan 'title' berbeda dalam dataset.")
    print(f"Jumlah 'movieId' unik: {unique_movie_ids}")
    print(f"Jumlah 'title' unik: {unique_titles}")
else:
    print("Jumlah 'movieId' dan 'title' sama dalam dataset.")

Jumlah 'movieId' dan 'title' sama dalam dataset.


Masalah duplikasi data telah teratasi

# Data Preprocessing

In [12]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Terdapat masalah lain disini, setiap film bisa mengandung lebih dari satu genre. Sedangkan data genres pada setiap film hanya dibatasi oleh tanda '|'.

Pada cell berikutnya dilakukan split terhadap data tersebut dengan menjadikan '|' sebagai pemisah. Lalu, teknik one-hot encoding diterapkan pada data genres agar data genres bisa dilihat satu-persatu

In [13]:
movies['genres'] = movies['genres'].str.split('|')

one_hot_encoding = movies['genres'].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='int')

movies = pd.concat([movies, one_hot_encoding], axis=1)

movies.drop('genres', axis=1, inplace=True)

movies

Unnamed: 0,movieId,title,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story (1995),1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),1,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9738,193583,No Game No Life: Zero (2017),0,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9739,193585,Flint (2017),0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Menampilkan informasi dari dataset movies

In [14]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9737 entries, 0 to 9741
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   movieId             9737 non-null   int64 
 1   title               9737 non-null   object
 2   Adventure           9737 non-null   int64 
 3   Animation           9737 non-null   int64 
 4   Children            9737 non-null   int64 
 5   Comedy              9737 non-null   int64 
 6   Fantasy             9737 non-null   int64 
 7   Romance             9737 non-null   int64 
 8   Drama               9737 non-null   int64 
 9   Action              9737 non-null   int64 
 10  Crime               9737 non-null   int64 
 11  Thriller            9737 non-null   int64 
 12  Horror              9737 non-null   int64 
 13  Mystery             9737 non-null   int64 
 14  Sci-Fi              9737 non-null   int64 
 15  War                 9737 non-null   int64 
 16  Musical             9737

Terdapat kolom '(no genres listed)'. Kolom tersebut harus ditinjau terlebih dahulu apa maksud dari kolom tersebut

In [15]:
movies[movies['(no genres listed)'] == 1]

Unnamed: 0,movieId,title,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
8517,114335,La cravate (1957),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8684,122888,Ben-hur (2016),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8687,122896,Pirates of the Caribbean: Dead Men Tell No Tal...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8782,129250,Superfast! (2015),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8836,132084,Let It Be Me (1995),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8902,134861,Trevor Noah: African American (2013),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9033,141131,Guardians (2016),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9053,141866,Green Room (2015),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9070,142456,The Brand New Testament (2015),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9091,143410,Hyena Road,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Kolom '(no genres listed)' berisikan film yang genrenya tidak terdeskripsikan pada dataset ini. terdapat 34 film yang tidak terdeskripsikan genrenya. Karena pada sistem rekomendasi ini menggunakan genre sebagai atribut untuk mengukur kemiripan, maka film-film tersebut bisa dikecualikan karena tidak relevan dengan sistem rekomendasi yang akan dibuat

In [16]:
movies = movies[movies['(no genres listed)'] != 1]

kolom '(no genres listed)' juga didrop

In [17]:
movies = movies.drop(columns='(no genres listed)')

Periksa kembali jumlah data dan kolom

In [18]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9703 entries, 0 to 9741
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movieId      9703 non-null   int64 
 1   title        9703 non-null   object
 2   Adventure    9703 non-null   int64 
 3   Animation    9703 non-null   int64 
 4   Children     9703 non-null   int64 
 5   Comedy       9703 non-null   int64 
 6   Fantasy      9703 non-null   int64 
 7   Romance      9703 non-null   int64 
 8   Drama        9703 non-null   int64 
 9   Action       9703 non-null   int64 
 10  Crime        9703 non-null   int64 
 11  Thriller     9703 non-null   int64 
 12  Horror       9703 non-null   int64 
 13  Mystery      9703 non-null   int64 
 14  Sci-Fi       9703 non-null   int64 
 15  War          9703 non-null   int64 
 16  Musical      9703 non-null   int64 
 17  Documentary  9703 non-null   int64 
 18  IMAX         9703 non-null   int64 
 19  Western      9703 non-null 

Data setelah dibersikan dan dilakukan preprocessing berjumlah 9703 film

# Model Development dengan _Content-Based Filtering_

Model yang akan digunakan adalah _Content-Based Filtering_ dimana atribut yang akan mengukur derajat kemiripan antar film adalah genrenya

Pastikan kembali tidak ada data yang terduplikasi

In [19]:
data = movies.drop_duplicates('movieId')
data

Unnamed: 0,movieId,title,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,...,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir
0,1,Toy Story (1995),1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),1,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9738,193583,No Game No Life: Zero (2017),0,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9739,193585,Flint (2017),0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Menampilkan informasi data yang akan digunakan

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9703 entries, 0 to 9741
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movieId      9703 non-null   int64 
 1   title        9703 non-null   object
 2   Adventure    9703 non-null   int64 
 3   Animation    9703 non-null   int64 
 4   Children     9703 non-null   int64 
 5   Comedy       9703 non-null   int64 
 6   Fantasy      9703 non-null   int64 
 7   Romance      9703 non-null   int64 
 8   Drama        9703 non-null   int64 
 9   Action       9703 non-null   int64 
 10  Crime        9703 non-null   int64 
 11  Thriller     9703 non-null   int64 
 12  Horror       9703 non-null   int64 
 13  Mystery      9703 non-null   int64 
 14  Sci-Fi       9703 non-null   int64 
 15  War          9703 non-null   int64 
 16  Musical      9703 non-null   int64 
 17  Documentary  9703 non-null   int64 
 18  IMAX         9703 non-null   int64 
 19  Western      9703 non-null 

Membuat tabel pivot dengan 'movieId' sebagai pivotnya

In [21]:
pivot_table = data.pivot_table(index="movieId", aggfunc="sum")
pivot_table.head()

  pivot_table = data.pivot_table(index="movieId", aggfunc="sum")


Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Melakukan perhitungan derajar kemiripan menggunakan _Cosine Similarity_ terhadap tabel pivot

In [22]:
cosine_sim = cosine_similarity(pivot_table)
cosine_sim.shape

(9703, 9703)

Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa nama resto


In [23]:
    # Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa nama resto
    cosine_sim_df = pd.DataFrame(cosine_sim, index=data['title'], columns=data['title'])
    print('Shape:', cosine_sim_df.shape)

    cosine_sim_df

Shape: (9703, 9703)


title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.000000,0.774597,0.316228,0.258199,0.447214,0.000000,0.316228,0.632456,0.000000,0.258199,...,0.447214,0.316228,0.316228,0.447214,0.0,0.670820,0.774597,0.00000,0.316228,0.447214
Jumanji (1995),0.774597,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.816497,0.000000,0.333333,...,0.000000,0.000000,0.000000,0.000000,0.0,0.288675,0.333333,0.00000,0.000000,0.000000
Grumpier Old Men (1995),0.316228,0.000000,1.000000,0.816497,0.707107,0.000000,1.000000,0.000000,0.000000,0.000000,...,0.353553,0.000000,0.500000,0.000000,0.0,0.353553,0.408248,0.00000,0.000000,0.707107
Waiting to Exhale (1995),0.258199,0.000000,0.816497,1.000000,0.577350,0.000000,0.816497,0.000000,0.000000,0.000000,...,0.288675,0.408248,0.816497,0.000000,0.0,0.288675,0.333333,0.57735,0.000000,0.577350
Father of the Bride Part II (1995),0.447214,0.000000,0.707107,0.577350,1.000000,0.000000,0.707107,0.000000,0.000000,0.000000,...,0.500000,0.000000,0.707107,0.000000,0.0,0.500000,0.577350,0.00000,0.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Black Butler: Book of the Atlantic (2017),0.670820,0.288675,0.353553,0.288675,0.500000,0.288675,0.353553,0.000000,0.500000,0.288675,...,0.750000,0.353553,0.353553,0.500000,0.0,1.000000,0.866025,0.00000,0.707107,0.500000
No Game No Life: Zero (2017),0.774597,0.333333,0.408248,0.333333,0.577350,0.000000,0.408248,0.000000,0.000000,0.000000,...,0.577350,0.408248,0.408248,0.577350,0.0,0.866025,1.000000,0.00000,0.408248,0.577350
Flint (2017),0.000000,0.000000,0.000000,0.577350,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.707107,0.707107,0.000000,0.0,0.000000,0.000000,1.00000,0.000000,0.000000
Bungo Stray Dogs: Dead Apple (2018),0.316228,0.000000,0.000000,0.000000,0.000000,0.408248,0.000000,0.000000,0.707107,0.408248,...,0.707107,0.500000,0.000000,0.707107,0.0,0.707107,0.408248,0.00000,1.000000,0.000000


Membuat fungsi yang akan menghasilkan rekomendasi film yang diurutkan berdasarkan derajat kemiripan tertinggi

In [24]:
def movie_recommendations(movie, similarity_data=cosine_sim_df, items=data[['title', 'movieId']], k=15):
    """
    Rekomendasi film berdasarkan kemiripan genre

    Parameter:
    ---
    movie : Judul film tipe data string (str)
    similarity_data : tipe data pd.DataFrame (object)
                      Kesamaan dataframe, simetrik, dengan movie sebagai
                      indeks dan kolom
    items : tipe data pd.DataFrame (object)
            Mengandung kedua judul dan movieId lainnya yang digunakan untuk mendefinisikan kemiripan
    k : tipe data integer (int)
        Banyaknya jumlah rekomendasi yang diberikan
    """

    # Mengambil data dengan menggunakan argpartition untuk melakukan partisi secara tidak langsung sepanjang sumbu yang diberikan
    # Dataframe diubah menjadi numpy
    # Range(start, stop, step)
    index = similarity_data.loc[:,movie].to_numpy().argpartition(
        range(-1, -k, -1))

    # Mengambil data dengan similarity terbesar dari index yang ada
    closest = similarity_data.columns[index[-1:-(k+2):-1]]

    # Drop movie yang dicari agar tidak muncul dalam daftar rekomendasi
    closest = closest.drop(movie, errors='ignore')

    recommendations = pd.DataFrame(closest).merge(items).head(k)

    # Menambahkan kolom similarity score
    recommendations['similarity_score'] = recommendations['title'].apply(lambda x: similarity_data.loc[movie, x])

    return recommendations

# Testing

In [25]:
movie_recommendations('Toy Story (1995)')

Unnamed: 0,title,movieId,similarity_score
0,"Tale of Despereaux, The (2008)",65577,1.0
1,Moana (2016),166461,1.0
2,Shrek the Third (2007),53121,1.0
3,Antz (1998),2294,1.0
4,"Wild, The (2006)",45074,1.0
5,"Monsters, Inc. (2001)",4886,1.0
6,Toy Story 2 (1999),3114,1.0
7,Turbo (2013),103755,1.0
8,"Adventures of Rocky and Bullwinkle, The (2000)",3754,1.0
9,The Good Dinosaur (2015),136016,1.0


Tabel diatas adalah 15 rekomendasi film untuk user yang menyukai film Toy Story (1995) berdasarkan derajat kemiripan genrenya