Menginstal & mengimpor library Pandas

In [87]:
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable


In [88]:
import pandas as pd

Membaca file .csv

In [89]:
movies = pd.read_csv("NetflixOriginals.csv", encoding="latin1")

Menampilkan beberapa baris pertama dataset

In [90]:
movies.head()

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese
1,Dark Forces,Thriller,"August 21, 2020",81,2.6,Spanish
2,The App,Science fiction/Drama,"December 26, 2019",79,2.6,Italian
3,The Open House,Horror thriller,"January 19, 2018",94,3.2,English
4,Kaali Khuhi,Mystery,"October 30, 2020",90,3.4,Hindi


Menampilkan informasi dataset (jumlah baris & kolom, tipe data untuk setiap kolom, dan jumlah nilai non-null di setiap kolom)

In [91]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Title       584 non-null    object 
 1   Genre       584 non-null    object 
 2   Premiere    584 non-null    object 
 3   Runtime     584 non-null    int64  
 4   IMDB Score  584 non-null    float64
 5   Language    584 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB


Menghitung jumlah nilai null setiap kolom

In [92]:
movies.isnull().sum()

Title         0
Genre         0
Premiere      0
Runtime       0
IMDB Score    0
Language      0
dtype: int64

Mengubah nama kolom "IMDB Score" menjadi "Score"

In [93]:
movies.rename(columns={'IMDB Score' : 'Score'}, inplace=True)

Mengganti semua kata "romantic" pada kolom "Genre" dengan kata "romance"

In [94]:
movies['Genre'] = movies['Genre'].str.replace('romantic', 'romance')

Membuat kolom "Genre + Language" yang merupakan gabungan dari kolom "Genre" dan "Language"

In [95]:
movies['Genre + Language'] = movies['Genre'] + "/" + movies['Language']
movies

Unnamed: 0,Title,Genre,Premiere,Runtime,Score,Language,Genre + Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese,Documentary/English/Japanese
1,Dark Forces,Thriller,"August 21, 2020",81,2.6,Spanish,Thriller/Spanish
2,The App,Science fiction/Drama,"December 26, 2019",79,2.6,Italian,Science fiction/Drama/Italian
3,The Open House,Horror thriller,"January 19, 2018",94,3.2,English,Horror thriller/English
4,Kaali Khuhi,Mystery,"October 30, 2020",90,3.4,Hindi,Mystery/Hindi
...,...,...,...,...,...,...,...
579,Taylor Swift: Reputation Stadium Tour,Concert Film,"December 31, 2018",125,8.4,English,Concert Film/English
580,Winter on Fire: Ukraine's Fight for Freedom,Documentary,"October 9, 2015",91,8.4,English/Ukranian/Russian,Documentary/English/Ukranian/Russian
581,Springsteen on Broadway,One-man show,"December 16, 2018",153,8.5,English,One-man show/English
582,Emicida: AmarElo - It's All For Yesterday,Documentary,"December 8, 2020",89,8.6,Portuguese,Documentary/Portuguese


Menginstal library Scikit-learn

In [96]:
!pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable


Mengimpor beberapa library:
1. "CountVectorizer" untuk mengonversi teks menjadi representasi vektor
2. "cosine_similarity" untuk menghitung nilai kesamaan antar vektor
3. "numpy" & "scipy.sparse" untuk melakukan operasi numerik dan representasi matriks

In [97]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import scipy.sparse as sp

Mendefinisikan fungsi "custom_tokenizer" untuk "CountVectorizer" kolom "Genre + Language", menggantikan tanda "/" dengan spasi, lalu membagi teks menjadi token-token terpisah

In [98]:
def custom_tokenizer(text):
    return text.replace("/", " ").split()

Mengonversi teks dalam kolom "Genre + Language" menjadi representasi numerik

In [99]:
genre_language_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)
genre_language_matrix = genre_language_vectorizer.fit_transform(movies['Genre + Language'])



Mendefinisikan fungsi "custom_tokenizer" untuk "CountVectorizer" kolom "Genre", menggantikan tanda "/" dan "-" dengan spasi, lalu membagi teks menjadi token-token terpisah

In [100]:
def custom_tokenizer(text):
    return text.replace("/", " ").replace("-", " ").split()

Mengonversi teks dalam kolom "Genre" menjadi representasi numerik

In [101]:
genre_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)
genre_matrix = genre_vectorizer.fit_transform(movies['Genre'])

Menggabungkan dua matriks representasi numerik dari kolom 'Genre + Language' dan 'Genre' menjadi satu matriks gabungan

In [113]:
combined_matrix = sp.hstack([genre_language_matrix, genre_matrix])

Menghitung matriks kesamaan, di mana setiap entri mewakili skor kesamaan antara dua film

In [114]:
similarity_matrix = cosine_similarity(combined_matrix)
similarity_matrix

array([[1.        , 0.        , 0.        , ..., 0.20412415, 0.57735027,
        0.8660254 ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.20412415, 0.        , 0.        , ..., 1.        , 0.        ,
        0.23570226],
       [0.57735027, 0.        , 0.        , ..., 0.        , 1.        ,
        0.66666667],
       [0.8660254 , 0.        , 0.        , ..., 0.23570226, 0.66666667,
        1.        ]])

Memilih judul film (yang akan diminta rekomendasinya) untuk mendapatkan indeksnya

In [115]:
film_req = movies[movies['Title']=='The Irishman'].index[0]
film_req

561

Mengurutkan daftar rekomendasi berdasarkan skor kesamaan dan skor IMDB yang didapat dari kolom "Score"

In [116]:
recommendations = sorted(list(enumerate(similarity_matrix[film_req])), reverse=True, key=lambda vector: (vector[1], movies.iloc[vector[0]]['Score']))
recommendations = [(idx, score) for idx, score in recommendations if idx != film_req]
recommendations

[(500, 0.9999999999999999),
 (429, 0.9999999999999999),
 (230, 0.9999999999999999),
 (116, 0.9999999999999999),
 (287, 0.9128709291752769),
 (262, 0.8451542547285165),
 (508, 0.7999999999999999),
 (491, 0.7999999999999999),
 (461, 0.7999999999999999),
 (361, 0.7999999999999999),
 (233, 0.7999999999999999),
 (565, 0.7745966692414834),
 (562, 0.7745966692414834),
 (549, 0.7745966692414834),
 (488, 0.7745966692414834),
 (454, 0.7745966692414834),
 (470, 0.7745966692414834),
 (475, 0.7745966692414834),
 (477, 0.7745966692414834),
 (438, 0.7745966692414834),
 (418, 0.7745966692414834),
 (389, 0.7745966692414834),
 (406, 0.7745966692414834),
 (409, 0.7745966692414834),
 (411, 0.7745966692414834),
 (369, 0.7745966692414834),
 (374, 0.7745966692414834),
 (376, 0.7745966692414834),
 (381, 0.7745966692414834),
 (322, 0.7745966692414834),
 (336, 0.7745966692414834),
 (299, 0.7745966692414834),
 (317, 0.7745966692414834),
 (269, 0.7745966692414834),
 (291, 0.7745966692414834),
 (247, 0.77459666924

Menampilkan 10 rekomendasi film terbaik

In [None]:
for i in recommendations[0:10]:
    similarity_score = i[1]
    film_index = i[0]
    if similarity_score > 0:
        print(movies.iloc[film_index].Title)

Menampilkan data lengkap dari semua film yang direkomendasikan

In [124]:
recommendations = sorted(list(enumerate(similarity_matrix[film_req])), reverse=True, key=lambda vector: (vector[1], movies.iloc[vector[0]]['Score']))
for film_index, similarity_score in distance:
    if similarity_score > 0 and film_index != film_req:
        title = movies.iloc[film_index]['Title']
        imdb_score = movies.iloc[film_index]['Score']
        genre = movies.iloc[film_index]['Genre']
        print(f"Title: {title}, Similarity Score: {similarity_score}, Genre: {genre} , IMDb Score: {imdb_score}")

Title: El Camino: A Breaking Bad Movie, Similarity Score: 0.9999999999999999, Genre: Crime drama , IMDb Score: 7.3
Title: The Highwaymen, Similarity Score: 0.9999999999999999, Genre: Crime drama , IMDb Score: 6.9
Title: Lost Girls, Similarity Score: 0.9999999999999999, Genre: Crime drama , IMDb Score: 6.1
Title: Òlòt?ré, Similarity Score: 0.9999999999999999, Genre: Crime drama , IMDb Score: 5.5
Title: The Outsider, Similarity Score: 0.9128709291752769, Genre: Crime drama , IMDb Score: 6.3
Title: 1922, Similarity Score: 0.8451542547285165, Genre: Horror/Crime drama , IMDb Score: 6.3
Title: On My Skin, Similarity Score: 0.7999999999999999, Genre: Crime drama , IMDb Score: 7.3
Title: Soni, Similarity Score: 0.7999999999999999, Genre: Crime drama , IMDb Score: 7.2
Title: Ferry, Similarity Score: 0.7999999999999999, Genre: Crime drama , IMDb Score: 7.1
Title: The Crimes That Bind, Similarity Score: 0.7999999999999999, Genre: Crime drama , IMDb Score: 6.6
Title: Rogue City, Similarity Score:

Membuat fungsi recommend()

In [107]:
def recommend(movie):
    film_req = movies[movies['Title']==movie].index[0]
    recommendations = sorted(list(enumerate(similarity_matrix[film_req])), reverse=True, key=lambda vector: (vector[1], movies.iloc[vector[0]]['Score']))
    recommendations = [(idx, score) for idx, score in recommendations if idx != film_req]
    for i in recommendations[0:10]:
        similarity_score = i[1]
        film_index = i[0]
        if similarity_score > 0:
            print(movies.iloc[film_index].Title)

Memanggil fungsi recommend()

In [108]:
recommend("The Irishman")

El Camino: A Breaking Bad Movie
The Highwaymen
Lost Girls
Òlòt?ré
The Outsider
1922
On My Skin
Soni
Ferry
The Crimes That Bind


Mengimpor library Pickle untuk mengubah objek Python menjadi format yang dapat disimpan

In [109]:
import pickle

Menyimpan movies dan similarity_matrix ke dalam file .pkl

In [110]:
pickle.dump(movies, open('movies_list.pkl','wb'))

In [111]:
pickle.dump(similarity_matrix, open('similarity.pkl','wb'))

Memuat data yang telah disimpan dalam file .pkl

In [112]:
pickle.load(open('movies_list.pkl','rb'))

Unnamed: 0,Title,Genre,Premiere,Runtime,Score,Language,Genre + Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese,Documentary/English/Japanese
1,Dark Forces,Thriller,"August 21, 2020",81,2.6,Spanish,Thriller/Spanish
2,The App,Science fiction/Drama,"December 26, 2019",79,2.6,Italian,Science fiction/Drama/Italian
3,The Open House,Horror thriller,"January 19, 2018",94,3.2,English,Horror thriller/English
4,Kaali Khuhi,Mystery,"October 30, 2020",90,3.4,Hindi,Mystery/Hindi
...,...,...,...,...,...,...,...
579,Taylor Swift: Reputation Stadium Tour,Concert Film,"December 31, 2018",125,8.4,English,Concert Film/English
580,Winter on Fire: Ukraine's Fight for Freedom,Documentary,"October 9, 2015",91,8.4,English/Ukranian/Russian,Documentary/English/Ukranian/Russian
581,Springsteen on Broadway,One-man show,"December 16, 2018",153,8.5,English,One-man show/English
582,Emicida: AmarElo - It's All For Yesterday,Documentary,"December 8, 2020",89,8.6,Portuguese,Documentary/Portuguese


Link Streamlit: https://recommendflix.streamlit.app