<center>

## ***Company-based Capstone Project***
#### **Team ID		: C24-MR01**

Team Member	: 

(ML) M002D4KY2877 - Auvarifqi Putra Diandra

(ML) M010D4KY3370 - Rafi Madani

(ML) M002D4KY2625 - Iskandar Muda Rizky Parlambang

(MD) A010D4KY4202 - Muhammad Adryan Haska Putra

(MD) A297D4KX4551 - Vena Feranica

(CC) C002D4KY1032 - Muhammad Naufal

(CC) C459D4KY0090 - Jamaludin Ahmad Rifai



---

## Import Library

In [2]:
import pandas as pd
import numpy as np
import urllib.request
import zipfile
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
from numpy import where
import os

In [34]:
import warnings
warnings.filterwarnings("ignore")

## Functions

In [35]:
# mengecek nilai null
def check_null(df):
    col_na = df.isnull().sum().sort_values(ascending=True)
    percent = col_na / len(df)
    missing_data = pd.concat([col_na, percent], axis=1, keys=['Total', 'Percent'])

    if (missing_data[missing_data['Total'] > 0].shape[0] == 0):
        print("Tidak ditemukan missing value pada dataset")
    else:
        print(missing_data[missing_data['Total'] > 0])

In [36]:
# cek outlier di tiap fitur
def check_outlier(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)

    # Menghitung RUB dan RLB.
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5*IQR
    upper_limit = Q3 + 1.5*IQR

    # Menampilkan banyaknya outlier pada atribut.
    outliers = (df < lower_limit) | (df > upper_limit)
    print ("Outlier pada tiap atribut:")
    print(outliers.sum())

    return outliers

In [37]:
# handle outlier di suatu column 
def outlier_handling(column, df):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_range = Q1 - 1.5 * IQR
    upper_range = Q3 + 1.5 * IQR
    median = df.loc[(df[column] >= lower_range) | (df[column] <= upper_range), column].median()
    df[column] = np.where((df[column] < lower_range) | (df[column] > upper_range), median, df[column])

## Data Preprocessing and Import And Read Dataset

In [38]:
# Get the current directory of the scrip

data_url_1 = 'https://drive.usercontent.google.com/download?id=1rACBSh5FWqP5S_xMn3Ty382BSjGZC6U0&export=download&authuser=0&confirm=t&uuid=ee5921d6-dc36-4593-8662-f5e7490f590f&at=APZUnTXz447GE_ox2yw3NvJM1NLN%3A1717769617938'
# Get the current working directory
current_dir = os.getcwd()

data_path = os.path.join(current_dir, 'data')
print(data_path)

# Create the target directory if it doesn't exist
os.makedirs(data_path, exist_ok=True)

# Define the full path for the downloaded file
target_file_path = os.path.join(data_path, '10000-movie.csv')

if not os.path.exists(target_file_path):
    # Download the file to the target directory
    urllib.request.urlretrieve(data_url_1, target_file_path)
    print(f'File downloaded and saved to {target_file_path}')
else:
    print(f'File already exists at {target_file_path}, skipping download.')


/Users/rafimadani/Documents/nbs-ml/data
File already exists at /Users/rafimadani/Documents/nbs-ml/data/10000-movie.csv, skipping download.


In [39]:
movies  = pd.read_csv(target_file_path)

movies = movies.iloc[:, 0:7]
movies = movies.dropna(subset=['cast', 'crew'])
movies.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7528 entries, 0 to 7531
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7528 non-null   object
 1   title     7528 non-null   object
 2   overview  7480 non-null   object
 3   genres    7528 non-null   object
 4   keywords  7528 non-null   object
 5   cast      7528 non-null   object
 6   crew      7528 non-null   object
dtypes: object(7)
memory usage: 470.5+ KB


## 1. Recommender System Based On Overview/Synopsys

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
movies['overview'] = movies['overview'].fillna('')
tfidf_matrix = tfidf.fit_transform(movies['overview'])
tfidf_matrix.shape

(7528, 24202)

In [41]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [42]:
indices = pd.Series(movies.index, index=movies['id'])

In [43]:
def get_recommendations(movie_id, cosine_sim=cosine_sim):
    try:
        id_to_find = movie_id
        title = movies.loc[movies['id'] == id_to_find, 'title'].values[0]

        # Get indices corresponding to the title
        idx = indices[movie_id]
        
        # Convert idx to a list if it's not already
        if not isinstance(idx, list):
            idx = [idx]

        sim_scores = []
        for index in idx:
            # Retrieve cosine similarities for the current index
            cosine_sims = cosine_sim[index]
            
            # Extend sim_scores with the enumerated cosine similarities
        sim_scores.extend(list(enumerate(cosine_sims)))

        # Sort the sim_scores list based on similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        # Retrieve top 5 similar movies
        sim_scores = sim_scores[1:6]
        # Extract movie indices from sim_scores
        movie_indices = [i[0] for i in sim_scores]

        return movies['id'].iloc[movie_indices]
    except IndexError as e:
    # Handle the error
        return("Movie ID is not found in dataset")


In [32]:
movie_id ="823464"
similar_movie_ids = get_recommendations(movie_id)

if isinstance(similar_movie_ids, str):
    print(similar_movie_ids)  # Print error message if the provided ID is not found
else:
    title = movies.loc[movies['id'] == movie_id, 'title'].values[0]
    print(f"Movies similar to '{title}' (ID: {movie_id}):")
    for movie_id in similar_movie_ids:
        movie_title = movies.loc[movies['id'] == movie_id, 'title'].values[0]
        print("- {}".format(movie_title))

KeyError: '823464'

In [48]:
movies

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,653346,Kingdom of the Planet of the Apes,Several generations in the future following Ca...,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...","[{'id': 11195, 'name': 'empire'}, {'id': 4152,...","[{'cast_id': 9, 'character': 'Noa', 'credit_id...","[{'credit_id': '5de6f63611386c001354710d', 'de..."
1,929590,Civil War,"In the near future, a group of war journalists...","[{'id': 10752, 'name': 'War'}, {'id': 28, 'nam...","[{'id': 1589, 'name': 'sniper'}, {'id': 242, '...","[{'cast_id': 228, 'character': 'Lee', 'credit_...","[{'credit_id': '61eb0f9b31644b0059dd097a', 'de..."
2,823464,Godzilla x Kong: The New Empire,"Following their explosive showdown, Godzilla a...","[{'id': 878, 'name': 'Science Fiction'}, {'id'...","[{'id': 11100, 'name': 'giant monster'}, {'id'...","[{'cast_id': 10, 'character': 'Dr. Ilene Andre...","[{'credit_id': '608879ed66e4690040e33c01', 'de..."
3,719221,Tarot,When a group of friends recklessly violate the...,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...","[{'id': 2591, 'name': 'tarot cards'}, {'id': 1...","[{'cast_id': 14, 'character': 'Haley', 'credit...","[{'credit_id': '5ef61aec13af5f0035502b54', 'de..."
4,614933,Atlas,A brilliant counterterrorism analyst with a de...,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...","[{'id': 10364, 'name': 'mission'}, {'id': 310,...","[{'cast_id': 10, 'character': 'Atlas Shepherd'...","[{'credit_id': '5d2895dcbe4b3632c49d840d', 'de..."
...,...,...,...,...,...,...,...
7527,943397,Shooting My Life's Script,Everything changes in Fani's life when the opp...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 321218, 'name': 'book adaptation'}]","[{'cast_id': 1, 'character': 'Fani', 'credit_i...","[{'credit_id': '6218c84319ab59004294e554', 'de..."
7528,245627,Abattoir,A reporter unearths an urban legend about a ho...,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...","[{'id': 1526, 'name': 'home'}, {'id': 3358, 'n...","[{'cast_id': 4, 'character': 'Julia', 'credit_...","[{'credit_id': '52fe4f09c3a36847f82b8b59', 'de..."
7529,296626,Finders Keepers,A haunted doll teaches one little girl why chi...,"[{'id': 9648, 'name': 'Mystery'}, {'id': 53, '...","[{'id': 9712, 'name': 'possession'}, {'id': 11...","[{'cast_id': 1, 'character': 'Alyson Simon', '...","[{'credit_id': '54433346c3a3683e0e0039dd', 'de..."
7530,338,"Good Bye, Lenin!",Alex Kerner's mother was in a coma while the B...,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 1157, 'name': 'husband wife relationsh...","[{'cast_id': 4, 'character': 'Alex', 'credit_i...","[{'credit_id': '52fe4239c3a36847f800d99b', 'de..."


## 2. Recommender System Based On Cast and Genre

In [44]:
from ast import literal_eval
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    movies[feature] = movies[feature].apply(literal_eval)

In [45]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [46]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return []

In [47]:
movies['director'] = movies['crew'].apply(get_director)
features = ['cast', 'keywords', 'genres']
for feature in features:
    movies[feature] = movies[feature].apply(get_list)

In [48]:
movies

Unnamed: 0,id,title,overview,genres,keywords,cast,crew,director
0,653346,Kingdom of the Planet of the Apes,Several generations in the future following Ca...,"[Science Fiction, Adventure, Action]","[empire, kingdom, gorilla]","[Owen Teague, Freya Allan, Kevin Durand]","[{'credit_id': '5de6f63611386c001354710d', 'de...",Wes Ball
1,929590,Civil War,"In the near future, a group of war journalists...","[War, Action, Drama]","[sniper, new york city, race against time]","[Kirsten Dunst, Wagner Moura, Cailee Spaeny]","[{'credit_id': '61eb0f9b31644b0059dd097a', 'de...",Alex Garland
2,823464,Godzilla x Kong: The New Empire,"Following their explosive showdown, Godzilla a...","[Science Fiction, Action, Adventure]","[giant monster, sequel, dinosaur]","[Rebecca Hall, Brian Tyree Henry, Dan Stevens]","[{'credit_id': '608879ed66e4690040e33c01', 'de...",Adam Wingard
3,719221,Tarot,When a group of friends recklessly violate the...,"[Horror, Thriller]","[tarot cards, fate, slasher]","[Harriet Slater, Adain Bradley, Avantika]","[{'credit_id': '5ef61aec13af5f0035502b54', 'de...",Spenser Cohen
4,614933,Atlas,A brilliant counterterrorism analyst with a de...,"[Science Fiction, Action]","[mission, artificial intelligence (a.i.), expl...","[Jennifer Lopez, Simu Liu, Sterling K. Brown]","[{'credit_id': '5d2895dcbe4b3632c49d840d', 'de...",Brad Peyton
...,...,...,...,...,...,...,...,...
7527,943397,Shooting My Life's Script,Everything changes in Fani's life when the opp...,"[Romance, Comedy]",[book adaptation],"[Bela Fernandes, Xande Valois, Alanys Santos]","[{'credit_id': '6218c84319ab59004294e554', 'de...",Pedro Antônio
7528,245627,Abattoir,A reporter unearths an urban legend about a ho...,"[Horror, Thriller]","[home, haunted house, based on comic]","[Jessica Lowndes, Joe Anderson, Lin Shaye]","[{'credit_id': '52fe4f09c3a36847f82b8b59', 'de...",Darren Lynn Bousman
7529,296626,Finders Keepers,A haunted doll teaches one little girl why chi...,"[Mystery, Thriller, Horror]","[possession, profession, evil doll]","[Jaime Pressly, Kylie Rogers, Tobin Bell]","[{'credit_id': '54433346c3a3683e0e0039dd', 'de...",Alexander Yellen
7530,338,"Good Bye, Lenin!",Alex Kerner's mother was in a coma while the B...,"[Comedy, Drama]","[husband wife relationship, coma, bureaucracy]","[Daniel Brühl, Katrin Sass, Chulpan Khamatova]","[{'credit_id': '52fe4239c3a36847f800d99b', 'de...",Wolfgang Becker


In [49]:
#clean data
movies = movies.drop(columns=['crew'])
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [50]:
#apply clean data
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    movies[feature] = movies[feature].apply(clean_data)

In [51]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

movies['soup'] = movies.apply(create_soup, axis=1)

In [52]:
print(movies['soup'])

0       empire kingdom gorilla owenteague freyaallan k...
1       sniper newyorkcity raceagainsttime kirstenduns...
2       giantmonster sequel dinosaur rebeccahall brian...
3       tarotcards fate slasher harrietslater adainbra...
4       mission artificialintelligence(a.i.) explorati...
                              ...                        
7527    bookadaptation belafernandes xandevalois alany...
7528    home hauntedhouse basedoncomic jessicalowndes ...
7529    possession profession evildoll jaimepressly ky...
7530    husbandwiferelationship coma bureaucracy danie...
7531     aziacosta jaclynjose monconfiado macalejandre...
Name: soup, Length: 7528, dtype: object


In [53]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies['soup'])

In [54]:
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [55]:
#reset index
movies = movies.reset_index()
indices = pd.Series(movies.index, index=movies['id'])

In [28]:
get_recommendations('Avatar', cosine_sim2)

'Movie ID is not found in dataset'

In [57]:
movie_id ="823464"
similar_movie_ids = get_recommendations(movie_id,cosine_sim2)

if isinstance(similar_movie_ids, str):
    print(similar_movie_ids)  # Print error message if the provided ID is not found
else:
    title = movies.loc[movies['id'] == movie_id, 'title'].values[0]
    print(f"Movies similar to '{title}' (ID: {movie_id}):")
    for movie_id in similar_movie_ids:
        movie_title = movies.loc[movies['id'] == movie_id, 'title'].values[0]
        print("- {}".format(movie_title))

Movies similar to 'Godzilla x Kong: The New Empire' (ID: 823464):
- Jurassic World Dominion
- Ape vs Mecha Ape
- Thor: Ragnarok
- Guardians of the Galaxy Vol. 2
- Transformers: Age of Extinction


In [25]:
movies

Unnamed: 0,index,id,title,overview,genres,keywords,cast,director,soup
0,0,653346,Kingdom of the Planet of the Apes,Several generations in the future following Ca...,"[sciencefiction, adventure, action]","[empire, kingdom, gorilla]","[owenteague, freyaallan, kevindurand]",wesball,empire kingdom gorilla owenteague freyaallan k...
1,1,929590,Civil War,"In the near future, a group of war journalists...","[war, action, drama]","[sniper, newyorkcity, raceagainsttime]","[kirstendunst, wagnermoura, caileespaeny]",alexgarland,sniper newyorkcity raceagainsttime kirstenduns...
2,2,823464,Godzilla x Kong: The New Empire,"Following their explosive showdown, Godzilla a...","[sciencefiction, action, adventure]","[giantmonster, sequel, dinosaur]","[rebeccahall, briantyreehenry, danstevens]",adamwingard,giantmonster sequel dinosaur rebeccahall brian...
3,3,719221,Tarot,When a group of friends recklessly violate the...,"[horror, thriller]","[tarotcards, fate, slasher]","[harrietslater, adainbradley, avantika]",spensercohen,tarotcards fate slasher harrietslater adainbra...
4,4,614933,Atlas,A brilliant counterterrorism analyst with a de...,"[sciencefiction, action]","[mission, artificialintelligence(a.i.), explor...","[jenniferlopez, simuliu, sterlingk.brown]",bradpeyton,mission artificialintelligence(a.i.) explorati...
...,...,...,...,...,...,...,...,...,...
7523,7527,943397,Shooting My Life's Script,Everything changes in Fani's life when the opp...,"[romance, comedy]",[bookadaptation],"[belafernandes, xandevalois, alanyssantos]",pedroantônio,bookadaptation belafernandes xandevalois alany...
7524,7528,245627,Abattoir,A reporter unearths an urban legend about a ho...,"[horror, thriller]","[home, hauntedhouse, basedoncomic]","[jessicalowndes, joeanderson, linshaye]",darrenlynnbousman,home hauntedhouse basedoncomic jessicalowndes ...
7525,7529,296626,Finders Keepers,A haunted doll teaches one little girl why chi...,"[mystery, thriller, horror]","[possession, profession, evildoll]","[jaimepressly, kylierogers, tobinbell]",alexanderyellen,possession profession evildoll jaimepressly ky...
7526,7530,338,"Good Bye, Lenin!",Alex Kerner's mother was in a coma while the B...,"[comedy, drama]","[husbandwiferelationship, coma, bureaucracy]","[danielbrühl, katrinsass, chulpankhamatova]",wolfgangbecker,husbandwiferelationship coma bureaucracy danie...
