<a href="https://colab.research.google.com/github/Robby-Akbar/ProjectNLP/blob/main/colab/recommended_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Movie Recommendation with TFIDF

In [11]:
import pandas as pd
import ast
import numpy as np

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

##Prepare Data

In [2]:
#Load data from link
url = 'https://raw.githubusercontent.com/Robby-Akbar/ProjectNLP/main/output/data/'
dataset = pd.read_csv(url+"dataset_mod.csv")

In [3]:
#format string genres to array
dataset['genres'] = dataset['genres'].apply(lambda x: ast.literal_eval(x))
#format string keywords to array
dataset['keywords'] = dataset['keywords'].apply(lambda x: ast.literal_eval(x))
#format string cast to array
dataset['cast'] = dataset['cast'].apply(lambda x: ast.literal_eval(x))

In [4]:
dataset.head()

Unnamed: 0,genres,id,original_title,overview,tagline,keywords,cast,director
0,"[Adventure, Fantasy, Family]",8844,Jumanji,siblings judy peter discover enchanted board g...,roll the dice and unleash the excitement!,"[jealousy, toy, boy, friendship, friends, riva...","[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...",John Lasseter
1,"[Romance, Comedy]",15602,Grumpier Old Men,family wedding reignites ancient feud nextdoor...,still yelling. still fighting. still ready for...,"[board game, disappearance, based on children'...","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",Joe Johnston
2,"[Comedy, Drama, Romance]",31357,Waiting to Exhale,"cheated on, mistreated stepped on, women holdi...",friends are the people who let you be yourself...,"[fishing, best friend, duringcreditsstinger, o...","[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...",Howard Deutch
3,[Comedy],11862,Father of the Bride Part II,"george banks recovered daughter's wedding, rec...",just when his world is back to normal... he is...,"[based on novel, interracial relationship, sin...","[Whitney Houston, Angela Bassett, Loretta Devi...",Forest Whitaker
4,"[Action, Crime, Drama, Thriller]",949,Heat,"obsessive master thief, neil mccauley leads to...",a los angeles crime saga,"[baby, midlife crisis, confidence, aging, daug...","[Steve Martin, Diane Keaton, Martin Short, Kim...",Charles Shyer


In [5]:
#mengecek kembali tidak ada data yang NaN, lalu dibuang
dataset.dropna(inplace=True)
dataset.isnull().sum()

genres            0
id                0
original_title    0
overview          0
tagline           0
keywords          0
cast              0
director          0
dtype: int64

In [6]:
#pecah kalimat menjadi sebuah list
dataset['overview'] = dataset['overview'].apply(lambda x:x.split())
dataset['tagline'] = dataset['tagline'].apply(lambda x:x.split())

In [7]:
# Combine all features into new column
dataset['features'] = dataset['overview'] + dataset['genres'] + dataset['tagline'] + dataset['keywords'] + dataset['cast']
dataset['features'] = dataset['features'].apply(lambda x: " ".join(x))
dataset['features'] = dataset['features'] + ' ' + dataset['director']
dataset['features'].head()

0    siblings judy peter discover enchanted board g...
1    family wedding reignites ancient feud nextdoor...
2    cheated on, mistreated stepped on, women holdi...
3    george banks recovered daughter's wedding, rec...
4    obsessive master thief, neil mccauley leads to...
Name: features, dtype: object

In [8]:
print(dataset['features'][0])

siblings judy peter discover enchanted board game opens door magical world, unwittingly invite alan adult trapped inside game 26 years living room. alan's hope freedom finish game, proves risky three find running giant rhinoceroses, evil monkeys terrifying creatures. Adventure Fantasy Family roll the dice and unleash the excitement! jealousy toy boy friendship friends rivalry boy next door new toy toy comes to life Tom Hanks Tim Allen Don Rickles Jim Varney Wallace Shawn John Ratzenberger Annie Potts John Morris Erik von Detten Laurie Metcalf R. Lee Ermey Sarah Freeman Penn Jillette John Lasseter


##Setting-Up TFIDF

In [9]:
# Vektorisasi dokumen dengan TF-IDF
tfidf_vectorizer = TfidfVectorizer(
    min_df=5, max_features=16000, strip_accents='unicode', lowercase=True,
    analyzer='word', token_pattern=r'\w+', ngram_range=(1, 3), max_df=0.7, use_idf=True, 
    smooth_idf=True, sublinear_tf=True, stop_words = 'english'
)

# Hitung fitur
tf_idf_matrix = tfidf_vectorizer.fit_transform(dataset['features'])

In [21]:
import pickle
pickle.dump(tf_idf_matrix, open("tf_idf_matrix.p", "wb"))

In [10]:
# Check the shape of the matrix
tf_idf_matrix.shape

(19943, 16000)

In [12]:
# Tfidf gives normalized vectors, linear_kernel will give the same result as cosine_similarity
# linear_kernel is computationally faster
similarity = linear_kernel(tf_idf_matrix, tf_idf_matrix)

In [13]:
# Check the shape
similarity.shape

(19943, 19943)

##Function Recommendation

In [16]:
movies_indices = dataset.reset_index()
titles = movies_indices['original_title']
indices = pd.Series(movies_indices.index, index=movies_indices['original_title'])

In [17]:
def get_recommendations(title):
    index = indices[title]
    score = list(enumerate(similarity[index]))
    score = sorted(score, key=lambda x: x[1], reverse=True)
    movies_indices = [i[0] for i in score]
    return titles.iloc[movies_indices[1:11]]

In [19]:
get_recommendations("The Dark Knight")

10847      The Dark Knight Rises
972               Batman Returns
110               Batman Forever
9134         Law Abiding Citizen
13007                    Tokarev
15108    Kidnapping Mr. Heineken
17857             Space Tourists
4700               Kaitei gunkan
12696           Reasonable Doubt
12415                Just Wright
Name: original_title, dtype: object