# Simple Recommender System in Python

This code from [datacamp tutorial](https://www.datacamp.com/community/tutorials/recommender-systems-python)

The dataset from [The Movielens Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset/data)

Download dataset dari kaggle menggunakan API Token demi menghemat kuota mahasiswa, lebih lengkapnya di [sini](https://www.kaggle.com/general/74235)

In [1]:
# Upload kaggle.json
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"irfanchairurrachman","key":"309b51d1349ccbc69c0aa34c8a2c408c"}'}

In [2]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls ~/.kaggle -la

total 16
drwxr-xr-x 2 root root 4096 May  3 06:30 .
drwx------ 1 root root 4096 May  3 06:30 ..
-rw------- 1 root root   75 May  3 06:30 kaggle.json


In [3]:
!kaggle datasets download -d rounakbanik/the-movies-dataset

Downloading the-movies-dataset.zip to /content
 99% 225M/228M [00:02<00:00, 111MB/s] 
100% 228M/228M [00:02<00:00, 103MB/s]


In [4]:
!unzip the-movies-dataset.zip

Archive:  the-movies-dataset.zip
  inflating: credits.csv             
  inflating: keywords.csv            
  inflating: links.csv               
  inflating: links_small.csv         
  inflating: movies_metadata.csv     
  inflating: ratings.csv             
  inflating: ratings_small.csv       


In [5]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [None]:
metadata.shape

(45466, 24)

In [6]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)

5.618207215133889


In [7]:
# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


In [8]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

In [None]:
metadata.shape

(45466, 24)

In [9]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [10]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [11]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


In [None]:
#Print plot overviews of the first 5 movies.
# metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [None]:
# Pakenya q_movies karena klo metadata kegedean
q_movies['overview'].head()

314      Framed in the 1940s for the double murder of h...
834      Spanning the years 1945 to 1955, a chronicle o...
10309    Raj is a rich, carefree, happy-go-lucky second...
12481    Batman raises the stakes in his war on crime. ...
2843     A ticking-time-bomb insomniac and a slippery s...
Name: overview, dtype: object

In [12]:
# Reset index dari 0
q_movies = q_movies.reset_index(drop=True)

In [13]:
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(30)

Unnamed: 0,title,vote_count,vote_average,score
0,The Shawshank Redemption,8358.0,8.5,8.445869
1,The Godfather,6024.0,8.5,8.425439
2,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
3,The Dark Knight,12269.0,8.3,8.265477
4,Fight Club,9678.0,8.3,8.256385
5,Pulp Fiction,8670.0,8.3,8.251406
6,Schindler's List,4436.0,8.3,8.206639
7,Whiplash,4376.0,8.3,8.205404
8,Spirited Away,3968.0,8.3,8.196055
9,Life Is Beautiful,3643.0,8.3,8.187171


In [60]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
q_movies['overview'] = q_movies['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(q_movies['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4606, 19694)

In [61]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[5000:5010]

['did',
 'didn',
 'dido',
 'die',
 'died',
 'diego',
 'dies',
 'diesel',
 'diet',
 'dietary']

In [62]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim.shape

(4555, 4555)

In [None]:
cosine_sim[1]

array([0.00522362, 1.        , 0.01249039, ..., 0.        , 0.01420965,
       0.01535064])

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(q_movies.index, index=q_movies['title']).drop_duplicates()

In [None]:
indices[:10]

title
The Shawshank Redemption       0
The Godfather                  1
Dilwale Dulhania Le Jayenge    2
The Dark Knight                3
Fight Club                     4
Pulp Fiction                   5
Schindler's List               6
Whiplash                       7
Spirited Away                  8
Life Is Beautiful              9
dtype: int64

In [63]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return q_movies['title'].iloc[movie_indices]
    # return idx

In [64]:
get_recommendations('The Dark Knight Rises')

4403                             Batman Forever
3                               The Dark Knight
1588                             Batman Returns
768                                      Batman
503                  Batman: Under the Red Hood
1553                           Batman: Year One
469     Batman: The Dark Knight Returns, Part 1
2254                          Batman: Bad Blood
2255                          Batman: Bad Blood
324     Batman: The Dark Knight Returns, Part 2
Name: title, dtype: object

In [None]:
get_recommendations('The Godfather')

10       The Godfather: Part II
3405                 Blood Ties
670     The Godfather: Part III
2835              Live by Night
1141                   Sinister
4436      The Cold Light of Day
2311                        Joe
503           Road to Perdition
1370         Death at a Funeral
1529          The Addams Family
Name: title, dtype: object

In [None]:
# BELUM BERDASARKAN CREDIT, GENRE, dan KEYWORD

In [14]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [15]:
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [16]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4606, 27)

In [17]:
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [18]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
841,The Godfather,6024.0,8.5,8.425439
10397,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12589,The Dark Knight,12269.0,8.3,8.265477
2870,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23868,Whiplash,4376.0,8.3,8.205404
5529,Spirited Away,3968.0,8.3,8.196055
2231,Life Is Beautiful,3643.0,8.3,8.187171


In [19]:
# Reset index dari 0
q_movies = q_movies.reset_index(drop=True)

In [23]:
q_movies[['title', 'vote_count', 'vote_average', 'score','cast']].head(10)

Unnamed: 0,title,vote_count,vote_average,score,cast
0,The Shawshank Redemption,8358.0,8.5,8.445869,"[{'cast_id': 3, 'character': 'Andy Dufresne', ..."
1,The Godfather,6024.0,8.5,8.425439,"[{'cast_id': 5, 'character': 'Don Vito Corleon..."
2,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453,"[{'cast_id': 1, 'character': 'Raj Malhotra', '..."
3,The Dark Knight,12269.0,8.3,8.265477,"[{'cast_id': 35, 'character': 'Bruce Wayne / B..."
4,Fight Club,9678.0,8.3,8.256385,"[{'cast_id': 4, 'character': 'The Narrator', '..."
5,Pulp Fiction,8670.0,8.3,8.251406,"[{'cast_id': 2, 'character': 'Vincent Vega', '..."
6,Schindler's List,4436.0,8.3,8.206639,"[{'cast_id': 14, 'character': 'Oskar Schindler..."
7,Whiplash,4376.0,8.3,8.205404,"[{'cast_id': 5, 'character': 'Andrew Neimann',..."
8,Spirited Away,3968.0,8.3,8.196055,"[{'cast_id': 3, 'character': 'Chihiro Ogino (v..."
9,Life Is Beautiful,3643.0,8.3,8.187171,"[{'cast_id': 7, 'character': 'Dora', 'credit_i..."


In [37]:
q_movies[['cast']]

Unnamed: 0,cast
0,"[{'cast_id': 3, 'character': 'Andy Dufresne', ..."
1,"[{'cast_id': 5, 'character': 'Don Vito Corleon..."
2,"[{'cast_id': 1, 'character': 'Raj Malhotra', '..."
3,"[{'cast_id': 35, 'character': 'Bruce Wayne / B..."
4,"[{'cast_id': 4, 'character': 'The Narrator', '..."
...,...
4601,"[{'cast_id': 1, 'character': 'Tim Avery', 'cre..."
4602,"[{'cast_id': 5, 'character': 'Will', 'credit_i..."
4603,"[{'cast_id': 11, 'character': 'Terl', 'credit_..."
4604,"[{'cast_id': 1, 'character': 'Edward', 'credit..."


In [38]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    q_movies[feature] = q_movies[feature].apply(literal_eval)

In [39]:
# Import Numpy
import numpy as np

In [40]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [41]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [43]:
# Define new director, cast, genres and keywords features that are in a suitable form.
q_movies['director'] = q_movies['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    q_movies[feature] = q_movies[feature].apply(get_list)

In [45]:
# Print the new features of the first 3 films
q_movies[['title', 'cast', 'director', 'keywords', 'genres']].head(5)

Unnamed: 0,title,cast,director,keywords,genres
0,The Shawshank Redemption,"[Tim Robbins, Morgan Freeman, Bob Gunton]",Frank Darabont,"[prison, corruption, police brutality]","[Drama, Crime]"
1,The Godfather,"[Marlon Brando, Al Pacino, James Caan]",Francis Ford Coppola,"[italy, love at first sight, loss of father]","[Drama, Crime]"
2,Dilwale Dulhania Le Jayenge,"[Shah Rukh Khan, Kajol, Amrish Puri]",Aditya Chopra,[musical],"[Comedy, Drama, Romance]"
3,The Dark Knight,"[Christian Bale, Michael Caine, Heath Ledger]",Christopher Nolan,"[dc comics, crime fighter, secret identity]","[Drama, Action, Crime]"
4,Fight Club,"[Edward Norton, Brad Pitt, Meat Loaf]",David Fincher,"[support group, dual identity, nihilism]",[Drama]


In [46]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [47]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    q_movies[feature] = q_movies[feature].apply(clean_data)

In [48]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [49]:
# Create a new soup feature
q_movies['soup'] = q_movies.apply(create_soup, axis=1)

In [52]:
q_movies[['soup']].head(5)

Unnamed: 0,soup
0,prison corruption policebrutality timrobbins m...
1,italy loveatfirstsight lossoffather marlonbran...
2,musical shahrukhkhan kajol amrishpuri adityach...
3,dccomics crimefighter secretidentity christian...
4,supportgroup dualidentity nihilism edwardnorto...


In [54]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(q_movies['soup'])

In [55]:
count_matrix.shape

(4606, 10413)

In [57]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# Reset index of your main DataFrame and construct reverse mapping as before
q_movies = q_movies.reset_index()
indices = pd.Series(q_movies.index, index=q_movies['title'])

In [65]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

3               The Dark Knight
199               Batman Begins
47                 The Prestige
3701    Kidnapping Mr. Heineken
2995                     Faster
3075                     Takers
3684                 The Double
1186                    Bronson
2518          Escape to Victory
2553             Gangster Squad
Name: title, dtype: object

In [66]:
get_recommendations('The Godfather', cosine_sim2)

672    The Godfather: Part III
10      The Godfather: Part II
60              Apocalypse Now
50                    Scarface
160                       Heat
261              Carlito's Way
321          On the Waterfront
378          Dog Day Afternoon
406              Donnie Brasco
640                    Serpico
Name: title, dtype: object