## Movie Recommender


This dataset consists of the following files:
data from kaggle:https://www.kaggle.com/rounakbanik/the-movies-dataset

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.\
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.\
credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.\
links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.\
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.\
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

#### Simple Recommender
Recommendation to all user based on rating score (IDMB) \
Weighted Rating (Rating)= (v/(v+m) * R) + (m/(m+v) * C)
* v is the number of votes for the movie;
* m is the minimum votes required to be listed in the chart;
* R is the average rating of the movie;
* C is the mean vote across the whole report.

In [1]:
import pandas as pd
import numpy as np

movies = pd.read_csv('movies_metadata.csv',low_memory=False)
movies.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [2]:
# cek data null
print(movies.isnull().sum())
print('-'*30)
print('shape:{}'.format(movies.shape))

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64
------------------------------
shape:(45466, 24)


In [3]:
# data cleaning, preparation or manipulation

column2=['vote_average','vote_count','revenue']

for x in column2:
    movies[x]=movies[x].fillna(0)

In [4]:
print(movies.describe())

            revenue       runtime  vote_average    vote_count
count  4.546600e+04  45203.000000  45466.000000  45466.000000
mean   1.120787e+07     94.128199      5.617466    109.882836
std    6.432813e+07     38.407810      1.925171    491.279576
min    0.000000e+00      0.000000      0.000000      0.000000
25%    0.000000e+00     85.000000      5.000000      3.000000
50%    0.000000e+00     95.000000      6.000000     10.000000
75%    0.000000e+00    107.000000      6.800000     34.000000
max    2.787965e+09   1256.000000     10.000000  14075.000000


From data we can : vote average (mean: 5.5 to max 10)

In [5]:
# this recommender use quantile (0.8)
C = movies['vote_average'].mean()
m = movies['vote_count'].quantile(0.8)


def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
movies['weighted_rating'] = movies.apply(weighted_rating, axis=1)
movies.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,weighted_rating
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.680947


In [6]:
# this data use, data movies >=quantile (0.8)/m

movies_1 = movies.loc[movies['vote_count']>=m]
movies_1.shape

(9151, 25)

In [7]:
# sort weighted rating descending

simple_recommender= movies_1[['title','vote_average','vote_count','weighted_rating']].sort_values('weighted_rating', ascending = False).reset_index(drop=True)
simple_recommender.head(10)

Unnamed: 0,title,vote_average,vote_count,weighted_rating
0,Dilwale Dulhania Le Jayenge,9.1,661.0,8.855096
1,The Shawshank Redemption,8.5,8358.0,8.482858
2,The Godfather,8.5,6024.0,8.476272
3,Your Name.,8.5,1030.0,8.366549
4,The Dark Knight,8.3,12269.0,8.289112
5,Fight Club,8.3,9678.0,8.286212
6,Pulp Fiction,8.3,8670.0,8.284618
7,Schindler's List,8.3,4436.0,8.270101
8,Whiplash,8.3,4376.0,8.269696
9,Spirited Away,8.3,3968.0,8.266619


####  Content-based recommenders
Suggest similar items based on  genre, director, description, actors to make these recommendations

###### based on Title

In [8]:
# used overview (description) as based on recommendation
print('Total data NA:{}'.format(movies['overview'].isna().sum()))
print('-'*60)
movies['overview'] = movies['overview'].fillna('')
print(movies['overview'].head())

Total data NA:954
------------------------------------------------------------
0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object


In [9]:
# I use 10.000 data( to big if i used all data)
movies_2 = movies[0:10000]

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english',lowercase=True)
matrix = vectorizer.fit_transform(movies_2['overview'])
matrix.shape

(10000, 32350)

In [11]:
vectorizer.get_feature_names()[6000:6005]

['commences', 'commendable', 'comment', 'commentary', 'commentator']

In [12]:
# Calculating Cosine Similarity
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(matrix, matrix)

In [13]:
indices = pd.Series(movies_2.index, index=movies_2['title']).drop_duplicates()
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

In [14]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return movies_2['title'].iloc[movie_indices]

In [16]:
# Recommender based on title

title='Toy Story'

get_recommendations(title)

2997                                    Toy Story 2
8327                                      The Champ
1071                          Rebel Without a Cause
3057                                Man on the Moon
1932                                      Condorman
485                                          Malice
5797                                  Class of 1984
7254                                 Africa Screams
6944                               Rivers and Tides
7615    The First $20 Million Is Always the Hardest
Name: title, dtype: object

##### Combination based on genre, keywords and cast

In [17]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# this recommender anly for 10000 data (for small data file)
movies_data = movies[0:10000]

credits['id'] = credits['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')
movies_data['id'] = movies_data['id'].apply(lambda x: x.replace('-','')).astype('int')

# merge or lookup data based on id
movies_data = movies_data.merge(credits, on='id')
movies_data = movies_data.merge(keywords, on='id')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['id'] = movies_data['id'].apply(lambda x: x.replace('-','')).astype('int')


In [18]:
from ast import literal_eval

column=['cast','keywords','genres','crew']

for x in column:
    movies_data[x]=movies_data[x].fillna('[]').apply(literal_eval).apply(lambda x: [x['name'] for x in x] if isinstance(x, list) else [])


In [19]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' '  + ' ' + ' '.join(x['genres'])

movies_data['soup'] = movies_data.apply(create_soup, axis=1)
movies_data.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,weighted_rating,cast,crew,keywords,soup
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,7.680947,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[John Lasseter, Joss Whedon, Andrew Stanton, J...","[jealousy, toy, boy, friendship, friends, riva...",jealousy toy boy friendship friends rivalry bo...
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.873964,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Larry J. Franco, Jonathan Hensleigh, James Ho...","[board game, disappearance, based on children'...",board game disappearance based on children's b...


In [20]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies_data['soup'])

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
indices = pd.Series(movies_data.index, index=movies_data['title'])

In [25]:
def get_recommendations(title, cosine_sim=cosine_sim2):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return movies_data['title'].iloc[movie_indices]

In [31]:
title='Men in Black II'

get_recommendations(title)

1520                                Men in Black
4672                   Megiddo: The Omega Code 2
2815                                Total Recall
8853        The Hitch Hikers Guide to the Galaxy
7014    Metalstorm: The Destruction of Jared-Syn
1416                               That Darn Cat
1823                              Small Soldiers
2947                           How I Won the War
1180                          Return of the Jedi
1355                               Mars Attacks!
Name: title, dtype: object