## Movie Recommender


This dataset consists of the following files:
data from kaggle:https://www.kaggle.com/rounakbanik/the-movies-dataset

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.\
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.\
credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.\
links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.\
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.\
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

#### Simple Recommender
Recommendation to all user based on rating score (IDMB) \
Weighted Rating (Rating)= (v/(v+m) * R) + (m/(m+v) * C)
* v is the number of votes for the movie;
* m is the minimum votes required to be listed in the chart;
* R is the average rating of the movie;
* C is the mean vote across the whole report.

In [1]:
import pandas as pd
import numpy as np

movies = pd.read_csv('movies_metadata.csv',low_memory=False)
movies.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [2]:
# cek data null
print(movies.isnull().sum())
print('-'*30)
print('shape:{}'.format(movies.shape))

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64
------------------------------
shape:(45466, 24)


In [3]:
# data cleaning, preparation or manipulation

column2=['vote_average','vote_count','revenue']

for x in column2:
    movies[x]=movies[x].fillna(0)

In [4]:
print(movies.describe())

            revenue       runtime  vote_average    vote_count
count  4.546600e+04  45203.000000  45466.000000  45466.000000
mean   1.120787e+07     94.128199      5.617466    109.882836
std    6.432813e+07     38.407810      1.925171    491.279576
min    0.000000e+00      0.000000      0.000000      0.000000
25%    0.000000e+00     85.000000      5.000000      3.000000
50%    0.000000e+00     95.000000      6.000000     10.000000
75%    0.000000e+00    107.000000      6.800000     34.000000
max    2.787965e+09   1256.000000     10.000000  14075.000000


From data we can : vote average (mean: 5.5 to max 10)

In [5]:
# this recommender use quantile (0.8)
C = movies['vote_average'].mean()
m = movies['vote_count'].quantile(0.8)


def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
movies['weighted_rating'] = movies.apply(weighted_rating, axis=1)
movies.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,weighted_rating
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.680947


In [6]:
# this data use, data movies >=quantile (0.8)/m

movies_1 = movies.loc[movies['vote_count']>=m]
movies_1.shape

(9151, 25)

In [7]:
# sort weighted rating descending

simple_recommender= movies_1[['title','vote_average','vote_count','weighted_rating']].sort_values('weighted_rating', ascending = False).reset_index(drop=True)
simple_recommender.head(10)

Unnamed: 0,title,vote_average,vote_count,weighted_rating
0,Dilwale Dulhania Le Jayenge,9.1,661.0,8.855096
1,The Shawshank Redemption,8.5,8358.0,8.482858
2,The Godfather,8.5,6024.0,8.476272
3,Your Name.,8.5,1030.0,8.366549
4,The Dark Knight,8.3,12269.0,8.289112
5,Fight Club,8.3,9678.0,8.286212
6,Pulp Fiction,8.3,8670.0,8.284618
7,Schindler's List,8.3,4436.0,8.270101
8,Whiplash,8.3,4376.0,8.269696
9,Spirited Away,8.3,3968.0,8.266619


####  Content-based recommenders
Suggest similar items based on  genre, director, description, actors to make these recommendations

###### based on Title

In [8]:
# used overview (description) as based on recommendation
print('Total data NA:{}'.format(movies['overview'].isna().sum()))
print('-'*60)
movies['overview'] = movies['overview'].fillna('')
print(movies['overview'].head())

Total data NA:954
------------------------------------------------------------
0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object


In [39]:
# I use 20.000 data( to big if i used all data)
movies_2 = movies[0:20000]

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english',lowercase=True)
matrix = vectorizer.fit_transform(movies_2['overview'])
matrix.shape

(20000, 47487)

In [41]:
vectorizer.get_feature_names()[6000:6005]

['buddah', 'budderball', 'buddha', 'buddhahood', 'buddhism']

In [42]:
# Calculating Cosine Similarity
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(matrix, matrix)

In [43]:
indices = pd.Series(movies_2.index, index=movies_2['title']).drop_duplicates()
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

In [44]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return movies_2['title'].iloc[movie_indices]

In [45]:
# Recommender based on title

title='Toy Story'

get_recommendations(title)

15348               Toy Story 3
2997                Toy Story 2
10301    The 40 Year Old Virgin
8327                  The Champ
1071      Rebel Without a Cause
11399    For Your Consideration
1932                  Condorman
3057            Man on the Moon
485                      Malice
11606              Factory Girl
Name: title, dtype: object

##### Combination based on genre, keywords and cast

In [46]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# this recommender anly for 10000 data (for small data file)
movies_data = movies[0:10000]

credits['id'] = credits['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')
movies_data['id'] = movies_data['id'].apply(lambda x: x.replace('-','')).astype('int')

# merge or lookup data based on id
movies_data = movies_data.merge(credits, on='id')
movies_data = movies_data.merge(keywords, on='id')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data['id'] = movies_data['id'].apply(lambda x: x.replace('-','')).astype('int')


In [47]:
from ast import literal_eval

column=['cast','keywords','genres']

for x in column:
    movies_data[x]=movies_data[x].fillna('[]').apply(literal_eval).apply(lambda x: [x['name'] for x in x] if isinstance(x, list) else [])


In [48]:
movies_data['crew'] = movies_data['crew'].fillna('[]').apply(literal_eval)

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return []
movies_data['director'] = movies_data['crew'].apply(get_director)

In [49]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['title'])
movies_data['soup'] = movies_data.apply(create_soup, axis=1)
movies_data['soup'] = movies_data['soup']+" "+ movies_data['director'].astype(str)
movies_data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,weighted_rating,cast,crew,keywords,director,soup
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Toy Story,False,7.7,5415.0,7.680947,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva...",John Lasseter,jealousy toy boy friendship friends rivalry bo...
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Jumanji,False,6.9,2413.0,6.873964,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[board game, disappearance, based on children'...",Joe Johnston,board game disappearance based on children's b...
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Grumpier Old Men,False,6.5,92.0,6.189249,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, best friend, duringcreditsstinger, o...",Howard Deutch,fishing best friend duringcreditsstinger old m...
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Waiting to Exhale,False,6.1,34.0,5.812777,"[Whitney Houston, Angela Bassett, Loretta Devi...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[based on novel, interracial relationship, sin...",Forest Whitaker,based on novel interracial relationship single...
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Father of the Bride Part II,False,5.7,173.0,5.681495,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlife crisis, confidence, aging, daug...",Charles Shyer,baby midlife crisis confidence aging daughter ...


In [50]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies_data['soup'])

In [51]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
indices = pd.Series(movies_data.index, index=movies_data['title'])

In [52]:
movies_data['release_date']=movies_data['release_date'].fillna(0)
movies_data['year']=movies_data['release_date'].str[0:4]
movies_data['new_title']=movies_data['title']+" ("+ movies_data['year']+')'

In [53]:
def get_recommendations(title, cosine_sim=cosine_sim2):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return movies_data['new_title'].iloc[movie_indices]

In [54]:
title='Men in Black II'

get_recommendations(title)

1520                                Men in Black (1997)
4672                   Megiddo: The Omega Code 2 (2001)
2815                                Total Recall (1990)
7484                   Babylon 5: A Call to Arms (1999)
7014    Metalstorm: The Destruction of Jared-Syn (1983)
8853        The Hitch Hikers Guide to the Galaxy (1981)
1823                              Small Soldiers (1998)
4665                                 Big Trouble (2002)
1416                               That Darn Cat (1997)
1820                                  Armageddon (1998)
Name: new_title, dtype: object

In [55]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim2[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:50]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = movies_data.iloc[movie_indices][['new_title', 'vote_count', 'vote_average','weighted_rating']]
    qualified = movies[(movies['vote_count'] >= m)]
    qualified['vote_count'] = qualified['vote_count'].astype(int)
    qualified['vote_average'] = np.round(qualified['vote_average'],1)
    qualified['weighted_rating'] = np.round(qualified['weighted_rating'],1)
    
    qualified = qualified.sort_values('weighted_rating', ascending=False).head(10)
    return qualified


In [57]:
pd.options.mode.chained_assignment = None 

title='Men in Black II'
improved_recommendations(title)

Unnamed: 0,new_title,vote_count,vote_average,weighted_rating
2858,Fight Club (1999),9678,8.3,8.3
351,Forrest Gump (1994),8147,8.2,8.2
256,Star Wars (1977),6778,8.1,8.1
1180,Return of the Jedi (1983),4763,7.9,7.9
348,The Crow (1994),980,7.3,7.2
5774,Treasure Planet (2002),980,7.2,7.1
2815,Total Recall (1990),1745,7.1,7.1
1105,The Abyss (1989),822,7.1,7.0
1520,Men in Black (1997),4521,6.9,6.9
1961,The Negotiator (1998),593,6.8,6.7
