This method uses the content based approach to include the user's genre preference and recommends movies  similar to user's highly rated movies.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [2]:
# Load movie data
movies_org = pd.read_csv("../0_data/input/movies_metadata.csv")#, converters={"genres": literal_eval, "tag": literal_eval})

In [3]:
movies = movies_org[['tmdbId', 'original_title', 'genres', 'overview', 'popularity', 'release_date', 'tagline', 'vote_average', 'vote_count']]

In [4]:
movies.head()

Unnamed: 0,tmdbId,original_title,genres,overview,popularity,release_date,tagline,vote_average,vote_count
0,862,Toy Story,"['Animation', 'Comedy', 'Family']","Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,,7.7,5415.0
1,8844,Jumanji,"['Adventure', 'Fantasy', 'Family']",When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,Roll the dice and unleash the excitement!,6.9,2413.0
2,15602,Grumpier Old Men,"['Romance', 'Comedy']",A family wedding reignites the ancient feud be...,11.7129,1995-12-22,Still Yelling. Still Fighting. Still Ready for...,6.5,92.0
3,31357,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']","Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,Friends are the people who let you be yourself...,6.1,34.0
4,11862,Father of the Bride Part II,['Comedy'],Just when George Banks has recovered from his ...,8.387519,1995-02-10,Just When His World Is Back To Normal... He's ...,5.7,173.0


In [5]:
movies.shape

(45463, 9)

In [10]:
movies['tagline'] = movies['tagline'].fillna('')
movies['overview'] = movies['overview'].fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tagline'] = movies['tagline'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['overview'] = movies['overview'].fillna('')


In [11]:
movies['description'] = movies['tagline'] + movies['overview']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['description'] = movies['tagline'] + movies['overview']


In [13]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['description'])

In [14]:
tfidf_matrix.shape

(45463, 1109556)

In [15]:
# http://scikit-learn.org/stable/modules/metrics.html#linear-kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [17]:
movies = movies.reset_index()
titles = movies['original_title']
indices = pd.Series(movies.index, index=movies['original_title'])
indices.head(2)

original_title
Toy Story    0
Jumanji      1
dtype: int64

In [18]:
def get_recommendations(title):
    idx = indices[title]
    if type(idx) != np.int64:
        if len(idx)>1:
            print("ALERT: Multiple values")
            idx = idx[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [19]:
get_recommendations('The Dark Knight').head(10)

ALERT: Multiple values


18252                                The Dark Knight Rises
150                                         Batman Forever
1328                                        Batman Returns
21193    Batman Unmasked: The Psychology of the Dark Kn...
15511                           Batman: Under the Red Hood
20231              Batman: The Dark Knight Returns, Part 2
41973                                The Lego Batman Movie
585                                                 Batman
25266                                    Batman vs Dracula
18035                                     Batman: Year One
Name: original_title, dtype: object

In [20]:
get_recommendations('Toy Story').head(10)

2997                                      Toy Story 2
15348                                     Toy Story 3
24522                                       Small Fry
23842                     Andy Hardy's Blonde Trouble
8327                                        The Champ
10301                          The 40 Year Old Virgin
43424                Andy Kaufman Plays Carnegie Hall
3057                                  Man on the Moon
38473    Superstar: The Life and Times of Andy Warhol
42718    Andy Peters: Exclamation Mark Question Point
Name: original_title, dtype: object

In [21]:
get_recommendations('Doctor Who: Last Christmas').head(10)

313                          The Santa Clause
16101                 Как я провёл этим летом
38357                            Le père Noël
41984                 The Spirit of Christmas
22177    The Life & Adventures of Santa Claus
2285                   Miracle on 34th Street
34846              Il Natale che quasi non fu
25099                              Santa Who?
36893    The Life & Adventures of Santa Claus
8856               Silent Night, Deadly Night
Name: original_title, dtype: object

In [23]:
movies[movies['original_title'] == 'Doctor Who: Last Christmas']['genres'].values[0]

"['Science Fiction', 'Adventure', 'Drama', 'Fantasy']"

In [24]:
movies[movies['original_title'] == 'Doctor Who: Last Christmas']['overview'].values[0]

'The Doctor and Clara face their Last Christmas. Trapped on an Arctic base, under attack from terrifying creatures, who are you going to call? Santa Claus!'

In [25]:
get_recommendations('Inception').head(10)

7460                 Cypher
10908           Renaissance
44311                   III
9095              The Brave
25448    Closer to the Moon
29073         Dear Murderer
32013              Hollywoo
22618      The Monkey's Paw
779               Lone Star
20358    En kvinnas ansikte
Name: original_title, dtype: object

In [26]:
movies[movies['original_title'] == 'III']['overview'].values[0]

"A small European town, where sisters Ayia and Mirra live, gets struck down by an unknown disease which takes many lives. Following their mother's death, the younger sister falls ill. Having realized that conventional medicine is useless in the face of the sister's disease, Ayia seeks help from Father Herman, a parish priest and a close family friend. In his house she finds books that are very far from the conventional religion. She gets to know that only penetration into Mirra's sick subconscious mind and discovery of the true cause of her disease will give her a chance to save her sister. Ayia is ready to go through this terrifying ritual, dive into the depths of the subconscious mind, and face the demons residing there."

In [27]:
movies[movies['original_title'] == 'Inception']['overview'].values[0]

'Cobb, a skilled thief who commits corporate espionage by infiltrating the subconscious of his targets is offered a chance to regain his old life as payment for a task considered to be impossible: "inception", the implantation of another person\'s idea into a target\'s subconscious.'

In [28]:
popularity_df = movies[['popularity', 'vote_average', 'vote_count']]
popularity_df.corr()

Unnamed: 0,popularity,vote_average,vote_count
popularity,1.0,0.154399,0.559965
vote_average,0.154399,1.0,0.123607
vote_count,0.559965,0.123607,1.0


## Include genre in TF-IDF

In [29]:
movies['description_genre'] = movies['description'] + 2*movies['genres']
movies['description_genre'] = movies['description_genre'].fillna('')
movies

Unnamed: 0,level_0,index,tmdbId,original_title,genres,overview,popularity,release_date,tagline,vote_average,vote_count,description,description_genre
0,0,0,862,Toy Story,"['Animation', 'Comedy', 'Family']","Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,,7.7,5415.0,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ..."
1,1,1,8844,Jumanji,"['Adventure', 'Fantasy', 'Family']",When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,Roll the dice and unleash the excitement!,6.9,2413.0,Roll the dice and unleash the excitement!When ...,Roll the dice and unleash the excitement!When ...
2,2,2,15602,Grumpier Old Men,"['Romance', 'Comedy']",A family wedding reignites the ancient feud be...,11.712900,1995-12-22,Still Yelling. Still Fighting. Still Ready for...,6.5,92.0,Still Yelling. Still Fighting. Still Ready for...,Still Yelling. Still Fighting. Still Ready for...
3,3,3,31357,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']","Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,Friends are the people who let you be yourself...,6.1,34.0,Friends are the people who let you be yourself...,Friends are the people who let you be yourself...
4,4,4,11862,Father of the Bride Part II,['Comedy'],Just when George Banks has recovered from his ...,8.387519,1995-02-10,Just When His World Is Back To Normal... He's ...,5.7,173.0,Just When His World Is Back To Normal... He's ...,Just When His World Is Back To Normal... He's ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45458,45458,45458,439050,رگ خواب,"['Drama', 'Family']",Rising and falling between a man and woman.,0.072051,,Rising and falling between a man and woman,4.0,1.0,Rising and falling between a man and womanRisi...,Rising and falling between a man and womanRisi...
45459,45459,45459,111109,Siglo ng Pagluluwal,['Drama'],An artist struggles to finish his work while a...,0.178241,2011-11-17,,9.0,3.0,An artist struggles to finish his work while a...,An artist struggles to finish his work while a...
45460,45460,45460,67758,Betrayal,"['Action', 'Drama', 'Thriller']","When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,A deadly game of wits.,3.8,6.0,A deadly game of wits.When one of her hits goe...,A deadly game of wits.When one of her hits goe...
45461,45461,45461,227506,Satana likuyushchiy,[],"In a small town live two brothers, one a minis...",0.003503,1917-10-21,,0.0,0.0,"In a small town live two brothers, one a minis...","In a small town live two brothers, one a minis..."


In [30]:
tf_new = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix_new = tf_new.fit_transform(movies['description_genre'])

In [31]:
cosine_sim_new = linear_kernel(tfidf_matrix_new, tfidf_matrix_new)

In [32]:
tf_new.vocabulary_['scifi']

878800

In [34]:
# movies = movies.reset_index()
titles = movies['original_title']
indices = pd.Series(movies.index, index=movies['original_title'])
indices.head(2)

original_title
Toy Story    0
Jumanji      1
dtype: int64

In [35]:
def get_recommendations_new(title):
    idx = indices[title]
    if type(idx) != np.int64:
        if len(idx)>1:
            print("ALERT: Multiple values")
            idx = idx[0]
    sim_scores = list(enumerate(cosine_sim_new[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [36]:
get_recommendations_new('Doctor Who: Last Christmas').head(10)

4149                    The Lost World
14332                           Аэлита
9952            Le Voyage dans la Lune
20105                       Slipstream
39666           キングスグレイブ ファイナルファンタジーXV
43642                             Okja
10466    Left Behind III: World at War
24905                         サカサマのパテマ
11743                         銀色の髪のアギト
33525                          Alraune
Name: original_title, dtype: object

In [51]:
movies[movies['original_title'] == 'Doctor Who: Last Christmas']['overview'].values[0]

'The Doctor and Clara face their Last Christmas. Trapped on an Arctic base, under attack from terrifying creatures, who are you going to call? Santa Claus!'

In [37]:
get_recommendations_new('Inception').head(10)

16763                         I Am Number Four
36238                                 Pandemic
5311                           Minority Report
17885                            Ticking Clock
39409                             Seventy-Nine
1092                                 The Abyss
6221                       The Matrix Reloaded
8138     Sky Captain and the World of Tomorrow
16891                      L: change the WorLd
19394                                   Looper
Name: original_title, dtype: object

In [38]:
get_recommendations_new('Avatar').head(10)

7473       Frank Herbert's Dune
9952     Le Voyage dans la Lune
10192            Fantastic Four
3872         Dungeons & Dragons
2526                   Superman
2527                Superman II
21067              Man of Steel
26564            Thor: Ragnarok
26567            Doctor Strange
38479            The Nostalgist
Name: original_title, dtype: object

### IMDB Weighted Average 

In [39]:
# this is V
vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')

# this is R
vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')

# this is C
C = vote_averages.mean()
C

5.244896612406511

In [40]:
m = vote_counts.quantile(0.95)
m

434.0

In [41]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

## Improved Recommendations

In [42]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim_new[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies_x = movies.iloc[movie_indices][['original_title', 'vote_count', 'vote_average']]
    vote_counts = movies_x[movies_x['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies_x[movies_x['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies_x[(movies_x['vote_count'] >= m) & (movies_x['vote_count'].notnull()) &
                       (movies_x['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [43]:
improved_recommendations('Doctor Who: Last Christmas').head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualified['vote_count'] = qualified['vote_count'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualified['vote_average'] = qualified['vote_average'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualified['wr'] = qualified.apply(weighted_rating, axis=1)


Unnamed: 0,original_title,vote_count,vote_average,wr
2416,Planet of the Apes,958,7,6.452791
43642,Okja,795,7,6.380216
9952,Le Voyage dans la Lune,314,7,5.981665
10561,Zathura: A Space Adventure,808,6,5.736139
24905,サカサマのパテマ,85,7,5.532341
39666,キングスグレイブ ファイナルファンタジーXV,201,6,5.483914
5276,Silent Running,179,6,5.465392
34775,Doctor Who: The Next Doctor,51,7,5.429454
7401,Explorers,120,6,5.408457
7473,Frank Herbert's Dune,114,6,5.40198


In [44]:
improved_recommendations('Avatar').head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualified['vote_count'] = qualified['vote_count'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualified['vote_average'] = qualified['vote_average'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualified['wr'] = qualified.apply(weighted_rating, axis=1)


Unnamed: 0,original_title,vote_count,vote_average,wr
23358,X-Men: Days of Future Past,6155,7,6.884396
26567,Doctor Strange,5880,7,6.879361
9952,Le Voyage dans la Lune,314,7,5.981665
21067,Man of Steel,6462,6,5.952478
2526,Superman,1042,6,5.777971
2527,Superman II,642,6,5.695432
1811,Small Soldiers,522,6,5.657202
2528,Superman III,500,5,5.113796
10192,Fantastic Four,3040,5,5.030594
13566,Dragonball Evolution,475,2,3.549269
