# Motivation/Introduction
I have been keeping a record of the movies I've watched, the genre of these movies, and my personal ratings for these movies for a few years now. Then, based on these records, I find movies I want to watch next, but this can be a hassle sometimes. So, I wanted to see if I can build something that can help automate the process of finding new movies to watch.

A simple movie recommendation system would help me do exactly this. So, this project would both help me in my quest for entertainment and as an added benefit, lends itself to a good, basic foundation in machine learning + data science.

As you will see, I have incorporated demographic, content based, and collaborative filtering (as a bonus aspect) into this movie recommender, allowing for a thorough process of generating predictions. 

I started off by importing the TMDB 5000 Movie Dataset, a dataset containing a whole slew of data for 5000+ movies.

In [1]:
import pandas as pd
import numpy as np

credits = pd.read_csv('./tmdb_5000_credits.csv')
movies = pd.read_csv('./tmdb_5000_movies.csv')

I then took a look at what each of the data sets look like, including what features they have and how the data is formatted.

In [2]:
credits.head(5)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [3]:
movies.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Here, I renamed the "movie_id" feature in the credits df to "id" so that I can then merge the two dfs on that feature. Here is a look at the new merged df.

In [4]:
credits.columns = ['id','tittle','cast','crew']
movies = movies.merge(credits, on = 'id')
movies.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


# Demographic Filtering
To score each of the movies for the demographic filtering process, I used IMBD's weighted rating system calculation:

$$\text{Weighted Rating} = {v \over v+m} * R + {m \over v+m} * C $$

So, first I calculated C by taking the average of the 'vote_average' feature:

In [5]:
C = movies['vote_average'].mean()
print("Value of C: ", C)

Value of C:  6.092171559442016


After seeing that the average rating was about 6, I chose a value for m, the minimum votes required to be listed in the chart. This was a somewhat arbitrary selection, but I chose 85% as the cutoff, meaning that a movie had to be better than 85% of the movies in the list to be included.

In [6]:
m = movies['vote_count'].quantile(0.85)
print("Value of m: ", m)

Value of m:  1300.6999999999998


Then, I filtered the qualifying movies based on the value for m. So, essentially, I got the top 15% of movies by executing this step. And as you can see, about 721 movies qualified.

In [7]:
qualifying_movies = movies.copy().loc[movies['vote_count'] >= m]
qualifying_movies.shape

(721, 23)

Now we were ready for calculating the ratings, and below, I did exactly that, with v as the vote count and R as the vote average. I then stored these weighted ratings as a new feature called 'score'.

In [8]:
def weighted_rating(x, m = m, C = C):
    v, R = x['vote_count'], x['vote_average']
    return (v / (v + m) * R) + (m / (m + v) * C)

qualifying_movies['score'] = qualifying_movies.apply(weighted_rating, axis = 1)

Finally, I sorted all the movies based on the weighted scores I just gave to the movies, allowing me to see the top 25 movies based on the IMDB weighted rating calculation. Having watched about half of the movies below, I knew that this weighted rating system was pretty good.

In [9]:
qualifying_movies = qualifying_movies.sort_values('score', ascending = False)
qualifying_movies[['title', 'vote_count', 'vote_average', 'score']].head(25)

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.170528
662,Fight Club,9413,8.3,8.031958
3232,Pulp Fiction,8428,8.3,8.00482
65,The Dark Knight,12002,8.2,7.993903
3337,The Godfather,5893,8.4,7.982719
96,Inception,13752,8.1,7.926504
809,Forrest Gump,7927,8.2,7.902889
95,Interstellar,10867,8.1,7.885368
329,The Lord of the Rings: The Return of the King,8064,8.1,7.821125
1990,The Empire Strikes Back,5879,8.2,7.818138


At this point, I finished the demographic filtering portion, and technically I was done creating the recommendation system, although it was very surface-level right now. To improve the recommendations, I added content-based filtering.

# Content-Based Filtering
To implement content-based filtering, I took into account the following "content": directors/actors, genres and plot keywords. Specifically, I got the name of the director, and for the actors, genres, and plot keywords, I took into account the top 3 from each feature. To accomplish this, I first converted these features into a usable format using literal_eval.

In [10]:
from ast import literal_eval

for feature in ['cast', 'crew', 'keywords', 'genres']:
    movies[feature] = movies[feature].apply(literal_eval)

Then, I used get_director() to get the director from the crew feature and get_list() to get the top 3 from the cast, keywords, and genres features. Then, I redefined the features (and create a new director feature) such that it used the defined functions. These new features are shown for the first 3 movies below.

In [11]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

def get_top3(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return [] # Return empty list in case of invalid data

movies['director'] = movies['crew'].apply(get_director)

for feature in ['cast', 'keywords', 'genres']:
    movies[feature] = movies[feature].apply(get_top3)

movies[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


Now, with these new features, I formatted all of the strings so that the vectorizer didn't make mistakes differentiating between two people (Sean Connery and Sean Penn for example). After this, the features looked like this:

In [12]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Check if director is given If not, return empty string
        return str.lower(x.replace(" ", "")) if isinstance(x, str) else ''

for feature in ['cast', 'keywords', 'director', 'genres']:
    movies[feature] = movies[feature].apply(clean_data)

movies[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[samworthington, zoesaldana, sigourneyweaver]",jamescameron,"[cultureclash, future, spacewar]","[action, adventure, fantasy]"
1,Pirates of the Caribbean: At World's End,"[johnnydepp, orlandobloom, keiraknightley]",goreverbinski,"[ocean, drugabuse, exoticisland]","[adventure, fantasy, action]"
2,Spectre,"[danielcraig, christophwaltz, léaseydoux]",sammendes,"[spy, basedonnovel, secretagent]","[action, adventure, crime]"


Now, that all of the "content" data was ready, I integrated these features into one new feature so that I could easily feed it into the vectorizer.

In [13]:
def join_data(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

movies['content_data'] = movies.apply(join_data, axis=1)

Now, that the data was all ready, I needed to define a recommendation function and use it to generate recommendations given a specific movie title. To do this, I first created a reverse mapping where I could get a movie id given its title (as input to the function). Then, for the recommendation function, I used the followings steps:
1. Get the movie index given title.
2. Get similarity scores for that movie with all other movies.
3. Sort by highest similarity scores and return the 10 highest (ignore 1st because it's the original movie)


In [14]:
indices = pd.Series(movies.index, index = movies['title']).drop_duplicates()

def get_recommendations(title, cosine_sim):
    sim_scores = list(enumerate(cosine_sim[indices[title]]))
    highest_sim = sorted(sim_scores, key = lambda x: x[1], reverse = True)[1:11]
    return movies['title'].iloc[[i[0] for i in highest_sim]]

With the recommendation function defined, I used CountVectorizer to essentially assign a weight to each piece of data when determining the similarity between different movies, which I did using the cosine similarity function. Sample movie recommendation outputs are shown below.

In [15]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies['content_data'])

cosine_sim = cosine_similarity(count_matrix, count_matrix)

df2 = movies.reset_index() # Reset index of the main df

In [16]:
get_recommendations('The Dark Knight Rises', cosine_sim)

65               The Dark Knight
119                Batman Begins
4638    Amidst the Devil's Wings
1196                The Prestige
3073           Romeo Is Bleeding
3326              Black November
1503                      Takers
1986                      Faster
303                     Catwoman
747               Gangster Squad
Name: title, dtype: object

In [17]:
get_recommendations('The Godfather', cosine_sim)

867      The Godfather: Part III
2731      The Godfather: Part II
4638    Amidst the Devil's Wings
2649           The Son of No One
1525              Apocalypse Now
1018             The Cotton Club
1170     The Talented Mr. Ripley
1209               The Rainmaker
1394               Donnie Brasco
1850                    Scarface
Name: title, dtype: object

# Collaborative Filtering

Now, this was already good enough for my purpose, but it still seemed incomplete at this point, and that was because it didn't take into account a user's preferences. Instead, it recommended the same movies regardless of a user's tastes. So, even though I didn't really need this part, I added a collaborative filtering aspect to the recommendation system. I used item-based collaborative filtering to be specific, and I chose to do it using Single Value Decomposition. For this part, I also needed to load in another dataset that had user ID's.

In [18]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

reader = Reader()
svd = SVD()

ratings = pd.read_csv('./ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


I then evaluated the RMSE to see how well ratings were being predicted for movies given a user. 

In [19]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 5, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8972  0.8981  0.8959  0.8933  0.8952  0.8959  0.0017  
MAE (testset)     0.6907  0.6932  0.6905  0.6871  0.6898  0.6902  0.0019  
Fit time          0.58    0.54    0.53    0.49    0.52    0.53    0.03    
Test time         0.05    0.12    0.05    0.09    0.05    0.07    0.03    


{'test_rmse': array([0.89720755, 0.898102  , 0.89593369, 0.8932635 , 0.89523529]),
 'test_mae': array([0.69068737, 0.69318954, 0.69048824, 0.68710795, 0.68977411]),
 'fit_time': (0.5830888748168945,
  0.5403509140014648,
  0.5327928066253662,
  0.4886610507965088,
  0.5150940418243408),
 'test_time': (0.046504974365234375,
  0.11946702003479004,
  0.04721713066101074,
  0.09434986114501953,
  0.04523515701293945)}

Seeing that it was only 0.89, I decided that it was good enough for me to train the model on the dataset.

In [20]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x280ada490>

And that's all there is to it. Although simple, I've definitely built a good starting point for this movie recommendation system, and I can definitely expand on this project in the future.