# We are going to create simple recommender system. We tweak parameters to come up with our IMDB Top-250 movie list. This is far from a personalized recommender system: we are just using our own preferences for two parameters to come up with a new list of 'best' or 'relevant' movies.

In [2]:
#Importing the relevant packages. Today these are pandas (dataset description) and numpy (numerical operations)
import pandas as pd
import numpy as np

#Import the dataset
df = pd.read_csv('../data/movies_metadata.csv')
df.head()


#all the way on the right-hand side, we see two important variables: vote_average and vote_count.
#these represent what users thought of it, and how many users did so. Ignore the error in red.

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### This is something you could also check out on your own, but the movie database also contains a lot of movies that are a bit more obscure. This means fewer users have watched them. You can already see that in the output above. It means that such movies are more 'sensitive' to small group of users giving high ratings and skewing our results.

#### The main challenge is, therefore, to strike the right balance between the number of votes and the rating. Let's explore the number of votes in the dataset first.

In [11]:
#the formula we will work with to compute a rating is: (v/(v+m) * R) + (m/(m+v) * C)
#where v is the number of votes, m the minimum number of votes, R the movie's rating,
#and C the average movie rating in the dataset (so, in the set of considered movies).

#Let's first explore 'm'
#Calculate the number of votes garnered by the 80th percentile movie
m = df['vote_count'].quantile(0.80)
#m = 500    -- this is an alternative
m

500

In [12]:
#We only wish to consider movies of a certain length; you can play around with these parameters as well to see how the output changes

#Only consider movies longer than 45 minutes and shorter than 300 minutes
q_movies = df[(df['runtime'] >= 45) & (df['runtime'] <= 300)]

#Only consider movies that have garnered more than m votes
q_movies = q_movies[q_movies['vote_count'] >= m]

#Inspect the number of movies that made the cut
q_movies.shape

(2052, 24)

In [13]:
# Calculate C
C = df['vote_average'].mean()
C

5.618207215133889

In [14]:
## Let's calculate the score for each movie given our parameters
# Function to compute the IMDB weighted rating for each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Compute the weighted score
    return (v/(v+m) * R) + (m/(m+v) * C)

In [17]:
# Compute the score using the weighted_rating function defined above
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [18]:
#Actually use the score: Sort movies in descending order of their scores
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 25 movies
q_movies[['title', 'vote_count', 'vote_average', 'score', 'runtime']].head(25)

Unnamed: 0,title,vote_count,vote_average,score,runtime
314,The Shawshank Redemption,8358.0,8.5,8.337334,142.0
834,The Godfather,6024.0,8.5,8.279139,175.0
12481,The Dark Knight,12269.0,8.3,8.194988,152.0
2843,Fight Club,9678.0,8.3,8.168255,139.0
292,Pulp Fiction,8670.0,8.3,8.153774,154.0
351,Forrest Gump,8147.0,8.2,8.050712,142.0
522,Schindler's List,4436.0,8.3,8.028344,195.0
23673,Whiplash,4376.0,8.3,8.025001,105.0
15480,Inception,14075.0,8.1,8.014861,148.0
1154,The Empire Strikes Back,5998.0,8.2,8.001339,124.0
