### The Internet Movie Database (IMDB) [link](https://tutorialedge.net/python/building-imdb-top-250-clone-pandas/)
maintains a chart called the IMDB Top 250, which is
a ranking of the top 250 movies according to a certain scoring metric.

All the movies in this
list are non-documentary, theatrical releases with a runtime of at least 45 minutes and over
250,000 ratings,

This chart can be considered the simplest of recommenders.

Building the simple recommender is fairly straightforward. The steps are as follows:
1. Choose a metric (or score) to rate the movies on
2. Decide on the prerequisites for the movie to be featured on the chart (conditions)
3. Calculate the score for every movie that satisfies the conditions
4. Output the list of movies in decreasing order of their scores

#### The metric
why not just rating ?

rating does not take the popularity of a movie into consideration. Therefore, a movie rated
9 by 100,000 users will be placed below a movie rated 9.5 by 100 users

\begin{equation}
wigthed rating(WR) =(\frac{\upsilon}{v+m})*R + (\frac{m}{m+v})*C
\end{equation}

- ${\upsilon}$ is the number of votes garnered by the movie
- $m$ is the minimum number of votes required for the movie to be in the chart (the
prerequisite) the higher the value of m, the higher the emphasis on the popularity
of a movie, and therefore the higher the selectivity
- $R$ is the mean rating of the movie
- $C$ is the mean rating of all the movies in the dataset

By looking at these we can see that the more votes an individual movie has, the stronger the weighting on the movies actual vote and the weaker the weighting on the average vote for all movies.

In other words the equation says: "The more votes a movie has, the more we trust its rating."

more on the explaination [link](https://www.reddit.com/r/statistics/comments/1niai5/imbd_weighted_average/)



In [1]:
import pandas as pd
import numpy as np
#Load the dataset into a pandas dataframe
df = pd.read_csv('../data/movies_metadata.csv')
#Display the first five movies in the dataframe
df.head()


  df = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [2]:
df.shape

(45466, 24)

#### prerequisites
- m (above 80 percentile)
- length/runtime (45min to 300min)

In [3]:
#Calculate the number of votes garnered by the 80th percentile movie

m = df['vote_count'].quantile(0.80)
m # 50.0, only 20% of the movies have gained more than 50 votes # 20% of 45466 is app 9000 movie

50.0

In [4]:
#Only consider movies longer than 45 minutes and shorter than 300 minutes
q_movies = df[(df['runtime'] >= 45) & (df['runtime'] <= 300)]
#Only consider movies that have garnered more than m votes
q_movies = q_movies[q_movies['vote_count'] >= m]
#Inspect the number of movies that made the cut
q_movies.shape # almost 20%

(8963, 24)

#### Calculate the score 

still need $C$ :is the mean rating of all the movies in the dataset


In [5]:
C = df['vote_average'].mean()
C # it is actually 7.0 as of 2013

5.618207215134185

In [6]:
def weighted_rating(x, m=m, C=C):
    """_summary_

    Args:
        x (DataFrame): dataset after applying the prerequisites 
        m (float, optional): min.number of votes requried to be on the chart. Defaults to m.
        C (float, optional): the mean rating of all the movies in the dataset. Defaults to C.

    Returns:
        float: wighted rating of the movies
    """
    v = x['vote_count']
    R = x['vote_average']
    # Compute the weighted score
    return (v/(v+m) * R) + (m/(m+v) * C)


In [7]:
# Compute the score using the weighted_rating function defined above
q_movies['score'] = q_movies.apply(weighted_rating, axis=1) # calculation is done by row 


#### output and sorting

In [8]:
out_df=q_movies.sort_values(by='score',ascending=False)
out_df.iloc[:250][["original_title","score"]]


Unnamed: 0,original_title,score
10309,Dilwale Dulhania Le Jayenge,8.855148
314,The Shawshank Redemption,8.482863
834,The Godfather,8.476278
40251,君の名は。,8.366584
12481,The Dark Knight,8.289115
...,...,...
26566,Guardians of the Galaxy Vol. 2,7.579811
3080,Papillon,7.579617
5016,Smultronstället,7.579523
15348,Toy Story 3,7.579183


### The knowledge-based recommender
1. Ask the user for the genres of movies he/she is looking for
2. Ask the user for the duration
3. Ask the user for the timeline of the movies recommended
4. Using the information collected, recommend movies to the user that have a high
weighted rating (according to the IMDB formula) and that satisfy the preceding
conditions.

but first we need to make some data cleaning and transformation

In [9]:
df.columns


Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [10]:
#Only keep those features that we require
df = df[['title','genres', 'release_date', 'runtime', 'vote_average',
'vote_count']]
df.head()


Unnamed: 0,title,genres,release_date,runtime,vote_average,vote_count
0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995-10-30,81.0,7.7,5415.0
1,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",1995-12-15,104.0,6.9,2413.0
2,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",1995-12-22,101.0,6.5,92.0
3,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",1995-12-22,127.0,6.1,34.0
4,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",1995-02-10,106.0,5.7,173.0


In [11]:
#Next, let us extract the year of release from our release_date feature:
#Convert release_date into pandas datetime format
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
print(df['release_date'][0])
#Extract year from the datetime
df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if x
!= np.nan else np.nan)
print(df['year'].isnull().value_counts())
#the null values are NaT , fillna doesn't seem to work
df['year'].fillna(0,inplace=True)


1995-10-30 00:00:00
year
False    45466
Name: count, dtype: int64


In [12]:
df['year']

0        1995
1        1995
2        1995
3        1995
4        1995
         ... 
45461     NaT
45462    2011
45463    2003
45464    1917
45465    2017
Name: year, Length: 45466, dtype: object

In [13]:
#Helper function to convert NaT to 0 and all other years to integers.
def convert_int(x):
    try:
        return int(x)
    except:
        return 0
#Apply convert_int to the year feature
df['year'] = df['year'].apply(convert_int)
df['year']

0        1995
1        1995
2        1995
3        1995
4        1995
         ... 
45461       0
45462    2011
45463    2003
45464    1917
45465    2017
Name: year, Length: 45466, dtype: int64

In [14]:
#Drop the release_date column
df = df.drop('release_date', axis=1)
#Display the dataframe
df.head()


Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",81.0,7.7,5415.0,1995
1,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",101.0,6.5,92.0,1995
3,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",127.0,6.1,34.0,1995
4,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",106.0,5.7,173.0,1995


In [15]:
df.iloc[0]['genres']


"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

the output is a stringified dictionary: need to convert it to an actual dictionary

In [16]:
#Import the literal_eval function from ast
from ast import literal_eval
#Define a stringified list and output its type
a = "[1,2,3]"
print(type(a))
#Apply literal_eval and output type
b = literal_eval(a)
print(type(b))


<class 'str'>
<class 'list'>


In [17]:
a="[1,2,3]"
b = list(a)
print(b)
c=[]
for i in b:
    c.append(convert_int(i))
for i in c:
    if i == 0:
        c.remove(i)

c

['[', '1', ',', '2', ',', '3', ']']


[1, 2, 3]

In [18]:
#Convert all NaN into stringified empty lists
df['genres'] = df['genres'].fillna('[]')
#Apply literal_eval to convert to the list object
df['genres'] = df['genres'].apply(literal_eval)

In [19]:
df['genres'][0] # toy story is Animation,Comdey,Family

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [20]:
#Convert list of dictionaries to a list of strings
df['genres'] = df['genres'].apply(lambda x: [i['name'] for i in x] if
isinstance(x, list) else [])
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"[Animation, Comedy, Family]",81.0,7.7,5415.0,1995
1,Jumanji,"[Adventure, Fantasy, Family]",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"[Romance, Comedy]",101.0,6.5,92.0,1995
3,Waiting to Exhale,"[Comedy, Drama, Romance]",127.0,6.1,34.0,1995
4,Father of the Bride Part II,[Comedy],106.0,5.7,173.0,1995


In [21]:
#The last step is to explode the genres
#column. In other words, if a particular movie has multiple genres, we will create multiple
#copies of the movie, with each movie having one of the genres.

#Create a new feature by exploding genres
s = df.apply(lambda x:pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)

print(s)
#Name the new feature as 'genre'
s.name = 'genre'
s=s.apply(str.lower)
print(s)
#Create a new dataframe gen_df which by dropping the old 'genres' feature
#and adding the new 'genre'.
gen_df = df.drop('genres', axis=1).join(s) # left join
#Print the head of the new gen_df
gen_df.head()

0        Animation
0           Comedy
0           Family
1        Adventure
1          Fantasy
           ...    
45461       Family
45462        Drama
45463       Action
45463        Drama
45463     Thriller
Length: 91106, dtype: object
0        animation
0           comedy
0           family
1        adventure
1          fantasy
           ...    
45461       family
45462        drama
45463       action
45463        drama
45463     thriller
Name: genre, Length: 91106, dtype: object


Unnamed: 0,title,runtime,vote_average,vote_count,year,genre
0,Toy Story,81.0,7.7,5415.0,1995,animation
0,Toy Story,81.0,7.7,5415.0,1995,comedy
0,Toy Story,81.0,7.7,5415.0,1995,family
1,Jumanji,104.0,6.9,2413.0,1995,adventure
1,Jumanji,104.0,6.9,2413.0,1995,fantasy


#### The build_chart function
$steps$:
1. Get user input on their preferences()
2. Extract all movies that match the conditions set by the user
3. Calculate the values of $m$ and $C$ for only these movies and proceed to build the
chart as in the previous section

In [22]:
def build_chart(gen_df, percentile=0.8):
#Ask for preferred genres
    print("Input preferred genre")
    genre = input().lower()
    print(genre)
#Ask for lower limit of duration
    print("Input shortest duration")
    low_time = int(input())
    print(low_time)
#Ask for upper limit of duration
    print("Input longest duration")
    high_time = int(input())
    print(high_time)
#Ask for lower limit of timeline
    print("Input earliest year")
    low_year = int(input())
    print(low_year)
#Ask for upper limit of timeline
    print("Input latest year")
    high_year = int(input())
    print(high_year)
#Define a new movies variable to store the preferred movies. Copy the
#contents of gen_df to movies
    movies = gen_df.copy()
#Filter based on the condition
    movies = movies[(movies['genre'] == genre) &
                    (movies['runtime'] >= low_time) &
                    (movies['runtime'] <= high_time) &
                    (movies['year'] >= low_year) &
                    (movies['year'] <= high_year)]
#Compute the values of C and m for the filtered movies
    C = movies['vote_average'].mean()
    m = movies['vote_count'].quantile(percentile)
#Only consider movies that have higher than m votes. Save this in a new dataframe q_movies
    q_movies = movies.copy().loc[movies['vote_count'] >= m]
#Calculate score using the IMDB formula
    q_movies['score'] = q_movies.apply(lambda x:(x['vote_count']/(x['vote_count']+m) * x['vote_average'])+
                                                (m/(m+x['vote_count']) * C),axis=1)
#Sort movies in descending order of their scores
    q_movies = q_movies.sort_values('score', ascending=False)
    return q_movies


In [23]:
build_chart(gen_df,0.80).head()

Input preferred genre
k
Input shortest duration
2
Input longest duration
22
Input earliest year
33
Input latest year
33


Unnamed: 0,title,runtime,vote_average,vote_count,year,genre,score


In [24]:

#Convert the cleaned (non-exploded) dataframe df into a CSV file and save it in the data folder
#Set parameter index to False as the index of the DataFrame has no inherent meaning.
df.to_csv('../data/metadata_clean.csv', index=False)
