# Rating based simple recommender

This recommender assumes that ratings is the only parameter needed to decide whether a movie should be recommended to a user. It uses the IMDB formula for weighted rating of movie to prepare movie charts for recommendation - (http://answers.google.com/answers/threadview/id/507508.html)

In [4]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from ast import literal_eval
from dask_ml.model_selection import train_test_split

### Loading data
... and taking a quick look at table layouts

In [25]:
data = dd.read_csv('../ratings.csv').set_index('movieId')
metadata = pd.read_csv('../movies_metadata.csv', dtype={'budget':'object',
                                                       'id': 'object',
                                                       'popularity': 'object',
                                                       'revenue': 'float64',
                                                        'vote_count': 'float64'})
data.head()

Unnamed: 0_level_0,userId,rating,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,25328,1.0,858444862
1,61186,3.5,1059599803
1,171264,5.0,938851603
1,171265,3.5,1265437174
1,171269,3.5,1460147564


In [26]:
links = pd.read_csv('../links.csv').set_index('movieId')
links.head()

Unnamed: 0_level_0,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,114709,862.0
2,113497,8844.0
3,113228,15602.0
4,114885,31357.0
5,113041,11862.0


In [27]:
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [28]:
metadata.dtypes

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

### Computing average rating for each movie

Since the vote count metric needed in the weighted rating formula is available in the metadata table, the rating average from this table (rather than the ratings file) is used to prepare our chart. <br><br>
Computing movie rating averages across users:

In [29]:
mean_ratings = data.groupby('movieId')['rating'].mean().compute()

In [30]:
mean_ratings

movieId
1         3.888157
2         3.236953
3         3.175550
4         2.875713
5         3.079565
            ...   
176267    4.000000
176269    3.500000
176271    5.000000
176273    1.000000
176275    3.000000
Name: rating, Length: 45115, dtype: float64

As per formula, C = average of ratings of movies. This is computed below:

In [31]:
C = metadata.loc[:,'vote_average'].mean()
C

5.618207215134185

m = minimum number of votes required for movie to show up in chart <br>
We select the top 5% (according to number of votes) for our chart

In [32]:
vote_count = metadata[metadata['vote_count'].notnull()]['vote_count'].astype('int')
m = vote_count.quantile(0.95)
m

434.0

Since we also intend to generate genre charts, genre values are cleaned up:

In [33]:
metadata['genres'] = metadata['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [34]:
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


Only movies with atleast minimum vote count are considered for chart:

In [35]:
chart_movies = metadata[(metadata['vote_count'] > m - 1) & (metadata['vote_count'].notnull())][['title', 'vote_count', 'genres', 'popularity', 'id', 'vote_average']]
chart_movies.head()

Unnamed: 0,title,vote_count,genres,popularity,id,vote_average
0,Toy Story,5415.0,"[Animation, Comedy, Family]",21.946943,862,7.7
1,Jumanji,2413.0,"[Adventure, Fantasy, Family]",17.015539,8844,6.9
5,Heat,1886.0,"[Action, Crime, Drama, Thriller]",17.924927,949,7.7
9,GoldenEye,1194.0,"[Adventure, Action, Thriller]",14.686036,710,6.6
15,Casino,1343.0,"[Drama, Crime]",10.137389,524,7.8


Computing weighted rating for all movies based on IMDB formula:

In [36]:
def weighted_rating(movie):
    v = int(movie['vote_count'])
    R = movie['vote_average']
    return (v*R + m*C)/(v + m)

In [37]:
chart_movies['Weighted_Rating'] = chart_movies.apply(weighted_rating, axis = 1)

### Preparing top charts:

In [38]:
top_charts = chart_movies.sort_values('Weighted_Rating', ascending = False).head(250)

In [39]:
top_charts.head()

Unnamed: 0,title,vote_count,genres,popularity,id,vote_average,Weighted_Rating
314,The Shawshank Redemption,8358.0,"[Drama, Crime]",51.645403,278,8.5,8.357746
834,The Godfather,6024.0,"[Drama, Crime]",41.109264,238,8.5,8.306334
12481,The Dark Knight,12269.0,"[Drama, Action, Crime, Thriller]",123.167259,155,8.3,8.208376
2843,Fight Club,9678.0,[Drama],63.869599,550,8.3,8.184899
292,Pulp Fiction,8670.0,"[Thriller, Crime]",140.950236,680,8.3,8.172155


This chart is exported to a csv file named after this model ('first')

In [40]:
top_charts.to_csv('../FIRST-top_charts.csv', index = False)

### Genre-based top charts
The top charts table is first brought to 1NF form so that genre based selection can be done:

In [41]:
genre_chart = top_charts.apply(lambda x: pd.Series(x['genres']), axis = 1).stack().reset_index(level = 1, drop = True)
genre_chart.name = 'genre'
genre_chart = top_charts.drop('genres', axis = 1).join(genre_chart).sort_values('Weighted_Rating', ascending = False)

In [42]:
genre_chart

Unnamed: 0,title,vote_count,popularity,id,vote_average,Weighted_Rating,genre
314,The Shawshank Redemption,8358.0,51.645403,278,8.5,8.357746,Drama
314,The Shawshank Redemption,8358.0,51.645403,278,8.5,8.357746,Crime
834,The Godfather,6024.0,41.109264,238,8.5,8.306334,Drama
834,The Godfather,6024.0,41.109264,238,8.5,8.306334,Crime
12481,The Dark Knight,12269.0,123.167259,155,8.3,8.208376,Drama
...,...,...,...,...,...,...,...
2997,Toy Story 2,3914.0,17.547693,863,7.3,7.132130,Comedy
2997,Toy Story 2,3914.0,17.547693,863,7.3,7.132130,Family
20910,The Great Gatsby,3885.0,17.598936,64682,7.3,7.131003,Romance
20910,The Great Gatsby,3885.0,17.598936,64682,7.3,7.131003,Drama


Genre based selection is now just a matter of using the appropriate selection operation.

In [43]:
def generate_genre_chart(genre):
    genre_top_movies = genre_chart[genre_chart['genre'] == genre]
    return genre_top_movies

In [45]:
generate_genre_chart('Romance').head()

Unnamed: 0,title,vote_count,popularity,id,vote_average,Weighted_Rating,genre
351,Forrest Gump,8147.0,48.307194,13,8.2,8.069421,Romance
10309,Dilwale Dulhania Le Jayenge,661.0,34.457024,19404,9.1,7.720002,Romance
40882,La La Land,4745.0,19.681686,313369,7.9,7.708786,Romance
22168,Her,4215.0,13.829515,152601,7.9,7.686987,Romance
7208,Eternal Sunshine of the Spotless Mind,3758.0,12.906327,38,7.9,7.663765,Romance
