# Lab 10 - Recommender Systems \[Optional Exercise - Possible Soutions & Approaches \]

One does not necessarily have to return the top recommendations as-is from the model without considering user preferences in a collaborative filtering system. Thus, for this optional exercise, your task is to augment/modify the `get_top_recommendations` function to improve the recommendations provided to a given user.

In [1]:
# Do not modify this cell

!pip install scikit-surprise --upgrade

import pandas as pd
import numpy as np
from surprise import Dataset
from surprise import Reader
from surprise import KNNBasic

try:
    ratings = pd.read_csv('../data/movie_ratings.csv')
except:
    ratings = pd.read_csv('https://raw.githubusercontent.com/GUC-DM/W2020/main/data/movie_ratings.csv')

try:
    movies_db = pd.read_csv('../data/movies_db.csv')
except:
    movies_db = pd.read_csv('https://raw.githubusercontent.com/GUC-DM/W2020/main/data/movies_db.csv')

# We'll set the TMDB ID as the index for quick indexing by ID
movies_db = movies_db.set_index('tmdbId')


# The Reader class is used to parse a file containing ratings
# Since we already loaded it as a dataframe, we only need to set the rating_scale parameter.
reader = Reader(rating_scale=(0.5, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(ratings[['userId', 'tmdbId', 'rating']], reader)

sim_options_user = {
    'name': 'cosine', # there are other options as well, including pearson
    'user_based': True  # compute similarities between users
}

user_knn_model = KNNBasic(k=40, min_k=1, sim_options=sim_options_user)

# Builds a training set from the entire dataset (no splitting is done)
# Needed to use the models for recommendations
trainset = data.build_full_trainset()

# Fit each model to the training set
user_knn_model.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7fdfe2423370>

**Augment/Modify the function below to improve the recommendations returned to the user**

**Hint**: consider how you can filter the recommendations returned from the model based on the attributes available in the movies_db dataset. You may also test with different user IDs.

In [2]:
def get_top_recommendations(user_id, n=10):
    # Get the IDs of movies that the user has already rated
    rated_movies = ratings.loc[ratings['userId'] == user_id, 'tmdbId']
    
    # Get the IDs of movies that were not yet rated by the user
    # Note: ~ is bitwise not
    movies_to_predict = movies_db[~movies_db.index.isin(rated_movies)].index

    # Setup dataframe to use for building and sorting the movie rating predictions for the user
    user_predictions = pd.DataFrame(movies_to_predict)

    # Predict the user's rating for each of the movies that were not previously rated
    user_predictions['predicted_rating'] = user_predictions['tmdbId'].apply(lambda movie_id: user_knn_model.predict(user_id, movie_id).est)

    # Return the top n recommendations based on the predicted score (and merge with movies_db to see movie title, genre, etc.)
    return user_predictions.merge(movies_db.reset_index()).nlargest(n, 'predicted_rating')

In [3]:
# Movies the given user (id: 1) has already rated
ratings.loc[ratings['userId'] == 1].merge(movies_db.reset_index()).sort_values('rating', ascending=False)

Unnamed: 0,userId,tmdbId,rating,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2
4,1,11216,4.0,tt0095765,Cinema Paradiso,"A filmmaker recalls his childhood, when he fel...",it,8.2,834.0,1988.0,Drama,Romance
13,1,97,4.0,tt0084827,Tron,As Kevin Flynn searches for proof that he inve...,en,6.6,717.0,1982.0,Science Fiction,Action
12,1,1051,4.0,tt0067116,The French Connection,Tough narcotics detective 'Popeye' Doyle is in...,en,7.4,435.0,1971.0,Action,Crime
8,1,6114,3.5,tt0103874,Dracula,When Dracula leaves the captive Jonathan Harke...,en,7.1,1087.0,1992.0,Romance,Horror
19,1,11072,3.0,tt0071230,Blazing Saddles,A town – where everyone seems to be named John...,en,7.2,619.0,1974.0,Western,Comedy
1,1,11360,3.0,tt0033563,Dumbo,Dumbo is a baby elephant born with oversized e...,en,6.8,1206.0,1941.0,Animation,Family
2,1,819,3.0,tt0117665,Sleepers,Two gangsters seek revenge on the state jail w...,en,7.3,729.0,1996.0,Crime,Drama
14,1,8393,3.0,tt0080801,The Gods Must Be Crazy,Misery is brought to a small group of Sho in t...,en,7.1,251.0,1980.0,Action,Comedy
17,1,9426,2.5,tt0091064,The Fly,When Seth Brundle makes a huge scientific and ...,en,7.1,1038.0,1986.0,Horror,Science Fiction
0,1,9909,2.5,tt0112792,Dangerous Minds,Former Marine Louanne Johnson lands a gig teac...,en,6.4,249.0,1995.0,Drama,Crime


In [4]:
get_top_recommendations(1)

Unnamed: 0,tmdbId,predicted_rating,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2
49,49133,5.0,tt0110299,Lamerica,"Fiore, an Italian conman, arrives in post Comm...",it,7.7,11.0,1994.0,Drama,Foreign
160,48787,5.0,tt0110604,Mute Witness,"Billy is mute, but it hasn't kept her from bec...",en,6.4,36.0,1995.0,Thriller,Foreign
268,30304,5.0,tt0114129,Picture Bride,"Riyo, an orphaned 17-year old, sails from Yoko...",en,7.4,5.0,1995.0,Drama,History
276,159185,5.0,tt0110769,"Red Firecracker, Green Firecracker",A woman inherits her father's fireworks factor...,zh,7.0,2.0,1994.0,Drama,
596,753,5.0,tt0062952,Faces,An old married man leaves his wife for a young...,en,7.1,36.0,1968.0,Drama,
632,85778,5.0,tt0110480,Maya Lin: A Strong Clear Vision,A film about the work of the artist most famou...,en,0.0,0.0,1995.0,Documentary,
636,22621,5.0,tt0113280,Heavy,Victor is a cook who works in a greasy bar/res...,en,7.7,11.0,1995.0,Drama,Romance
686,48144,5.0,tt0111424,The Day the Sun Turned Cold,,zh,7.0,2.0,1994.0,,
705,11985,5.0,tt0109066,Vive L'Amour,The film focuses on three city folks who unkno...,zh,7.3,16.0,1994.0,Drama,
781,23114,5.0,tt0027893,Little Lord Fauntleroy,An American boy turns out to be the heir of a ...,en,6.6,13.0,1936.0,Drama,Family


## Solution 1 - Simple Filtering based on "Preferred" Languages and Genres

This solution suffers from the cold start problem and assumes all rated movies' genre and language as preferred, but is a possible direction for improving the recommendation output. We can add weighing to the genres, consider the rating when deciding on preference (e.g., ratings >= 3.5 would be considered "preferred" and/or given higher weight), etc. 

In [5]:
def get_top_recommendations(user_id, n=10):
    # Get the IDs of movies that the user has already rated
    rated_movies = ratings.loc[ratings['userId'] == user_id, 'tmdbId']
    
    # !!Code Addition!!
    watched_ids = movies_db.index.isin(rated_movies)
    preferred_languages = movies_db[watched_ids].original_language.unique()
    preferred_genres = pd.concat([movies_db[watched_ids].genre_1, movies_db[watched_ids].genre_2]).unique()
    
    # !!Code Change!!
    movies_to_predict = movies_db[
        ~movies_db.index.isin(rated_movies) &
        movies_db.original_language.isin(preferred_languages) & 
        (
        movies_db.genre_1.isin(preferred_genres) |
        movies_db.genre_1.isin(preferred_genres)
        )
    ].index

    # Setup dataframe to use for building and sorting the movie rating predictions for the user
    user_predictions = pd.DataFrame(movies_to_predict)

    # Predict the user's rating for each of the movies that were not previously rated
    user_predictions['predicted_rating'] = user_predictions['tmdbId'].apply(lambda movie_id: user_knn_model.predict(user_id, movie_id).est)

    # Return the top n recommendations based on the predicted score (and merge with movies_db to see movie title, genre, etc.)
    return user_predictions.merge(movies_db.reset_index()).nlargest(n, 'predicted_rating')

In [6]:
get_top_recommendations(1)

Unnamed: 0,tmdbId,predicted_rating,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2
47,49133,5.0,tt0110299,Lamerica,"Fiore, an Italian conman, arrives in post Comm...",it,7.7,11.0,1994.0,Drama,Foreign
140,48787,5.0,tt0110604,Mute Witness,"Billy is mute, but it hasn't kept her from bec...",en,6.4,36.0,1995.0,Thriller,Foreign
239,30304,5.0,tt0114129,Picture Bride,"Riyo, an orphaned 17-year old, sails from Yoko...",en,7.4,5.0,1995.0,Drama,History
534,753,5.0,tt0062952,Faces,An old married man leaves his wife for a young...,en,7.1,36.0,1968.0,Drama,
567,22621,5.0,tt0113280,Heavy,Victor is a cook who works in a greasy bar/res...,en,7.7,11.0,1995.0,Drama,Romance
686,23114,5.0,tt0027893,Little Lord Fauntleroy,An American boy turns out to be the heir of a ...,en,6.6,13.0,1936.0,Drama,Family
704,85328,5.0,tt0117357,The Pompatus of Love,Four guys sit around drinking beer and talking...,en,5.3,2.0,1996.0,Comedy,Romance
920,41801,5.0,tt0116293,Female Perversions,An ambitious female attorney wallows in excess...,en,5.4,10.0,1996.0,Drama,Romance
1004,2892,5.0,tt0112362,Angel Baby,Two schizophrenics meet during therapy and fal...,en,7.7,3.0,1995.0,Drama,
1018,58911,5.0,tt0116565,Hotel de Love,"10 years ago at a party, Steven thinks he sees...",en,4.3,2.0,1996.0,Comedy,Foreign


## Solution 2 - Combined User Collaborative Filtering with Demographic Filtering \[Credit: Laila Ragheb\]

In [7]:
#set the minimum number of votes to be the 90th percentile
vote_counts = movies_db[movies_db['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = movies_db[movies_db['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
m = vote_counts.quantile(0.90)
#Filter the movies to those that meet the minimum vote count
qualified_movies = movies_db[
(movies_db['vote_count'] >= m) & 
(movies_db['vote_count'].notnull()) & 
(movies_db['vote_average'].notnull())].copy()
qualified_movies.shape

(905, 9)

In [8]:
def weighted_rating(row):
    v = row['vote_count']
    R = row['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

# Apply the weighted_rating over each row in the dataframe
qualified_movies['weighted_rating'] = qualified_movies.apply(weighted_rating, axis=1)

# Get the top 250 movies according to its calculated weighted_rating
qualified_movies = qualified_movies.nlargest(250, 'weighted_rating')
qualified_movies.head(15)

Unnamed: 0_level_0,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2,weighted_rating
tmdbId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
278,tt0111161,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,en,8.5,8358.0,1994.0,Drama,Crime,8.192319
155,tt0468569,The Dark Knight,Batman raises the stakes in his war on crime. ...,en,8.3,12269.0,2008.0,Drama,Action,8.098993
238,tt0068646,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",en,8.5,6024.0,1972.0,Drama,Crime,8.091935
550,tt0137523,Fight Club,A ticking-time-bomb insomniac and a slippery s...,en,8.3,9678.0,1999.0,Drama,,8.050805
680,tt0110912,Pulp Fiction,"A burger-loving hit man, his philosophical par...",en,8.3,8670.0,1994.0,Thriller,Crime,8.025173
27205,tt1375666,Inception,"Cobb, a skilled thief who commits corporate es...",en,8.1,14075.0,2010.0,Action,Thriller,7.937729
13,tt0109830,Forrest Gump,A man with a low IQ has accomplished great thi...,en,8.2,8147.0,1994.0,Comedy,Drama,7.921858
157336,tt0816692,Interstellar,Interstellar chronicles the adventures of a gr...,en,8.1,11187.0,2014.0,Adventure,Drama,7.899681
1891,tt0080684,The Empire Strikes Back,"The epic saga continues as Luke Skywalker, in ...",en,8.2,5998.0,1980.0,Adventure,Action,7.837999
122,tt0167260,The Lord of the Rings: The Return of the King,Aragorn is revealed as the heir to the ancient...,en,8.1,8226.0,2003.0,Adventure,Fantasy,7.836282


In [9]:
# Augment/Modify the function below to improve the recommendations returned to the user
# Hint: consider how you can filter the recommendations returned from the model based on
# the attributes available in the movies_db dataset

# You may test with different user IDs

def get_top_recommendations(user_id, n=10):

    # Get the IDs of movies that the user has already rated
    rated_movies = ratings.loc[ratings['userId'] == user_id, 'tmdbId']
    
    # Get the IDs of movies that were not yet rated by the user
    # Note: ~ is bitwise not
    movies_to_predict = qualified_movies[~qualified_movies.index.isin(rated_movies)].index

    # Setup dataframe to use for building and sorting the movie rating predictions for the user
    user_predictions = pd.DataFrame(movies_to_predict)

    # Predict the user's rating for each of the movies that were not previously rated
    user_predictions['predicted_rating'] = user_predictions['tmdbId'].apply(lambda movie_id: user_knn_model.predict(user_id, movie_id).est)

    # Return the top n recommendations based on the predicted score (and merge with movies_db to see movie title, genre, etc.)
    return user_predictions.merge(qualified_movies.reset_index()).nlargest(n, 'predicted_rating')

In [10]:
get_top_recommendations(1)

Unnamed: 0,tmdbId,predicted_rating,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2,weighted_rating
156,296096,5.0,tt2674426,Me Before You,A small town girl is caught between dead-end j...,en,7.6,2674.0,2016.0,Drama,Romance,7.099871
192,28178,5.0,tt1028532,Hachi: A Dog's Tale,A drama based on the true story of a college p...,en,7.7,1769.0,2009.0,Drama,Family,7.004756
58,76203,4.507253,tt2024544,12 Years a Slave,"In the pre-Civil War United States, Solomon No...",en,7.9,3787.0,2013.0,Drama,History,7.444148
43,629,4.375,tt0114814,The Usual Suspects,"Held in an L.A. interrogation room, Verbal Kin...",en,8.1,3334.0,1995.0,Drama,Crime,7.547266
232,2493,4.362429,tt0093779,The Princess Bride,"In this enchantingly cracked fairy tale, the b...",en,7.6,1518.0,1987.0,Adventure,Family,6.88152
183,335,4.35703,tt0064116,Once Upon a Time in the West,This classic western masterpiece is an epic fi...,it,8.1,1160.0,1968.0,Western,,7.022486
150,96721,4.345451,tt1979320,Rush,A biographical drama centered on the rivalry b...,en,7.7,2310.0,2013.0,Drama,Action,7.114102
1,155,4.340791,tt0468569,The Dark Knight,Batman raises the stakes in his war on crime. ...,en,8.3,12269.0,2008.0,Drama,Action,8.098993
2,238,4.3375,tt0068646,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",en,8.5,6024.0,1972.0,Drama,Crime,8.091935
122,567,4.336239,tt0047396,Rear Window,"Professional photographer L.B. ""Jeff"" Jeffries...",en,8.2,1531.0,1954.0,Drama,Mystery,7.230264


## Solution 3 - All of the above and more, with fallback \[Credit: Zeina Kandil\]

In [11]:
def get_top_recommendations_enhanced(user_id, n=10):
    # Join ratings table with movies table
    movies_rating = ratings.loc[ratings['userId'] == user_id].merge(movies_db.reset_index()).sort_values('rating', ascending=True)

    # Only consider the top 90% of movies in terms of vote count 
    vote_counts = movies_db[movies_db['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies_db[movies_db['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.90)
    qualified_movies = movies_db[
        (movies_db['vote_count'] >= m) & 
        (movies_db['vote_count'].notnull()) & 
        (movies_db['vote_average'].notnull())
    ].copy()

    # Only consider movies with a genre that the user has watched before
    genres_1 = movies_rating.loc[movies_rating['userId'] == user_id, 'genre_1'].drop_duplicates().tolist()
    genres_2 = movies_rating.loc[movies_rating['userId'] == user_id, 'genre_2'].drop_duplicates().tolist()
    genres = []
    [genres.append(x) for x in genres_1] 
    [genres.append(x) for x in genres_2 if x not in genres]
    qualified_movies = qualified_movies[
        (qualified_movies['genre_1'].isin(genres)) | 
        (qualified_movies['genre_2'].isin(genres)) 
    ]

    # Only consider movies that are in a language that the user previously watched a movie in
    languages = movies_rating.loc[movies_rating['userId'] == user_id, 'original_language'].drop_duplicates()
    qualified_movies = qualified_movies[
        (qualified_movies['original_language'].isin(languages))
    ]

    # Get the IDs of movies that the user has already rated
    rated_movies = ratings.loc[ratings['userId'] == user_id, 'tmdbId']
    
    # Get the IDs of movies that were not yet rated by the user
    # Note: ~ is bitwise not
    movies_to_predict = qualified_movies[~qualified_movies.index.isin(rated_movies)].index

    # Setup dataframe to use for building and sorting the movie rating predictions for the user
    user_predictions = pd.DataFrame(movies_to_predict)

    # Predict the user's rating for each of the movies that were not previously rated
    user_predictions['predicted_rating'] = user_predictions['tmdbId'].apply(lambda movie_id: user_knn_model.predict(user_id, movie_id).est)

    # Top recommendations considering language, genre, vote_count, and predicted_rating
    primary_recommendation = user_predictions.merge(movies_db.reset_index())

    # Top recommendations based on original function
    secondary_recommendation = get_top_recommendations(user_id)

    # Merge both outputs because sometimes primary_recommendation will be empty or less than n
    # This can be because the user did not rate many movies which leads to limited options
    two_recommendations = pd.concat([primary_recommendation, secondary_recommendation]).drop_duplicates(subset=['tmdbId'])

    # Return the top n recommendations using both functions giving priority to recommendations with the genres and languages that the user is familiar with
    return two_recommendations.nlargest(n, 'predicted_rating')

In [12]:
get_top_recommendations_enhanced(1)

Unnamed: 0,tmdbId,predicted_rating,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2,weighted_rating
498,28178,5.0,tt1028532,Hachi: A Dog's Tale,A drama based on the true story of a college p...,en,7.7,1769.0,2009.0,Drama,Family,
818,167073,5.0,tt2381111,Brooklyn,"In 1950s Ireland and New York, young Ellis Lac...",en,7.2,1235.0,2015.0,Drama,Romance,
846,105864,5.0,tt1979388,The Good Dinosaur,An epic journey into the world of dinosaurs wh...,en,6.6,1782.0,2015.0,Adventure,Animation,
875,296096,5.0,tt2674426,Me Before You,A small town girl is caught between dead-end j...,en,7.6,2674.0,2016.0,Drama,Romance,
873,140300,4.522834,tt2267968,Kung Fu Panda 3,"Continuing his ""legendary adventures of awesom...",en,6.7,1630.0,2016.0,Action,Adventure,
715,76203,4.507253,tt2024544,12 Years a Slave,"In the pre-Civil War United States, Solomon No...",en,7.9,3787.0,2013.0,Drama,History,
494,22954,4.5,tt1057500,Invictus,Newly elected President Nelson Mandela knows h...,en,7.0,1150.0,2009.0,Drama,History,
595,50619,4.5,tt1324999,The Twilight Saga: Breaking Dawn - Part 1,The new found married bliss of Bella Swan and ...,en,5.8,2622.0,2011.0,Adventure,Fantasy,
888,324668,4.5,tt4196776,Jason Bourne,The most dangerous former operative of the CIA...,en,5.9,2386.0,2016.0,Action,Thriller,
8,629,4.375,tt0114814,The Usual Suspects,"Held in an L.A. interrogation room, Verbal Kin...",en,8.1,3334.0,1995.0,Drama,Crime,
