# Implementing Recommender Systems - Lab

## Introduction

In this lab, you'll practice creating a recommender system model using `surprise`. You'll also get the chance to create a more complete recommender system pipeline to obtain the top recommendations for a specific user.


## Objectives

In this lab you will: 

- Use surprise's built-in reader class to process data to work with recommender algorithms 
- Obtain a prediction for a specific user for a particular item 
- Introduce a new user with rating to a rating matrix and make recommendations for them 
- Create a function that will return the top n recommendations for a user 


For this lab, we will be using the famous 1M movie dataset. It contains a collection of user ratings for many different movies. In the last lesson, you were exposed to working with `surprise` datasets. In this lab, you will also go through the process of reading in a dataset into the `surprise` dataset format. To begin with, load the dataset into a Pandas DataFrame. Determine which columns are necessary for your recommendation system and drop any extraneous ones.

In [4]:
import pandas as pd
df = pd.read_csv('./ml-latest-small/ratings.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [5]:
# Drop unnecessary columns
new_df = df.drop(columns=['timestamp'])

In [6]:
# Inspect the new DataFrame
new_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


It's now time to transform the dataset into something compatible with `surprise`. In order to do this, you're going to need `Reader` and `Dataset` classes. There's a method in `Dataset` specifically for loading dataframes.

In [8]:
from surprise import Reader, Dataset
# read in values as Surprise dataset 

# Defining the rating scale 
reader = Reader(rating_scale=(new_df.rating.min(), new_df.rating.max()))

# Loading the data into Surprise format
data = Dataset.load_from_df(new_df[['userId', 'movieId', 'rating']], reader)

Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity

In [10]:
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


## Determine the best model 

Now, compare the different models and see which ones perform best. For consistency sake, use RMSE to evaluate models. Remember to cross-validate! Can you get a model with a higher average RMSE on test data than 0.869?

In [12]:
# importing relevant libraries
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
import numpy as np

In [13]:
## Perform a gridsearch with SVD
# ⏰ This cell may take several minutes to run
from surprise import SVD

# GridSearch with SVD
param_grid = {
    'n_factors': [50, 100],
    'lr_all': [0.005, 0.010],
    'reg_all': [0.02, 0.1]
}

gs_svd = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs_svd.fit(data)

print("✅ GridSearch for SVD complete.")

✅ GridSearch for SVD complete.


In [14]:
# print out optimal parameters for SVD after GridSearch

print("Best RMSE score (SVD):", gs_svd.best_score['rmse'])
print("Best parameters (SVD):", gs_svd.best_params['rmse'])

Best RMSE score (SVD): 0.8621975588823085
Best parameters (SVD): {'n_factors': 100, 'lr_all': 0.01, 'reg_all': 0.1}


In [15]:
# cross validating with KNNBasic

algo_knn_basic = KNNBasic()
cv_knn_basic = cross_validate(algo_knn_basic, data, measures=['RMSE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9495  0.9331  0.9616  0.9477  0.9492  0.9482  0.0091  
Fit time          0.23    0.19    0.18    0.18    0.27    0.21    0.04    
Test time         1.04    1.23    1.19    1.25    1.19    1.18    0.07    


In [16]:
# print out the average RMSE score for the test set

# Average RMSE for KNNBasic
avg_rmse_knn_basic = np.mean(cv_knn_basic['test_rmse'])
print("Average RMSE (KNNBasic):", avg_rmse_knn_basic)

Average RMSE (KNNBasic): 0.948208751527855


In [17]:
# cross validating with KNNBaseline

algo_knn_baseline = KNNBaseline()
cv_knn_baseline = cross_validate(algo_knn_baseline, data, measures=['RMSE'], cv=5, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8881  0.8755  0.8715  0.8670  0.8687  0.8741  0.0075  
Fit time          0.42    0.44    0.45    0.52    0.43    0.45    0.04    
Test time         1.35    1.30    1.43    1.41    1.50    1.40    0.07    


In [18]:
# print out the average score for the test set

# Average RMSE for KNNBaseline
avg_rmse_knn_baseline = np.mean(cv_knn_baseline['test_rmse'])
print("Average RMSE (KNNBaseline):", avg_rmse_knn_baseline)

Average RMSE (KNNBaseline): 0.8741418504580011


Based off these outputs, it seems like the best performing model is the SVD model with `n_factors = 50` and a regularization rate of 0.05. Use that model or if you found one that performs better, feel free to use that to make some predictions.

## Making Recommendations

It's important that the output for the recommendation is interpretable to people. Rather than returning the `movie_id` values, it would be far more valuable to return the actual title of the movie. As a first step, let's read in the movies to a dataframe and take a peek at what information we have about them.

In [21]:
df_movies = pd.read_csv('./ml-latest-small/movies.csv')

In [22]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Making simple predictions
Just as a reminder, let's look at how you make a prediction for an individual user and item. First, we'll fit the SVD model we had from before.

In [24]:
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a519b9b980>

In [25]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=3.0416296162407095, details={'was_impossible': False})

This prediction value is a tuple and each of the values within it can be accessed by way of indexing. Now let's put our knowledge of recommendation systems to do something interesting: making predictions for a new user!

## Obtaining User Ratings 

It's great that we have working models and everything, but wouldn't it be nice to get to recommendations specifically tailored to your preferences? That's what we'll be doing now. The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it. 

The function `movie_rater()` should take as parameters: 

* `movie_df`: DataFrame - a dataframe containing the movie ids, name of movie, and genres
* `num`: int - number of ratings
* `genre`: string - a specific genre from which to draw movies

The function returns:
* rating_list : list - a collection of dictionaries in the format of {'userId': int , 'movieId': int , 'rating': float}

#### This function is optional, but fun :) 

In [28]:
# Step-by-step movie_rater() Implementation

import random

def movie_rater(movie_df, num=5, genre=None):
    """
    Randomly sample movies and collect user ratings.
    
    Parameters:
        movie_df (DataFrame): DataFrame with movieId, title, and genres
        num (int): Number of movies to rate
        genre (str): If specified, filter movies by genre
    
    Returns:
        List of dictionaries with structure: {'userId': 999, 'movieId': int, 'rating': float}
    """
    rating_list = []

    # Filter by genre if specified
    if genre:
        filtered_df = movie_df[movie_df['genres'].str.contains(genre, case=False, na=False)]
    else:
        filtered_df = movie_df

    # Randomly sample 'num' movies
    sample_movies = filtered_df.sample(n=num)

    print("\nPlease rate the following movies (1 to 5 stars). Enter 0 if you haven't watched it.\n")
    
    for idx, row in sample_movies.iterrows():
        print(f"🎬 {row['title']} [{row['genres']}]")
        try:
            rating = float(input("Your rating (0 if not seen): "))
        except ValueError:
            rating = 0.0

        if rating > 0:
            rating_list.append({'userId': 999, 'movieId': int(row['movieId']), 'rating': rating})
        print()  # Add space

    return rating_list

In [29]:
# try out the new function here!

my_ratings = movie_rater(df_movies, num=5, genre='Comedy')
print(my_ratings)


Please rate the following movies (1 to 5 stars). Enter 0 if you haven't watched it.

🎬 Daria: Is It Fall Yet? (2000) [Animation|Comedy]


Your rating (0 if not seen):  1



🎬 Paul Blart: Mall Cop (2009) [Action|Comedy|Crime]


Your rating (0 if not seen):  2



🎬 From Dusk Till Dawn (1996) [Action|Comedy|Horror|Thriller]


Your rating (0 if not seen):  3



🎬 Big Trouble in Little China (1986) [Action|Adventure|Comedy|Fantasy]


Your rating (0 if not seen):  5



🎬 Odd Life of Timothy Green, The (2012) [Comedy|Drama|Fantasy]


Your rating (0 if not seen):  4



[{'userId': 999, 'movieId': 27369, 'rating': 1.0}, {'userId': 999, 'movieId': 65802, 'rating': 2.0}, {'userId': 999, 'movieId': 70, 'rating': 3.0}, {'userId': 999, 'movieId': 3740, 'rating': 5.0}, {'userId': 999, 'movieId': 96430, 'rating': 4.0}]


If you're struggling to come up with the above function, you can use this list of user ratings to complete the next segment

user_rating5

### Making Predictions With the New Ratings
Now that you have new ratings, you can use them to make predictions for this new user. The proper way this should work is:

* add the new ratings to the original ratings DataFrame, read into a `surprise` dataset 
* train a model using the new combined DataFrame
* make predictions for the user
* order those predictions from highest rated to lowest rated
* return the top n recommendations with the text of the actual movie (rather than just the index number) 

In [33]:
## add the new ratings to the original ratings DataFrame

# Assuming original ratings DataFrame is df_ratings
df_ratings = pd.read_csv('./ml-latest-small/ratings.csv')  

# Suppose your `my_ratings` is a list of dicts from movie_rater()
# Convert it to a DataFrame
df_new_ratings = pd.DataFrame(my_ratings)

# Combine with the original ratings
df_combined = pd.concat([df_ratings, df_new_ratings], ignore_index=True)

In [34]:
# train a model using the new combined DataFrame

from surprise import Dataset, Reader

# Defining a reader object
reader = Reader(rating_scale=(0.5, 5.0))

# Loading data from DataFrame
data_combined = Dataset.load_from_df(df_combined[['userId', 'movieId', 'rating']], reader)

# Building trainset
trainset = data_combined.build_full_trainset()

# Retraining the model
svd_updated = SVD(n_factors=50, reg_all=0.05)
svd_updated.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a519e24320>

In [35]:
# make predictions for the user
# you'll probably want to create a list of tuples in the format (movie_id, predicted_score)

# Getting all movie IDs
all_movie_ids = df_movies['movieId'].unique()

# Finding movies the new user hasn't rated yet
rated_movie_ids = df_new_ratings['movieId'].values
unseen_movies = [movie_id for movie_id in all_movie_ids if movie_id not in rated_movie_ids]

# Predicting ratings for all unseen movies
predictions = [(movie_id, svd_updated.predict(999, movie_id).est) for movie_id in unseen_movies]

In [36]:
# order the predictions from highest to lowest rated

ranked_movies = ranked_movies = sorted(predictions, key=lambda x: x[1], reverse=True)

 For the final component of this challenge, it could be useful to create a function `recommended_movies()` that takes in the parameters:
* `user_ratings`: list - list of tuples formulated as (user_id, movie_id) (should be in order of best to worst for this individual)
* `movie_title_df`: DataFrame 
* `n`: int - number of recommended movies 

The function should use a `for` loop to print out each recommended *n* movies in order from best to worst

In [38]:
# return the top n recommendations using the 

# Creating recommended_movies() function
def recommended_movies(user_ratings, movie_title_df, n=5):
    """
    Prints top n recommended movies with titles.
    
    Parameters:
        user_ratings (list): List of (movieId, predicted_rating)
        movie_title_df (DataFrame): Movie metadata
        n (int): Number of movies to recommend
    """
    print(f"\n🎉 Top {n} Movie Recommendations for You:\n")
    count = 0
    for movie_id, score in user_ratings[:n]:
        title = movie_title_df[movie_title_df['movieId'] == movie_id]['title'].values
        if len(title) > 0:
            print(f"⭐ {title[0]} — Predicted Rating: {score:.2f}")
            count += 1
        if count >= n:
            break

recommended_movies(ranked_movies,df_movies,5)


🎉 Top 5 Movie Recommendations for You:

⭐ Shawshank Redemption, The (1994) — Predicted Rating: 4.28
⭐ Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) — Predicted Rating: 4.25
⭐ Lawrence of Arabia (1962) — Predicted Rating: 4.23
⭐ Grand Day Out with Wallace and Gromit, A (1989) — Predicted Rating: 4.21
⭐ Streetcar Named Desire, A (1951) — Predicted Rating: 4.20


## Level Up (Optional)

* Try and chain all of the steps together into one function that asks users for ratings for a certain number of movies, then all of the above steps are performed to return the top $n$ recommendations
* Make a recommender system that only returns items that come from a specified genre

In [40]:
# Full Recommender System by Genre

import pandas as pd
import random
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate

# Loading movie metadata
df_movies = pd.read_csv('./ml-latest-small/movies.csv')
df_ratings = pd.read_csv('./ml-latest-small/ratings.csv')

def movie_rater(movie_df, num=5, genre=None):
    """
    Presents a user with random movies from a genre and asks for ratings.
    """
    if genre:
        movie_df = movie_df[movie_df['genres'].str.contains(genre, case=False, na=False)]

    selected_movies = movie_df.sample(num)
    ratings = []

    print(f"\n🎬 Rate the following {num} {genre} movies (skip by pressing Enter):\n")

    for _, row in selected_movies.iterrows():
        try:
            user_input = input(f"How would you rate '{row['title']}' (0.5 to 5)? ")
            if user_input.strip() == '':
                continue
            rating = float(user_input)
            if 0.5 <= rating <= 5.0:
                ratings.append({'userId': 999, 'movieId': int(row['movieId']), 'rating': rating})
        except ValueError:
            print("❌ Invalid input. Skipping this movie.")
            continue

    return ratings

def genre_recommender_system(num_ratings=5, genre='Comedy', top_n=5):
    """
    Main function: asks user for movie ratings, retrains model, and recommends movies of a chosen genre.
    """
    # Getting user ratings
    user_ratings = movie_rater(df_movies, num=num_ratings, genre=genre)

    if not user_ratings:
        print("⚠️ No ratings provided. Cannot make recommendations.")
        return

    df_new_ratings = pd.DataFrame(user_ratings)
    
    # CombinIing with original ratings
    df_combined = pd.concat([df_ratings, df_new_ratings], ignore_index=True)

    # Training model
    reader = Reader(rating_scale=(0.5, 5.0))
    data = Dataset.load_from_df(df_combined[['userId', 'movieId', 'rating']], reader)
    trainset = data.build_full_trainset()

    svd = SVD(n_factors=50, reg_all=0.05)
    svd.fit(trainset)

    # Predicting on unseen movies from the same genre
    seen_movies = df_new_ratings['movieId'].unique()
    genre_movies = df_movies[df_movies['genres'].str.contains(genre, case=False, na=False)]
    unseen_movies = genre_movies[~genre_movies['movieId'].isin(seen_movies)]

    predictions = []
    for movie_id in unseen_movies['movieId'].values:
        est_rating = svd.predict(999, movie_id).est
        predictions.append((movie_id, est_rating))

    # Sorting predictions and display top N
    ranked_movies = sorted(predictions, key=lambda x: x[1], reverse=True)[:top_n]

    print(f"\n🎯 Top {top_n} Recommended {genre} Movies for You:\n")
    for movie_id, rating in ranked_movies:
        title = df_movies.loc[df_movies['movieId'] == movie_id, 'title'].values[0]
        print(f"⭐ {title} — Predicted Rating: {rating:.2f}")

# Example:
genre_recommender_system(num_ratings=5, genre='Action', top_n=5)


🎬 Rate the following 5 Action movies (skip by pressing Enter):



How would you rate 'Kick-Ass (2010)' (0.5 to 5)?  5
How would you rate 'Ballistic: Ecks vs. Sever (2002)' (0.5 to 5)?  4
How would you rate 'Jupiter Ascending (2015)' (0.5 to 5)?  6
How would you rate 'I Am Number Four (2011)' (0.5 to 5)?  4
How would you rate 'Money Talks (1997)' (0.5 to 5)?  3



🎯 Top 5 Recommended Action Movies for You:

⭐ Great Escape, The (1963) — Predicted Rating: 4.52
⭐ North by Northwest (1959) — Predicted Rating: 4.45
⭐ Fight Club (1999) — Predicted Rating: 4.45
⭐ Laputa: Castle in the Sky (Tenkû no shiro Rapyuta) (1986) — Predicted Rating: 4.42
⭐ Apocalypse Now (1979) — Predicted Rating: 4.40


## Summary

In this lab, you got the chance to implement a collaborative filtering model as well as retrieve recommendations from that model. You also got the opportunity to add your own recommendations to the system to get new recommendations for yourself! Next, you will learn how to use Spark to make recommender systems.