# Developing Recommendation Systems from Real-World Datasets

## 1M Movie Dataset
We shall get the chance to create a more complete recommender system pipeline to obtain the top recommendations for a specific user.

In [1]:
#Loading our dataset
import pandas as pd
df = pd.read_csv('ratings.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [2]:
# Drop unnecessary columns
new_df = df.drop(columns='timestamp')

We transform the dataset into something compatible with `surprise`. In order to do this, we are going to need `Reader` and `Dataset` classes. There's a method in `Dataset` specifically for loading dataframes.

In [3]:
from surprise import Reader, Dataset

# read in values as Surprise dataset 
reader = Reader()
data = Dataset.load_from_df(new_df,reader)

Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity

In [4]:
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


**Determine the best model**: Now, compare the different models and see which ones perform best. For consistency sake, use RMSE to evaluate models. Remember to cross-validate! Can you get a model with a higher average RMSE on test data than 0.869?

In [5]:
# importing relevant libraries
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
import numpy as np

In [6]:
## Perform a gridsearch with SVD
# ⏰ This cell may take several minutes to run
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

In [7]:
# print out optimal parameters for SVD after GridSearch
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.8694549447006225, 'mae': 0.6680067046870862}
{'rmse': {'n_factors': 100, 'reg_all': 0.05}, 'mae': {'n_factors': 50, 'reg_all': 0.05}}


In [8]:
# cross validating with KNNBasic
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [9]:
# # print out the average RMSE score for the test set
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([0.97399348, 0.97279161, 0.97047477, 0.97423627, 0.97250206]))
('test_mae', array([0.75338593, 0.75097489, 0.74605343, 0.75295196, 0.75163522]))
('fit_time', (1.0879662036895752, 1.1078801155090332, 1.1071605682373047, 1.0629935264587402, 1.031749963760376))
('test_time', (1.6417181491851807, 1.6572823524475098, 1.6419687271118164, 1.6733195781707764, 1.642467737197876))
-----------------------
0.9727996365991605


In [10]:
# cross validating with KNNBaseline
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [11]:
# print out the average score for the test set
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([0.88583506, 0.87480521, 0.87294868, 0.8774304 , 0.87314251]))
('test_mae', array([0.67660928, 0.66894125, 0.66541226, 0.67036099, 0.66851752]))
('fit_time', (1.1541850566864014, 1.2625689506530762, 1.3072295188903809, 1.278538465499878, 1.2829015254974365))
('test_time', (2.310779571533203, 2.4475972652435303, 2.4624223709106445, 2.4573185443878174, 2.4610376358032227))


0.8768323703793162

> Based off these outputs, it seems like the best performing model is the SVD model with `n_factors = 50` and a regularization rate of 0.05. Use that model or if you found one that performs better, feel free to use that to make some predictions.

**Making Recommendations:** It's important that the output for the recommendation is interpretable to people. Rather than returning the `movie_id` values, it would be far more valuable to return the actual title of the movie. As a first step, let's read in the movies to a dataframe and take a peek at what information we have about them.

In [12]:
df_movies = pd.read_csv('movies.csv')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [13]:
#making simple predictions
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a3962b9eb0>

In [14]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=2.993807613080075, details={'was_impossible': False})

This prediction value is a tuple and each of the values within it can be accessed by way of indexing. Now let's put our knowledge of recommendation systems to do something interesting: making predictions for a new user!

**Obtaining User Ratings**: It's great that we have working models and everything, but wouldn't it be nice to get to recommendations specifically tailored to your preferences? That's what we'll be doing now. The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it. 

The function `movie_rater()` should take as parameters: 

* `movie_df`: DataFrame - a dataframe containing the movie ids, name of movie, and genres
* `num`: int - number of ratings
* `genre`: string - a specific genre from which to draw movies

The function returns:
* rating_list : list - a collection of dictionaries in the format of {'userId': int , 'movieId': int , 'rating': float}

In [15]:
def movie_rater(movie_df,num, genre=None):
    userID = 1000
    rating_list = []
    while num > 0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df.sample(1)
        print(movie)
        rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID,'movieId':movie['movieId'].values[0],'rating':rating}
            rating_list.append(rating_one_movie) 
            num -= 1
    return rating_list

In [16]:
user_rating = movie_rater(df_movies, 4, 'Comedy')

      movieId                        title                    genres
5921    33830  Herbie: Fully Loaded (2005)  Adventure|Comedy|Romance
How do you rate this movie on a scale of 1-5, press n if you have not seen :
n
      movieId                     title  \
7180    72226  Fantastic Mr. Fox (2009)   

                                         genres  
7180  Adventure|Animation|Children|Comedy|Crime  
How do you rate this movie on a scale of 1-5, press n if you have not seen :
n
      movieId             title          genres
3881     5454  Mo' Money (1992)  Comedy|Romance
How do you rate this movie on a scale of 1-5, press n if you have not seen :
n
      movieId                                    title         genres
8050    98633  My Lucky Stars (Fuk sing go jiu) (1985)  Action|Comedy
How do you rate this movie on a scale of 1-5, press n if you have not seen :
n
      movieId                       title                    genres
2150     2863  Hard Day's Night, A (1964)  Adventure|Co

***If you're struggling to come up with the above function, you can use this list of user ratings to complete the next segment***

In [17]:
user_rating

[{'userId': 1000, 'movieId': 3177, 'rating': '3'},
 {'userId': 1000, 'movieId': 3515, 'rating': '4'},
 {'userId': 1000, 'movieId': 59814, 'rating': '3'},
 {'userId': 1000, 'movieId': 90647, 'rating': '5'}]

***Making Predictions With the New Ratings***\
Now that you have new ratings, you can use them to make predictions for this new user. The proper way this should work is:

* add the new ratings to the original ratings DataFrame, read into a `surprise` dataset 
* train a model using the new combined DataFrame
* make predictions for the user
* order those predictions from highest rated to lowest rated
* return the top n recommendations with the text of the actual movie (rather than just the index number) 

In [18]:
## add the new ratings to the original ratings DataFrame
user_ratings = pd.DataFrame(user_rating)
new_ratings_df = pd.concat([new_df, user_ratings], axis=0)
new_data = Dataset.load_from_df(new_ratings_df,reader)

In [19]:
# train a model using the new combined DataFrame
svd_ = SVD(n_factors= 50, reg_all=0.05)
svd_.fit(new_data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a38c54e550>

In [20]:
# make predictions for the user
# you'll probably want to create a list of tuples in the format (movie_id, predicted_score)
list_of_movies = []
for m_id in new_df['movieId'].unique():
    list_of_movies.append( (m_id,svd_.predict(1000,m_id)[3]))

In [21]:
# order the predictions from highest to lowest rated
ranked_movies = sorted(list_of_movies, key=lambda x:x[1], reverse=True)

 For the final component of this challenge, it could be useful to create a function `recommended_movies()` that takes in the parameters:
* `user_ratings`: list - list of tuples formulated as (user_id, movie_id) (should be in order of best to worst for this individual)
* `movie_title_df`: DataFrame 
* `n`: int - number of recommended movies 

The function should use a `for` loop to print out each recommended *n* movies in order from best to worst

In [22]:
# return the top n recommendations using the 
def recommended_movies(user_ratings,movie_title_df,n):
        for idx, rec in enumerate(user_ratings):
            title = movie_title_df.loc[movie_title_df['movieId'] == int(rec[0])]['title']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n-= 1
            if n == 0:
                break
            
recommended_movies(ranked_movies,df_movies,5)

Recommendation #  1 :  602    Dr. Strangelove or: How I Learned to Stop Worr...
Name: title, dtype: object 

Recommendation #  2 :  510    Silence of the Lambs, The (1991)
Name: title, dtype: object 

Recommendation #  3 :  951    Chinatown (1974)
Name: title, dtype: object 

Recommendation #  4 :  46    Usual Suspects, The (1995)
Name: title, dtype: object 

Recommendation #  5 :  277    Shawshank Redemption, The (1994)
Name: title, dtype: object 



**Summary:** We have got the chance to implement a collaborative filtering model as well as retrieve recommendations from that model. We also got the opportunity to add our own recommendations to the system to get new recommendations for yourself! Next, you will need to learn how to use Spark to make recommender systems.