### Collaborative Recommender 

This recommender uses previous user ratings to predict how the user would feel about a movie they have not seen yet. The final product is a list of unwatched movies returned, sorted by the user's estimated rating of the movie.

In [1]:
import random
import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import (GridSearchCV, KFold, cross_validate,
                                      train_test_split)
from surprise.model_selection.validation import cross_validate
from surprise.prediction_algorithms.matrix_factorization import SVDpp
from surprise.model_selection import train_test_split

import warnings; warnings.simplefilter('ignore')

%matplotlib inline

print('All libraries loaded successfully!')

All libraries loaded successfully!


In [2]:
ratings = pd.read_csv('train.csv')

In [3]:
ratings.shape

(10000038, 4)

In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [5]:
movies = pd.read_csv('movies.csv')

In [6]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
movies.shape

(62423, 3)

In [8]:
rated_movs = ratings.groupby('movieId').sum()
rated_movs.shape

(48213, 3)

In [9]:
users = ratings.groupby('userId').sum()
users.shape

(162541, 3)

We're seeing 10 Million ratings on 48 213 movies made by 162 541 users

In [10]:
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

data = ratings['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / ratings.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} movie-ratings'.format(ratings.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

We can see from the above that over 26% of the movies are rated 4. Ratings 0 to 2,5 together is less than 16%

In [11]:
tags = pd.read_csv('genome_tags.csv')
tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [13]:
genome_scores = pd.read_csv('genome_scores.csv')
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [14]:
genome_scores.shape

(15584448, 3)

In [15]:
links = pd.read_csv('links.csv')
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [16]:
links.shape

(62423, 3)

In [17]:
n_users = ratings.userId.unique().shape[0]
n_movies = ratings.movieId.unique().shape[0]
print('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies))

Number of users = 162541 | Number of movies = 48213


While using Surprise, one can use a bunch of built-in datasets (e.g.movielens) parsed by Dataset module. However, it is usually required to build a customized recommender system. In a case as such, it is necessary to upload your own rating dataset either from a file (e.g. csv) or from a pandas' dataframe. In both cases, you need to define a Reader object to parse the file or the dataframe by Surprise. See the reference [here](https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset)

In [18]:
reader = Reader(rating_scale=(1, 5))

Now, we upload the dataframe with ratings per user by movie with Dataset.load_from_df and specify reader as the argument

In [19]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

### Singular Value Decomposition (SVD)
SVD decomposes any matrix into singular vectors and singular values. If the reader has previous experience with machine learning, particularly with dimensionality reduction, they would find traditional use of SVD in Principal Component Analysis (PCA). Simply put, SVD is equivalent to PCA after mean centering, i.e. shifting all data points so that their mean is on the origin

- We will use the SVD Algorithm 
- For training and tuning: GridSearchCV, RMSE(stochastic gradient descent)

In [20]:
from surprise import accuracy
# define a cross-validation iterator

start = time.time()

kf = KFold(n_splits=2)

svd = SVD(verbose=True)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    svd.fit(trainset)
    alt_pred = svd.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(alt_pred, verbose=True)
    
print("Runtime %0.2f" % (time.time() - start))

Processing epoch 0


KeyboardInterrupt: 

In [None]:
rmses = [0.8339,0.8347,0.8343,0.8345,0.8330]
rmse_avg = round(sum(rmses) / len(rmses),5)
print('The mean RMSE of the full rating set is: {}'.format(rmse_avg))

In [None]:
algo.predict(345,3954)

In [None]:
ratings[ratings['userId'] == 345]

### Analysis so far....

Our SVD algorithm has a RMSE of 0.83408, which is fine. Ratings tend to be centeralized, leaning toward mean ratings around 3-4 for every movie predicted. Lets see if this has to do with the distribution of ratings in the raw data and what we would get if we scaled it down to have a more even number of ratings in each bin.

In [None]:
sns.boxplot(ratings['rating'])

Our results tend to fall within the boxplot's allotment for 'mean' ratings, meaning it might not be perfectly correct for specified users. We'll try training with a more even set.

In [None]:
ratings['rating'].value_counts()

In [None]:
alt_ratings = 4

In [None]:
sub_sample = (3.0,ratings)

In [None]:
alt_ratings = pd.DataFrame()
my_ratings = [4.0,3.0,5.0,3.5,4.5,2.0]
for rating in my_ratings:
    temp_df = ratings[ratings['rating'] == rating]
    temp_df = temp_df.sample(n=100000)
    alt_ratings = pd.concat([alt_ratings,temp_df],axis=0)

In [None]:
alt_ratings.shape

In [None]:
alt_ratings = alt_ratings.drop('timestamp',axis=1)
ratings = ratings.drop('timestamp',axis=1)

In [None]:
ratings_left = [2.5,1.0,1.5,0.5]
for rating in ratings_left:
    temp_df = ratings[ratings['rating'] == rating]
    alt_ratings = pd.concat([alt_ratings,temp_df],axis=0)

In [None]:
alt_ratings['rating'].value_counts()

Now we have 100,000 ratings each of 2 to 5, and all the previous ratings of 0.5 to 2.5 stars. We are using a subsample, lets check how many users or movies were lost. 

#### Now we look at the adjusted distribution with ratings randomly subsampled

In [None]:
sns.distplot(alt_ratings['rating'])
plt.title('Rating Distribution (adjusted)')

In [None]:
# define a cross-validation iterator

start = time.time()

kf = KFold(n_splits=5)

svd = SVD(verbose=True)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    svd.fit(trainset)
    alt_pred = svd.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(alt_pred, verbose=True)
    
print("Runtime %0.2f" % (time.time() - start))

In [None]:
rmses = [0.8345,0.8351,0.8332,0.8338,0.8340]
rmse_avg = round(sum(rmses) / len(rmses),5)
print('The mean RMSE of the adjusted rating set is: {}'.format(rmse_avg))

In [None]:
ratings['userId'].value_counts().mean()

In [None]:
alt_ratings.head()

Now we will create a  function checks the predicted rating against ratings made by the user and takes in an int UserId (Id)
an int limit (the number of movies returned)dataframe columns necessary (movieId, userId)a dataframe of ratings (df)
an algorithm (algo)

In [None]:
def check_system(Id,movieId,limit,df=ratings,userId='userId',algo=algo):
     # Isolates necessary columns from the dataframe
    df = df[[movieId,userId,'rating']]
    
    # Takes a subsample of the user's ratings
    user_df = df[df['userId'] == Id]
    if user_df.shape[0] >= df[userId].value_counts().mean():
        user_df = user_df.sample(frac=.10)
    else:
        user_df = user_df.sample(frac=.50)

    # Builds the dataframe to be returned     
    user_df['est'] = user_df['movieId'].apply(lambda x: round(algo.predict(Id,x).est,2))
    user_df['error'] = user_df['est']-user_df['rating']
    user_df['avg_error'] = user_df['error'].mean()
    
    # Returns a dataframe dependent on what the limit is set to
    if limit == None:
        user_df = pd.merge(user_df,movies,on=movieId)
        return user_df[[userId,movieId,'title','rating','est','error','avg_error']]
    else:
        if limit >= user_df.shape[0]:
            user_df = pd.merge(user_df,movies,on=movieId)
            return user_df[[userId,movieId,'title','rating','est','error','avg_error']]
        else:
            user_df = user_df.head(limit)
            user_df = pd.merge(user_df,movies,on=movieId)
            return user_df[[userId,movieId,'title','rating','est','error','avg_error']]

In [None]:
alt_ratings['userId'].value_counts().describe()

In [None]:
ratings['userId'].value_counts().describe()

Lets check how predictions turn on for users who have:
        
1. Rated a lot ofmovies (large data to pull from)
2. Rated an average number of movies
3. Rated few movies

#### Let us first check for the user who rated the most amount of movies (12,952 ratings)

In [None]:
find_user = ratings.copy()
find_user['count'] = 1
find_user = find_user.groupby('userId').sum()
find_user[find_user['count'] == 12952].head(1)

#### User 72315 has the most ratings - 12,952!! Let's get 10 ratings from this user

In [None]:
check_system(72315, 'movieId', 10)

#### Let us check for the user who rated an average amount of movies (61 ratings)

In [None]:
find_user = ratings.copy()
find_user['count'] = 1
find_user = find_user.groupby('userId').sum()
find_user[find_user['count'] == 61].head(1)

#### User 333 has average amount of ratings - 61 - Let's get 10 ratings from this user

In [None]:
check_system(333, 'movieId', 10)

#### Let us check for the user who rated a below average amount of movies (14 ratings)

In [None]:
find_user = ratings.copy()
find_user['count'] = 1
find_user = find_user.groupby('userId').sum()
find_user[find_user['count'] == 14].head(1)

#### User 17 has below average amount of ratings - 174 - Let's get 10 ratings from this user

In [None]:
check_system(17, 'movieId', 10)

In [None]:
movies_df = pd.read_csv('movies.csv')
movies_df = movies_df[['movieId','title']]
movies_df.head()

### Predicting ratings 

In [None]:
trainset = data.build_full_trainset()

In [None]:
model = SVD(verbose=True)
model = model.fit(trainset)

In [None]:
test = pd.read_csv("test.csv")
test

#### Creating a submission file 

In [None]:
# this will take a while, be patient runnint it :)
def predict_rating(row):
    u = row["userId"]
    i = row["movieId"]
    return model.estimate(u, i) # name of the model here

test = test.assign(rating=test.apply(predict_rating, axis=1))

In [None]:
test = test.assign(Id=test.userId.astype(str)+"_"+test.movieId.astype(str))
submission = test[["Id", "rating"]]
submission.to_csv("submission_svd.csv", index=None)

In [None]:
def predict_ratings(Id, movieId, n, df=ratings, userId='userId', algo=algo):
    # Select some random movies from our set
    df = df[[movieId,userId,'rating']]
    movie_choices = df[movieId].unique()
    movies = np.random.choice(movie_choices,n)
    
    # Build the dataframe that we'll return
    predicted_df = pd.DataFrame()
    predicted_df['movieId'] = movies
    predicted_df['userId'] = Id
    
    predicted_df['est'] = predicted_df['movieId'].apply(lambda x: round(algo.predict(Id,x).est,2))
    
    # Grabbing the titles
    predicted_df = pd.merge(predicted_df,movies_df,on='movieId')
    
    
    return predicted_df[[userId,movieId,'title','est']]

In [None]:
predict_ratings(551,'movieId',10)

#### We will create a function that returns n movies, sorted by predicted user rating, from a random sample of movies

In [None]:
def predicted_top_n(Id, movieId, n, samples, df=ratings, userId='userId', algo=algo):
    
    df = df[[movieId,userId,'rating']]

    movie_choices = df[movieId].unique()
    
    # Take out movies the user has already watched
    temp_df = ratings[ratings[userId] == Id]
    watched_movs = temp_df[movieId].unique()
    unwatched = np.setdiff1d(movie_choices,watched_movs)
    
    # Select random movies according to 'samples'
    if samples == None:
        movies = unwatched
        samples = movies.shape[0]
    elif samples <= unwatched.shape[0]:
         movies = np.random.choice(unwatched,samples)
    else:
        print("The sample size exceeds the available movies. Reset to {} movies".format(unwatched.shape[0]))
        movies = unwatched
        samples = movies.shape[0]
        
    # Build the dataframe that we'll return
    predicted_df = pd.DataFrame()
    predicted_df[movieId] = movies
    predicted_df[userId] = Id
    predicted_df['est'] = predicted_df[movieId].apply(lambda x: round(algo.predict(Id,x).est,2))
    predicted_df = predicted_df.sort_values(by='est', ascending=False)
    
    if n >= samples:
        predicted_df = pd.merge(predicted_df,movies_df,on=movieId)
        return predicted_df[[userId,'title','est']]
    else:
        predicted_df = predicted_df.head(n)
        predicted_df = pd.merge(predicted_df,movies_df,on=movieId)
        return predicted_df[[userId,'title','est']] 
    

### Recommending unwatched movies

We wrote a function that does the following:

1. Creates a filtered list of movies that the user in question hasn't rated (ie hasn't watched)
2. The user is allowed to subsample a smaller set of movies if they want
3. If the user does not do above, all movies in the set are considered
4. A filtered dataframe is returned of unwatched movies and the user's estimated rating of the movie, sorted by the estimated rating
5. The user chooses how many movies they want.

### User with the most ratings (User 72315, 12952 ratings)

In [None]:
predicted_top_n(72315, 'movieId', 10, None)

User with average amount of ratings (User 333, 61 ratings)

In [None]:
predicted_top_n(333, 'movieId', 10, None)

### User with the below-average number of ratings (User 17, 14 ratings)

In [None]:
predicted_top_n(17, 'movieId', 10, None)

### References 

- Hug, N.(2015). Getting started with Surprise. Retrieved from https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset


- Chen, D.(2020). Recommender System — singular value decomposition (SVD) & truncated SVD. Retrieved from https://towardsdatascience.com/recommender-system-singular-value-decomposition-svd-truncated-svd-97096338f361


- Sharma, P.(2018). Comprehensive Guide to build a Recommendation Engine from scratch (in Python). Retrieved from https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/


- Deisenroth, M.P., Faisal, A.A., Ong C.S.(2021). Mathematics for Machine Learning, pg.98-105 and pg.317-343