<a href="https://colab.research.google.com/github/Coolinglass/Applied-Machine-Learning-Projects/blob/master/User_Based_Collaborative_Filtering_recommender_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movielens - 100K Dataset

MovieLens 100K dataset has been a standard dataset used for benchmarking recommender systems for more than 20 years now and hence this provides a good point to start our learning journey for recommender systems. For non commercial personalised recommendations for movies you can check out the website: https://movielens.org/

This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies.
	* Each user has rated at least 20 movies.
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

## Data Description


**Ratings**    -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a comma separated list of
	         user id | item id | rating | timestamp.
              The time stamps are unix seconds since 1/1/1970 UTC   


**Movie Information**   -- Information about the items (movies); this is a comma separated
              list of
              movie id | movie title | release date | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.


**User Demographics**    -- Demographic information about the users; this is a comma
              separated list of
              user id | age | gender | occupation | zip code

## Table of Content

[1. Reading Dataset](#Reading-Dataset)

[2. Merging Movie information to ratings dataframe](#merge)

[3. Creating train and test data & setting evaluation metric](#eval)

[4. Simple Baseline](#simplebaseline)

[6. User based Collaborative filtering with simple user mean](#usermean)

[7. User based Collaborative filtering with similarity weighted mean](#userwmean)


## 1. Reading Dataset <a class="anchor" id="Reading-Dataset"></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [None]:
#Reading ratings file:
ratings = pd.read_csv('ratings.csv')

#Reading Movie Info File
movie_info = pd.read_csv('movie_info.csv')

## 2.  Merging Movie information to ratings dataframe <a class="anchor" id="merge"></a>

The movie names are contained in a separate file. Let's merge that data with ratings and store it in ratings dataframe. The idea is to bring movie title information in ratings dataframe as it would be useful later on

In [None]:
ratings = ratings.merge(movie_info[['movie id','movie title']], how='left', left_on = 'movie_id', right_on = 'movie id')

In [None]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title
0,196,242,3,881250949,242,Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997)
2,22,377,1,878887116,377,Heavyweights (1994)
3,244,51,2,880606923,51,Legends of the Fall (1994)
4,166,346,1,886397596,346,Jackie Brown (1997)


Lets also combine movie id and movie title separated by ': ' and store it in a new column named movie

In [None]:
ratings['movie'] = ratings['movie_id'].map(str) + str(': ') + ratings['movie title'].map(str)

In [None]:
ratings.head(2)

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title,movie
0,196,242,3,881250949,242,Kolya (1996),242: Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997),302: L.A. Confidential (1997)


In [None]:
ratings.columns

Index(['user_id', 'movie_id', 'rating', 'unix_timestamp', 'movie id',
       'movie title', 'movie'],
      dtype='object')

Keeping the columns movie, user_id and rating in the ratings dataframe and drop all others

In [None]:
ratings = ratings.drop(['movie id', 'movie title', 'movie_id','unix_timestamp'], axis = 1)

In [None]:
ratings.head(1)

Unnamed: 0,user_id,rating,movie
0,196,3,242: Kolya (1996)


In [None]:
ratings = ratings[['user_id','movie','rating']]

In [None]:
ratings.head(1)

Unnamed: 0,user_id,movie,rating
0,196,242: Kolya (1996),3


## 3. Creating Train & Test Data & Setting Evaluation Metric <a class="anchor" id="eval"></a>
In order to test how well we do with a given rating prediction method, we would first need to define our train and test set, we will only use the train set to build different models and evaluate our model using the test set.

In [None]:
#Assign X as the original ratings dataframe
X = ratings.copy()

#Split into training and test datasets
X_train, X_test = train_test_split(X, test_size = 0.25, random_state=42)

In [None]:
#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [None]:
X_test

Unnamed: 0,user_id,movie,rating
75721,877,381: Muriel's Wedding (1994),4
80184,815,"602: American in Paris, An (1951)",3
19864,94,431: Highlander (1986),4
76699,416,875: She's So Lovely (1997),2
92991,500,182: GoodFellas (1990),2
...,...,...,...
21271,399,684: In the Line of Fire (1993),3
34014,222,"580: Englishman Who Went Up a Hill, But Came D...",3
81355,551,162: On Golden Pond (1981),5
65720,803,"988: Beautician and the Beast, The (1997)",1


## 4. Simple Baseline using average of all ratings <a class="anchor" id="simplebaseline"></a>

A simple baseline would give us the RMSE score that we get from just averaging all the available ratings and using it as predictions for all user movie pairs in the test set. This will also help us ensure that further when we use more complex techniques, we beat this score. If that is not the case maybe we need to change things.

In [None]:
#Define the baseline model to always return average of all available ratings
def baseline(user_id, movie):
    return X_train['rating'].mean()

In [None]:
X_train['rating'].mean()

3.5292666666666666

In [None]:
X_train

Unnamed: 0,user_id,movie,rating
98980,811,901: Mr. Magoo (1997),4
69824,804,755: Jumanji (1995),3
9928,52,287: Marvin's Room (1996),5
75599,735,181: Return of the Jedi (1983),4
95621,897,96: Terminator 2: Judgment Day (1991),5
...,...,...,...
6265,216,231: Batman Returns (1992),2
54886,343,276: Leaving Las Vegas (1995),5
76820,437,475: Trainspotting (1996),3
860,284,322: Murder at 1600 (1997),3


In [None]:
#Function to compute the RMSE score obtained on the test set by a model
def rmse_score(model):

    #Construct a list of user-movie tuples from the test dataset
    id_pairs = zip(X_test['user_id'], X_test['movie'])

    #Predict the rating for every user-movie tuple
    y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])

    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['rating'])

    #Return the final RMSE score

    return rmse(y_true, y_pred)

In [None]:
id_pairs = zip(X_test['user_id'], X_test['movie'])
y_pred = np.array([cf_user_mean(user, movie) for (user, movie) in id_pairs])

NameError: name 'cf_user_mean' is not defined

In [None]:
rmse_score(baseline)

(1.1244396573898978,
 array([3.52926667, 3.52926667, 3.52926667, ..., 3.52926667, 3.52926667,
        3.52926667]))

## 6. User based Collaborative filtering with simple user mean <a class="anchor" id="usermean"></a>
In User based CF we discussed steps for using weighted mean of similar users' ratings, let's first try just a simple average of all ratings given to a particular movie by all other users and make predictions. To do that first we will create the ratings matrix using pandas pivot_table function.

In [None]:
#Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie')

r_matrix.head()

movie,1000: Lightning Jack (1994),"1001: Stupids, The (1996)","1002: Pest, The (1997)",1003: That Darn Cat! (1997),1004: Geronimo: An American Legend (1993),"1005: Double vie de Véronique, La (Double Life of Veronique, The) (1991)",1006: Until the End of the World (Bis ans Ende der Welt) (1991),1007: Waiting for Guffman (1996),1008: I Shot Andy Warhol (1996),1009: Stealing Beauty (1996),...,992: Head Above Water (1996),993: Hercules (1997),"994: Last Time I Committed Suicide, The (1997)","995: Kiss Me, Guido (1997)","996: Big Green, The (1995)",997: Stuart Saves His Family (1995),998: Cabin Boy (1994),999: Clean Slate (1994),99: Snow White and the Seven Dwarfs (1937),9: Dead Man Walking (1995)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,3.0,5.0
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [None]:
#User Based Collaborative Filter using Mean Ratings
def cf_user_mean(user_id, movie):

    #Check if movie exists in r_matrix
    if movie in r_matrix:

        #Compute the mean of all the ratings given to the movie
        mean_rating = r_matrix[movie].mean()

    else:
        #Default to average rating from the train set
        mean_rating = X_train['rating'].mean()

    return mean_rating

In [None]:
#Compute RMSE for the Mean model
rmse_score(cf_user_mean)

1.0224465207437918

We have improved significantly on RMSE score with this simple change, this clearly shows that there's value in using other users'ratings for making rating predictions for the movie.

## 7. User based Collaborative filtering with similarity weighted mean <a class="anchor" id="userwmean"></a>
Now let's use pearson correlation and using these pearson correlations as weight try to predict the unknown ratings and check performance.

In [None]:
#Compute the Pearson Correlation using the ratings matrix with corr function from Pandas
pearson_corr = r_matrix.T.corr()

In [None]:
#Convert into pandas dataframe
pearson_corr = pd.DataFrame(pearson_corr, index=r_matrix.index, columns=r_matrix.index)

pearson_corr.head(10)

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,-0.01785714,-0.2758386,-0.688247,0.343604,0.167618,0.35613,0.669623,-0.3015113,-0.2648507,...,0.116327,-0.255377,0.3556769,0.0,0.148884,0.787562,0.4268828,-2.166933e-16,-0.4372411,0.102244
2,-0.017857,1.0,9.930137e-17,0.57735,0.0,0.411569,0.514376,0.0,0.5,0.06933752,...,0.104828,0.174078,0.1518871,0.081044,-0.09505,,0.2000817,,0.02054554,0.583333
3,-0.275839,9.930137e-17,1.0,0.207514,,-0.265949,-0.735147,0.102598,,0.5773503,...,,,-0.1705606,-0.57735,-0.158777,,-8.392497000000001e-17,,0.3370999,
4,-0.688247,0.5773503,0.2075143,1.0,,,-0.328897,0.57735,,,...,,,,,0.866025,,0.7938842,,,
5,0.343604,0.0,,,1.0,0.237095,0.239475,0.636003,,-0.0181116,...,0.121353,-0.5,0.2973177,0.5,0.678003,0.904534,-0.1607116,0.4082483,0.3185591,0.475075
6,0.167618,0.4115688,-0.2659489,,0.237095,1.0,0.145616,0.726489,0.07537784,0.362786,...,-0.144049,-0.229416,0.4193636,0.296961,0.038835,,0.03869116,0.1324532,0.1098244,0.078826
7,0.35613,0.5143759,-0.735147,-0.328897,0.239475,0.145616,1.0,0.291131,-0.1075829,0.2729831,...,-0.109807,-0.340307,0.5053534,0.592965,0.125578,0.26968,-0.08774509,,0.4660431,0.361683
8,0.669623,0.0,0.1025978,0.57735,0.636003,0.726489,0.291131,1.0,,0.3887408,...,-0.110657,,0.7644708,0.944911,0.877515,,0.3994298,-1.0,-1.532253e-16,0.239229
9,-0.301511,0.5,,,,0.075378,-0.107583,,1.0,3.4399e-16,...,0.866025,,0.0,0.755929,,,-0.5,1.0,,
10,-0.264851,0.06933752,0.5773503,,-0.018112,0.362786,0.272983,0.388741,3.4399e-16,1.0,...,-0.219333,-0.075378,1.279469e-16,-0.612372,,,0.1356748,0.5,0.1814575,-0.114645


Here we see that there are a lot of missing values, this could be due to no common ratings between 2 users or only 1 common rating, in both cases correlation will not be defined. We can replace all these missing values by 0 as this essentially means no correlation from the provided data between the 2 users

In [None]:
#Fill all the missing correlations with 0
pearson_cor = pearson_corr.fillna(0)

In [None]:
r_matrix

movie,1000: Lightning Jack (1994),"1001: Stupids, The (1996)","1002: Pest, The (1997)",1003: That Darn Cat! (1997),1004: Geronimo: An American Legend (1993),"1005: Double vie de Véronique, La (Double Life of Veronique, The) (1991)",1006: Until the End of the World (Bis ans Ende der Welt) (1991),1007: Waiting for Guffman (1996),1008: I Shot Andy Warhol (1996),1009: Stealing Beauty (1996),...,992: Head Above Water (1996),993: Hercules (1997),"994: Last Time I Committed Suicide, The (1997)","995: Kiss Me, Guido (1997)","996: Big Green, The (1995)",997: Stuart Saves His Family (1995),998: Cabin Boy (1994),999: Clean Slate (1994),99: Snow White and the Seven Dwarfs (1937),9: Dead Man Walking (1995)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,3.0,5.0
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,,,...,,,,,,,,,,5.0
940,,,,,,,,,,,...,,,,,,,,,,3.0
941,,,,,,,,4.0,,,...,,4.0,,,,,,,,
942,,,,,,,,,,,...,,,,,,,,,5.0,


Now, we have the user user similarities stored in the matrix pearson_cor. We will define a function to predict the unknown ratings in the test set using user based collarborative filtering with simiarity as pearson correlation and using all neighbours with positive correlation. For each user movie pair:
1. Check if a movie is there in train set, if its not in that case we will just predict the mean rating as the predicted rating
2. Calculate the mean rating for the active user
3. Extract correlation values from matrix pearson_corr and sort it in decreasing order of correlation values
4. Keep only similarity scores for users with positive correlation with the active user
5. Drop all the users similar to active user but haven't rated the target movie
6. Do a check and predict mean rating if there are no similar users who have rated the target movie
7. Use the prediction formula to make rating predictions
<img src="pred_formula.png" style="width: 500px;"/>

In [None]:
#User Based Collaborative Filter using Weighted Mean Ratings
def cf_user_wmean(user_id, movie_id):

    #Check if movie_id exists in r_matrix
    if movie_id in r_matrix:

        #Mean rating for active user
        ra = r_matrix.loc[user_id].mean()

        #Get the similarity scores for the user in question with every other user
        sim_scores = pearson_corr[user_id].sort_values(ascending = False)

        # Keep similarity scores for users with positive correlation with active user
        sim_scores_pos = sim_scores[sim_scores > 0]

        #Get the user ratings for the movie in question
        m_ratings = r_matrix[movie_id][sim_scores_pos.index]




        #Extract the indices containing NaN in the m_ratings series (Users who have not rated the target movie)
        idx = m_ratings[m_ratings.isnull()].index

        #Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()

        # If there are no ratings from similar users we cannot use this method so we predict just
        # the average rating of the movie else we use the prediction formula
        if len(m_ratings) == 0:
            #Default to average rating in the absence of ratings by similar users
            wmean_rating = r_matrix[movie_id].mean()
        else:
            #Drop the corresponding correlation scores from the sim_scores series
            sim_scores_pos = sim_scores_pos.drop(idx)

            #Subtract average rating of each user from the rating (rbp - mean(rb))
            m_ratings = m_ratings - r_matrix.loc[m_ratings.index].mean(axis = 1)

            #Compute the final weighted mean using np.dot which is nothing but the product divided by sum of weights
            wmean_rating = ra + (np.dot(sim_scores_pos, m_ratings)/ sim_scores_pos.sum())

    else:
        #Default to average rating in the absence of any information on the movie in train set
        wmean_rating = X_train['rating'].mean()

    return wmean_rating

In [None]:
rmse_score(cf_user_wmean)

0.9568512581492972

We see that a weighted similarity approach has provided a major improvement in the performance. In the next video we will introduce a new library called the surprise library that can be called the Scikit Learn for recommender systems. It provides necessary tools to tune design parameters such as neighbourhood size similarity measures and much more