First we will build a well-defined framework that will allow us to build and test our collaborative filtering models effortlessly.
The framework will consist of data, the evaluation metric, and a correspoinding function to compute that metric for a given model

### The Framework

Since collaborative filtering demands data on user behavior, we will use anotger dataset called MovieLens.
MovieLens gives us user ratings on a variety of movies and is available in various sizes.
The full version consists of more than 26,000,000 ratings applied to 45,000 movies by
270,000 users.

Download dataset from [here](https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset)

In [5]:
# Exploring that data
# Load the u.user file into a df
import pandas as pd
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv('../data/movielens/u.user', sep='|', names=u_cols, encoding='latin-1')

users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [6]:
# load u.item file into a df
i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies =pd.read_csv('../data/movielens/u.item', sep='|', names=i_cols, encoding='latin-1')

movies.head()

Unnamed: 0,movie_id,title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [7]:
# remove all information except Movie ID and title (we dont need the other information since we're building collaborative filters)
movies = movies[['movie_id', 'title']]

In [8]:
# Load the u.data file into a df
r_cols = ['user_id', 'movie_id', 'rating', 'timestap']
ratings = pd.read_csv('../data/movielens/u.data', sep='\t', names=r_cols, encoding='latin-1')

ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestap
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [9]:
# drop timestamp as we don't need it
ratings = ratings.drop('timestap', axis=1)

Although the ratings can only take 5 discrete values, will model this as a regression problem.
Consider a case whre the true rstiing given by a user to a movie is 5. A classification model will not distiguish between a predicted rating of 1 and 4, it will treat both as misclassified. However in a regression model, it will penalize the 1 more than the 4 and that is the behaviour we want.

As we saw in [Data mining techniques](../Data-Mining-Techniques/Data-Mining.ipynb) ,the first step in building a supervised learning model is to construct the test and training sets. The model will learn using the training dataset and it will be tested by the testing dataset.

#### Training and test data

We will split the dataset such that 75% of a user's ratings is training dataset and 25% is the testiing dataset.
First, let's assume that the user_id field is the target variable(y) and that our rating consists of the predictor variables(x).
We will then pass these two variables into scikit-learn's `train_test_split` function and `stratify` it along y. This will ensure that the propotion of each class is the same in both the training and testing datasets

In [10]:
from sklearn.model_selection import train_test_split

# assign x as the original ratings df and y as the user_id column of ratings
x = ratings.copy()
y = ratings['user_id']

# split into training and test datasets, stratifies along user_id
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, stratify=y, random_state=42)

#### Evaluation

We already know from [data mining techniques](../Data-Mining-Techniques/Data-Mining.ipynb) that root mean aquared error(RMSE) is the most commonly used perfomance metric for regressors.
`scikit-learn` already gives us an implementation of the mean sqaured error. So all we have to do is define a function that returns the square root of the value returned by `mean_squared_error`

In [11]:
from sklearn.metrics import mean_squared_error
import numpy as np

# function that computes the RSME
def rsme(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true,y_pred))

All our collaborative filter models will take in a user_id and movie_id as input, and output a floating number between 1 and 5, We therefore define our baseline model in such a way that it returns 3 regardless of user_id and movie_id

In [12]:
# define baseline model

def baseline(user_id, movie_id):
    return 3.0

In [13]:
# To test the potency of our model, we compute the RSME obtained by that particular model for all user-movie pairs in the test dataset

def score(cf_model):
    # clonstruct a list of user-movie tuples from the testing dataset
    id_pairs = zip(x_test['user_id'], x_test['movie_id'])
    # predict the rating for every user-movie tuple
    y_pred = np.array([cf_model(user,movie) for (user, movie) in id_pairs])
    # extract the actual ratings given by the users in the test data
    y_true = np.array(x_test['rating'])
    # return the final RSME score
    return rsme(y_true, y_pred)


In [14]:
# compute RSME obtained by the baseline model
score(baseline)

1.2488234462885457

Our aim with the rest of the models we will build is to try to obtain ab RMSE that is less than the one obtained by our baseline

## User-based collaborative filtering

As we saw in the [intro](../Intro.md), user-based collaborative filters find users similar to a particular user and the recommend products that those users have liked to the first user.
We will implement this idea in code. We will then test their perfomance using the framework we just constructed above.

We start by building a ratings matrix where each row represents a user, and each column a movie.
Therefore, the ith row and jth column will denote the rating given by user i to movie j
Pandas provides us a useful function, called `pivot_table` to construct the matrix from our ratings df

In [15]:
# build the ratings matrix
r_matrix = x_train.pivot_table(values='rating', index='user_id', columns='movie_id')

r_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### Mean
One of the simplest collaborative filters. it simply takes in user_id and movie_id and outputs the mean rating for the movie by all the users who have rated it. The rating of each user is assigned equal weight.
Since some movies are only in the test set, and not in the training test(not in our rating matrix), we will default to a rating of 3 like in the baseline. 

In [16]:
#  user based collaborative filter using mean rating
def cf_user_mean(user_id, movie_id):
    # check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        # compute the mean of all the rating given to the movie
        mean_rating = r_matrix[movie_id].mean()
    else:
        # default to a 3 if info doesn't exist
        mean_rating = 3.0
    return mean_rating

In [17]:
# compute RMSE for the mea model
score(cf_user_mean)

1.0300824802393536

Since the score is lower than the baseline, it means that this model is better than the baseline


### Weighted mean

In the previous section, all users were given the same weight. However, it only makes sense that we give more weight to the users whose ratings are similar to the user in question. We will therefore alter the mean model by introducing a weight coefficient.

We will use the cosine score as our similarity function. Like the one we built in the [content-based engine](../Content-Based-Recommenders/Content-Based.ipynb)

Since Scikit-learn's `cosine_similarity` function doesn't work with NaN values, we will convert all missing values to zero.

In [18]:
# create dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)

# import cosine_score
from sklearn.metrics.pairwise import cosine_similarity

# compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

# convert into pandas df
cosine_sim = pd.DataFrame(cosine_sim, index = r_matrix.index, columns=r_matrix.index)

cosine_sim.head(10)

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.108361,0.046638,0.029577,0.245753,0.335853,0.344724,0.191582,0.057149,0.251979,...,0.257073,0.069412,0.231643,0.108093,0.176842,0.104799,0.232472,0.051528,0.129555,0.256333
2,0.108361,1.0,0.057613,0.130237,0.054918,0.190552,0.079399,0.076146,0.167992,0.147376,...,0.136993,0.252887,0.255454,0.285193,0.232751,0.149088,0.102807,0.062386,0.109143,0.107686
3,0.046638,0.057613,1.0,0.139805,0.0,0.032485,0.043869,0.080968,0.022263,0.059925,...,0.027402,0.0,0.17506,0.010343,0.105635,0.019052,0.127099,0.023917,0.060392,0.0
4,0.029577,0.130237,0.139805,1.0,0.0,0.04519,0.088586,0.199526,0.135013,0.026919,...,0.055392,0.049773,0.076549,0.139382,0.113886,0.0,0.130343,0.077357,0.15789,0.063911
5,0.245753,0.054918,0.0,0.0,1.0,0.176443,0.28186,0.132205,0.03879,0.1342,...,0.183969,0.019305,0.073714,0.041807,0.081088,0.029743,0.188392,0.068342,0.055557,0.207259
6,0.335853,0.190552,0.032485,0.04519,0.176443,1.0,0.394725,0.143385,0.125126,0.372679,...,0.328643,0.070809,0.135806,0.17167,0.125446,0.086464,0.230566,0.095478,0.197307,0.185268
7,0.344724,0.079399,0.043869,0.088586,0.28186,0.394725,1.0,0.215861,0.121224,0.378723,...,0.339853,0.110866,0.096055,0.10469,0.126108,0.075012,0.270071,0.020036,0.236086,0.266571
8,0.191582,0.076146,0.080968,0.199526,0.132205,0.143385,0.215861,1.0,0.116173,0.169088,...,0.150048,0.064242,0.118297,0.053969,0.168057,0.095736,0.164157,0.076269,0.089871,0.210995
9,0.057149,0.167992,0.022263,0.135013,0.03879,0.125126,0.121224,0.116173,1.0,0.152694,...,0.082819,0.0644,0.127051,0.069251,0.095673,0.0,0.131458,0.106763,0.089297,0.089583
10,0.251979,0.147376,0.059925,0.026919,0.1342,0.372679,0.378723,0.169088,0.152694,1.0,...,0.279849,0.087828,0.131888,0.111841,0.094423,0.080883,0.255758,0.063461,0.169309,0.181031


We need to only consider those cosine similarity scores that have a corresponding, non-null rating. Meaning, we need to avoid users that have not rated a movie.

In [29]:
# user based collaborative filterusing weighted mean rating
def cf_user_wmean(user_id, movie_id):
    
    # Check if movie_id exists in r_matrix
    if movie_id in r_matrix:
        
        # Get the similarity scores for the user in question with every other user
        sim_scores = cosine_sim[user_id]
        
        # Get the user ratings for the movie in question
        m_ratings = r_matrix[movie_id]
        
        # Fix bug => ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
        # Check for NaN and infinite values in m_ratings
        if np.isnan(m_ratings).any() or not np.isfinite(m_ratings).all():
            wmean_rating = 3.0 
            return wmean_rating
        
        # Clip very large values in m_ratings
        m_ratings = np.clip(m_ratings, a_min=None, a_max=5.0)
        
        # Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index
        
        # Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()
        
        # Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)

        
        # Compute the final weighted mean
        wmean_rating = np.dot(sim_scores, m_ratings) / sim_scores.sum()
    
    else:
        # Default to a rating of 3.0 in the absence of any information
        wmean_rating = 3.0
    
    return wmean_rating

In [30]:
score(cf_user_wmean)

1.2488234462885457

### User demographic

These filters basically imply that users of the same demographic tend to hace similar tastes. Their effectiveness depends on the assumption that women, or teenagers or people from the same area will share the same taste in movies.

These filters do not take into account the rating given by all users to a particular movie. Instead, they only look at those that fit in a certain demographic.

We will create a gender demographic filter. It will identify the gender of a user, compute the weighted mean rating of a movie by that particular gender and return that as the predicted value.