First we will build a well-defined framework that will allow us to build and test our collaborative filtering models effortlessly.
The framework will consist of data, the evaluation metric, and a correspoinding function to compute that metric for a given model

### The Framework

Since collaborative filtering demands data on user behavior, we will use anotger dataset called MovieLens.
MovieLens gives us user ratings on a variety of movies and is available in various sizes.
The full version consists of more than 26,000,000 ratings applied to 45,000 movies by
270,000 users.

Download dataset from [here](https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset)

In [1]:
# Exploring that data
# Load the u.user file into a df
import pandas as pd
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv('../data/movielens/u.user', sep='|', names=u_cols, encoding='latin-1')

users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [2]:
# load u.item file into a df
i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies =pd.read_csv('../data/movielens/u.item', sep='|', names=i_cols, encoding='latin-1')

movies.head()

Unnamed: 0,movie_id,title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [3]:
# remove all information except Movie ID and title (we dont need the other information since we're building collaborative filters)
movies = movies[['movie_id', 'title']]

In [4]:
# Load the u.data file into a df
r_cols = ['user_id', 'movie_id', 'rating', 'timestap']
ratings = pd.read_csv('../data/movielens/u.data', sep='\t', names=r_cols, encoding='latin-1')

ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestap
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
# drop timestamp as we don't need it
ratings = ratings.drop('timestap', axis=1)

Although the ratings can only take 5 discrete values, will model this as a regression problem.
Consider a case whre the true rstiing given by a user to a movie is 5. A classification model will not distiguish between a predicted rating of 1 and 4, it will treat both as misclassified. However in a regression model, it will penalize the 1 more than the 4 and that is the behaviour we want.

As we saw in [Data mining techniques](../Data-Mining-Techniques/Data-Mining.ipynb) ,the first step in building a supervised learning model is to construct the test and training sets. The model will learn using the training dataset and it will be tested by the testing dataset.

#### Training and test data

We will split the dataset such that 75% of a user's ratings is training dataset and 25% is the testiing dataset.
First, let's assume that the user_id field is the target variable(y) and that our rating consists of the predictor variables(x).
We will then pass these two variables into scikit-learn's `train_test_split` function and `stratify` it along y. This will ensure that the propotion of each class is the same in both the training and testing datasets

In [6]:
from sklearn.model_selection import train_test_split

# assign x as the original ratings df and y as the user_id column of ratings
x = ratings.copy()
y = ratings['user_id']

# split into training and test datasets, stratifies along user_id
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, stratify=y, random_state=42)

#### Evaluation

We already know from [data mining techniques](../Data-Mining-Techniques/Data-Mining.ipynb) that root mean aquared error(RMSE) is the most commonly used perfomance metric for regressors.
`scikit-learn` already gives us an implementation of the mean sqaured error. So all we have to do is define a function that returns the square root of the value returned by `mean_squared_error`

In [8]:
from sklearn.metrics import mean_squared_error
import numpy as np

# function that computes the RSME
def rsme(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true,y_pred))

All our collaborative filter models will take in a user_id and movie_id as input, and output a floating number between 1 and 5, We therefore define our baseline model in such a way that it returns 3 regardless of user_id and movie_id

In [11]:
# define baseline model

def baseline(user_id, movie_id):
    return 3.0

In [9]:
# To test the potency of our model, we compute the RSME obtained by that particular model for all user-movie pairs in the test dataset

def score(cf_model):
    # clonstruct a list of user-movie tuples from the testing dataset
    id_pairs = zip(x_test['user_id'], x_test['movie_id'])
    # predict the rating for every user-movie tuple
    y_pred = np.array([cf_model(user,movie) for (user, movie) in id_pairs])
    # extract the actual ratings given by the users in the test data
    y_true = np.array(x_test['rating'])
    # return the final RSME score
    return rsme(y_true, y_pred)


In [12]:
# compute RSME obtained by the baseline model
score(baseline)

1.2488234462885457

Our aim with the rest of the models we will build is to try to obtain ab RMSE that is less than the one obtained by our baseline

### User-based collaborative filtering

As we saw in the [intro](../Intro.md), user-based collaborative filters find users similar to a particular user and the recommend products that those users have liked to the first user.
We will implement this idea in code. We will then test their perfomance using the framework we just constructed above.

We start by building a ratings matrix where each row represents a user, and each column a movie.
Therefore, the ith row and jth column will denote the rating given by user i to movie j
Pandas provides us a useful function, called `pivot_table` to construct the matrix from our ratings df