# Collaborative Filtering Model #

Collaborative filtering is a method used in recommender systems to predict a user's interests based on the preferences of a larger user group. It operates under the principle that users who agreed in the past will agree in the future about certain items. The two main types are:

User-Based Collaborative Filtering: This method makes recommendations based on the similarities between users. It's effective when there are fewer users than items, but it can face challenges in scalability and changing user preferences.

Item-Based Collaborative Filtering: This approach focuses on the similarities between items, based on user ratings or interactions. It's preferred in cases where the number of items is smaller than the number of users and is generally more stable as items tend to change less frequently than user preferences.

When to Use:
User-based filtering is suitable for systems with stable user preferences and a manageable user base.

Item-based filtering works well in larger systems with more users, as it tends to be more scalable and efficient.

1. Formulating a Prediction Question:
Our primary objective is to leverage a collaborative filtering recommender algorithm to predict movie ratings by users.
"Based on the MovieLens dataset, how can we predict the rating a user would give to a movie they haven't seen yet, based on the ratings provided by users with similar viewing habits?" 

    This involves identifying user similarities based on their movie rating patterns and utilizing these similarities to predict ratings for unseen movies.


2. Dataset
For this assignment I chose a MovieLens Dataset from kaggle repository it contain files movies.csv and ratings.csv. These datasets are commonly used in building movie recommender systems, where movies.csv typically contains information about the movies, such as movie IDs, titles, and genres, and ratings.csv includes user ratings for these movies.

3. Importing Libraries:
To build a recommender system in Python, we need to import libraries such as pandas for data manipulation, numpy for numerical operations, and surprise for building and evaluating recommender systems.

4. Loading and Preprocessing the Dataset:
We'll load the datasets into data frames and preprocess them. This include handling missing values, encoding categorical variables, and merging datasets.

# Importing libraries and loading the data 

In [3]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

# Load the datasets
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# Display the first few rows of each dataset for exploration
movies_head = movies.head()
ratings_head = ratings.head()

movies_head, ratings_head


(   movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  
 0  Adventure|Animation|Children|Comedy|Fantasy  
 1                   Adventure|Children|Fantasy  
 2                               Comedy|Romance  
 3                         Comedy|Drama|Romance  
 4                                       Comedy  ,
    userId  movieId  rating  timestamp
 0       1        1     4.0  964982703
 1       1        3     4.0  964981247
 2       1        6     4.0  964982224
 3       1       47     5.0  964983815
 4       1       50     5.0  964982931)

In [4]:
# Merging the datasets on 'movieId'
df = pd.merge(ratings, movies, on='movieId')

In [4]:
# Display the first few rows of the DataFrame for a quick overview

df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [5]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

Missing Values:
userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
dtype: int64


In [6]:
df.shape

(100836, 6)

In [7]:
#Display the information of the DataFrame for a quick overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
 4   title      100836 non-null  object 
 5   genres     100836 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 4.6+ MB


In [8]:
# Calculate summary statistics
df.describe()


Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


5. Splitting the Dataset
Before building the recommender system model, we need to split the data into training and testing sets. This is crucial for training the model and then evaluating its performance on unseen data. 

In [9]:
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split

# Defining a reader with the rating scale
reader = Reader(rating_scale=(0.5, 5))

# Loading the dataset into Surprise's format
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

# Splitting the dataset into training and testing sets (75% train, 25% test)
trainset, testset = train_test_split(data, test_size=0.25)


6. Building the Recommender System
For collaborative filtering choosing Singular Value Decomposition (SVD) algorithm is a popular choice for this purpose and training it on train data. 

In [10]:
from surprise import KNNBasic, SVD
from surprise.model_selection import cross_validate

# Initialize the SVD algorithm
model = SVD()

# Train the model on the training dataset
model.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x12072bb50>

7. Making Predictions and Measuring Accuracy:
 Making predictions on the test set and evaluating the performance using metrics RMSE (Root Mean Square Error) and MAE (Mean Absolute Error).

In [11]:
from surprise import accuracy

# Making predictions on the test set
predictions = model.test(testset)

# Compute and print RMSE and MAE
accuracy_rmse = accuracy.rmse(predictions)
accuracy_mae = accuracy.mae(predictions)


RMSE: 0.8736
MAE:  0.6712


RMSE (Root Mean Square Error) - 0.8755:
An RMSE of 0.8755 means that the standard deviation of the prediction errors (i.e., the differences between actual and predicted values) is approximately 0.8755 ratings points.
Lower RMSE values are better as they indicate smaller errors. Considering typical movie rating scales (like 1-5 or 0-10), an RMSE of 0.8755 suggests a moderate level of prediction error.
MAE (Mean Absolute Error) - 0.6732:
A MAE of 0.6732 means that, on average, the absolute error of the predictions is 0.6732 ratings points.
Like RMSE, lower values of MAE are better. A MAE of 0.6732 is relatively low, indicating that the model has decent accuracy in predictions.

Interpreting These Results in Context:
Goodness of Fit: Both RMSE and MAE are relatively low, which indicates that the model has a good fit to the data and can make reasonably accurate predictions.

8. Fine-tuning with GridSearchCV
Using GridSearchCV to find the optimal parameters for the SVD algorithm. This step helps in enhancing the model's performance by tuning its parameters.

In [12]:
from surprise.model_selection import GridSearchCV

# Define a parameter grid to search over
param_grid = {
    'n_epochs': [5, 10, 20], 
    'lr_all': [0.002, 0.005],
    'reg_all': [0.4, 0.6]
}

# Use grid search to find the best parameters
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

# Best RMSE score
best_rmse = gs.best_score['rmse']

# Combination of parameters that gave the best RMSE score
best_params = gs.best_params['rmse']

best_rmse, best_params

(0.8865458940632052, {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.4})

Best RMSE Score - 0.8860:
This is the lowest (best) RMSE (Root Mean Square Error) achieved by the GridSearchCV process.
An RMSE of 0.8860 means that the standard deviation of the prediction errors in your model is about 0.8860 rating points.
In the context of movie ratings (often on a scale from 1 to 5 or 0 to 10), this RMSE indicates a moderate level of accuracy. It implies that the average error in predicting a movie rating is less than one rating point.

Best Parameters:
n_epochs: 20 - This indicates that the best results were obtained when the model was iterated 20 times over the training set. In the context of algorithms like SVD, an epoch is a single pass through the entire training set. More epochs can lead to a better-trained model but also increase the risk of overfitting.
lr_all: 0.005 - This is the learning rate for all parameters, which controls the size of the steps the algorithm takes during optimization. A learning rate of 0.005 suggests a balance between speed and accuracy of convergence.
reg_all: 0.4 - This refers to the regularization term, which is used to prevent overfitting by penalizing larger model parameters. A value of 0.4 indicates a moderate level of regularization, balancing model complexity and generalization to new data.

Interpretation and Context:
Model Optimization: The combination of parameters leading to this RMSE value is an optimal setting for the model according to GridSearchCV. It represents a balance between overfitting and underfitting, learning speed, and regularization.

Performance Consideration: While the RMSE of 0.8860 is relatively good.

## Building the Recommender System using KNNBasic algorithm

In [13]:
from surprise.accuracy import rmse, mae

# Configure the algorithm to use user-based collaborative filtering
sim_options = {
    'name': 'cosine',
    'user_based': True  # for user-based collaborative filtering; set to False for item-based
}

# Create the KNNBasic model
model = KNNBasic(sim_options=sim_options)

# Train the model
model.fit(trainset)

# Make predictions on the test set
predictions = model.test(testset)

# Calculate and print RMSE and MAE
accuracy_rmse = rmse(predictions)
accuracy_mae = mae(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9736
MAE:  0.7505


Cosine Similarity Matrix Computation:
"Computing the cosine similarity matrix... Done computing similarity matrix." This message indicates that the KNN algorithm has computed the similarity matrix using the cosine similarity metric.
Cosine similarity is a measure that calculates the cosine of the angle between two vectors. In the context of a recommender system, these vectors represent user preferences or item features. A cosine similarity close to 1 implies a high degree of similarity.

RMSE (Root Mean Square Error) - 0.9802:
RMSE is used to measure the average magnitude of the errors between predicted and actual ratings. An RMSE of 0.9802 suggests that the standard deviation for the prediction errors is around 0.9802 points on the rating scale.
Compared to other models, this RMSE might be considered slightly high, depending on the rating scale used (typically 1-5 or 0-10). A lower RMSE is generally desired as it indicates higher prediction accuracy.

MAE (Mean Absolute Error) - 0.7542:
MAE measures the average absolute difference between predicted and actual ratings. A MAE of 0.7542 means that, on average, the model's predictions are about 0.7542 points off from the actual ratings.
Like RMSE, a lower MAE is preferable as it indicates more accurate predictions.

Interpretation and Context:
Model Performance: The RMSE and MAE values suggest that your KNN model with cosine similarity provides moderately accurate predictions. However, the performance might not be as high as desired, especially if compared to other models or algorithms.

Suitability of the Model: The effectiveness of a KNN-based model can depend heavily on the nature of the data. Sparse datasets or datasets with a wide range of preferences can sometimes challenge KNN's performance.

In [14]:
from surprise import KNNBasic
from surprise.model_selection import GridSearchCV

# Define the parameter grid to search over
param_grid = {
    'k': [10, 20, 30],
    'sim_options': {
        'name': ['msd', 'cosine', 'pearson'],
        'user_based': [True]
    }
}

# Use grid search to find the best parameters for KNNBasic
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

# Best score and parameters
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine simila

## Comparing the results of the KNN model with  SVD model: ##
RMSE Comparison:
The SVD model has a lower RMSE (0.8755) compared to the KNN model (0.9802).
A lower RMSE indicates that the SVD model's predictions are, on average, closer to the actual ratings than those of the KNN model.
The SVD model seems to better capture the variance in the user-item ratings matrix.
MAE Comparison:
Similarly, the SVD model shows a lower MAE (0.6732) than the KNN model (0.7542).
This suggests that the SVD model's predictions are more accurate on average, with less absolute error in its predictions.
Interpretation:
Model Effectiveness: The SVD model appears to be more effective for this particular dataset based on both RMSE and MAE metrics. This suggests that it might be better at capturing user preferences and predicting ratings in the dataset.

Algorithm Differences: These differences can be attributed to the underlying mechanisms of the algorithms. SVD is a matrix factorization technique that can capture complex patterns in the data, often performing well even with sparse datasets. KNN, on the other hand, relies on similarity between users or items, which might not be as effective if the dataset doesn’t exhibit strong similarity patterns or is sparse.
Context and Data Characteristics: The choice between SVD and KNN should also consider the specific characteristics of the dataset and the application's needs. For example, if interpretability is key, KNN might be preferred despite its slightly lower performance.

Conclusion:
Based on the comparison, the SVD model seems to be a better fit for the dataset in terms of both RMSE and MAE. However, the final decision should also take into account factors like computational efficiency, scalability, and the specific nature of the dataset and recommendation context.

## Item-based collaborative filtering: ##
In item-based collaborative filtering, the system makes recommendations based on the similarity between items rather than users. This approach is particularly useful when we have more users than items, as it's often easier to calculate the similarity between a smaller set of items than a larger set of users.

Here's a general outline of how to implement item-based collaborative filtering, especially using the surprise library in Python:

1. Choose the Right Algorithm
For item-based collaborative filtering, we can use algorithms like KNNBasic, but with a configuration that focuses on item similarities. In surprise, this can be done by setting the user_based field to False in the sim_options argument.

In [15]:
from surprise import KNNBasic
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split

sim_options = {
    'name': 'cosine',
    'user_based': False  # compute similarities between items
}

algo = KNNBasic(sim_options=sim_options)

In [16]:
trainset, testset = train_test_split(data, test_size=0.25)
algo.fit(trainset)


Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x1744ad7d0>

In [17]:
predictions = algo.test(testset)
# Calculate and print RMSE and MAE
accuracy_rmse = rmse(predictions)
accuracy_mae = mae(predictions)

RMSE: 0.9731
MAE:  0.7582


In [18]:
from surprise import KNNBasic
from surprise.model_selection import GridSearchCV

param_grid = {
    'k': [20, 30, 40],
    'sim_options': {
        'name': ['msd', 'cosine'],
        'min_support': [1, 5],
        'user_based': [True, False]
    }
}

# Use grid search to find the best parameters for KNNBasic
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

# Best score and parameters
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done c

Best RMSE Score - 0.9143:
The best RMSE (Root Mean Square Error) achieved across all parameter combinations was 0.9143.
RMSE is a standard measure of the average magnitude of the prediction error, implying that the average error in your model’s predictions is about 0.9143 points on the rating scale.

Best Parameters:
k: 40 - The optimal number of neighbors was found to be 40. This means the best results were achieved when each prediction considered the 40 most similar users or items.

sim_options: A dictionary indicating the best combination of similarity options:
name: 'msd' - The best similarity metric was the Mean Squared Difference. This metric performs calculations based on the squared difference between ratings.

min_support: 1 - The minimum number of common items needed for the similarity calculation was 1.
user_based: False - This indicates that the best results were achieved using item-based collaborative filtering rather than user-based.

Interpretation:
Model Performance: An RMSE of 0.9143 suggests moderate accuracy. Whether this is acceptable depends on the context of the application and the nature of the dataset.
Item-Based Filtering Effectiveness: The fact that user_based was set to False with the best performance suggests that for this particular dataset, item-based collaborative filtering is more effective than user-based.
Parameter Suitability: The combination of k=40 and msd for similarity indicates that considering a larger number of neighbors and using the MSD metric for calculating similarities between items leads to more accurate predictions.

Conclusion:
This grid search has helped  identify the most effective parameters for the collaborative filtering model, specifically pointing towards an item-based approach with a fairly high number of neighbors. The RMSE indicates that there is room for improvement, and depending on the requirements of the application, further tuning or a different approach might be needed for better accuracy.

References:

Harper, F. M., & Konstan, J. A. (2015). The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4), Article 19. http://dx.doi.org/10.1145/2827872

Kaggle. (2018). MovieLens [Data set]. https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset
