# Collaborative Filtering Recommendation Systems



## 1. Introduction

Collaborative Filtering (CF) is a widely used technique for building recommendation systems. 
It relies on user interactions and relationships to predict preferences and suggest items. 
CF is applied in various domains, including:
- **Streaming platforms**: Netflix uses CF to recommend movies and TV shows.
- **E-commerce**: Amazon suggests products based on user behavior.
- **Music streaming**: Spotify personalizes playlists using user-item interactions.
- **Social networks**: Facebook and Instagram recommend content based on user engagement.



## 2. Examples of Collaborative Filtering in Practice

1. **Netflix**:
   - Netflix offered a $1M prize for improving its recommendation algorithm using Collaborative Filtering.
   - Reference: Bennett, J., & Lanning, S. (2007). "The Netflix Prize.":https://www.cs.uic.edu/~liub/KDD-cup-2007/NetflixPrize-description.pdf.
   
2. **Amazon**:
   - Amazon uses Item-Based Collaborative Filtering to recommend products.
   - Reference: Linden, G., Smith, B., & York, J. (2003). "Amazon.com recommendations: Item-to-item collaborative filtering.": https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf.

3. **Spotify**:
   - Combines CF with hybrid models to create personalized playlists.
   - Reference: Johnson, C. (2014). "Logistic Matrix Factorization for Implicit
Feedback Data":https://web.stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf.



## 3. Selected Library and Functions

**Library:** Surprise (Python)
- **Why Surprise?**
  - Supports multiple Collaborative Filtering algorithms.
  - Easy evaluation of recommendation models using metrics like RMSE.

**Functions and Parameters:**
1. **Dataset**
   - `Dataset.load_builtin('ml-100k')`: Loads the MovieLens 100k dataset.
2. **User-Based CF**
   - `KNNBasic(sim_options={'name': 'cosine', 'user_based': True})`: Implements User-Based Collaborative Filtering using cosine similarity.
3. **Matrix Factorization**
   - `SVD()`: Applies Singular Value Decomposition for Matrix Factorization.
4. **Model Evaluation**
   - `accuracy.rmse(predictions)`: Calculates Root Mean Squared Error for model performance evaluation.

In [9]:
# Importing necessary libraries
from surprise import Dataset, Reader, KNNBasic, SVD, accuracy
from surprise.model_selection import train_test_split
import pandas as pd


## 4. Data Set Characteristics

**Dataset:** MovieLens 100k  
- **Description**: A widely used dataset containing user ratings of movies.  
- **Key Characteristics**:
  - **Number of Users**: 943.
  - **Number of Movies**: 1,682.
  - **Number of Ratings**: 100,000.
  - **Sparsity**: ~94% of the user-item matrix is empty.  
- **Features**:
  - User ID.
  - Movie ID.
  - Rating (1–5 scale).
  - Timestamp.

In [10]:
#Loading data and inspecting its structure

# Load the built-in MovieLens 100k dataset
data_raw = Dataset.load_builtin('ml-100k')

# Convert raw data into a Pandas DataFrame for analysis
df = pd.DataFrame(data_raw.raw_ratings, columns=["user_id", "item_id", "rating", "timestamp"])

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

# Check the dataset dimensions
print("\nDataset dimensions:")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of unique users: {df['user_id'].nunique()}")
print(f"Number of unique items: {df['item_id'].nunique()}")

# Display the distribution of ratings
print("\nRating distribution:")
print(df['rating'].value_counts())

First 5 rows of the dataset:
  user_id item_id  rating  timestamp
0     196     242     3.0  881250949
1     186     302     3.0  891717742
2      22     377     1.0  878887116
3     244      51     2.0  880606923
4     166     346     1.0  886397596

Dataset dimensions:
Number of rows: 100000
Number of unique users: 943
Number of unique items: 1682

Rating distribution:
rating
4.0    34174
3.0    27145
5.0    21201
2.0    11370
1.0     6110
Name: count, dtype: int64



## 5. Empirical Analysis

**Goal**: To evaluate and compare the performance of User-Based Collaborative Filtering and Matrix Factorization (SVD) on the MovieLens 100k dataset.

### Steps:
1. Load and preprocess the dataset.
2. Implement User-Based CF and Matrix Factorization.
3. Evaluate models using RMSE.
4. Interpret results.


In [11]:
# Splitting the data into training and testing sets

# Split the dataset using Surprise
trainset, testset = train_test_split(data_raw, test_size=0.2, random_state=42)

# User-Based Collaborative Filtering

# Define similarity options for User-Based Collaborative Filtering
sim_options = {'name': 'cosine', 'user_based': True}  # Use cosine similarity between users

# Create the User-Based CF model
user_based_model = KNNBasic(sim_options=sim_options)

# Train the model on the training set
user_based_model.fit(trainset)

# Generate predictions for the test set
user_based_predictions = user_based_model.test(testset)

# Calculate and display the RMSE for User-Based CF
user_based_rmse = accuracy.rmse(user_based_predictions)
print(f"\nUser-Based CF RMSE: {user_based_rmse}")

# Matrix Factorization (SVD)

# Create the SVD model
svd_model = SVD()

# Train the model on the training set
svd_model.fit(trainset)

# Generate predictions for the test set
svd_predictions = svd_model.test(testset)

# Calculate and display the RMSE for SVD
svd_rmse = accuracy.rmse(svd_predictions)
print(f"SVD RMSE: {svd_rmse}")

# Making predictions for a specific user and item 

# Predict the rating for user ID 196 and movie ID 302
predicted_rating = svd_model.predict(uid=str(196), iid=str(302))
print(f"\nPredicted Rating for User 196 and Movie 302: {predicted_rating.est}")

# Comparing the results

# Print a comparison of the RMSE values for the two methods
print("\nComparison of Methods:")
print(f"User-Based CF RMSE: {user_based_rmse}")
print(f"SVD RMSE: {svd_rmse}")


Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0194

User-Based CF RMSE: 1.0193536815834319
RMSE: 0.9347
SVD RMSE: 0.9346883169891882

Predicted Rating for User 196 and Movie 302: 4.077811301035312

Comparison of Methods:
User-Based CF RMSE: 1.0193536815834319
SVD RMSE: 0.9346883169891882



## Interpretation of Results:

The results indicate the performance of two Collaborative Filtering methods, User-Based CF and Matrix Factorization (SVD), based on RMSE (Root Mean Squared Error).

### User-Based Collaborative Filtering 

RMSE: 1.0194

This is the error rate for the User-Based CF model. The model has a higher RMSE, indicating it is less accurate in predicting user ratings compared to SVD. User-Based CF struggles with sparsity in the dataset (many missing ratings), making it less effective for this problem.

### Matrix Factorization (SVD)

RMSE: 0.9347

The SVD model has a lower RMSE, showing better prediction accuracy compared to User-Based CF.
SVD captures latent factors (hidden relationships between users and items) and handles sparse datasets more effectively.
SVD's performance indicates it is better suited for large, sparse datasets like MovieLens.

### Specific Prediction

Predicted Rating for User 196 and Movie 302: 4.29

The SVD model predicts that User 196 would rate Movie 302 approximately 4.29. This suggests the user is likely to enjoy this movie since the predicted rating is close to the maximum value (5).