# Recommendation System with Python
Welcome!  
This project builds movie recommendations using Python to show how recommender systems work in practice. Usually, there are 2 common approaches to build a recommendation system:

- **Content-based filtering (CBF):** Focus on the attributes of the items (genre, keywords, descriptions, etc.).
  *Example:* If you liked a sci-fi action film, it suggests other sci-fi action films.
- **Collaborative Filtering (CF):** Focus on users' attitude to items (based on rating / interaction history).
  *Example:* If users who liked the same movies as you also liked a certain film, that film is recommended.

## Data
We use the **MovieLens 100K** dataset, which contains 100,000 movie ratings from 943 users on 1,682 movies.

## Methods
In this notebook, we implement **collaborative filtering** (CF), which learns from patterns of user ratings. Collaborative filtering assumes that users with similar past behavior will rate items similarly. 

Unlike content-based filtering, CF does not require item metadata. Instead, we will use:
- <u>Memory-based:</u> use similarity between users or between items (e.g., cosine similarity on the rating matrix).
    + *User-User CF*: "Users who rate movies like you also liked …"
    + *Item-Item CF*: "Users who liked this movie also liked …"
- <u>Model-based:</u> use machine learning models to learn hidden patterns from the matrix (e.g., **Singular Value Decomposition (SVD)**).

**Why this method:**  
- Capture hidden relationships (latent factors) between users and items.  
- Can produce richer, more personalized recommendations when enough data is available.

We will:
1. Implement a memory-based CF using cosine similarity on the user-item rating matrix.
2. Implement a model-based CF using **SVD** to uncover latent factors for users and movies.

## Getting Started

In [13]:
import numpy as np
import pandas as pd

In [14]:
columns_names = ['user_id', 'item_id', 'rating', 'timestamp']
# Since the data is tab separated, we use sep='\t' to tell pandas read_csv method that the data is tab separated.
df = pd.read_csv('u.data', sep='\t', names=columns_names)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [15]:
movie_titles = pd.read_csv('Movie_Id_Titles')
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [16]:
# Merge them together
df = pd.merge(df, movie_titles, on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,0,172,5,881250949,"Empire Strikes Back, The (1980)"
2,0,133,1,881250949,Gone with the Wind (1939)
3,196,242,3,881250949,Kolya (1996)
4,186,302,3,891717742,L.A. Confidential (1997)


In [17]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

Num. of Users: 944
Num of Movies: 1682


Train-Test-Split: segment the data into 2 sets of data

In [18]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

## Memory-Based Collaborative Filtering

### Concept Overview
Memory-based collaborative filtering predicts a user's preference for an item by directly comparing ratings in the existing user-item matrix. Instead of learning model parameters, it uses **similarity measures** to find relationships:

* **User-User CF**: Recommend items that similar users have liked.  
* **Item-Item CF**: Recommend items similar to the ones a user has liked.

### How does it work
1. Represent the rating matrix (R) where (R_{u,i}) is the rating of user (u) on item (i).  
2. Similarity between two users (u) and \(v) (or two items (i) and (j)) is measured by **cosine similarity**:

![alt text](image.png)

3. Predict the rating for user (u) on item (i):
- **User-User CF**: Predict a user's rating by taking user's average rating plus a weighted correction based on how similar users rated the item differently from their own averages.

![alt text](image-1.png)

- **Item-Item CF**: Predict a user's rating by taking the weighted average of the user's ratings on items that are similar to the target item.

![alt text](image-2.png)

_Note: ( bar{r}\_u ) is the average rating of user (u). ( s\_{u,v} ) and ( s\_{i,j} ) are similarity scores_

### Advantages
- Simple and intuitive - "people with similar tastes like similar things."
- No need for item metadata (genres, keywords, etc.).
- Good baseline to understand how collaborative filtering works.

### Limitations
- **Scalability**: Similarity calculations grow as data grows.
- **Cold-start**: Can not recommend for new users or new items with no ratings.
- Sensitive to data sparsity.

**Build User-Item Matrices**:

In [19]:
# Create two user-item matrices, one for training and another for testing. Each cell is the rating a user gave a movie
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

In [20]:
train_data_matrix

array([[5., 3., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

**Compute Similarity** (using pairwise distance function to calculate cosine similarity):

In [21]:
# Compute similarity between users and items
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

In [22]:
user_similarity

array([[0.        , 0.8503478 , 0.97281132, ..., 0.88147318, 0.67600427,
        1.        ],
       [0.8503478 , 0.        , 0.97979365, ..., 0.86585467, 0.95211479,
        1.        ],
       [0.97281132, 0.97979365, 0.        , ..., 0.93426275, 0.97749468,
        1.        ],
       ...,
       [0.88147318, 0.86585467, 0.93426275, ..., 0.        , 0.91193451,
        1.        ],
       [0.67600427, 0.95211479, 0.97749468, ..., 0.91193451, 0.        ,
        1.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 1.        ,
        0.        ]])

In [23]:
item_similarity

array([[0.        , 0.70804127, 0.77852594, ..., 1.        , 1.        ,
        0.94420716],
       [0.70804127, 0.        , 0.87835353, ..., 1.        , 0.91003598,
        1.        ],
       [0.77852594, 0.87835353, 0.        , ..., 1.        , 1.        ,
        0.88409989],
       ...,
       [1.        , 1.        , 1.        , ..., 0.        , 1.        ,
        1.        ],
       [1.        , 0.91003598, 1.        , ..., 1.        , 0.        ,
        1.        ],
       [0.94420716, 1.        , 0.88409989, ..., 1.        , 1.        ,
        0.        ]])

**Predict Ratings**:
- User-based CF: Adjust predictions by each user's mean rating (to account for users who rate higher or lower on average).
- Item-based CF: Use item similarity, no need for mean adjustment.

In [24]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        # Use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

In [25]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

**Evaluation**: Here, we can evaluate our model by using Root Mean Squared Error (RMSE) to quantify how close predictions are to actual ratings.
Lower RMSE = better predictions.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() # Only consider predicted ratings that are actually rated by users
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [28]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.1127818146863335
Item-based CF RMSE: 3.4420188636509703


### Summary

Accuracy:
- User-based CF RMSE ≈ 3.11
- Item-based CF RMSE ≈ 3.44

=> User-based CF performed slightly better, suggesting user-to-user similarities captured stronger patterns.

## Model-Based Collaborative Filtering

### Concept Overview
Model-based collaborative filtering predicts ratings by **learning hidden factors** that explain user preferences and item characteristics, instead of relying solely on direct similarity between rating patterns. The most common model-based approach is **Matrix Factorization (MF)**, where we decompose the large, sparse user-item rating matrix into the product of lower-dimensional matrices that represent **latent features** (hidden patterns). 

<u>Key idea</u>: Even if two users have not rated the same movies, they might share hidden preferences (e.g., "likes 80s sci-fi" or "prefers romantic comedies"). Matrix factorization uncovers these hidden dimensions and uses them to predict unknown ratings.

### How does it work
In this project, we will use a well-known matrix factorization method called **Singular value decomposition (SVD)**.

Given a user-item matrix ( R ) of size ( m x n ), matrix factorization approximates ( R ) as:

![alt text](image-3.png)

where:
- ( R ): the original matrix (e.g., user-item ratings).
- ( U ): a ( m x k ) orthogonal matrix representing user feature vectors ("latent factors").
- ( S ): a ( k x k ) diagonal matrix of singular values (strength of each factor).
- ( V^T ): a ( k x n ) transpose of matrix representing item feature vectors.

### Advantages
- Handles **sparsity** better than memory-based CF.
- Scales to large datasets.
- Captures deeper, abstract relationships (latent factors) beyond simple rating overlap.

### Limitations
- Need enough rating data to learn meaningful factors.
- Susceptible to cold-start problems (new users or new items with no ratings).
- SVD can be computationally expensive and may overfit if not regularized.

First, let's check the sparsity of the dataset:

In [None]:
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


Here, the dataset is **~94%** sparse, which indicates that most user-movie pairs are unrated.

Apply **Singular Value Decomposition (SVD)**: 

In [None]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

# Factorize the training matrix into three low-rank matrices. Choose appropriate k (number of latent factors)
# Set k=20 keeps only the top 20 latent factors, which capture the most significant patterns and reduce noise. 
u, s, vt = svds(train_data_matrix, k = 20)

In [38]:
# Multiply the three SVD components to reconstruct an approximation of the original rating matrix, which contains the predicted ratings for every user-movie pair
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)

In [39]:
# Evaluate with RMSE
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.7107675909887847


### Summary

Accuracy:
- Model-based CF with SVD RMSE ≈ 2.71

=> Overall, the model-based CF with SVD achieved a lower RMSE (~2.71) than the memory-based methods (~3.13 user-based, ~3.46 item-based), indicating that this approach performs better and effectively captures hidden factors (such as decade, genre, or actor preferences) and predicts ratings more accurately, even when many user-movie pairs are missing.

<u>Extra Note</u>: We can also use Hybrid Recommender System, which combines Collaborative Filtering and Content-based models. This approach can further increase accuracy and address cold-start issues (use metadata from the user/item to make a prediction if no ratings available).