# Collaborative Filtering via Matrix Factorization

Implementation of a simple joint filtering system using **matrix factorization** with stochastic gradient descent (SGD). The idea is to recommend products to users based on their previous ratings.

## Import

In [12]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


### Matrix Factorization with Gradient Descent

### `train_mf`
The function using gradient descent for collaborative filtering.  
The goal is to learn two matrices:  
- **User feature matrix** $U$ (shape: *num_users* × *num_features*)  
- **Movie feature matrix** $M$ (shape: *num_movies* × *num_features*)  


The function takes:
- **train_df**: the training dataset containing `(user_id, movie_id, rating)`
- **num_users**: total number of unique users
- **num_movies**: total number of unique movies
- **num_features**: number of latent features to learn
- **alpha**: learning rate for gradient descent
- **lambd**: regularization parameter to prevent overfitting 
- **epochs**: number of iterations over the dataset

For each training data point:
- It retrieves the corresponding user vector $U_u$ and movie vector $M_m$
- Computes the **predicted rating** as the dot product:
  $$
  \hat{r}_{u,m} = U_u \cdot M_m^T
  $$
- Computes the error:  
  $$
  e_{u,m} = r_{u,m} - \hat{r}_{u,m}
  $$
- Updates the user and movie vectors using gradient descent:  
  $$
  U[u] += alpha * (error * M[m] - lambd * U[u])
  $$
  $$
  M[m] += alpha * (error * U[u] - lambd * M[m])
  $$


In [28]:
def train_mf(train_df, num_users, num_movies, num_features=5, alpha=0.01, lambd=0.01, epochs=1000):
    U = np.random.normal(0, 0.1, (num_users, num_features))
    M = np.random.normal(0, 0.1, (num_movies, num_features))
    
    for epoch in range(epochs):
        for row in train_df.itertuples():
            u = row.user_id - 1
            m = row.movie_id - 1
            r = row.rating
            pred = np.dot(U[u], M[m])
            error = r - pred
            
            U[u] += alpha * (error * M[m] - lambd * U[u])
            M[m] += alpha * (error * U[u] - lambd * M[m])
            
    return U, M


### Predicting Ratings and Evaluating Model Performance

After training the **Matrix Factorization model**, predictions on new data using the learned matrices $U$ and $M$.

#### `predict()` 

The function takes:
- **test_df**: the test dataset containing `(user_id, movie_id, rating)`
- **U**: the learned user feature matrix
- **M**: the learned movie feature matrix

For each test data point:
- It retrieves the corresponding user vector $U_u$ and movie vector $M_m$
- Computes the **predicted rating** as the dot product:
  $$
  \ pred = U_u \cdot M_m^T
  $$

In [None]:
def predict(test_df, U, M):
    preds = []
    truths = []
    
    for row in test_df.itertuples():
        u = row.user_id - 1
        m = row.movie_id - 1
        pred = np.dot(U[u], M[m])
        preds.append(pred)
        truths.append(row.rating)
    
    return preds, truths

def rmse(preds, truths):
    return np.sqrt(mean_squared_error(truths, preds))

### Creates a small dataset of user movie ratings and splits it into a 70/30 training and test set.

In [29]:
df = pd.DataFrame({
    'user_id': [1,1,1,2,2,3,3,3,4,4],
    'movie_id': [1,2,3,1,4,2,3,4,1,4],
    'rating': [5,4,3,4,5,2,3,5,4,4]
})

train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)

### Training a matrix factorization model

In [None]:
num_users = df['user_id'].max()
num_movies = df['movie_id'].max()

U, M = train_mf(train_df, num_users, num_movies, epochs=500)

preds, truths = predict(test_df, U, M)

print('RMSE:', rmse(preds, truths))

RMSE: 2.4445903867871794


### Comparison of the true value with the predicted value

In [None]:
results_df = pd.DataFrame({
    'UserID': test_df['user_id'].values,
    'MovieID': test_df['movie_id'].values,
    'True Rating': truths,
    'Predicted Rating': preds
})

results_df.head(1)

Unnamed: 0,UserID,MovieID,True Rating,Predicted Rating
0,4,1,4,3.289931
