# Exercise: Implementing Matrix Factorization from Scratch

**Course:** Recommender Systems <br>
**Professor:** Guilherme MEDEIROS MACHADO <br>
**Topic:** Collaborative Filtering with Matrix Factorization

---

## Goal of the Exercise

The objective of this exercise is to build a movie recommender system by implementing the **Matrix Factorization** algorithm from scratch using Python. We will use the famous **MovieLens 100k** dataset. By the end of this notebook, you will have:

1.  Understood the theoretical foundations of matrix factorization.
2.  Implemented the algorithm using **Stochastic Gradient Descent (SGD)**.
3.  Trained your model on real-world movie rating data.
4.  Evaluated your model's performance using Root Mean Squared Error (RMSE).
5.  Generated personalized top-10 movie recommendations for a specific user.

This exercise forbids the use of pre-built matrix factorization libraries (like `surprise`, `lightfm`, etc.) to ensure you gain a deep understanding of the inner workings of the algorithm.

---

## The Dataset: MovieLens 100k

We will be using the MovieLens 100k dataset, a classic dataset in the recommender systems community. It contains:
* 100,000 ratings (1-5) from...
* 943 users on...
* 1682 movies.

You will need two files from this dataset:
* `u.data`: The full dataset of 100k ratings. Each row is in the format: `user_id`, `item_id`, `rating`, `timestamp`.
* `u.item`: Information about the movies (items). Each row contains the `item_id`, `movie_title`, and other metadata. We'll use it to get the movie names for our final recommendations.

Let's start by downloading and exploring the data.

In [51]:
import pandas as pd
import numpy as np
import os
from urllib.request import urlretrieve
import zipfile

# --- Download the dataset if it doesn't exist ---
if not os.path.exists('ml-100k'):
    print("Downloading MovieLens 100k dataset...")
    url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
    urlretrieve(url, 'ml-100k.zip')
    with zipfile.ZipFile('ml-100k.zip', 'r') as zip_ref:
        zip_ref.extractall()
    print("Download and extraction complete.")

# --- Load the data ---
# u.data contains the ratings
data_cols = ['user_id', 'item_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('ml-100k/u.data', sep='\t', names=data_cols)

# u.item contains movie titles
item_cols = ['item_id', 'title'] + [f'col{i}' for i in range(22)] # Remaining columns are not needed
movies_df = pd.read_csv('ml-100k/u.item', sep='|', names=item_cols, encoding='latin-1', usecols=['item_id', 'title'])

# Merge the two dataframes to have movie titles and ratings in one place
df = pd.merge(ratings_df, movies_df, on='item_id')

print("Data loaded successfully!")
df.head()

Data loaded successfully!


Unnamed: 0,user_id,item_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


---

## Part 1: Data Preparation

The raw data is a list of ratings. For matrix factorization, it's conceptually easier to think of our data as a large **user-item interaction matrix**, let's call it $R$. In this matrix:
* The rows represent users.
* The columns represent movies (items).
* The value at cell $(u, i)$, denoted $R_{ui}$, is the rating user $u$ gave to movie $i$.

This matrix is typically very **sparse**, as most users have only rated a small fraction of the available movies.

Let's create this matrix using a Pandas pivot table. This will also help us determine the number of unique users and movies.

In [52]:
# TODO:Your code here
from sklearn.model_selection import train_test_split

def split_data(df):
    df_notes = df[df["rating"].notnull()]
    train, test = train_test_split(df_notes, test_size=0.2, random_state=42)
    return (train, test)
    
def create_user_item_matrix(df):
    """
    Creates the user-item interaction matrix from the dataframe.

    Args:
        df (pd.DataFrame): The dataframe containing user_id, item_id, and rating.

    Returns:
        pd.DataFrame: A user-item matrix with users as rows, items as columns, and ratings as values.
                       NaNs indicate that a user has not rated an item.
    """
    df=df[["user_id","rating","title"]]
    train,test=split_data(df)
    R_train = train.pivot_table(index="user_id", columns="title", values="rating")
    R_test=test.pivot_table(index="user_id", columns="title", values="rating")

    all_titles = df["title"].unique()

    R_train = R_train.reindex(columns=all_titles)
    R_test = R_test.reindex(columns=all_titles)
    
    return (R_train,R_test)
    # TODO: Create a pivot table.
    # The index should be 'user_id', columns 'item_id', and values 'rating'.

In [53]:
R_train,R_test=create_user_item_matrix(df)


In [54]:
R_train.head(10)

title,Kolya (1996),L.A. Confidential (1997),Heavyweights (1994),Legends of the Fall (1994),Jackie Brown (1997),Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),"Hunt for Red October, The (1990)","Jungle Book, The (1994)",Grease (1978),"Remains of the Day, The (1993)",...,Sleepover (1995),Everest (1998),Nobody Loves Me (Keiner liebt mich) (1994),Getting Away With Murder (1996),Scream of Stone (Schrei aus Stein) (1991),Mamma Roma (1962),"Eighth Day, The (1996)",Girls Town (1996),"Silence of the Palace, The (Saimt el Qusur) (1994)",Dadetown (1995)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,5.0,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,2.0,,,5.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,1.0,,...,,,,,,,,,,
6,4.0,4.0,,,,5.0,,1.0,,3.0,...,,,,,,,,,,
7,,,,2.0,,5.0,5.0,4.0,5.0,4.0,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,4.0,,,,,,,,,,...,,,,,,,,,,
10,,4.0,,,,4.0,,,,,...,,,,,,,,,,


---

## Part 2: The Theory of Matrix Factorization

The core idea is to **decompose** our large, sparse user-item matrix $R$ (size $m \times n$) into two smaller, dense matrices:
1.  A **user-feature matrix** $P$ (size $m \times k$).
2.  An **item-feature matrix** $Q$ (size $n \times k$).

Here, $k$ is the number of **latent factors**, which is a hyperparameter we choose. These latent factors represent hidden characteristics of users and items. For movies, a factor might represent the "amount of comedy" vs. "drama", or "blockbuster" vs. "indie film". For users, a factor might represent their preference for these characteristics.



The prediction of a rating $\hat{r}_{ui}$ that user $u$ would give to item $i$ is calculated by the dot product of the user's latent vector $p_u$ and the item's latent vector $q_i$:

$$\hat{r}_{ui} = p_u \cdot q_i^T = \sum_{k=1}^{K} p_{uk} q_{ik}$$

Our goal is to find the matrices $P$ and $Q$ such that their product $P \cdot Q^T$ is as close as possible to the known ratings in our original matrix $R$. We formalize this using a **loss function**. A common choice is the sum of squared errors, with **regularization** to prevent overfitting:

$$L = \sum_{(u,i) \in \mathcal{K}} (r_{ui} - \hat{r}_{ui})^2 + \lambda \left( \sum_{u} ||p_u||^2 + \sum_{i} ||q_i||^2 \right)$$

Where:
* $\mathcal{K}$ is the set of $(u, i)$ pairs for which the rating $r_{ui}$ is known.
* $\lambda$ is the regularization parameter, another hyperparameter.

---

## Part 3: The Algorithm - Stochastic Gradient Descent (SGD)

To minimize our loss function $L$, we will use **Stochastic Gradient Descent (SGD)**. Instead of calculating the gradient over all known ratings (which is computationally expensive), SGD iterates through each known rating one by one and updates the parameters in the direction that minimizes the error for that single rating.

For each known rating $r_{ui}$:
1.  Calculate the prediction error: $e_{ui} = r_{ui} - \hat{r}_{ui}$
2.  Update the user and item latent vectors ($p_u$ and $q_i$) using the following update rules:

$$p_u \leftarrow p_u + \alpha \cdot (e_{ui} \cdot q_i - \lambda \cdot p_u)$$
$$q_i \leftarrow q_i + \alpha \cdot (e_{ui} \cdot p_u - \lambda \cdot q_i)$$

Where:
* $\alpha$ is the **learning rate**, a hyperparameter that controls the step size.

We repeat this process for a fixed number of **epochs** (iterations over the entire training dataset).

---

## Part 4: Step-by-Step Implementation

Let's build our model. First, we need to split our data into a training and a testing set.

### 4.1 Initialization

We need to initialize our user-feature matrix $P$ and item-feature matrix $Q$ with small random values.

In [55]:
def initialize_matrices(n_users, n_items, n_factors):
    """
    Initializes the user-feature (P) and item-feature (Q) matrices.

    Args:
        n_users (int): Number of users.
        n_items (int): Number of items.
        n_factors (int): Number of latent factors.

    Returns:
        tuple: A tuple containing:
            - P (np.ndarray): The user-feature matrix (n_users x n_factors).
            - Q (np.ndarray): The item-feature matrix (n_items x n_factors).
    """
    # TODO: Initialize P and Q with small random values from a standard normal distribution.
    P = np.random.normal(scale=0.1, size=(n_users, n_factors))
    Q = np.random.normal(scale=0.1, size=(n_items, n_factors))
    return (P,Q)


In [56]:
n_users, n_items = R_train.shape
n_factors=10
P,Q= initialize_matrices(n_users, n_items, n_factors)

### 4.2 The Training Loop (SGD)

This is the core of our algorithm. We will loop for a specified number of epochs. In each epoch, we will iterate over all known ratings in our training set `R_train` and update the corresponding user and item vectors in `P` and `Q`.

In [106]:
def train_model(R_train, P, Q, learning_rate, regularization, epochs):
    R_train = R_train.to_numpy()
    R_train = np.nan_to_num(R_train, nan=0.0)
    r_predict = np.zeros(R_train.shape)
    e = np.zeros(R_train.shape)
    n_users, n_items = R_train.shape
    for epoch in range(epochs):
        L=0
        for u in range(n_users):
            for i in range(n_items):
                if R_train[u][i] > 0:
                    r_predict[u][i] = P[u].dot(Q[i])
                    e[u][i]= R_train[u][i]-r_predict[u][i]
                    p_u_old = P[u].copy()
                    P[u]=P[u]+learning_rate*(e[u][i]*Q[i]-regularization*P[u])
                    Q[i]=Q[i]+learning_rate*(e[u][i]*p_u_old-regularization*Q[i])
                    L=L+(e[u][i]*e[u][i])+regularization*(np.sum(P[u]**2)+np.sum(Q[i]**2))
        print ("Epoch",epoch,"loss",L)
    
                
    return (P,Q)
                
                

            
        
    """
    Trains the matrix factorization model using SGD.

    Args:
        R_train (np.ndarray): The training user-item matrix.
        P (np.ndarray): The user-feature matrix.
        Q (np.ndarray): The item-feature matrix.
        learning_rate (float): The learning rate (alpha).
        regularization (float): The regularization parameter (lambda).
        epochs (int): The number of iterations over the training data.

    Returns:
        tuple: A tuple containing the trained P and Q matrices.
    """

In [107]:
learning_rate=0.01
regularization=0.001
epochs=50
P_train,Q_train=train_model(R_train, P, Q, learning_rate, regularization, epochs)

Epoch 0 loss 52476.447794076244
Epoch 1 loss 47967.17250196821
Epoch 2 loss 43252.015811526784
Epoch 3 loss 38603.44972172715
Epoch 4 loss 34125.742609142464
Epoch 5 loss 29948.673459851154
Epoch 6 loss 26161.009953429395
Epoch 7 loss 22800.786706234052
Epoch 8 loss 19867.704620400065
Epoch 9 loss 17337.518869102587
Epoch 10 loss 15172.630685968705
Epoch 11 loss 13329.613449995073
Epoch 12 loss 11764.44776307062
Epoch 13 loss 10435.78185088418
Epoch 14 loss 9306.619970506928
Epoch 15 loss 8344.91852647133
Epoch 16 loss 7523.510830992956
Epoch 17 loss 6819.672868137767
Epoch 18 loss 6214.539788838527
Epoch 19 loss 5692.500937225538
Epoch 20 loss 5240.64135683899
Epoch 21 loss 4848.258078671153
Epoch 22 loss 4506.456299908562
Epoch 23 loss 4207.819053301214
Epoch 24 loss 3946.1396520001313
Epoch 25 loss 3716.2056921378435
Epoch 26 loss 3513.62456582005
Epoch 27 loss 3334.682115636475
Epoch 28 loss 3176.2277059976286
Epoch 29 loss 3035.5803958000774
Epoch 30 loss 2910.452033136918
Epoch 3

---
## Part 5: Evaluation

After training, we must evaluate how well our model performs on unseen data. We will use the **Root Mean Squared Error (RMSE)**, which measures the average magnitude of the errors between predicted and actual ratings.

The formula is:
$$RMSE = \sqrt{\frac{1}{|\mathcal{T}|} \sum_{(u,i) \in \mathcal{T}} (r_{ui} - \hat{r}_{ui})^2}$$

Where $\mathcal{T}$ is the set of ratings in our test set. A lower RMSE means better performance.

In [118]:
def calculate_rmse(R_test, P, Q):
    R_test = R_test.to_numpy()
    R_test = np.nan_to_num(R_test, nan=0.0)
    r_predict = np.zeros(R_test.shape)
    error=0
    T=0
    n_users, n_items = R_test.shape
    for u in range(n_users):
        for i in range(n_items):
            if R_test[u][i] > 0:
                r_predict[u][i] = P[u].dot(Q[i])
                error= error+ (R_test[u][i]-r_predict[u][i])**2
                T=T+1
    RMSE=np.sqrt(error/T)
    return (RMSE)
                
    """
    Calculates the Root Mean Squared Error (RMSE) on the test set.

    Args:
        R_test (np.ndarray): The testing user-item matrix.
        P (np.ndarray): The trained user-feature matrix.
        Q (np.ndarray): The trained item-feature matrix.

    Returns:
        float: The RMSE value.
    """

In [119]:
RMSE=calculate_rmse(R_test, P_train,Q_train)
print (RMSE)

1.12984496318482


---
## Part 6: Putting It All Together

Now, let's connect all the pieces. We'll set our hyperparameters, initialize our matrices, train the model, and finally evaluate it.

**Your Goal:** Tune the hyperparameters to achieve an **RMSE below 0.98**. A good model can even reach ~0.95. If your RMSE is higher, try adjusting the learning rate, regularization, number of factors, or epochs.

In [110]:
# --- Hyperparameters ---
# Number of latent factors (k)
# Learning rate (alpha)
# Regularization parameter (lambda)
# Number of epochs
k=150
learning_rate=0.001
regularization=0.1
epochs=100


# --- Initialization ---
# Remember user and item IDs are 1-based, but our numpy arrays are 0-based.
# The number of users/items from the shape of R_df is correct for 0-based indexing.

R_train,R_test=create_user_item_matrix(df)
P,Q= initialize_matrices(n_users, n_items, k)


# --- Training ---
P_train,Q_train=train_model(R_train, P, Q, learning_rate, regularization, epochs)


# --- Evaluation ---
RMSE=calculate_rmse(R_test, P_train,Q_train)
print(RMSE)


Epoch 0 loss 1120631.2228858448
Epoch 1 loss 1111981.2100844672
Epoch 2 loss 1099510.9929520423
Epoch 3 loss 1076224.7401401675
Epoch 4 loss 1027334.72756295
Epoch 5 loss 929723.4143127124
Epoch 6 loss 773523.028307528
Epoch 7 loss 599363.5521224932
Epoch 8 loss 465131.04741022893
Epoch 9 loss 379244.68014423136
Epoch 10 loss 323090.79707807803
Epoch 11 loss 283840.906180726
Epoch 12 loss 255243.46592432444
Epoch 13 loss 233820.4669629916
Epoch 14 loss 217396.90341373856
Epoch 15 loss 204546.90944788745
Epoch 16 loss 194309.70501232188
Epoch 17 loss 186022.1430143001
Epoch 18 loss 179216.1443617375
Epoch 19 loss 173554.48979433498
Epoch 20 loss 168789.7106423946
Epoch 21 loss 164737.15895537607
Epoch 22 loss 161256.9780226721
Epoch 23 loss 158241.7879144207
Epoch 24 loss 155608.119768328
Epoch 25 loss 153290.35733515996
Epoch 26 loss 151236.38542513465
Epoch 27 loss 149404.41951456256
Epoch 28 loss 147760.66523890782
Epoch 29 loss 146277.56935012716
Epoch 30 loss 144932.49790159936
Epo

---

## Part 7: Making Recommendations

The ultimate goal is to recommend movies! Now that we have our trained matrices $P$ and $Q$, we can predict the rating for *any* user-item pair, including those the user has not seen yet.

The process for a given user `user_id`:
1.  Get the user's latent vector $p_u$ from the trained matrix $P$.
2.  Calculate the predicted ratings for all items by taking the dot product of $p_u$ and the entire item-feature matrix $Q^T$.
3.  Create a list of movie titles and their predicted ratings.
4.  Filter out movies the user has already seen.
5.  Sort the remaining movies by their predicted rating in descending order.
6.  Return the top N movies.

In [111]:
def recommend_top_movies(user_id, P, Q, movie_titles_df, R_df, top_n=10):
    """
    Recommends top N movies for a given user.

    Args:
        user_id (int): The ID of the user.
        P (np.ndarray): The trained user-feature matrix.
        Q (np.ndarray): The trained item-feature matrix.
        movie_titles_df (pd.DataFrame): Dataframe with item_id and title.
        R_df (pd.DataFrame): The original user-item matrix dataframe (for checking seen movies).
        top_n (int): The number of movies to recommend.

    Returns:
        pd.DataFrame: A dataframe with the top N recommended movie titles and their predicted ratings.
    """
    movie_titles = movie_titles_df['title'].values
    R = R_df.values

    num_items = R.shape[1]
    r_predict = np.full(num_items, -np.inf) 

    for i in range(num_items):
        if np.isnan(R[user_id, i]):
            r_predict[i] = P[user_id].dot(Q[i])

    top_indices = np.argsort(r_predict)[::-1][:top_n]

    recommendations = np.vstack((movie_titles[top_indices],  np.round(r_predict[top_indices], 2))).T

    return recommendations
    
    

In [112]:
def create_R_df(df):
    df=df[["user_id","rating","title"]]
    R_df = df.pivot_table(index="user_id", columns="title", values="rating")
    
    return (R_df)

R_df=create_R_df(df)

In [113]:
recommendations= recommend_top_movies(1, P_train, Q_train, df, R_df)

In [114]:
print (recommendations)

[['Thinner (1996)' 4.64]
 ['Jumanji (1995)' 4.59]
 ['Home Alone (1990)' 4.56]
 ['When Harry Met Sally... (1989)' 4.54]
 ['Mimic (1997)' 4.5]
 ['Pulp Fiction (1994)' 4.49]
 ['Kiss the Girls (1997)' 4.49]
 ["Devil's Own, The (1997)" 4.49]
 ['Pulp Fiction (1994)' 4.47]
 ['River Wild, The (1994)' 4.47]]
