# Recommender Systems & Collaborative Filtering Notes

Recommender systems power many popular online platforms—from shopping sites like Amazon to streaming services like Netflix—by suggesting products, movies, or articles based on users' past behavior. Despite their academic treatment being relatively modest, their commercial impact is enormous. The core idea is to use available rating data to predict what a user might like next, thus driving engagement and sales. In these notes, we break down the concepts with more detailed explanations, examples, and step-by-step walkthroughs.

---

## Basic Framework and Notation

Imagine a movie streaming site where users rate movies. To build a recommender system, we first need to define our basic components:

- **Users and Items**:  
  - $n_u$: Number of users (e.g., Alice, Bob, Carol, Dave).  
  - $n_m$: Number of movies (e.g., *Love at Last*, *Romance Forever*, etc.).  

  *Example*: If there are 4 users and 5 movies, then $n_u = 4$ and $n_m = 5$.

- **Ratings and Observations**:  
  - $y(i,j)$: Rating given by user $j$ for movie $i$.  
  - $r(i,j)$: An indicator variable; it equals 1 if user $j$ has rated movie $i$, and 0 otherwise.

  *Step-by-Step*:
  1. If Alice (user 1) rates *Love at Last* (movie 1) 5 stars, then $y(1,1) = 5$ and $r(1,1) = 1$.
  2. If Bob (user 2) hasn’t rated *Romance Forever* (movie 2), then $r(2,2) = 0$.

- **Features of Items**:  
  When available, movies might have inherent features. For example:
  - Feature $X_1$: Degree of romance.
  - Feature $X_2$: Level of action.

  *Example*:  
  - *Love at Last* might have features $X(1) = [0.9, 0]$ (high romance, no action).
  - *Nonstop Car Chases* might have features $X(4) = [0.1, 1.0]$ (low romance, high action).

These definitions provide the building blocks for our prediction models.

---

## Prediction Using Linear Models

### With Known Item Features

When the features for each movie are provided, the rating prediction for user $j$ and movie $i$ is modeled with a linear function:

$$
\hat{y}_{ij} = w^{(j)} \cdot X(i) + b^{(j)}
$$

- **Parameters**:  
  - $w^{(j)}$: Weight vector capturing user $j$'s preference for each feature.
  - $b^{(j)}$: A bias term for user $j$ to account for overall rating tendencies.

*Step-by-Step Example*:
1. Assume for Alice ($j = 1$), we choose:
   - $w^{(1)} = [5, 0]$ meaning she highly values the romance feature and ignores action.
   - $b^{(1)} = 0$ (no additional bias).
2. For a movie with features $X(i) = [0.99, 0]$:
   $$
   \hat{y}_{i1} = 5 \times 0.99 + 0 = 4.95
   $$
   This suggests that Alice would give nearly 5 stars for a very romantic movie.

### Cost Function for Rating Prediction

To learn the parameters $w^{(j)}$ and $b^{(j)}$ from data, we minimize the mean squared error (MSE) computed only over the movies that user $j$ has rated:

$$
J(w^{(j)}, b^{(j)}) = \frac{1}{2m(j)} \sum_{i: r(i,j)=1} \left( w^{(j)} \cdot X(i) + b^{(j)} - y(i,j) \right)^2 + \frac{\lambda}{2m(j)} \sum_{k=1}^{n} \left( w_k^{(j)} \right)^2
$$

- **Details**:  
  - $m(j)$: Number of movies rated by user $j$.  
  - $\lambda$: Regularization parameter that helps prevent overfitting by penalizing large weights.
  
*Step-by-Step*:
1. Calculate the error for each rated movie by subtracting the actual rating $y(i,j)$ from the predicted rating.
2. Square each error, sum them, and divide by $2m(j)$.
3. Add the regularization term, which is the sum of the squares of the weights (scaled by $\lambda$ and $m(j)$).

*Note*: During optimization, dividing by $m(j)$ does not affect the location of the minimum, so it can sometimes be dropped to simplify calculations.

---

## Learning Item Features: Collaborative Filtering

When item features are not explicitly given, collaborative filtering techniques allow us to learn these features directly from the rating data.

### The Idea Behind Collaborative Filtering

Rather than relying on pre-defined item characteristics, we let the data “speak for itself” by inferring both:
- The unknown feature vector $X(i)$ for each movie.
- The user-specific parameters $w^{(j)}$ and $b^{(j)}$.

*Analogy*:  
Think of each movie as a “mystery box” with hidden traits. By observing how different friends (users) react to these mystery boxes (movies), you can infer the underlying characteristics (features) of the boxes.

### Formulating the Cost Function for Item Features

For a given movie $i$, the cost function to learn its feature vector $X(i)$ is:

$$
J(X(i)) = \frac{1}{2} \sum_{j: r(i,j)=1} \left( w^{(j)} \cdot X(i) - y(i,j) \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} \left( X_k(i) \right)^2
$$

- **Explanation**:
  - We sum the squared differences between the predicted ratings $w^{(j)} \cdot X(i)$ and the actual ratings $y(i,j)$ for all users who rated movie $i$.
  - The regularization term $\frac{\lambda}{2} \sum_{k=1}^{n} \left( X_k(i) \right)^2$ prevents the inferred features from growing too large.

*Step-by-Step*:
1. For each movie $i$, consider only the users $j$ with $r(i,j) = 1$.
2. Compute the prediction $w^{(j)} \cdot X(i)$ and compare it to the actual rating.
3. Sum these errors and add the regularization penalty.

### Joint Learning: Users and Items

We combine the cost functions for both user-specific parameters and item features into a unified cost function:

$$
J(w, b, X) = \frac{1}{2} \sum_{(i,j): r(i,j)=1} \left( w^{(j)} \cdot X(i) + b^{(j)} - y(i,j) \right)^2 + \text{regularization terms}
$$

- **Explanation**:
  - This formulation accounts for all observed ratings by summing over every pair $(i,j)$ for which $r(i,j)=1$.
  - Regularization is applied to both the user parameters and item features.
  - The joint optimization problem can be solved using methods like gradient descent, where you update $w^{(j)}$, $b^{(j)}$, and $X(i)$ iteratively.

*Step-by-Step Example*:
1. **Initialization**: Start with random or small initial values for $w^{(j)}$, $b^{(j)}$, and $X(i)$ for all users and movies.
2. **Prediction**: For each rating $y(i,j)$, compute the predicted rating.
3. **Error Calculation**: Calculate the difference between the predicted and actual rating.
4. **Update Parameters**: Adjust $w^{(j)}$, $b^{(j)}$, and $X(i)$ based on the gradient of the cost function.
5. **Iteration**: Repeat until the cost function converges to a minimum.

---

## Extension to Binary Labels

In many real-world applications, explicit ratings (like 1–5 stars) are replaced by binary signals (e.g., a “like” or a click).

### Using a Logistic Regression Approach

When ratings are binary, $y(i,j)$ is either 0 (non-engagement) or 1 (engagement). The model then predicts the probability that user $j$ likes movie $i$:

$$
P(y(i,j)=1) = g\left( w^{(j)} \cdot X(i) + b^{(j)} \right)
$$

where the logistic (sigmoid) function is defined as:

$$
g(z) = \frac{1}{1 + e^{-z}}
$$

- **Explanation**:
  - The logistic function squashes any real-valued number into the range $(0,1)$, making it suitable for probability estimation.
  - For instance, if $w^{(j)} \cdot X(i) + b^{(j)} = 2$, then
    $$
    g(2) = \frac{1}{1+e^{-2}} \approx 0.88,
    $$
    suggesting an 88% chance that the user likes the movie.

### Modified Cost Function

When using binary labels, we adapt the cost function to the cross-entropy loss, which is common in classification tasks:

$$
J(w, b, X) = - \sum_{(i,j): r(i,j)=1} \left[ y(i,j) \log \left(g\left( w^{(j)} \cdot X(i) + b^{(j)} \right)\right) + (1 - y(i,j)) \log \left(1 - g\left( w^{(j)} \cdot X(i) + b^{(j)} \right)\right) \right] + \text{regularization terms}
$$

*Step-by-Step*:
1. **For each rating**:
   - If $y(i,j)=1$: Add the term $-\log(g(\cdot))$.
   - If $y(i,j)=0$: Add the term $-\log(1 - g(\cdot))$.
2. **Sum over all rated pairs**.
3. **Add regularization** to keep the parameters small and avoid overfitting.

This approach shifts the problem from predicting an exact rating to estimating the probability of engagement, which is often more useful in scenarios like ad-click prediction or binary purchase signals.

---

## Practical Implementation & Insights

- **Gradient Descent Updates**:  
  Both the user parameters ($w^{(j)}, b^{(j)}$) and item features $X(i)$ are updated iteratively using gradient descent.  
  *Step-by-Step*:
  1. Compute the gradient (partial derivative) of the cost function with respect to each parameter.
  2. Update each parameter by moving it slightly in the direction that reduces the cost.
  3. Repeat until the cost function changes very little between iterations (convergence).

- **Collaboration Across Users**:  
  The term “collaborative filtering” stems from the idea that the system leverages the collective ratings of multiple users to infer item features.  
  *Analogy*:  
  Imagine you’re trying to decide which new restaurant to try based on reviews from many friends. Even if you have never visited that restaurant, you can predict whether you might enjoy it by understanding what your friends liked or disliked.

- **Regularization**:  
  Adding regularization (via the $\lambda$ term) is crucial to prevent overfitting, especially when the number of features is large compared to the number of available ratings.  
  *Explanation*:  
  Regularization discourages the model from fitting noise in the training data by penalizing overly complex models (i.e., very large weights).

- **Real-World Example Analogy**:  
  Think of each movie as a “mystery box” with hidden traits. By observing multiple friends’ (users’) reactions to the box, you can infer what characteristics (like romance or action) the box might have. Later, when a new friend comes along with similar tastes, you can predict their reaction based on the shared inferred traits.

- **Binary Feedback Use Cases**:  
  In many applications like online shopping or ad-click predictions, the feedback is binary (purchase/no purchase, click/no click) rather than a detailed rating.  
  *Explanation*:  
  - Logistic regression is used to model these cases because it naturally handles binary outcomes by predicting probabilities.
  - This adaptation expands the range of scenarios where collaborative filtering can be applied.
