# Recommender Systems and Collaborative Filtering

Recommender systems power many popular online platforms—from shopping sites like Amazon to streaming services like Netflix—by suggesting products, movies, or articles based on users' past behavior. Despite their academic treatment being relatively modest, their commercial impact is enormous. The core idea is to use available rating data to predict what a user might like next, thus driving engagement and sales.

---

## Basic Example

Imagine you have a set of users and a set of movies. The data is typically organized in a matrix where:
- **Rows:** Represent movies (or items).
- **Columns:** Represent users.
- **Entries:** The rating that a user gave to a movie.

For example, suppose we have 4 users (Alice, Bob, Carol, Dave) and 5 movies. We define:
- $n_u$: Number of users (e.g., Alice, Bob, Carol, Dave). In our case 4.
- $n_m$: Number of movies (e.g., _Love at Last_, _Romance Forever_, etc.). In our case 5.

We introduce two functions:
**Indicator function:**  An indicator variable; it equals 1 if user $j$ has rated movie $i$, and 0 otherwise.

$$
r(i, j) = 
\begin{cases}
1 & \text{if user } j \text{ has rated movie } i, \\
0 & \text{otherwise.}
\end{cases}
$$

**Rating function:** Rating given by user $j$ for movie $i$.

$$y(i, j) = \text{the rating user } j \text{ gave movie } i.$$

**Example:**  
1. If Alice (user 1) rates _Love at Last_ (movie 1) 5 stars, then $y(1,1) = 5$ and $r(1,1) = 1$.
2. If Bob (user 2) hasn’t rated _Romance Forever_ (movie 2), then $r(2,2) = 0$.

### Prediction Using Linear Models

We want to predict how a user might rate a movie they haven't seen yet. To do this, we assume that every movie can be described by a set of features (like how “romantic” or “action-packed” it is). Each user has their own “taste profile” defined by parameters that interact with these features.

For user $j$ and movie $i$, the prediction is made using a linear model:

$$
\hat{y}^{(i, j)} = \mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}
$$

Where:
- $\mathbf{x}^{(i)}$ is the feature vector for movie $i$.
- $\mathbf{w}^{(j)}$ is the weight vector (user preferences) for user $j$.
- $b^{(j)}$ is the bias (a constant that can shift the prediction).

Suppose movies have two features:
- $x_1$: How romantic the movie is.
- $x_2$: How action-packed it is.

For user Alice, who loves romance:
- Let $\mathbf{w}^{(Alice)} = [5, 0]$ (high weight for romance, zero weight for action).
- Let $b^{(Alice)} = 0$.

For a movie with features $\mathbf{x}^{(i)} = [0.99, 0]$ (very romantic, no action), the predicted rating is:

$$
\hat{y}^{(i, Alice)} = 5 \times 0.99 + 0 = 4.95
$$

This matches our intuition that a highly romantic movie should receive a high rating from Alice.

---

## The Cost Function

To adjust the parameters ($\mathbf{w}^{(j)}$ and $b^{(j)}$) so that our predictions are close to the actual ratings, we define a cost function. For user $j$, the cost function (using Mean Squared Error) is:

$$
J(\mathbf{w}^{(j)}, b^{(j)}) = \frac{1}{2\,m(j)} \sum_{i: r(i,j)=1} \Big( \mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y(i,j) \Big)^2 + \frac{\lambda}{2\,m(j)} \sum_{k=1}^{n} \big(w^{(j)}_k\big)^2
$$

- **$m(j)$**: Number of movies rated by user $j$.
- **$\lambda$**: Regularization parameter to prevent overfitting.
- The first term is the squared error (difference between prediction and true rating) and the second term penalizes large weights.

_Step-by-Step_:

1. Calculate the error for each rated movie by subtracting the actual rating $y(i,j)$ from the predicted rating.
2. Square each error, sum them, and divide by $2m(j)$.
3. Add the regularization term, which is the sum of the squares of the weights (scaled by $\lambda$ and $m(j)$).

_Note_: During optimization, dividing by $m(j)$ does not affect the location of the minimum, so it can sometimes be dropped to simplify calculations.


### For All Users

To learn for every user, sum the cost over all users:

$$
J_{\text{total}} = \sum_{j=1}^{\nu} J(\mathbf{w}^{(j)}, b^{(j)})
$$

An optimization algorithm (like gradient descent) is then used to find the best parameters for all users.

---

## Learning Item Features: Collaborative Filtering

Often, you might not have pre-defined features for movies. Collaborative filtering allows us to learn the features $\mathbf{x}^{(i)}$ directly from the data.

Rather than relying on pre-defined item characteristics, we let the data “speak for itself” by inferring both:
- The unknown feature vector $X(i)$ for each movie.
- The user-specific parameters $w^{(j)}$ and $b^{(j)}$.

_Analogy_:  
Think of each movie as a “mystery box” with hidden traits. By observing how different friends (users) react to these mystery boxes (movies), you can infer the underlying characteristics (features) of the boxes.

### Step-by-Step Example

Imagine for movie 1:
- Alice and Bob rated it 5, while Carol and Dave rated it 0.
- With given user parameters (for example, Alice’s $\mathbf{w}^{(Alice)} = [5, 0]$, Bob’s similar, and Carol/Dave have weights that produce low scores for a given feature), a natural choice for $\mathbf{x}^{(1)}$ might be $[1, 0]$.
- Then:
  - For Alice: $5 \times 1 = 5$,
  - For Carol: if her weight on the first feature is near 0, the prediction will be near 0.
  
We create a cost function for movie $i$:

$$
J(\mathbf{x}^{(i)}) = \frac{1}{2} \sum_{j: r(i,j)=1} \Big( \mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} - y(i,j) \Big)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} \big(x^{(i)}_k\big)^2
$$

The idea is to adjust $\mathbf{x}^{(i)}$ (for every movie) so that, for all users who rated it, the predictions are as accurate as possible.

_Step-by-Step_:

1. For each movie $i$, consider only the users $j$ with $r(i,j) = 1$.
2. Compute the prediction $w^{(j)} \cdot X(i)$ and compare it to the actual rating.
3. Sum these errors and add the regularization penalty.

---

## Joint Learning: Users and Items

We combine the cost functions for both user-specific parameters and item features into a unified cost function:

$$
J(w, b, X) = \frac{1}{2} \sum_{(i,j): r(i,j)=1} \left( w^{(j)} \cdot X(i) + b^{(j)} - y(i,j) \right)^2 + \text{regularization terms}
$$

- **Explanation**:
  - This formulation accounts for all observed ratings by summing over every pair $(i,j)$ for which $r(i,j)=1$.
  - Regularization is applied to both the user parameters and item features.
  - The joint optimization problem can be solved using methods like gradient descent, where you update $w^{(j)}$, $b^{(j)}$, and $X(i)$ iteratively.

_Step-by-Step Example_:

1. **Initialization**: Start with random or small initial values for $w^{(j)}$, $b^{(j)}$, and $X(i)$ for all users and movies.
2. **Prediction**: For each rating $y(i,j)$, compute the predicted rating.
3. **Error Calculation**: Calculate the difference between the predicted and actual rating.
4. **Update Parameters**: Adjust $w^{(j)}$, $b^{(j)}$, and $X(i)$ based on the gradient of the cost function.
5. **Iteration**: Repeat until the cost function converges to a minimum.

---

## Binary Labels

Many recommender systems do not have star ratings. Instead, they use binary signals like "liked" (1) or "not liked" (0).

For binary outcomes, we predict the probability that user $j$ likes movie $i$ using a logistic function:

$$
P\big(y^{(i,j)}=1\big) = g\Big(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}\Big)
$$

where the logistic function is:

$$
g(z) = \frac{1}{1 + e^{-z}}
$$

### Cost Function

For binary labels, the cost function uses **binary cross-entropy loss** instead of mean squared error:

$$
J(\mathbf{w}^{(j)}, b^{(j)}) = -\frac{1}{m(j)} \sum_{i: r(i,j)=1} \left[ y(i,j)\log\big(g(z)\big) + (1-y(i,j))\log\big(1-g(z)\big) \right] + \text{regularization}
$$

This is similar to moving from linear to logistic regression.

_Step-by-Step_:
1. **For each rating**:
   - If $y(i,j)=1$: Add the term $-\log(g(\cdot))$.
   - If $y(i,j)=0$: Add the term $-\log(1 - g(\cdot))$.
2. **Sum over all rated pairs**.
3. **Add regularization** to keep the parameters small and avoid overfitting.

This approach shifts the problem from predicting an exact rating to estimating the probability of engagement, which is often more useful in scenarios like ad-click prediction or binary purchase signals.

---

## Mean Normalization: Handling the Cold Start Problem

When a new user (or an item) has very few ratings, predictions can be very poor. Mean normalization adjusts the ratings so that every movie’s ratings are centered around zero.

1. **Compute the Mean Rating:**  
   For each movie $i$, calculate:

$$\mu_i = \frac{1}{\text{number of ratings for } i} \sum_{j: r(i,j)=1} y(i,j)$$

   For example, if movie 1 has ratings 5, 5, 0, 0 from four users, then $\mu_1 = 2.5$.

3. **Normalize the Ratings:**  
   Replace each rating with:

$$y_{\text{norm}}(i,j) = y(i,j) - \mu_i$$

   This centers the ratings so that on average, they are zero.

5. **Make Predictions and Re-adjust:**  
   Predict using the model:

$$\hat{y}_{\text{norm}}^{(i,j)} = \mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}$$

   Then add the mean back to get the final prediction:

$$\hat{y}^{(i,j)} = \hat{y}_{\text{norm}}^{(i,j)} + \mu_i$$

### Why It Helps

For a new user with no ratings, the parameters might default to zeros (because of regularization), leading to a prediction of 0 for every movie. By adding the movie’s mean rating back, the system instead predicts the average rating—much more reasonable than zero.

---

## Implementing Collaborative Filtering in TensorFlow

TensorFlow is not only for deep neural networks; it can also be used to implement collaborative filtering by automatically computing gradients.

### Automatic Differentiation (Auto Diff)

Instead of manually calculating derivatives for the cost function, TensorFlow’s **gradient tape** records operations and computes gradients automatically.
  
**Example Code Outline:**

```python
  import tensorflow as tf
  
  # Initialize variable w (for demonstration)
  w = tf.Variable(3.0)
  x = 1.0  # Example input
  y = 1.0  # True value
  alpha = 0.01  # Learning rate
  
  # Use gradient tape to compute derivative automatically
  for iter in range(30):
      with tf.GradientTape() as tape:
          f = w * x
          J = (f - y) ** 2
      dJdw = tape.gradient(J, w)
      w.assign_sub(alpha * dJdw)  # Update parameter w
  print(w.numpy())  # Expected to approach 1.0
```

This example uses a simple quadratic cost function to illustrate gradient descent.

### Applying to Collaborative Filtering

In collaborative filtering:
- **Parameters:** Include both user parameters ($\mathbf{w}^{(j)}$, $b^{(j)}$) and item features ($\mathbf{x}^{(i)}$).
- **Cost Function:** Is a combination of prediction error (either squared error or cross-entropy) and regularization.
- **Optimizer:** You can use basic gradient descent or more advanced ones like Adam, which TensorFlow supports.

The key is writing the cost function in TensorFlow. The optimizer and gradient tape then handle computing the necessary updates for all parameters.

---

## Finding Similar (Related) Items

Once the model learns item features $\mathbf{x}^{(i)}$, these vectors can be used to find similar items.

Use the squared Euclidean distance between two feature vectors:

$$
d(\mathbf{x}^{(i)}, \mathbf{x}^{(k)}) = \sum_{l=1}^{n} \Big( x^{(i)}_l - x^{(k)}_l \Big)^2
$$

A smaller distance indicates that the movies are more similar.

### Practical Example

- **Step 1:** For a given movie (say, "Romantic Movie A") with feature vector $\mathbf{x}^{(A)}$, compute the distance to every other movie.
- **Step 2:** Select the movies with the smallest distances.
- **Analogy:**  
  Imagine each movie as a point in a multi-dimensional space (each dimension representing an abstract quality like romance or action). Movies that lie close together in this space are likely to be similar in style or content.

---

## Limitations and Extensions

### Cold Start Problem
- **New Items:** With few ratings, it is hard to learn good features.
- **New Users:** With little interaction data, the system might initially predict the average rating.
- **Mitigation:**  
  Techniques such as mean normalization help, and sometimes additional data (e.g., demographic or content information) is combined with collaborative filtering.

### Integrating Side Information
Collaborative filtering by itself does not naturally include additional details (like a movie’s genre, director, or user demographics). Extensions like **content-based filtering** or hybrid models combine both collaborative and side information to improve recommendations.