# Big Data Lecture Notes: Collaborative Filtering

## Part 1: Introduction to Recommender Systems

### What is a Recommender System? 🤔

A recommender system is a tool that predicts a user's preference for an item based on their historical behavior. You interact with these systems every day when services like Amazon suggest products to buy, Netflix recommends movies to watch, or Spotify curates playlists for you.

To make these recommendations, the system relies on user feedback, which can be:
* **Explicit Feedback**: Direct input from the user, like giving a movie a 5-star rating.
* **Implicit Feedback**: Indirect information gathered from a user's behavior, such as clicks, viewing time, number of shares, or purchase history.

The two most common approaches for building these systems are **Content-Based Filtering** and **Collaborative Filtering**.

---

### Collaborative Filtering vs. Content-Based Filtering

#### Content-Based Filtering
The main idea here is to **recommend items that are similar to items a user has previously liked**.
* **How it works**: If you like a movie with a specific actor and genre, this system will recommend other movies featuring the same actor or belonging to the same genre.
* **Requirement**: This method needs a lot of detailed information (metadata) about the items themselves.

#### Collaborative Filtering (CF)
The main idea of Collaborative Filtering is to use the preferences and behaviors of **many users** to make recommendations. It operates on a simple premise: "If person A has the same opinion as person B on an issue, A is more likely to have B's opinion on a different issue."
* **How it works**: It finds users with tastes similar to yours and recommends items that those similar users have liked.
* **Advantage**: It's **domain-free**, meaning it doesn't need to know anything about the items' content (like genre or actors). It only needs the user-item interaction data (e.g., who rated what).


---

## Part 2: Types of Collaborative Filtering

Collaborative Filtering can be broken down into two main categories: Memory-Based and Model-Based.

### Memory-Based Collaborative Filtering

This approach, also known as the neighborhood-based approach, uses the entire user-item dataset to calculate similarities and make recommendations.

* **User-Based CF**: This method finds users who have rated items similarly to the target user. It then recommends items that these "similar users" liked but that the target user has not yet seen.
* **Item-Based CF**: This method finds items that are similar to the items the target user has rated highly. It recommends these similar items.

#### Measuring Similarity
To find "similar" users or items, we need to calculate a similarity score. Two common methods are:

* **Cosine Similarity**: This measures the cosine of the angle between two rating vectors. It's not about the magnitude of the ratings, but the orientation. A score of **1** means the users have very similar tastes, while **-1** means they have opposite tastes.

    $sim(u, u') = \frac{\sum_i r_{ui} r_{u'i}}{\sqrt{\sum_i r_{ui}^2} \sqrt{\sum_i r_{u'i}^2}}$
    * **Note**: This formula calculates the similarity between two users, u and u'.
    * $r_{ui}$: The rating given by user *u* to item *i*.
    * The formula compares the ratings for all items *i* that both users have rated.

* **Euclidean Distance**: This measures the straight-line distance between two users in the rating space. A **smaller distance** means the users have more similar tastes.

#### Predicting a Rating
Once we find similar users, we can predict a target user's rating for an item they haven't seen. This is often done using a weighted average of the ratings from the most similar users.

$r_{ui} = k \sum_{u' \in U} \text{simil}(u, u') r_{u'i}$ where $k = 1 / \sum_{u' \in U} |\text{simil}(u, u')|$
* **Note**: This formula predicts the rating user *u* will give to item *i*.
* It's a weighted sum of the ratings ($r_{u'i}$) given by other users (*u'*) to that same item.
* The weight is the similarity score ($simil(u, u')$) between the target user and the other users.
* $k$ is a normalization factor to keep the prediction within the rating scale.

### Model-Based Collaborative Filtering

Instead of using the entire dataset for every prediction, this approach builds a **model** from the data to learn user and item profiles.

#### Matrix Factorization
A very powerful model-based technique is **Matrix Factorization**. The idea is to take the very large and sparse user-item rating matrix and decompose it into two smaller, dense matrices:
1.  A **user-factor matrix (U)**: Represents each user in terms of a small number of "latent factors".
2.  An **item-factor matrix (V)**: Represents each item in terms of the same latent factors.

These **latent factors** are hidden features that the algorithm learns from the data. For movies, these factors might represent dimensions like "comedy vs. drama," "action-packed vs. slow-paced," or "blockbuster vs. arthouse film."

#### Alternating Least Squares (ALS)
**Alternating Least Squares (ALS)** is a popular algorithm for learning the user and item factor matrices (U and V).
* **Goal**: Find U and V such that their product ($U \times V^T$) closely approximates the original rating matrix.
* **Process**: It works by first fixing the user matrix (U) and solving for the best item matrix (V). Then, it fixes the new V and solves for the best U. This process is repeated—*alternating* back and forth—until the error between the predicted ratings and the actual ratings is minimized.

#### Collaborative Filtering in Spark
Apache Spark's machine learning library (`spark.ml`) has a built-in, scalable implementation of **model-based collaborative filtering** that uses the **Alternating Least Squares (ALS)** algorithm.