## Content
- Introduction
 - What is a recommendation system?
 - How do we formulate a rec sys problem?
 - How can we represent the dataset for rec sys?
- Collaborative Filtering
- Similarity based approaches
  - Item-item based similarity
  - User-user based similarity
  - Cold start problem
- Content based Rec Sys
- Recommendation as a Regression/Classification problem
- Types of Rec Algos

## Introduction

#### Q. What is a recommendation system/engine?
A recommender system is a system which predicts ratings a user might give to a specific item.

These predictions will then be ranked and returned back to the user as rcommendations.

The goal of a good recommmender engine is to hook the user to the platform.

There are many apps that we use that make use of recommendation systems. Example:
- Netflix
- Tiktok
- Amazon
- Youtube

<br>

> **Q. How have recommender systems evolved over time?**

- **Pre 2007**
 - **Similarity based**
 - **Content based**
 - **Collaborative Filtering** algorithms were used

- **2007 - 2015**
 - In 2007, **Netflix** held a competition where it promised a prize of million dollars to the team of people that can improve their recommendation engine.
 - The winning team had their solution based on the concept of **Matrix Factorization**, which became popular in this period.

- **Post 2015**
 - **Deep learning** algorithms are being used now.

![image](https://docs.google.com/uc?id=16kDmFNdeSUPoS4HyaC_AamF1ke7i7DvG)


#### Q. How do we formulate a recommender system problem?
The problem formulation here is different than what we've seen already for Classification or regression.

Suppose we have $n$ users, $U_i; i=[1, n]$ and $m$ items (product on amazon, movie on Netflix, song on spotify, etc), $I_j; j=[1,m]$.

`Goal:`
- We need to suggest a list of items to user $U_i$ that he/she will like.
- These items should be ranked (preferably).
- This is of course, based on historical data.

**NOTE:** Naturally, m would be in billions, whereas we can have millions of users (n). The point is, that the **scale is very large**.

![picture](https://drive.google.com/uc?export=view&id=1aSnd1pGoS6dOHVMi3p_py-9SPJ7QU976)

---

#### Q. How can we represent the dataset for recommender systems?

The dataset is represented in the form of a **matrix**, lets call it `A`.

In `A`, each row represents a user $U_i$, and each column represents an item $I_j$. This way, element $A_{ij}$ denotes user $U_i$'s interaction with product $I_j$.

This interaction can be measured in many ways, depending on context,
- If the youtube video is liked **(Binary value)**.
- Percentage of song listened to, in spotify **(real valued)**.
- If user spent 2 minuets reading about the product on amazon.
- Rating of the Netflix movie **(numeric value)**.
- If the product was bought or not.

Basically, we try to represent whether user indicated any form of interest in that product.

Naturally, `A` is a `n x m` matrix.

<br>

> **Q. What if user $U_i$ has never interacted with a specific product ,say $I_2$?**

When we have billions of products, it is next to impossible for any given user to have interacted with all of them.

For example,
- A user of Youtube, residing in New Delhi would not have seen any German videos, owing to the language barier.
- This leaves the entire section of German videos, that this user has never even interacted with.

In such cases, we leave the cell $A_{i2}$, where $I_2$ represents such a German videos, as **empty**.

Similarly, a lot of cells would be empty in matrix A.

![picture](https://drive.google.com/uc?export=view&id=1HQQpohiU06-jQB7zWZz8FGNrcAmOqfU8)



Recall that we have about a billion users, $n \approx 10^9$, and a few hundred million videos of Youtube, which means, $m ≈ 10^8$.

This means we have a total of $10^9 * 10^8 = 10^{17}$ cells in matrix A.

It is not possible for every user to have watched (or rated) every video on YouTube.

Practically, on an average, an user $U_i$ would have interacted with 1000 youtube videos, which is next to nothing in comparison to the total number of videos present.

Hence we can say that matrix A is **sparse**.

<br>

> **Q. How do we fix our matrix being sparse?**

We don't need to "fix" anything.

The main goal of building a recommender system is that if a user $U_i$ has given rating, for a few items, then the system should recommend new items based on the user's interests. This IS the problem at hand.

<br>

> **Q. How can we calculate the sparsity of matrix A?**

$sparsity = \frac{Number \ of \ non \ empty \ cells}{Total \ No. \ of \ cells}$

As per our example of Youtube,

$sparsity = \frac{10^9 * 1000}{10^{17}} = 10^{-5}$

This means that only 1 in 10<sup>5</sup> cells is not empty in A.

![picture](https://drive.google.com/uc?export=view&id=1cn2qXfypJveCvM64vJmyLssMwanT93tL)



---

## Collaborative Filtering

#### Q. We're just given the A<sub>n x m</sub> matrix, how can we recommend items to users based on this?
**Task:** Given A<sub>n x m</sub> , we need to recommend some items to user $U_i$.

Before we dive into this problem, first lets try to retrieve information about the users and the items from the given matrix A.


> **Q. How can we obtain an user / item vector from matrix A?**

We're just given the matrix A<sub>n x m</sub>, and we wish to obtain some representation of data for a user $U_i$.

Recall that for text data, we form a bag of words representation (BoW), which works as our features.

Similarly, we can have a vector of dimension $m$ x $1$, where each element would tell if user $U_i$ has bought/watched (shown interest in) item j, such that $j ϵ [1, m]$.

For cases where user $U_i$ has not interacted with / shown interest in item $I_j$, we can simply put 0.

This is a very crude approach, but this way we can have a user data.

![picture](https://drive.google.com/uc?export=view&id=1NbAvRsl8In3WuJhk_6TEeXfVr_6GKLcI)


<br>

If you think about it, this is essentially the ith row of matrix A.

Similarly, we can also get a $n$ x $1$ vector that represents data for an item $I_j$, which would be the jth column of matrix A.

![picture](https://drive.google.com/uc?export=view&id=1r4Gk3pqEFfWxbg40nFlW9Z01S8GVc1v7)

This way of using the given matrix A to form user and item vectors is known as **Collaborative filtering (CF)** reccomendation system.

Lets take a look at different approaches we can adopt to implement CF Recommendation systems.


---

## Item-Item based Similarity

We already have historical data about the items bought (positively rated) by a user $U_i$.

Let's say that this user has already bought item $I_{10}$ and $I_{12}$

> **Q. What if we find an item $I_j$ that is similar to these already bought items?**

This is a good idea.

We can find items with high similarity to the items historically bought by user $U_i$ ($I_{10}$ and $I_{12}$), as we know that this user showed interest in such items.


For example,
- If a user of Spotify likes music by Lata Mangeshkar, there is a high chance he/she will like songs by Asha Bhosle also.
- If a user of Netflix likes Mission Impossible 2, there is a high chance he/she will like other action movies like John Wick.

There is a sense of similarity among these items.

In our case,
- since our user $U_i$ likes $I_{10}$ and $I_{12}$,
- we need to find set of items similar to both $I_{10}$ and $I_{12}$
- If an item $I_k$ lies in both these sets, there is more chances of user $U_i$ liking item $I_k$.

This is like the **Nearest Neighbour** based approach.

<br>

> **Q. How can we find similarity between different items?**

Every $I_j$ is a n dimensional vector. It represents the ratings given to item $I_j$ by all the n users.

So, we use **cosine similarity** metric to measure the similarity between a specific item say $I_{10}$ and rest of the items.

Based on this metric value, we recommend the items that are most similar to $I_{10}$.

$sim(I_i, I_j) = cosine \ similarity(I_i, I_j) = \frac{I_i ^T * I_j}{||I_i|| * ||I_j||} = S_{ij}$


![picture](https://drive.google.com/uc?export=view&id=1GS4UyFjq2dN-YullCGZd9P1euM48eT0T)


<br>

> **Q. How can we store these similarity scores?**

We build a **similarity matrix**, with all the $S_{ij}$ values.

Since we're talking about item-item similarity, we denote this matrix as $S^i$.

The dimensions of $S^i$ becomes $m$ x $m$, and $S_{ij}^i$ represents how similar two items $I_i$ and $I_j$ are.

Larger the value of $S_{ij}^i$, more similar items $I_i$ and $I_j$ are.

TODO: Scribble showing similarity matrix

<br>

This is known as **Item-Item based Collaborative Filtering**  Recommender system. It was introduced by **Amazon** in 1998.

Here, the recommender engine compares the items that are already positively rated by a user, with the items that he did not rate and looks for similarities.

Items similar to the positively rated ones will be recommended.

**NOTE:**
- This is one of the most basic ideas behind YouTube recommendation engine.
- There is a lot of added layers of complexity on top of this, but this is the idea.

<br>

The dimensions of the item-item similarity matrix is $m$ x $m$, and m is a very large number.

> **Q. Wouldn't the calculation of the similarity matrix be a very time consuming process?**

Yes. This would be very time consuming.

Back in the day, when this approach was used in platforms like Amazon, We could not afford to do spend so much time on the go, as the user logs in.

So, this calculation did not took place when the user logs in, rather it was performed during the nights, or maybe even weekly, so that when user logs in, they readily get the recommendations based on this calculation.

![picture](https://drive.google.com/uc?export=view&id=1hZWWGQHJL08dIoWWox8TngqexalbFquu)

---



## User-User based similarity

Consider a user $U_i$.

> **Q. What if we find a user $U_j$ that is similar to $U_i$?**

Since we know both these users are similar, we can use the history of one user to give recommendations to the other.

We already have a $m$ x $1$ vector representing user data for ith user, that we retrieved from A, $U_i$.

Using **cosine similarity** we can find similarity between user $U_i$ and all the others, to find the most similar users.

Here also, this data is formed in the form of a **similarity matrix** $S^u$, where $S_{ij}^u$ represents how similar user $U_i$ is to user $U_j$.

![picture](https://drive.google.com/uc?export=view&id=1gIo0rnPPZgYHmJDF6Rpgw54jKWh_5r5_)

<br>

> **Q. Using cosine similarity, we can find the most similar users, what now? How do we use this to recommend items to user $U_i$?**

Consider that user $U_i$ has already bought items $I_{10}$ and $I_{18}$.

Using cosine similarity, we find that users $U_{10}, U_{26}, U_{58}$ are similar to user $U_i$. Their purchase history includes:-
- $U_{10}$: $I_{10}$, $I_{12}$, $I_{18}$, $I_{20}$
- $U_{26}$: $I_{10}$, $I_{18}$, $I_{26}$, $I_{12}$
- $U_{58}$: $I_{18}$, $I_{12}$

User $U_i$ has already bought items $I_{10}$ and $I_{18}$.

Using a **frequency based** approach, we can say that item $I_{12}$ seems to be very popular among the other similar users ($U_{10}, U_{26}, U_{58}$), we can guess that user $U_i$ might also be interested in buying $I_{12}$, making it a good recommendation!

![picture](https://drive.google.com/uc?export=view&id=10sy_RR9__4hswrL8uyG-NNJR42Ehej8-)

<br>

This technique, where we find similar users is called **User-User similarity based Collaborative Filtering**.

**NOTE:**
- This approach is also similar to Nearest Neighbour approach.

<br>

> **Q. Can you think of a flaw in user-user similarity approach?**

One major problem with user-user similarity is that **User preferences can change over time**.

This can lead to bad recommendations.

In order to handle this, we prefer using item-item similarity approach, because in contrast, the ratings on given items do not change significantly over time.

> **Q. How to decide when to use user-user or item-item similarity approach?**

Consider the following rule of thumb.

When we have more number of users than items, i.e. n > m, and if the item ratings do not change much over time, after the initial period, then it is better to use the item-item similarity approach.

<br>

---

## Cold Start Problem

> **Q. What if we have a new user joining the platform?**

When a new user joins, then we have no ratings given from this user, as he is new to the ecosystem. So, we cannot find similar users.

Suppose $U_i$ is a new user. Hence, the cells of ith row in A would be empty.

This means that entire user vector has no data.

For every recommender system, it is required to build a user profile by considering the user's activities and behaviours with the system.

Based on user's history only, recommendations are made.

If there is no historical data, that is a problem.

<br>

> **Q. Similarly, what if there is a new product that's added to the platform?**

Here again, we have no historical data, as it is a new product.

We cannot find any similar items to it.

If $I_j$ represents the new item, jth column of A would be empty.

This means that entire item vector has no data.

![picture](https://drive.google.com/uc?export=view&id=1vR2xZ3AgfRWZq1hyBl38nlKkFQ0fT4U1)


<br>

> **Q. What is a cold start problem?**

Both these cases are called as a **Cold Start Problem**.

This arises since we have no data about these new users/items, hence preventing us from being able to give good recommendations.

This problem arises due to 3 different reasons:-
- For new users
- For new items
- For new communities

In these cases, we don't have enough information to make good decisions/recommendations.

<br>


---

## Content based Recommendation System


> **Q. How can we overcome cold start problem?**

A basic idea would be to recommend the most popular / frequently bought items using a frequency based approach. But this is a very vague approach. Let's think of something else.

Consider the case of a new user.

Even though we do not have any information regarding this user's interactions with different items, we do have other additional information about this new user.
- **Location**
 - This can be used to get an idea of the items used / purchased by other users in that area.
 - A swiggy recommender engine can make an assumption that Idli-Dosa are more probable to be liked by a user residing in Southern India.
 - We can get the location of user from IP Address
 - Most platforms do ask for your location before letting you sign up.

- **Gender**
 - Useful in recommending clothes and accessories.

- **Age**

- **Type of Credit Card**
 - This too can help get a lot of information about the data, like their spending habbits, their credit limit, brand of credit card, etc.

- **Device being used to access the platform**
 - We can assume that an user using Apple Macbook would have more spending power than a user using a cheap Chinese smartphone.

We form a new d'-dimensional vector that holds all this data, and then use **user-user similarity** on it, and recommend accordingly.

This is known as **User-user similarity based Content Filtering** Recommender systems.


![picture](https://drive.google.com/uc?export=view&id=1_h3afZD7o4PYRpiO3SJ5anALn-Z_1-8H)

<br>

> **Q. Do we have any kind of additional information that can be used in case of a new item cold start?**

Yes.

Consider that there is a new product on Amazon, though there is no data about user ratings, we still have additional information like:-
- **Product description**
 - This would potentially be stored as a BoW
- **Price**
- **Category of product**
 - Like electronics, clothing, sports, etc.
... and so on.

We form a new d-dimensional vector that holds all this data, and then use **item-item similarity** on it, and recommend accordingly.

Hence this is called **item-item similarity based content filtering** Recommender systems.

We can recommend this new item to those users who bought similar items from the same categories until we have sufficient information.

![picture](https://drive.google.com/uc?export=view&id=1BdcUIWKgog9doqXhOXWc8F-JgSPqJrs-)

<br>

These additional information is known as **metadata**.

This process of finding user-user or item-item similarities, using metadata in order to recommend items to users is called **Content based Recommendation system**.

<br>

> **Q. Why is it called "Content based" Recommendation?**

Because we are not using the purchase data (the $A_{ij}$s) for finding similarity and then recommending.

Instead we are using **features extracted from**/provided in the **content** of the user / item to form d-dimensional vector representing the user/item metadata.

The point of content based filtering is that we have to know the content of both the users and the items.


![picture](https://drive.google.com/uc?export=view&id=1jLtzH5t0VkCJLYVPtOFXznvEE2lfrP2a)



#### Q. What are some advantages and disadvantages of content based filtering?

**Advantages**
- The model doesn't need any data about other users, since the recommendations are specific to this user. This makes it **easier to scale** to a large number of users.
- The model can capture the specific interests of a user, and **can recommend niche items** that very few other users are interested in.

**Disadvantages**
- Since the feature representation of the items are hand-engineered to some extent, this technique requires a lot of domain knowledge. Therefore, the model can only be as good as the hand-engineered features.
- It always recommends items related to the same categories, and never recommend anything from other categories.

<br>

#### Q. What are some advantages and disadvantages of collaborative filtering?
**Advantages**
- No domain knowledge necessary
 - We don't need domain knowledge because the embeddings are automatically learned.
- **Serendipity**
 - The model can help users discover new interests. In isolation, the ML system may not know the user is interested in a given item, but the model might still recommend it because similar users are interested in that item.

- The system doesn't need contextual features.

**Disadvantages**
- Fails in case of cold start.

---

## Recommendation as a Regression/Classification problem

We saw how we can use the metadata to create d-dimensional user and item feature vectors.

> **Q. Can we use these d-dimensional user and item vectors as features, of a Regression/Classification problem?**

Suppose that a user $U_i$ gave a rating of 4 to an item $I_j$, i.e. $A_{ij} = 4$.

Using metadata, we can obtain a feature vectors for the user $U_i$ and for item $I_j$ of dimension d' and d respectively.

If we concatenate these two feature vectors, and put the label as $A_{ij}$, which is the rating given by $U_i$ to $I_j$, we get an entry of data that we can use for training a regression/classification problem.

This way, we can form training data from the entries of matrix A, that are non-empty.

Here, we're using both, the metadata given to us, and the entries of matrix A, to train a regression/classification problem, and predict the ratings of a new pair of user and item, given their features.

Therefore, this becomes a hybrid model of sorts.

![picture](https://drive.google.com/uc?export=view&id=1nQaCej8Y31ZBWH0eT1UTY0hPHBPhY22h)

<br>

> **Q. Is there any problem with this approach?**

Consider the case when a new user is added.

We do not run into a cold start problem, as we can predict ratings of this new user for items, using his/her metadata.

But, in order to recommend the top 10 items, we will first have to predict the ratings for **ALL** the items. This is a problem because we can have as many as a million items in real world.

So, naturally, doing so would be very very computationally heavy.

Hence, this model is not feasible.

![picture](https://drive.google.com/uc?export=view&id=11ZLH5nKnyLK92gaa38jMkyLJNZdLT9EQ)


---

## Summary: Types of Recommendation Algorithm

The classical Recommender systems can be broken down into following types:-
1. Content based
 - Uses content based features like location, gender, etc
 - Very helpful in cold start problems
2. Collaborative filtering
 - Uses the data given in matrix A, i.e. the $A_{ij}$ values
 - Unlike content based filtering, we don't need to hand-engineer the features.
3. Hybrid models
 - Uses both content based features and  the $A_{ij}$ values

**Note:**
- Both user-user and item-item based similarity approaches can be adopted for both, content based and collaborative filtering techniques.

![picture](https://drive.google.com/uc?export=view&id=1o0dHtVRrI6mAcw2mn9tw7MTLIxYONCoK)


---

## Matrix Factorization (MF)

Recall the concept of factorization in algebra, that you'd have studied in school.

As per this concept, we can write a number as products of it's **factors**. For example, $6 = 2 * 3$.

Recall that we have our $n$ x $m$ matrix A.

Let's try to expand this concept for matrices by decomposing it as a product of two other matrices:-

A<sub>n x m</sub> = B<sub>n x d</sub> . C<sub>d x m</sub>

This is called as **Matrix Factorization (MF) / Matrix Decomposition**

<br>

> **Q. Can we factorize matrix A into a product of 3 matrices?**

Yes.

Just as 12 can be written as both $12 = 2*6$ and $12=2*3*2$, we can similarly decompose matrix A into 3 factor matrices instead of 2.

The final dimensions should match with $n$ x $m$. If that is taken care of, then we can factorize A into even more factors.

A<sub>n x m</sub> = B<sub>n x d</sub> . C<sub>d x d'</sub> . D<sub>d' x m</sub>

But, In context of Recommender systems, decomposition into 2 factor matrices is done.

![picture](https://drive.google.com/uc?export=view&id=1ZZ7AAx4WrK4gcKm2lvnXLM4wEt7rLs6p)

<br>

> **Q. How is the concept of MF relevant to Recommender systems?**

Recall that our matrix A is very sparse, which means that there are a lot of missing values.

If we could approximately complete this matrix based on the values that we do have, we would essentially get an idea of how each user would rate each item. This would be very helpful in recommending new items to the users.

This technique of utilising the available values to complete the sparse matrix A is called **Matrix Completion**.

There are tonnes of ways to solve the problem of Matrix Completion, one way of achieving this is by the process of **MF**.



![picture](https://drive.google.com/uc?export=view&id=13Oe-MaTwTBY4Yx4r_NvHNmgheQ5Orb5c)

![picture](https://drive.google.com/uc?export=view&id=1EV7IHts9dj-1AtLsGWDMNReo5wPF3Kaf)

> **Q. What is the underlying assumption behind Matrix Factorization in Recommender systems?**

The fundamental assumption of Matrix Factorization based Recommender systems is that $A_{ij}$, i.e. the user $U_i$'s rating for an item $I_j$ can be decomposed as a dot product between an user vector $B_i$ and an item vector $C_j$.

This idea of using Matrix Factorization for Recommender systems was introduced around 2008-2009, during the **Netflix prize competition**.

![picture](https://drive.google.com/uc?export=view&id=1CDcassowMBPLGITbNH0v0Om6470rcaSY)




> **Q. How can we go about completing the given matrix A using MF?**

Consider the matrix A. Even though we don't all of it's values as it is sparse.

Let's assume the following decomposition to be true: A<sub>n x m</sub> = B<sub>n x d</sub> . C<sup>T</sup><sub>d x m</sub>,
where <br>
n -> No of users <br>
m -> No of items


C has dimensions: $m$ x $d$, but in order to take a dot product, we need to take transpose, so that dimensions match with those of B<sub>n x d</sub>

This is known as an **Interaction model**, we will see why it's called that in a bit.

![picture](https://drive.google.com/uc?export=view&id=1ReFOAT3Zc7STUPyNqi1mdSvZjlhEqxjp)


<br>

#### Q. How can we represent the value of a cell $A_{ij}$ in terms of matrices B and C?

Look at how B and C matrix look like in the figure above.

We can obtain $A_{ij}$ by doing a dot product of the ith row of matrix B and the jth row of matrix C.

Let's obtain these rows in form of vectors. Recall that Whenever we write a vector, we say its a column vector.

So, we get $B_i$ and $C_j$ as the required column vectors respectively.

Hence, in order to do a dot product, we need to a transpose of $B_i$.


$A_{ij_{1 x 1}} = B_{i_{1 x d}}^T . C_{j_{d x 1}}$


<br>

Notice that our original matrix B had a dimension of $n$ x $d$, where n represents the no of users.

So we can interpret $B_i$ to be a d-dimenisonal vector that represents user $U_i$.

Similarly, $C_j$ can be thought of as a d-dimensional vector that represents the item $I_j$

TODO: Scribble 2
![picture](https://drive.google.com/uc?export=view&id=1yiNaCGWOfyrf0UhlA3yVIycBlfOECmWw)

**NOTE:**
- Multiplication in linear algebra is also known as "Interaction", hence we call this model as the **Interaction Model.**


As you can see we're getting $A_{ij}$, i.e. the user $U_i$'s rating for an item $I_j$ by the interaction (dot product, to be exact) between $B_i$ and $C_j$ vectors, that represent the user $U_i$ and item $I_j$ repectively.

This is the assumption we had made in the beginning of MF.

$A_{ij_{1 x 1}} = B_{i_{1 x d}}^T . C_{j_{d x 1}}$



#### Q. How can we find the $B_i$ and $C_j$ vectors, even though we only have a few $A_{ij}$ values in the rating matrix?

Suppose that $n=10,000$ and $m=1,000$

This means total number of cells in A = $10^4 * 10^3 = 10^7$

Since A is sparse, lets assume that out of these $10^7$ cells, only $10^5$ are non-empty.

**Task:** Given a small subset of $A_{ij}$ values, we want to compute d-dimensional vectors $B_i$ for all users ($i -> [1, n]$), and $C_j$ for all items ($j -> [1,m]$).

![picture](https://drive.google.com/uc?export=view&id=13j1uVjf7_82DJ8V_kOIQBOzlePhJnPrQ)


Though we don't know what $B_i$ and $C_j$ are, we want their product to be as close as possible to $A_{ij}$, for all the **non-empty** entries in A.

$A_{ij} ≈ B_{i_{1 x d}}^T . C_{j_{d x 1}}$

We can translate this in the form of **Mean Squared Loss**. We'd like to minimise that for all the cells that are **non-empty** in A

$min_{B_i, C_j} Σ (A_{ij} - B_i^T . C_j)^2$

This becomes our **optimization problem** for MF based Rec Sys.

![picture](https://drive.google.com/uc?export=view&id=1_hgbqyK2jrBoMvWctc5-kym3-A0XU7sn)

<br>

> **Q. How can we solve the optimization problem of MF based Rec Sys?**

There are broadly 2 approaches to solve this:-
1. **Stochastic Gradient Descent (SGD)**
 - Randomnly assign values in $B_i$ and $C_j$ and perform SGD
 - This will take time as there are huge number of users and items.
2. **Coordinate Descent Algorithms**
 - We consider $B_i$ and $C_j$ as separate coordinates
 - We fix $B_i$ for a few iterations and update $C_j$.
 - Then, we fix $C_j$ and update $B_i$ for a few iterations.
 - This is repeated untill they converge to solve the optimization problem.
 - This techinque is also known as **Alternating Least Squares (ALS)**

![picture](https://drive.google.com/uc?export=view&id=14fJvxyl_ntBtxSX9htJed3ql6Tq7mwL_)

So, using the non-empty cells of matrix A along with the optimization technique, we're able to find suitable values for $B_i$ and $C_j$.

Recall that this was the primary goal of this entire exercise.


<br>

> **Q. How can we complete the missing cells in matrix A?**

Suppose that the third user has not rated the tenth item, i.e. $A_{3, 10}$ -> Missing

Recall that our underlying assumption for MF based Rec Sys was that $A_{ij} \approx B_i^T . C_j$

We already computed the value for user vector $B_3$, using other ratings that user $U_3$ has given, which can be found in $A_{3, j}$.

Similarly, we have value for item vector $C_{10}$, as item $I_{10}$ would've been rated by other users, which can be found in $A_{i, 10}$.

So, taking a dot product of $B_3$ and $C_{10}$, can get a very close approximate value of the empty cell $A_{3, 10}$.

Similarly, we can fill up all the empty cells in matrix A.

> This is the relation between **matrix completion** and **recommender systems**.

![picture](https://drive.google.com/uc?export=view&id=1tPT40WRqwiE_AZl-MxlJqDiUdWSS0drZ)

Lets understand this with context to the optimisation problem, i.e. $min_{B_i, C_j} Σ (A_{ij} - B_i^T . C_j)^2$

**Suppose we're trying to find the item vector $C_{10}$**

In order to find this, we want to plug-in **ALL** the ratings that item $I_{10}$ has received from different users, i.e. $A_{1, 10}, A_{2, 10}, A_{3, 10}, ..., A_{n, 10}$.

Out of these, we use all the ratings that are not NULL/empty.
For eg, if $A_{3, 10} = NULL$, it is not included in the optimization problem.

<br>

**Suppose we're trying to find the user vector $B_{3}$**

Similarly, here we plug-in **ALL** ratings that user $U_3$ has given to different products, i.e. $A_{3, 1}, A_{3, 2}, A_{3, 3}, ..., A_{3, m}$.

If $A_{3, 2}, A_{3, m} = NULL$, they are not included in the optimization problem.

![picture](https://drive.google.com/uc?export=view&id=1CEog1d0CGK0Hsw_nn0W4d8hLx8ukz3g8)

<br>

> **Q. Is there any failure/boundary case for this?**

Yes.

Suppose that there is an user, say, $U_{100}$, who hasn't rated even a single item (Cold start or an old user that doesnt rate).

This means that all values of 100th row in A would be empty, $A_{100, 1}, A_{100, 2}, A_{100, 3}, ..., A_{100, m} = NULL$

Then, it'll be impossible to find the user vector (i.e. $B_{100}$) for him.

Similarly, if there's a new item, that has never been rated, even once (cold start), say item $I_{150}$, then $A_{1, 150}, A_{2, 150}, A_{3, 150}, ..., A_{n, 150} = NULL$, and it becomes impossible to find item vector $I_{150}$.

![picture](https://drive.google.com/uc?export=view&id=1Kf0ZZnqlYDIK_WWAz3aq2La90eKiRJ2x)



> **To summarize (Mental Map):**
- **Recommendation System** can be formed through **Matrix Completion**, which can be achieved through **Matrix Factorization**, which utilises **Stochastic Gradient Descent**.

![picture](https://drive.google.com/uc?export=view&id=1qzZNG2re3v-5hSMCzab4BHBNOXYJE2o0)


---

## Principal Component Analysis (PCA)

We've studied about PCA in the last few classes. Let's connect how PCA is related to concept of MF.

Suppose we have our data matrix, containing **standardised data** $X$ with dimensions `n x d`.

Recall that we calculated the **covariance matrix** using X, $S_{d x d}$, which has dimensions `d x d`.

Covariance matrix is a square and symmetrix matrix.

<br>

> **Q. How did we calculate the covariance matrix?**

Using the relation, $S_{dxd} = \frac{X_{dxn}^T.X_{nxd}}{n-1}$

<br>

Before PCA existed, there was this idea called **Eigen Decomposition**, which is a special type of matrix decomposition.

> **Q. How can we decompose our covariance matrix using the concept of eigen decomposition?**

We can write our $S_{dxd}$ as: $S_{dxd} = W_{dxd} . ∧_{dxd} . W_{dxd}^T$

where <br>
$W_{dxd}$ -> Matrix where columns represent the `d` **eigen vectors** ($v_1,v_2, v_3, ..., v_d$)<br>
$∧_{dxd}$ -> Matrix where all diagonal elements are the **eigen values** ($λ_1, λ_2, λ_3, ..., λ_d$), and all the other elements are 0 <br>
$W_{dxd}^T$ -> This is transpose of $W_{dxd}$, so it becomes a matrix where each row represents the transpose of the `d` eigen vectors.

**NOTE:**
- A property of the singular values in $Σ$ is that: $λ_1>=λ_2>=λ_3>=...>=λ_n$.

<br>

> **NOTE:**
- This is actually how PCA is solved internally, i.e. by decomposing the covariance matrix S as $S_{dxd} = W_{dxd} . ∧_{dxd} . W_{dxd}^T$
- During the PCA lecture, it was not discussed how PCA is solved internally.

![picture](https://drive.google.com/uc?export=view&id=1z1h-ac7Ub23u6gOQRlHoBXR2NoliOJf1)

<br>

> **Q. How is PCA different from Matrix Factorization?**

It's not. PCA is just a special type of MF.

Earlier when we performed MF on $A_{ij}$, we did not care what the decomposing factors $B_i$ and $C_j$ look like, as long as they're satisfying the relation $A_{ij} = B_i^T.C_j$

Whereas, in case of PCA, we have constraints on the value of it's decomposed factors.

<br>

> **Q. What are the constraints on decomposed factors in case of PCA?**

Since $W_{dxd}$ consists of eigen vectors, **each column is perpendicular** to other columns.

Conversely, each row in $W_{dxd}^T$ is perpendicular to other rows.

Also, as we've seen $∧_{dxd}$ is a diagonal matrix.

![picture](https://drive.google.com/uc?export=view&id=1LowAgBaXLrR-1vejNwdfYCuWHSStbS6l)

---

## Singular Value Decomposition (SVD)


One drawback of using PCA is that it can only be applied on the covariance matrix (S), which is square and symmetric.

There is another technique which can help in factorising our data matrix, called **Singular Value Decomposition (SVD)**.

SVD doesn't require a square matrix, it can be directly applied on our rectangular data matrix with dimensions `n x d`.


<br>

> **NOTE:**
- We are not going to prove these relations, we're just looking at them

<br>

> **Q. What does SVD formulation look like?**

Here also, we're trying to decompose the data matrix into product of 3 matrices.

$X_{nxd} = U_{nxn}.Σ_{nxd}.V_{dxd}^T$

Assuming that number of data points is greater than number of dimensions, i.e. $n > d$, lets break down each of these factors:-

- $Σ_{nxd}$: diagonal matrix that contain **d singular values**, and the rest of elements are 0. Notice the dimensions, this is a **rectangular** matrix.

- $U_{nxn}$: **Left singular vectors**, i.e. a Square matrix containing the **eigen vectors** of $X_{nxd}.X_{dxn}^T = S'_{nxn}$ (let), stacked along columns
 - **Note:** $X_{nxd}.X_{dxn}^T$ is not the same as covariance matrix $S_{nxn}$, that was $X_{dxn}^T.X_{nxd}$

- $V_{dxd}$: **Right singular values**, i.e. a square matrix containing **eigen vectors** of $X_{dxn}^T.X_{nxd} = S_{nxn}$ covariance matrix, stacked along columns.
 - **Note:** In SVD formulation, $V_{dxd}$ is transposed, therefore, these eigen vectors become stacked along the rows.


> **NOTE:**
- Eigenvectors and Eigenvalues are only defined for squared matrices, whereas singular values and singular vectors are defined for rectangular matrices also.
- Notice that $V_{dxd}$ becomes exactly same as $W_{dxd}$ that we saw in PCA.



<br>

> **Q. How do we find singular values?**

Singular values are related to eigen values as per the following relation:

$S_i^2 = λ_i * (n-1)$

**NOTE:**
- A property of the singular values in $Σ$ is that: $s_1>=s_2>=s_3>=...>=s_n$.

![picture](https://drive.google.com/uc?export=view&id=1fhrFTWGMyBtdXyVcy3md1h1YxzZzqAuh)


![picture](https://drive.google.com/uc?export=view&id=1Ny2ihrbM6bwfdWPBY-QCcOVtOpdNx5Ir)

![picture](https://drive.google.com/uc?export=view&id=1rvuRF9hFyEWftAFKv8LpH79xziHAsAUJ)

> **Q. Why is SVD important for us?**

The concept of SVD is very closely related to PCA. However, there is another concept, called `truncated SVD` that is very important.

We will study about this shortly.

This concept is capable of coming up with very interesting feature engineering, given a data matrix $X_{nxd}$

In the next lecture, we will see other special cases and applications of matrix factorization, and connect all of it to the context of Recommender Systems.

---
---