# Unsupervised Learning
- Unsupervised learning is 'understanding data'
- Data: ${x^1, x^2,....,x^n}$
- $x^i \in R^d$
- Build model that compress, explain and group data.

### Dimensionality Reduction 
- Data: ${x^1,x^2,....,x^n}$
- $x^i$ $\in$ $\mathbb{R}^d$
- Encoder f: $\mathbb{R}^d$ -> $\mathbb{R}^{d'}$
- Decoder g: $\mathbb{R}^{d'}$ -> $\mathbb{R}^d$
- Goal : g(f($x^i$)) ~ $x^i$
- 
$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^n \| g(f(x_i)) - x_i \|^2
$$



---
---

# Introduction to Unsupervised Learning

Hello, everyone, and welcome to another lecture on machine learning foundations.

In the previous lectures, we introduced the supervised learning paradigm and went into detail on the two main problems, two main types of supervised learning tasks which are regression and classification.

In this lecture, we are going to start on unsupervised, on the unsupervised learning paradigm, and the two main tasks associated with it, which are dimensionality reduction and density estimation.

In contrast to supervised learning, which had very clear and marked goals and ways of quantifying them, unsupervised learning is typically much more vague. And unsupervised learning typically is used as a pre-processing step, and not as an end in itself.

Vaguely, unsupervised learning can be viewed as base for understanding data. And data in our context here is simply going to be a collection of vectors.

Note that in contrast to supervised learning, which always had two pairs of \(x_i, y_i\), here you have just \(x_i\), which are just a collection of \(D\)-dimensional vectors.

The goal of unsupervised learning is to build models that compress, explain, and group data, which is what I am broadly grouping as understanding.

We will explain what all of these mean with two specific examples, which are dimensionality reduction and density estimation.

---

## Usefulness of Unsupervised Learning

Here is an example of how unsupervised learning can be useful.

Unsupervised learning, as I mentioned, is typically not used as an end in itself, because the outputs of unsupervised learning algorithms by themselves are not useful.

But after human interpretation, and after other machine learning tasks, they can become very valuable.

For example, let us say you are a marketing manager at Coca Cola, and your job is to collect the tweets about Coca Cola and summarize them to your boss.

So, that is your job.

And let us say in any given week, there are, let us say, 1 million tweets about Coca Cola that happen in a given week.

There is no way you can show all the million tweets to your boss and explain what each tweet is — it is just not possible.

A reasonable thing to do would be if you can group these million tweets into 10 distinct groups.

So, maybe you can have one group of people who are all just taking selfies with Coke in a new place.

Or maybe there is another group of tweets, which are from other brands, which are doing some co-branding.

Or we can have another group of feeds, which are all about people who are promoting Coke, and paid by Coke.

So, there are several groups which you can potentially think of, and if you can, even without actually giving this information, group these million tweets into 10 manageable groups, and understand what these groups are by going through these groups, then you can go to your manager and say:

> “Well, there are 10 types of tweets that happened this week. The type 1 tweets are called tweets by people who are buying Coke for the first time and tweeting about it from a store near their house. Group 2 is all people who are businesses collaborating with Coke,” and so on.

You can easily summarize such a situation to your boss, and the boss can do that in a reasonable amount of time, and you can get your job done.

---

Note that what the unsupervised learning does is simply group the tweets into 10 groups.

Anything beyond that — interpreting these groups — is typically the job of the human being, because groups by themselves are meaningless.

Only when you assign coherent meaning to such a group would it be actionable or useful.

So, that is done by the human which, in this case, would be you.

That is the reason why unsupervised learning is typically not an end in itself, but rather a pre-processing stage which is used by other processes.

---

## Dimensionality Reduction: A Concrete Example of Unsupervised Learning

Here is one example of a concrete unsupervised learning task: **dimensionality reduction**.

The goal of dimensionality reduction is **compression and simplification**.

---

### Where is Dimensionality Reduction Useful?

For example, let us say you have a genetics company and you want to export or comprehensively store the gene expression levels of a million people.

Each person has, let us say, a million genes.

So, you have computed the gene expression levels of these million genes for a million people.

In principle, this is a \(10^6 \times 10^6\) matrix — a million people each having a million genes.

That is a huge amount of data.

There is no way you can transmit this data from one lab to another lab easily.

It is just not possible.

What would be nice is if you can compress this data into a simpler format, which can be used for transmitting.

So, that is one reasonable goal to have.

Dimensionality reduction is one of the main tools that you can use for such a task.

---

### Mathematical Formulation

Formally, you might have data in \(D\)-dimensional vectors \(x_1, x_2, \ldots, x_n\).

The goal of dimensionality reduction is to come up with **two models**, unlike all the other previous cases like classification or regression, where the goal was to come up with a single model.

The goal of a dimensionality reduction algorithm is to come up with:

- an **encoder** \(f\), and
- a **decoder** \(g\).

---

The **encoder** is a function which takes in a \(D\)-dimensional vector and outputs a \(d'\)-dimensional vector, where typically \(d' \ll D\).

Effectively, the encoder compresses a \(D\)-dimensional vector into a \(d'\)-dimensional vector.

The **decoder** essentially hopes to undo the effect of the encoder.

It takes a \(d'\)-dimensional vector and outputs a \(D\)-dimensional vector.

---

### Goal of Encoder and Decoder

The goal is that for any input \(x_i\),

\[
g(f(x_i)) \approx x_i
\]

That is, after encoding and decoding, you should get back the original input approximately.

---

### Measuring Approximation

A reasonable way of measuring approximation is to compute the norm squared of the difference:

\[
\| g(f(x_i)) - x_i \|^2
\]

If this is zero, then you have perfect reconstruction.

In practice, you want to find encoder-decoder pairs that minimize the average reconstruction error:

\[
\frac{1}{n} \sum_{i=1}^n \| g(f(x_i)) - x_i \|^2
\]

---
