# 1. Marginalization
The concept of _**marginalization**_ is one that comes up very frequently in machine learning, particularly concerning **bayesian learning** and **probabilistic graphical models**. It is something that is relatively straight forward, but must be considered in both the discrete and continuous, and firmly defined with strong intuitions.

## 1.1 Discrete Case
Let's start with a simple discrete example. Say we have a the following table:

<img src="images/marginalization-1.png" width="400">

Now, let's ask: What is the probability that someone is experiencing symptoms? In order to solve this, intuitively we say:

> What are all of the ways in which someone can be experience symptoms? 

In other words, what is the probability that $Y=1$?

#### $$p(Y=1)$$

In order to find that we can just sum up the different ways in which $Y=1$! Visually that looks like:

<img src="images/marginalization-2.png" width="400">

And we can write it as:

#### $$p(Y=1) = p(Y=1, X=0) + p(Y=1, X=1) = 0.1 + 0.3 = 0.4$$

This can be rewritten with a summation for consciseness: 

#### $$p(Y=1) = \sum_{x}p(Y=1, X=x)$$

Now, our general case above can be expanded for the general formula of marginalization:

#### $$P(Y=y) = \sum_{x}p(Y=y, X=x)$$

Now, keep in mind that we can do the exact same thing to the $X$ variable if we'd like; that is, we can marginalize out $Y$ and be left with only $X$. For instance, say we want to know:

> What is the probability that someone has the disease? 

Intuitvely, we know that is just _the probability someone has the disease and shows symptons_, plus _the probability someone has the disease and doesn't show symptoms_.

<img src="images/marginalization-3.png" width="400">

And we can write it as:

#### $$p(X=1) = p(Y=0, X=1) + p(Y=1, X=1) = 0.1 + 0.3 = 0.4$$

### 1.1.1 Discrete Case Intuition
A way to think about this in the discrete case (when we are dealing with tables), is that we are _collapsing_ the dimension that we are marginalizing out. For instance, our table above represents the _**joint distribution**_ between _disease_ and _symptoms_. Now, if we wanted to find just the distribution of _symptoms_, we would need to marginalize out _disease_:

#### $$P(Symptoms) = \sum_{disease={yes,no}}p(Symptoms, Disease=disease)$$

#### $$P(Y) = \sum_{x}p(Y, X=x)$$

The key is to remember how this looks visually:

<img src="images/marginalization-4.png" width="700">

We can see that out columns in $X$ (the disease) were collapsed into a single probability column for $Y$, and the $X$ variable no longer remained. 


### 1.1.2 A Confusing Convention
You may notice that if we are trying to find $p(Y)$ for all $y$, the equation to do so is written as:

#### $$P(Y) = \sum_{x}p(Y, X=x)$$

What may seem strange if you think about this from the mechanical perspective (or if you think about implementing it in code), is that we seem to be missing an iterator. In other words, what we are trying to do is sum over all $x$ in $X$, _for each_ $y$. In code we could write it as:

```
P(y,x) # Joint Distribution
P(y) # Marginalized Distribution
for y in Y:
    for x in X:
        P(y) += P(y,x)
```

However, when we write the actual mathematical equation, that first iterator that occurs over all $y$ is missing. That is simply a convention, nothing more. 

---

## 2. Continuous Case 
Now, let's take a moment to look at our continuous case. Visually, a continuous joint distribution looks like:

<img src="images/continuous-joint.png" width="450">

Where we have two variables, $X_1$ and $X_2$, and then the probability density on our vertical axis. Our probability density function looks like:

#### $$p(X_1, X_2)$$

Note, the volume under this curve must be equal to 1, since the total volume under the density function represents the probability of our variables $X_1$ and $X_2$.

Now, what if we were asked to find $P(X_1)$? In order to do that we would need to _marginalize_ out $X_2$. Mathematically, that would look like:

#### $$P(X_1) = \int_{x_2= - \infty}^{x_2 = \infty}p(X_1, X_2 = x_2)$$

But what can that be viewed as visually? Well, let's think for a moment. We are going through all possible values of $X_1$ (this is not written explicitly, but implied based on the convention of marginalization), and for each we want to find the area under curve at that value (the area under the curve along the $X_2$ axis). Let's take $x_1 = -0.6$ for instance. Visually, when we fix $X_1 = -0.6$, we have the following curve, shown in pink:

<img src="images/continuous-joint-2.png" width="450">

That curve is occuring in only 2-dimensions (since $X_1$ is fixed). Now, the integral is finding the _area_ under that curve, from $X_2 = -\infty$ to $X_2 = \infty$. 

<img src="images/continuous-joint-3.png" width="450">

That _total area_, shaded in pink, is then going to the be $X_1$ probability value! 

<img src="images/continuous-joint-4.png" width="450">

This can then be repeated across all values of $X_1$, until we finally end up with a probability distribution that no longer contains $X_2$ (seen in orange)! 

<img src="images/continuous-joint-5.png" width="450">

We can see that the $X_2$ dimension was collapsed, with each area associated with a fixed $X_1$ being the final probability at that $X_1$!