# 1. Preliminaries
So far we have defined graphical models primarily as a data structure for encoding probability distributions. So, we have talked about how we can take a probability distribution and using a set of parameters that are some how tied to the graph structure, one can go ahead and represent a probability distribution over a high dimensional space in a factored form. 

It turns out that we can view the graph structure in a graphical model using a  completely complimentary viewpoint. Which is, as a representation of the set  of independencies, that the probability distribution must satisfy. 

## 1.1 Independence 
We start by talking about the independence of $\alpha$ and $\beta$ within a probability distribution. So:

> For events $\alpha$, $\beta$, P $\models \alpha \perp \beta$ if:

Note: $\models$ is the logical symbol for satisfies, and $\perp$ is the symbol for independence. So we can read the statement above as *P satisfies that $\alpha$ is independent of  $\beta$.*

There are 3 entirely equivalent definitions of the concept of independence.So, if we go back to our definition:
> For events $\alpha$, $\beta$, P $\models \alpha \perp \beta$ if:
* Definition 1: The probability of the conjunction (can be shown using intersection, $\cap$, or a comma) of the two events is:
$$P(\alpha, \beta) = P(\alpha \cap \beta) = P(\alpha)P(\beta)$$
* Definition 2: This definition concerns flow of influence. It says, if you tell me $\beta$, it doesn't effect my probability of $\alpha$. So, the probability of $\alpha$, given the information about $\beta$ is the same as the probability of $\alpha$ if you don't give that information:
$$P(\alpha|\beta) = P(\alpha)$$
Becasue probabilistic influence is symmetrical, we have the exact converse of that:
$$P(\beta|\alpha) = P(\beta)$$

So this is independence of events, and we can take that exact same definition and generalize it to **independence of random variables**. 

> For random variables $X, Y, P \models X \perp Y$ if:
* $P(X, Y) = P(X)P(Y)$
* $P(X|Y) = P(X)$
* $P(Y|X) = P(Y)$

We can read these statements in two different but equivalent forms. The first is as a **universal statement**. So, for all $x, y$:

$$\forall x, y: P(x,y) = P(x)P(y)$$

This means we can think of it as a conjunction of many independence statements involving x and y, which are in X and Y, of the form above. 

The second interpretation is an expression over **factors**. That is, we are told that the factor $P(X, Y)$ (which is the joint distribution of X and Y) is actually the product of two lower dimensional factor, one which is a factor whose scope is X, and one is a factor whose scope is Y. 

### 1.1.1 Example of Independence 
We can now look at an example of independence by looking at a fragment of our student network. It has 3 random variables: intelligence, difficulty, and course grade. And this is a probability distribution which has a scope over 3 variables. 

<img src="images/independence.png" height="300" width="300">

We can go ahead and marginalize that and get a probability distribution over the scope, which is a factor over the scope I, D. And it happens that this is the marginal distribution (recall, in order to get $i^0$ and $d^0$, we add up the first 3 rows). 

<img src="images/independence-1.png" height="300" width="300">

It is not difficult to test that if we go ahead and marginalize $P(I, D)$ to get $P(I)$ and $P(D)$, that:

$$P(I,D) = P(I)P(D)$$

<img src="images/independence-2.png" height="300" width="300">

When we look at the graphical model we can see that there are no direct connections between $I$ and $D$.

<img src="images/independence-3.png" height="300" width="300">

## 1.2 Conditional Independence
Now, by itself, independence is not a particularly powerful notion. This is because it only happens rarely. That is, only in very rare cases will you have random variables that are truly independent of each other. 

We will now talk about **conditional independence**, which is written as:

> For (sets of) random variables $X, Y, Z: P \models (X \perp Y | Z)$

This can be read as: *P satisfies X is independent of Y given Z*. We have 4 equivalent definitions of this property.

> $P \models (X \perp Y | Z)$ if:
* $P(X, Y|Z) = P(X|Z)P(Y|Z)$ (
* $P(X|Y,Z) = P(X|Z)$ (given Z, Y gives us no information that changes our probability in X)
* $P(Y|X,Z) = P(Y|Z)$ (given Z, X gives us no information that changes our probability in Y)
* $P(X, Y, Z) \propto \phi_1(X, Z) \phi_2(Y,Z)$ (the joint distribution of X, Y, Z is proportional to a product of two factors; one factor over X and Z, and one factor over Y and Z. 

### 1.2.1 Conditional Independence Example
Let's look at an example of conditional independence. Imagine that you are given two coins, and we are told that one coin is fair and one coin is biased and will come up heads 90% of the time. 

<img src="images/cond-ind-1.png" height="300" width="300">

Now, we have a process where we first pick a coin and then toss it twice. So in the image above, the coin is the coin that you pick, and $X_1$ and $X_2$ are the two tosses. Let's think about dependence and independence in this example. If you don't know which coin you picked, and you toss the coin and it comes out heads, what happens to the probability of heads in the second toss? It is higher! That is because there is a greater chance that we picked the biased coin. However, if we are told that we picked the fair coin, we don't care about the first toss. We know it will not effect the outcome of the second toss. 

So we have:

$$P \not\models X_1 \perp X_2$$
$$P \models (X_1 \perp X_2 | C)$$

In other words, if we know the coin that was chosen, we do not gain any information by observing the toss; **we already know the coin**! So, when we have not been told what coin we are dealing with (given C), then $X_1$ and $X_2$ are not independent of eachother, since if we flip heads for $X_1$ then we are more likely to flip heads for $X_2$. However, if we are given the coin then it doesn't matter what $X_1$ is; $X_2$ is just going to be the probability associated with the coin we were given!

### 1.2.2 Conditional Independence Example 2
<img src="images/cond-ind-2.png" height="400" width="400">

We will now go over another example that we have seen before. In this case there is still one common cause, the students intelligence. We then have two things that eminate from that: the students grade and their SAT scores. And once again you can generate the joint distribution:

$$P(I, S, G)$$

You can then look at the probability of S and G given $i^0$:

$$P(S, G|i^0)$$

And we can then ask, how does that decompose and is that independent when look at the probability of S given $i^0$ and G given $i^0$?

$$P(S|i^0) \; and \; P(G|i^0)$$

### 1.2.2 Conditional can lose Independencies
<img src="images/cond-ind-3.png" height="400" width="400">

Now we can think about the case where intelligence and difficult are influencing the grade. We have seen that I and D are independent in the original distribution, they are **not independent** when we condition on grade. We can see that I and D are not independent in the conditional distribution, even though they were in the marginal distribution. 

---

# 2. Independencies in Bayesian Networks
One of the most elegant properties of probabilistic graphical models is the intricate connection between the factorization of the distribution as the product of factors and the independence properties that it needs to satisfy. Now we're going to talk about how that  connection manifests in the context of a directed graphical models or Bayesian networks. 

## 2.1 Independence and Factorization
So, let's first remind ourselves about  why independence and factorization are related to each other.  So, for example, the independence definition that $P(X|Y)$ is the is the  product of two factors $P(X)$ and $P(Y)$ is the definition of independence. 

$$P(X, Y) = P(X)P(Y)$$

And at the same time it's a factorization of the joint distribution as a product of  two factors. Similarly one of the definitions that we  gave for conditional independence, which is the, the joint distribution over X, Y  and Z is a factor over X and Z times a factor over Y and Z, is the definition of  conditional independence.

$$P(X, Y, Z) = \phi_1(X,Z) \phi (Y,Z)$$

This is the definition of conditional independence:

$$(X \perp Y | Z)$$

So, once again, independence of  factorization. So, we see that factorization of the  distribution corresponds to independencies that hold in that  distribution, and the question is if we have that if so if we know now that a  distribution P factorizes over G. The question is can we know something about  the independencies that the distribution P must satisfy, just by looking at the  structure of the graph G? 

## 2.2 Flow of Influence and d-separation
So, what are independencies that might hold in a probabilistic graphical model?  So we talked about the notion of flow of influence in a probabilistic graphical model where we have for example the notion of an active trail that goes  through $S$ up through $I$, down through $G$ and up through $D$, if for example we have that $G$ is observed-that is $G$ is in $Z$. And that gave us an intuition about when problistic influence might flow. 

We can now turn this notion on its head and ask the question:

> *What happens when we know that there are no active trails on the graph, i.e. that influence can't flow?*  

So we're going to make that notion formal using the notion of **d-separation**. And we're going to say that:

> **Definition**: *X and Y are d-separated in a graph G, given a set of  observations Z, if there's no active trail between them.*

Using a more mathematical terminology:
> **Definition 2:** *$X$ and $Y$ are d-separated in G, given Z, if there is no active trail in G between X and Y given Z*.

And, our notation for this will look like:

$$d-sep_G(X,Y|Z)$$

And the intuition that would like to demonstrate in this context is that this  notion of influence can't flow corresponds much more formally to the  rigorous notion of conditional independence in the graph.  

## 2.3 Factorization $\rightarrow$ Independence: BNs
So let's actually prove that that's in fact the case.  So the theorem that we'd like to state that if P factorizes over the graph, and we have a d-separation property that holds in the graph-so X and Y are d-separated in the graph (there's no active trails between them)- hen P satisfies these conditional independence statements: X is independent of Y given Z.  

> **Theorem**: *If P factorizes over G, and $d-sep_G(X,Y|Z)$, then P satisfies $(X \perp Y | Z)$*

So d-separation implies independence if the probability distribution  factorizes over G. So, we're now going to prove the theorem  in its full glory. We're going to prove it by example  because that example really does illustrate the main points of  the derivation. So, the assumption is that here is our  graph G, and here is the factorization of the distribution.  So, according to the chain rule, of Bayesian networks.  And this is a factorization that we've seen before.  And so now we'd like to prove that a d-separation statement follows from this  from this deriva-, from this assumption. And, and the d-separation statement that  we'd like to prove follows as an independence.  Is one that says that d is independent of S.  First, let's convince ourselves that D and S are in fact d-separated in this  graph. And so we see that there is only one  possible trail between d and s in this graph.  It goes. That instance g is not observed in this  case, and neither is l. We have the, the trail is not activated  and so they, the two are, the two nodes are de-separated and so, we'd like to  prove that this independence holds. 