# 1. Bayesian Networks - Semantics and Factorization 

We are about to dig further into the actual representation behind a bayesian network and how it is constructed from a set of factors. We will continue with our example from section 1.

Recall, our example consisted of a student who is taking a class for a grade. In this example, we represented that following variables with their first letter:

```
Grade - G (A, B, C)
Course Difficulty - D (1/0)
Student Intelligence - I (1/0)
Student SAT - S (1/0)
Reference Letter - L (1/0)
```

And our joint distribution would be represented with: 

#### $$P(G, D, I, S, L)$$

And now, we can just ask ourselves:

> "What does the grade of the student depend on?"

Well, it makes sense that the grade would depend on the intelligence of the student and the difficulty of the class. 

<img src="images/bn-1.png" height="200" width="200">

This is already a small bayesian network! We can then take the other random variable and introduce them into the mix. So for example, the SAT score of the  student doesn't seem to depend on the difficulty of the course or on the grade  that the student gets in the course. The only thing it's likely to depend on  in the context of this model is the intelligence of the student.

<img src="images/bn-2.png" height="300" width="300">

And finally, caricaturing the way in which instructors write recommendation  letters- we're going to assume that the quality of the letter depends only on the student's grade.

<img src="images/bn-3.png" height="300" width="300">

Now, the above figure is a model our the dependencies. Keep in mind that it is not set in stone (can change), but it is just a representation of how we believe the world works. 

The question that you may have is:

> *How do we get this to represent a probability distribution?*

Currently it is just a bunch of nodes stuck together with edges; how do we turn that into a clear probability distribution? Well, we are going to annotate each of the nodes in the network with a **conditional probability distribution**, or **CPD**. 

<img src="images/cpd-annotated.png">

Now, each of these is a CPD. So, we have 5 nodes and subsequently 5 CPD's. Now, if you look at some of these CPDs,  they're kind of degenerate, so for example, the difficulty CPD isn't  actually conditioned on anything. It's just an unconditional probability  distribution that tells us that courses are only 40% likely to be difficult and 60% likely to be easy. Intelligence in this case is also unconditioned.

#### $$P(D)$$
#### $$P(I)$$

Now this gets more interesting when you look at the actual conditional  probability distributions. So, we can see the conditional probability grade distribution that we've already seen before for the probability of grade given intelligence, and difficulty, and we've  already discussed how each of these rows necessarily sums to one because the  probability distribution over the variable grade. And we have two other  CPD's here. In this case, the probability of SAT  given intelligence and the probability of letter given grade.  

#### $$P(G \; | \; I, D)$$
#### $$P(L \; | \; G)$$
#### $$P(S \; | \; I)$$

And that now is a **fully parameterized Bayesian network** and what we'll show next  is how this Bayesian network produces a joint probability distribution over these  five variables. So, here are the CPDs:

<img src="images/cpd-prob.png">

And what we're going to define now is the **chain rule** for Bayesian networks. The chain rule basically takes these different CPDs and  multiplies them together:

#### $$P(D, I, G, S, L) = P(D)*P(I)*P(G|I, D)*P(S|I)*P(L |G)$$

Now, before we think of what that means, let us first note that this is actually a  factor product in exactly the same way that we just defined.  So here, we have five factors, they have overlapping scopes and what we  end up with is a factor product that gives us a big, big factor whose scope is  five variables. So what does that translate into when we  apply the chain rule for Bayesian networks in the context of the particular  example? Assume we are trying to compute the following:

#### $$P(d^0, i^1, g^3, s^1, l^1)$$

Well, based on our factors from the figure above, if we start at **Difficulty** and **Grade**, we end up with:

#### $$0.6 * 0.3 * 0.2 * 0.8 *0.01$$

So, what does that give us as a definition? 

> **Bayesian Network:** *A Bayesian Network is a directed **acyclic** graph (DAG) G whose nodes represent the random variables $X_1,...,X_n$*. For each node in the graph, $X_i$, we have a CPD $P(X_i|Par_G(X_i))$, which denotes the dependents of $X_i$ on its parents in the graph G.

The **BN** represents a joint probability distribution via the chain rule for bayesian networks:

#### $$P(X_1,...,X_n) = \prod_i P(X_i|Par_G(X_i))$$

### 1.1 How do we know it is a legal Probability Distribution?
We need to show, first off, that it is great than or equal to 0. In our case this is rather trivial, since $P$ is a product of CPD's, and CPD's are nonnegative. 

We must then show that it sums to 1. 

#### $$\sum P = 1$$

To show that in the context of our previous example, we can sum up over all possible assignments:

#### $$\sum_{D, I, G, S, L}P(D, I, G, S, L) = \sum_{D, I, G, S, L}P(D)*P(I)*P(G|I, D)*P(S|I)*P(L |G)$$

Above, we broke this up via the chain rule, since that is how we defined our distribution. Now, the trick that we will use to solve this is to realize that each factor only involves a small subset of the variables. This allows us to push the summations in. We can start by pushing in the summation over L:

#### $$= \sum_{D, I, G, S}P(D)*P(I)*P(G|I, D)*P(S|I)* \sum _L P(L |G)$$

Keep in mind that $\sum _L P(L |G) = 1$. This is because no matter what the value of $G$ is, the probability of $L$ is well defined and will sum to one (look at the factor in the figure above). Another way to think about this is that we are summing up over the row of the CPD $P(L|G)$, and that means that the sum must be 1. This means that the term can be replaced with one, leaving us with:

#### $$= \sum_{D, I, G, S}P(D)*P(I)*P(G|I, D)*P(S|I)$$

Now, we can do exactly the same thing with $S$:

#### $$= \sum_{D, I, G}P(D)*P(I)*P(G|I, D)* \sum_S P(S|I)$$

This too is the sum over the row of a CPD, meaning it will evaluate to 1. We now have: 

#### $$= \sum_{D, I, G}P(D)*P(I)*P(G|I, D)$$

We can now do the same thing with G:

#### $$= \sum_{D, I}P(D)*P(I)* \sum_G P(G|I, D)$$

Which again, yields 1. We could do the same process for both $D$ and $I$, leaving us finally with just 1. 

### 1.2 Terminology 
We can now define more of the terminology that is going to accompany us further on.

**P Factorizes of G**<br>
* Let $G$ be a graph over $X_1,...,X_n$
* Then, $P$ factorizes of $G$ if:
#### $$P(X_1,...,X_n) = \prod_i P(X_i|Par_G(X_i))$$

In other words, a distribution $P$ factorizes over $G$ (we can represent it over the graph $G$), if we can encode it using the chain rule for bayesian networks. 

---

<br>
# 2. Reasoning Patterns
Now that we have the bayesian network defined, we can look at some of the reasoning patterns utilized. 

### 2.1 Causal Reasoning
If we go back to our student network example with the following CPD's:

<img src="images/cpd-annotated.png">

We can now look at some of the probabilities that one would get if you took the bayesian network, produced the joint distribution using the chain rule for bayesian networks, and now computed the values for different marginal probabilities. For instance, we could ask the probability of getting a strong letter. 

<img src="images/bn-marg-1.png">

We won't get through the calculation (tedious), but the probability of getting a strong letter is ~ 0.5. However, we can do more interesting queries. We can, for instance, condition on one variable, and ask how that changes this probability. For example, say we condition on low intelligence and use red to denote the false value:

<img src="images/bn-marg-2.png">

#### $$P(l^1|i^0) \approx 0.39$$

The probability of a strong letter is now 0.39. It is not surprising that the probability in this case goes down. This makes sense seeing as an intelligence goes down, the probability of getting a good grade goes down, and so does the probability of getting a strong letter. The is an example of **causal reasoning**, because intuitively the reasoning goes in a causal direction; from top to bottom. 

We can also make things more interesting. We can ask what happens if we make the difficulty of the course low.

<img src="images/bn-marg-3.png">

#### $$P(l^1|i^0, d^0) \approx 0.51$$

### 2.2 Evidential Reasoning
We can also perform **evidential reasoning**, which goes from the bottom to the top. 

<img src="images/bn-marg-4.png">

So, we can condition on the grade and ask what happens to the probability of the parents. Imagine that there is a student who takes the class and gets a C. Initially, the probability that the class was difficult was:

#### $$P(d^1) = 0.4$$

And the probability that the student was intelligent was:

#### $$P(i^1) = 0.3$$

But, now with this additional evidence, the probability that the student was intelligent goes down:

#### $$P(i^1|g^3) \approx 0.08$$

And the probability that the class was difficult goes up:

#### $$P(d^1 | g^3) \approx 0.63$$

### 2.3 Intercausal Reasoning
Now, there is another type of reasoning that is not quite as standard: **intercausal reasoning**. This is reasoning that is effectively the flow of information between two causes of a single effect. 

Let's go back to our situation where the student received a grade of C, $g^3$. 

<img src="images/bn-marg-5.png">

But now, we find out that this class really is difficult. So, we are going to condition on $d^1$. And now notice that the probability that the student is intelligent has gone up (from 0.08 to 0.11). 

<img src="images/bn-marg-6.png">

Now, that is not a huge increase. We will see as we play with more bayesian networks that the changes in probability are somewhat subtle. 

In another case, assume that the student gets a B. So now we have that the probability of high intelligence stills goes down

<img src="images/bn-marg-7.png">

But now if we determine that the class is hard, the probability goes up (even higher than the originally probability).

<img src="images/bn-marg-8.png">

### 2.4 Intercausal Reasoning Explained
Let's look drill into a particular example to determine how intercausal reasoning really works. Below, we can see the purest form of intercausal reasoning.

<img src="images/intercausal-1.png">

We have two random variables $X_1$ and $X_2$. We are going to assume that they are distributed uniformly (each has a 50% probability of being 1 and 50% probability of being 0). And we have on effect, which is simply the deterministic **OR**, of those two parents, which we represent as $Y$. In general, when we have a deterministic variable we will denote it with the double lines. 

Now, what if we condition on the evidence $Y = 1$. Before we conditioned on this evidence, $X_1$ and $X_2$ were independent of each other. However, after the conditioning, one entry in our table is removed, and we have:

<img src="images/intercausal-2.png">

Where $X_1$ and $X_2$ are now dependent on each other. Why is that the case? Well, currently in the above probability distribution, the probability that $X_1 = 1$ is:

#### $$P(X_1 = 1) = \frac{2}{3}$$

And the same thing for $X_2 = 1$:

#### $$P(X_2 = 1) = \frac{2}{3}$$

Now, if we condition on $X_1 = 1$, we will remove the second row:

<img src="images/intercausal-3.png">

And all of a sudden the probability of $X_2 = 1$ is back to 50%:

#### $$P(X_2 =1 \;|\; X_1 =1) = 0.5$$

The reason for this is that if we know that $Y = 1$, there are two things that could have made that true. Either $X_1 = 1$ or $X_2 = 1$. If we find out that $X_1 = 1$, we have completely explained what happened. This means that we want to go back to the way it was before (50/50), since there is nothing to suggest that it should be any other way. This particular situation is known as **explaining away**, and it is when one cause explains away a reason that made us suspect a different cause. 

### 2.5 Student Aces SAT
So, let's go back to our example and look at a reasoning pattern that involves even longer paths in the graph. Let's imagine that our student gets a C, but we have an additional piece of information that they also ace the SAT. 

<img src="images/student-ace.png">

When we just had the evidence regarding the grade, we had the probability of the student being intelligent was only 0.08. But now we have an additional piece of conflicting evidence, and the probability goes up to 0.58. 

What is going to happen to difficulty. Now, it is explaining away an action going in a different direction. If it is not the fact that the student is not smart, then why did they get a bad grade? The reason is more likely that the class is very difficult.

<img src="images/student-ace-2.png">


---

<br>
# 3. Flow of Probabilistic Influence
We have now seen reasoning patterns where intuitively probabilistic influence starts in one node, and flows through the graph to another node. This may seem somewhat "hand wavy", but in reality this is exactly what is going on in a bayesian network. 

So, we are going to try and understand exactly when one variable $X$ can influence a variable $Y$. 

> * If X is connected to Y (X is parent of Y), then X can influence Y:
$$X \rightarrow Y$$
Ex. think of the case where intelligence is the parent of grade. If we know a students intelligence, we gain knowledge over what we may expect their grade to be
* If Y is a child of X, then as we saw with evidential reasoning that observing X can change the probability distribution of Y:
$$X \leftarrow Y$$
Ex. now think of the case where we know the students grade but not their intelligence. In this case we have certain knowledge and can make predictions on what their grade may be
* More interesting are the cases where we have indirect influence between X and Y. Let's consider a case where we have an intervening variable $W$, and ask can X influence Y via W? 
$$X \rightarrow W \rightarrow Y$$ 
This would be a causal chain going from difficulty to letter via grade. 

<img src="images/flow.png">

> * The other arrangment can occur as well, where we go the evidential route: 
$$X \leftarrow W \leftarrow Y$$
In this case letter would influence grade which would influence difficulty 
* We then have the case where there is a common cause $W$, that has two effects, $X$ and $Y$: 
$$X \leftarrow W \rightarrow Y$$
Again, it seems to make sense that if we observe the value of the SAT, then that changes our beliefs in the student's intelligence and subsequently our probability distribution over their grade.
* The last and most interesting case, is the case of two causes that have a joint effect: 
$$X \rightarrow W \leftarrow Y$$
This is refered to as a **V-structure** for obvious reasons. In this case, imagine that we know that a student took a class, and that the class is difficult, does that tell me anything about the students intelligence? No! So this is the one case where probabilistic flow does not work. 

## 3.1 Active Trails
We can now define this notion of active trail in the context of no evidence. A **trail** is a sequence of nodes that are connected to each other by single edges in the graph:

#### $$X_1 - ...- X_k $$

The fact that these edges are undirected means that they can go in either direction. So, we have seen that influence can flow from one variable to another variable in the graph. The only thing that can block an active trail is a v-structure, because that is the one case where no influence flows! So, a trail is active if it has **no** v-structures:

#### $$X_{i-1} \rightarrow X_i \leftarrow X_{i+1}$$

With all said and done we have the following different cases that can occur:

|Case|
|---|
|X $\rightarrow$ Y |
|X $\leftarrow$ Y|
|X $\rightarrow$ W $\rightarrow$ Y|
|X $\leftarrow$ W $\leftarrow$ Y|
|X $\leftarrow$ W $\rightarrow$ Y|
|X $\rightarrow$ W $\leftarrow$ Y|


## 3.2 When can X influence Y given evidence about Z?
Now let's look at a more interesting case. Now we have some set of observations which we are going to denote as a set of variables **Z**. The question is:
> *When can X influence Y given evidence about Z?*

So, the first two cases are rather straight forward. Here if X is directly connected to Y in either the causal or the evidential direction, if you tell me something about one of them, it can change my beliefs about the  other.

|Case|Can X influence Y given evidence about Z?|
|---|---|
|X $\rightarrow$ Y |<span style="color: blue">**Yes**</span>|
|X $\leftarrow$ Y|<span style="color: blue">**Yes**</span>|
Now let's look at the other 4 cases, which in general are more interesting to us. We are looking at *when can X influence Y via intervening node W?*

#### $$X \rightarrow W \rightarrow Y $$
#### $$X \leftarrow W \leftarrow Y $$
#### $$X \leftarrow W \rightarrow Y $$
#### $$X \rightarrow W \leftarrow Y $$

There are really two cases here; that where $W$ is in the evidence set $Z$, $W \in Z$, and then when $W$ is not in the evidence set, $W \not\in Z$.

|Case|W $\not\in$ Z|W $\in$ Z|
|---|---|
|X $\rightarrow$ W $\rightarrow$ Y|?|?|
|X $\leftarrow$ W $\leftarrow$ Y|?|?|
|X $\leftarrow$ W $\rightarrow$ Y|?|?|
|X $\rightarrow$ W $\leftarrow$ Y|?|?|

We can start with the scenario where $W$ is not in the evidence set $Z$. Well, in this case we **do not get to observe W**. We are asking *can X influence Y via W*? In other words, *can Difficulty influence letter via grade, if grade is not observed?* So, for our first three cases we have the same behavior as before. That is, the intermediate variable through which the influence flowed, was not observed, and therefore there is not reason why observing X can change things. 

|Case|W $\not\in$ Z|W $\in$ Z|
|---|---|
|X $\rightarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|?|
|X $\leftarrow$ W $\leftarrow$ Y|<span style="color: blue">**Yes**</span>|?|
|X $\leftarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|?|
|X $\rightarrow$ W $\leftarrow$ Y|?|?|

Let's now contrast these three cases to there where $W$ is observed; where $W$ is evidence. Let's use the following situation as an example:

<img src="images/flow-1.png">

Here we see a trail where difficulty influences the letter via grade. Note that this is not an edge in the bayesian network, it is just demonstrating the flow of influence. 

So now the question is: We know that observing difficulty can change our value of the distribution of the letter, but what if we know (observe) the grade? For instance, we know the student received an A in the class. Then, we are told that the class is very hard. Does that change the probability distribution of the letter? No! We already know that the student got an A and the letter only depends on the grade. So in this case, influence cannot flow through grade if grade is observed. 

|Case|W $\not\in$ Z|W $\in$ Z|
|---|---|
|X $\rightarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\leftarrow$ Y|<span style="color: blue">**Yes**</span>|?|
|X $\leftarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|?|
|X $\rightarrow$ W $\leftarrow$ Y|?|?|

And what about the evidential case? Well we have already spoken about how probabilistic influence is symmetrical, hence letter cannot influence difficulty when grade is observed. 

|Case|W $\not\in$ Z|W $\in$ Z|
|---|---|
|X $\rightarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\leftarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|?|
|X $\rightarrow$ W $\leftarrow$ Y|?|?|

The third case is the situation where we have a common cause that has two effects. That may look like the SAT changing our beliefs in grade, via intelligence.

<img src="images/flow-2.png">

However, if we are told that the student is intelligent, 
then there is no way for the SAT to change our probability distribution in grade. This reflects that grade and SAT are **conditionally independent**. If we don't observe Intelligence, then Grade and SAT are **dependent**, because observing Grade gives us some information about Intelligence and therefore about SAT, and vice versa. However, if we have already observed Intelligence, then observing Grade can't affect SAT and vice versa, so they are conditionally independent.

|Case|W $\not\in$ Z|W $\in$ Z|
|---|---|
|X $\rightarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\leftarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\rightarrow$ W $\leftarrow$ Y|?|?|

Now we can talk about the last (and most interesting case), that is the one where we have a **V-structure**. This cas is represented by the example: "can difficulty influence intelligence via grade?"

<img src="images/flow-3.png">

Now, if grade is observed, then we have the exact case of **intercausal reasoning**. So, if $W \in Z$ then we are in the case where influence can flow!

|Case|W $\not\in$ Z|W $\in$ Z|
|---|---|
|X $\rightarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\leftarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\rightarrow$ W $\leftarrow$ Y|?|<span style="color: blue">**Yes**</span>|

Now, we have one tricky scenario left which is where $W \not\in Z$. In this case the naive thing to say is that if W is not in Z then it is the same case as before and influence cannot flow. However, this is not quite right. For instance, what happens if we do observe grade, but we *do* observe letter. So, we don't observe the grade directly, we do observe something that gives a strong indication of what value the grade took. In this case we are given evidence that needs to be explained, and we can explain it via difficulty or via intelligence. At that point we have established a connection/correlation between difficulty and intelligence, so that observing one does influence the other. So, X cannot influence Y if W and all of its descendants are not observed (in Z).

|Case|W $\not\in$ Z|W $\in$ Z|
|---|---|
|X $\rightarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\leftarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\leftarrow$ W $\rightarrow$ Y|<span style="color: blue">**Yes**</span>|<span style="color: red">**No**</span>|
|X $\rightarrow$ W $\leftarrow$ Y|<span style="color: red">**No, if W and all of its descendants are not observed (in Z)**</span>|<span style="color: blue">**Yes**</span>|

We have now created a taxonomy of when influence can flow through an intervening variable. We can now take that and define an overall model of general influence. For example, when can influence flow from S through I, through G, into D:

#### $$S - I - G -D$$

<img src="images/flow-4.png">

Let's look at a few cases! 

**I is observed**<br>
If I is observed, then it blocks the trail. If it blocks the trail then there is no more opportunity for it to flow! So, in this case influence cannot flow.

**I is not observed, but nothing else is observed**<br>
Well, you can climb up through I, but you fall down when you get to Grade, and cannot get back up. So, in this case influence cannot flow.

**I is not observed, G is observed**<br>
Well, in this case, you can climb up to I, go down to grade, and then go back up the hill into difficulty. 

We can think of it as a flow of water, expect that different nodes behave differently in terms of the valve structure. So, in the case of a branching structure (two outward arrows), observing a variable (intelligence) closes the valve and prevents the flow of water. However, in the case of a **v-structure**, closing the valve actually lets the water climb back up. 

## 3.2 Active Trails Definition 
> *A trail $X_1$ - ... - $X_k$ is active given $Z$ if:*
* For any v-structure $X_{i-1} \rightarrow X_i \leftarrow X_{i+1}$ we have that X_i or one of its descendants $\in$ Z. (**active v-structure**)
* no other $X_i$ is in Z