# Bayesian Models
We are now going to dig further into a specific type of **Probabilistic Graphical Model**, specifically **Bayesian Networks**. We will discuss the following:
1. What are Bayesian Models
2. Independencies in Bayesian Networks
3. How is Bayesian Model encoding the Joint Distribution
4. How we do inference from Bayesian models

---

## 1. What are Bayesian Models? 
A Bayesian Network is a probabilistic graphical model (a type of statistical model) that represents a set of **random variables** and their **conditional dependencies** via a **directed acyclic graph** (DAG). Bayesian networks are often used when we want to represent *causal relationships* between the random variables. They are parameterized by using **Conditional Probability Distributions** (CPD). Each node in the network is parameterized using:

#### $$P(node|Pa(node))$$
Where $Pa(node)$ represents the parents of the nodes in the network. We can dig into this further by looking at the following student model:

<img src="images/student_full_param.png">

If we the use the library **pgmpy**, then we create the above model as follows:
> 1. Define network structure (or learn it from data)
2. Define CPD's between nodes (random variables)
3. Associated CPD's with structure

We can see this implemented below.

### 1.1 Implementation

In [20]:
# Imports needed from pgmpy
from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD

### 1.1.1 Set the Structure
So, with our imports taken care of, we start by defining the model structure. We are able to define this by passing in a list of edges. Note, these edges are *directional*; for example, we have the tuple `(D, G)`, which means that `difficulty` influences `grade`. 

In [21]:
student_model = BayesianModel([('difficulty', 'grade'), 
                       ('intelligence', 'grade'), 
                       ('grade', 'letter'), 
                       ('intelligence', 'sat')])

### 1.1.2 Setup the relationships (CPDs)
We then want to set up our relationshisp in the form of CPD's. A few things to note:
> 1. `variable_card`: this is meant ot represent the number of discrete possibilities that the random variable can take on.
2. `evidence`: this is referring to the parent of the random variable, i.e. $Pa(node)$.


In [22]:
difficulty_cpd = TabularCPD(variable='difficulty',
                       variable_card=2,
                       values=[[0.6, 0.4]])

In [23]:
intelligence_cpd = TabularCPD(variable='intelligence',
                              variable_card=2,
                              values=[[0.7, 0.3]])

In [24]:
grade_cpd = TabularCPD(variable='grade', 
                       variable_card=3, 
                       values=[[0.3, 0.05, 0.9,  0.5],
                               [0.4, 0.25, 0.08, 0.3],
                               [0.3, 0.7,  0.02, 0.2]],
                      evidence=['intelligence', 'difficulty'],
                      evidence_card=[2, 2])

In [25]:
letter_cpd = TabularCPD(variable='letter', variable_card=2, 
                   values=[[0.1, 0.4, 0.99],
                           [0.9, 0.6, 0.01]],
                   evidence=['grade'],
                   evidence_card=[3])

In [26]:
sat_cpd = TabularCPD(variable='sat', variable_card=2,
                   values=[[0.95, 0.2],
                           [0.05, 0.8]],
                   evidence=['intelligence'],
                   evidence_card=[2])

### 1.1.3 Add the relationships (CPDs) to the Model
The next step is to actually add our CPD's to our model. The way in whcih PGMPY specifies models is highly modular, which is great because it allows us to add and take away different CPD's very easily. 

In [27]:
student_model.add_cpds(difficulty_cpd, intelligence_cpd, grade_cpd, letter_cpd, sat_cpd)

At this point we can actually check our model for the network structure and CPDs and verifies that the CPDs are correctly defined and sum to 1.

In [28]:
student_model.check_model()

True

### 1.1.4 Examine the Structure of the Graph
We can see our model with the respective CPD's incorporated:

In [29]:
student_model.get_cpds()

[<TabularCPD representing P(difficulty:2) at 0x110c828d0>,
 <TabularCPD representing P(intelligence:2) at 0x115c1e0b8>,
 <TabularCPD representing P(grade:3 | intelligence:2, difficulty:2) at 0x115c1e5c0>,
 <TabularCPD representing P(letter:2 | grade:3) at 0x115c1e438>,
 <TabularCPD representing P(sat:2 | intelligence:2) at 0x115c1e4e0>]

And we can examine specific nodes to ensure that the corresponding distributions are correct. 

In [30]:
print(student_model.get_cpds('difficulty'))

╒══════════════╤═════╕
│ difficulty_0 │ 0.6 │
├──────────────┼─────┤
│ difficulty_1 │ 0.4 │
╘══════════════╧═════╛


In [31]:
print(student_model.get_cpds('intelligence'))

╒════════════════╤═════╕
│ intelligence_0 │ 0.7 │
├────────────────┼─────┤
│ intelligence_1 │ 0.3 │
╘════════════════╧═════╛


In [32]:
print(student_model.get_cpds('grade'))

╒══════════════╤════════════════╤════════════════╤════════════════╤════════════════╕
│ intelligence │ intelligence_0 │ intelligence_0 │ intelligence_1 │ intelligence_1 │
├──────────────┼────────────────┼────────────────┼────────────────┼────────────────┤
│ difficulty   │ difficulty_0   │ difficulty_1   │ difficulty_0   │ difficulty_1   │
├──────────────┼────────────────┼────────────────┼────────────────┼────────────────┤
│ grade_0      │ 0.3            │ 0.05           │ 0.9            │ 0.5            │
├──────────────┼────────────────┼────────────────┼────────────────┼────────────────┤
│ grade_1      │ 0.4            │ 0.25           │ 0.08           │ 0.3            │
├──────────────┼────────────────┼────────────────┼────────────────┼────────────────┤
│ grade_2      │ 0.3            │ 0.7            │ 0.02           │ 0.2            │
╘══════════════╧════════════════╧════════════════╧════════════════╧════════════════╛


---

## 2. Independencies in Bayesian Networks 
Independencies implied the by the structure of our bayesian network can be categorized in 2 types:
> 1. **Local Independencies:** Any variable in the network that is independent of its non-descendents given its parents. Mathematically it can be written as:<br>
<br>
$$X \perp NonDesc(X)|Pa(X)$$
where $NonDesc(X)$ is the set of variables which are not descendents of $X$ and $Pa(X)$ is the set of variables whcih are parents of $X$. 
2. **Global Independencies:** For discussing global independencies in bayesian networks we need to look at the various network structures possible. Starting with the case of 2 nodes, there are only 2 possible ways for it to be connected:

<img src="images/two_nodes.png">

In the above two caes it is obvious that change in either node will effect the other. For the first case we can take the example of $difficulty \rightarrow grade$. If we increase the difficulty of the course the probability of getting a higher grade decreases. For the second case we can take the example of $ SAT \leftarrow Intel $. Now if we increase the probability of getting a good score in SAT that would imply that the student is intelligent, hence increasing the probability of $ i_1 $. Therefore in both the cases shown above any change in the variables leads to change in the other variable.

Now, there are four possible ways of connection between 3 nodes:

<img src="images/three_nodes.png">


Now in the above cases we will see the flow of influence from $ A $ to $ C $ under various cases.

1. **Causal**: In the general case when we make any changes in the variable $ A $, it will have an effect on variable $ B $ (as we discussed above) and this change in $ B $ will change the values in $ C $. One other possible case can be when $ B $ is observed i.e. we know the value of $ B $. So, in this case any change in $ A $ won't affect $ B $ since we already know the value. And hence there won't be any change in $ C $ as it depends only on $ B $. Mathematically we can say that: 
$$ (A \perp C | B) $$
2. **Evidential**: Similarly in this case also observing $ B $ renders $ C $ independent of $ A $. Otherwise when $ B $ is not observed the influence flows from $ A $ to $ C $. Hence:
$$ (A \perp C | B) $$
3. **Common Cause**: The influence flows from $ A $ to $ C $ when $ B $ is not observed. But when $ B $ is observed and change in $ A $ doesn't affect $ C $ since it's only dependent on $ B $. Hence here also:
$$ ( A \perp C | B) $$
4. **Common Evidence**: This case is a bit different from the others. When $ B $ is not observed any change in $ A $ reflects some change in $ B $ but not in $ C $. Let's take the example of $ D \rightarrow G \leftarrow I $. In this case if we increase the difficulty of the course the probability of getting a higher grade reduces but this has no effect on the intelligence of the student. But when $ B $ is observed let's say that the student got a good grade. Now if we increase the difficulty of the course this will increase the probability of the student to be intelligent since we already know that he got a good grade. Hence in this case 
$$ (A \perp C) $$ 
and 
$$ ( A \not\perp C | B) $$
This structure is also commonly known as **V structure**. 

We can see this in greater detail by utilizing pgmpy.

### 2.1 Find Local Independencies
We can look at the independencies for specific nodes.

In [33]:
student_model.local_independencies('difficulty')

(difficulty _|_ letter, grade, intelligence, sat)

In [34]:
student_model.local_independencies('grade')

(grade _|_ letter, sat | difficulty, intelligence)

In [35]:
student_model.local_independencies(['difficulty', 'intelligence', 'sat', 'grade', 'letter'])

(difficulty _|_ letter, grade, intelligence, sat)
(intelligence _|_ letter, difficulty, grade, sat)
(sat _|_ letter, difficulty, grade | intelligence)
(grade _|_ letter, sat | difficulty, intelligence)
(letter _|_ difficulty, intelligence, sat | grade)

In [36]:
student_model.get_independencies()

(difficulty _|_ intelligence, sat)
(difficulty _|_ letter | grade)
(difficulty _|_ sat | intelligence)
(difficulty _|_ intelligence | sat)
(difficulty _|_ sat | letter, intelligence)
(difficulty _|_ letter, sat | grade, intelligence)
(difficulty _|_ letter | grade, sat)
(difficulty _|_ sat | letter, grade, intelligence)
(difficulty _|_ letter | grade, intelligence, sat)
(grade _|_ sat | intelligence)
(grade _|_ sat | letter, intelligence)
(grade _|_ sat | difficulty, intelligence)
(grade _|_ sat | letter, difficulty, intelligence)
(intelligence _|_ difficulty)
(intelligence _|_ letter | grade)
(intelligence _|_ difficulty | sat)
(intelligence _|_ letter | difficulty, grade)
(intelligence _|_ letter | grade, sat)
(intelligence _|_ letter | difficulty, grade, sat)
(letter _|_ difficulty, intelligence, sat | grade)
(letter _|_ sat | intelligence)
(letter _|_ intelligence, sat | difficulty, grade)
(letter _|_ sat | difficulty, intelligence)
(letter _|_ difficulty, sat | grade, intelligence

### 2.2 Find Active Trail Nodes
We can also look for **active trail nodes**. We can think of active trail nodes as path's of influence; what can give you information about something else?   

In [37]:
student_model.active_trail_nodes('difficulty')

{'difficulty': {'difficulty', 'grade', 'letter'}}

In [38]:
student_model.active_trail_nodes('grade')

{'grade': {'difficulty', 'grade', 'intelligence', 'letter', 'sat'}}

Notice that for `grade` we had everything be fully returned. This is because everything provides information about grade, meaning grade is dependent upon all other random variables. 

We can also see how the active trails to difficulty change when we observed `grade`.

In [39]:
student_model.active_trail_nodes('difficulty')

{'difficulty': {'difficulty', 'grade', 'letter'}}

In [40]:
student_model.active_trail_nodes('difficulty', observed='grade')

{'difficulty': {'difficulty', 'intelligence', 'sat'}}

---

## 3. Inference in Bayesian Models
Until now we discussed just about representing Bayesian Networks. Now let's see how we can do inference in a Bayesian Model and use it to predict values over new data points for machine learning tasks. In this section we will consider that we already have our model (structure and parameters).

In inference we try to answer probability queries over the network given some other variables. So, we might want to know the probable grade of an intelligent student in a difficult class given that he scored good in SAT. So for computing these values from a Joint Distribution we will have to reduce over the given variables that is:
#### $$ I = 1, D = 1, S = 1 $$ 
and then marginalize over the other variables that is 
#### $$ L $$ 
to get 
#### $$ P(G | I=1, D=1, S=1) $$
But carrying on marginalize and reduce operations on the complete Joint Distribution is computationaly expensive since we need to iterate over the whole table for each operation and the table is exponential in size to the number of variables. But in Graphical Models we exploit the independencies to break these operations in smaller parts making it much faster.

One of the very basic methods of inference in Graphical Models is **Variable Elimination**.

### 3.1 Variable Elimination
We know that:

#### $$ P(D, I, G, L, S) = P(L|G) * P(S|I) * P(G|D, I) * P(D) * P(I) $$

Now let's say we just want to compute the probability of G. For that we will need to marginalize over all the other variables.

#### $$ P(G) = \sum_{D, I, L, S} P(D, I, G, L, S) $$ 
##### $$ P(G) = \sum_{D, I, L, S} P(L|G) * P(S|I) * P(G|D, I) * P(D) * P(I) $$
#### $$ P(G) = \sum_D \sum_I \sum_L \sum_S P(L|G) * P(S|I) * P(G|D, I) * P(D) * P(I) $$

Now since not all the conditional distributions depend on all the variables we can push the summations inside:

#### $$ P(G) = \sum_D P(D) \sum_I P(G|D, I) * P(I) \sum_S P(S|I) \sum_L P(L|G) $$

So, by pushing the summations inside we have saved a lot of computation because we have to now iterate over much smaller tables.

In [43]:
from pgmpy.inference import VariableElimination
infer = VariableElimination(student_model)
print(infer.query(['grade']) ['grade'])

╒═════════╤══════════════╕
│ grade   │   phi(grade) │
╞═════════╪══════════════╡
│ grade_0 │       0.3620 │
├─────────┼──────────────┤
│ grade_1 │       0.2884 │
├─────────┼──────────────┤
│ grade_2 │       0.3496 │
╘═════════╧══════════════╛


There can be cases in which we want to compute the conditional distribution let's say 
#### $$ P(G | D=0, I=1) $$

In such cases we need to modify our equations a bit:

#### $$ P(G | D=0, I=1) = \sum_L \sum_S P(L|G) * P(S| I=1) * P(G| D=0, I=1) * P(D=0) * P(I=1) $$
#### $$ P(G | D=0, I=1) = P(D=0) * P(I=1) * P(G | D=0, I=1) * \sum_L P(L | G) * \sum_S P(S | I=1) $$

In pgmpy we will just need to pass an extra argument in the case of conditional distributions:

In [44]:
print(infer.query(['grade'], 
                  evidence={'difficulty': 0, 
                            'intelligence': 1}) ['grade'])


╒═════════╤══════════════╕
│ grade   │   phi(grade) │
╞═════════╪══════════════╡
│ grade_0 │       0.9000 │
├─────────┼──────────────┤
│ grade_1 │       0.0800 │
├─────────┼──────────────┤
│ grade_2 │       0.0200 │
╘═════════╧══════════════╛


**Predicting values from new data points** <br>

Predicting values from new data points is quite similar to computing the conditional probabilities. We need to query for the variable that we need to predict given all the other features. The only difference is that rather than getting the probabilitiy distribution we are interested in getting the most probable state of the variable.

In pgmpy this is known as MAP query. Here's an example:

In [52]:
infer.map_query(['grade'])


{'grade': 2}

In [53]:
infer.map_query(['grade'], 
                evidence={'difficulty': 0, 
                          'intelligence': 1})

{'grade': 0}

In [55]:
infer.map_query(['grade'], 
                evidence={'difficulty': 0, 
                          'intelligence': 1, 
                          'letter': 1, 
                          'sat': 1})

{'grade': 0}