# # Course 1 Week 2
## Lecture 1. Overview of Template Models
### 1.0 Sharing Between Models & Sharing Within Models

#### Image Segmentation  
The same model can be shared for every pixel within an image  

The same model can be shared between images of the same database  

<img src="./images/image_segmentation.png">  

### 1.1 Template Variable
Template variable $X(U_1, ..., U_k)$ is instantiated (duplicated) multiple times.

* these variables are indexed by various things, for instance: timepoints, people, pixels, courses, students, etc.  

### 1.2 Template Model
Language that specifies how ground variables inherit dependencies in the model from a template.  

Useful for representing Dynamic Bayesian Networks (Networks that have replication over time).  

The following are advantages of using template models:
* Template models can often capture events that occur in a time series.
* CPDs in template models can often be copied many times.
* Template models can capture parameter sharing within a model.
* NOTE: Template models cannot represent CPDs that cannot be represented in non-template models.

## Lecture 2. Temporal Models - DBNs
### 2.1 Distribution over temporal trajectories
* discretize time by picking a time granularity: $\Delta$  
* $X^{(t)}$ - a specific instantiation of variable $X$ at time $t\Delta$
* $X^{(t:t')} = \{X^{(t)}, ..., X^{(t')} \}$ where $(t \leq t')$  
* Out goal is to be able to represent a probability distribution $P(X^{(t:t')})$ for any $t$, $t'$.   

### 2.2 Markov Assumption
If we have 1,000,000 timepoints we would need 1,000,000 representations of $X^{(t)}$ to represent this model.  This is not feasible!  

The Markov Assumption allows us to compact the model  

The chain rule for probabilities gives the following  
$$P(X^{(0:T)}) = P(X^{(0)}) \prod^{T-1}_{t=0} P(X^{(t+1)} \mid P(X^{(0:t)})$$  

This reads as follows:  
The set of variables spanning the time from $0,...,T$ equals the probability of X at time 0 multiplied by the state of the system at $t+1$ (i.e., $P(X^{(0)})$) given all previous states at timepoints $0, ..., t$ (i.e.,  $P(X^{(t+1)} \mid P(X^{(0:t)})$)  

The **Markov Assumption**  
$$(X^{(t+1)} \bot X^{(0:t-1)}) \mid X^{(t)}$$  

The state at T+1 is independent of the past given the present.  

Now we can simplify the previous statement to:  
  
$$P(X^{(0:T)}) = P(X^{(0)}) \prod^{T-1}_{t=0} P(X^{(t+1)} \mid P(X^{(t)})$$  

**this assumption is sometimes too strong**  
To address this case, we can enrich the model to include variables that would address the missing information.  For instance, adding the random variable velocity to the model when developing a model for robot positioning.   

also, Semi-Markov models.  

**Even now we still have a probability distribution for every timepoint at $X^{(t)}$**  

This is where we are going to end-up with a template model...  

### 2.3 Time Invariance
For every $X^{(t)}$ we will have the following **template probability model**: $P(X' \mid X)$  

Important notation in template probability models:  

$X^{(t+1)} = X'$ denotes the next timpeoint  
$X^{(t)} = X$ denotes the current timepoint  

For every single timepoint $t$, this model is replicated such that:  

$$P(X^{(t+1)}) \mid P(X^{(t)}) = P(X' \mid X)$$  

This replication is called **time invariance**  

### 2.4 Transition Model  
<img src="./images/template_transition_model.png">  
<img src="./images/initial_state_description.png">  

### 2.5 Two-Time-Slice Bayesian Netowrk (2TBN)  
A transition model (2TBN) over $X_1, ..., X_n$ is specified as a BN fragment such that:  

* the nodes include $X'_1, ..., X'_n$ and a subset of $X_1, ..., X_n$  
* only the nodes $X'_1, ..., X'_n$ have parents and a CPD  

The 2TBN defines a conditional distribution:  

$$P(X' \mid X) = \prod^n_{i=1} P(X'_i \mid Pa_{x'_i})$$

$Pa$ denotes the parents  

<img src="./images/2tbn.png">  

### 2.6 Dynamic Bayesian Netowrk (DBN)
A dynamic Bayesian network (DBN) over $X_1,...,X_n$ is defined by a:  

1. 2TBN, defined as $BN_{\rightarrow}$, over $X_1, ..., X_n$  
2. a Bayesian network $BN^{(0)}$ over $X^{(0)}_1,...,X^{(0)}_n$  

### 2.7 Ground Baysian Network (Unrolled Network)
for a trajectory over $0,...,T$ we define a ground (unrolled) network such that:  

1. the dependency model for $X^{(0)}_1,...,X^{(0)}_n$ is copied from $BN^{(0)}$  
2. the dependency model for $X^{(t)}_1,...,X^{(t)}_n$ for all $t>0$ is copied from $BN_{\rightarrow}$  

<img src="./images/ground_bayesian_network.png">   
 

## Lecture 3. Hidden Markov Models  

### 3.1 definition  

Hidden Markov Models (HMM) are comprised of a **Transition Model**, and an **Observation Model**  

<img src="./images/HMM_example.png"> 
**(left)** 2TBN HMM template model.  **(Right)** HMM ground network.  

There is often a great deal of complexity encoded in the **transition model**.  

<img src="./images/transition_model.png"> 
This is an example CPD of the transition model for the 2TBN of a HMM.  The nodes in this graph represent the four states that the variable $S'$ can take according the the CPD $P(s' \mid s)$.  Arrows represent the probability of the variable transitioning from the current state $s$ to the next state $s'$.  

### 3.2 Applications
* Robot localization
* Speech Recognition (Best success story)
* Biological sequence analysis
* Text annotation

### 3.3 Examples
**Spectrogram of speech signal**  
<img src="./images/acoustic_signal.png">  


**HMM used to determine the phoneme sequence**  
<img src="./images/word_HMM.png">  


**HMM used to predict the word from the lexicon baed on phoneme sequences**  
<img src="./images/recognition_HMM.png">  

### 3.4 Summary

* HMMs are a subclass of DBNs  
* HMMs seem unstructured at the level of random variables  
* HMMs structure typically manifests in sparsity and repeated elements within the transition matrix  
* HMMs are used in a wide variety of applications for modeling sequences  


## Lecture 4. Plate Models  

### 4.1 Purpose 
Useful for encoding networks that have **multiple objects of the same type**  

### 4.2 Modeling Repetition  
A plate denotes $n$ occurrences of a variable $t$.  

The plate also contains random variables $X$.  

For every instance of the plate variable $t_n = i$, there is a random variable $X(t_i)$  

<img src="./images/plate_1.png">  
This plate model represents multiple coin tosses.  

### 4.3 Nested Plates
Plates within a plate are nested plates

Each random variable(s) in the plate is indexed by **both** the variable of the plate and the variable of the plate within which it is nested.  

<img src="./images/nested_plates.png">  
Example of a nested plate. Note that $G$ and $I$ are indexed by both $s$ and $c$. This example states that each student has a "course specific" intelligence. This may or may not be an assumption that we would like.

### 4.4 Overlapping Plates
Plates that are not nested, but overlap.  

<img src="./images/overlapping_plate.png">  
The assumption here is that difficulty is a property of the course, intelligence is a property of the student, and only grade is a property of both intelligence and difficulty.  

### 4.5 explicit Parameter Sharing  
We may explicitly write down that certain random variables share parameters of their CPD using the notation $\theta$.  

<img src="./images/parameter_sharing.png">

### 4.6 Collective Inference   
<img src="./images/collective_inference.png">  

### 4.7 Summary  
For a **template variable** $A(U_1,...,U_k)$:  
* we have a set of **template parents** $B_1(U_1),...,B_m(U_m)$
* each index of the template parent $B$ must be a subset of the indices for its child template variable $A$, such that for $B$: $U_i \subseteq \{U_1, ..., U_k \}$  
* we cannot have an index in the parent that does not appear in the child, b/c this would result in a child variable that has up to an infinite number of parents
* we can define a **template CPD** $P(A \mid B_1,...,B_m)$
* given the above template variable and template parents, for any instantation $u_1,...,u_k$ to $U_1, ..., U_k$ we would have the following exact **ground network**:  
$$B(u_1) \rightarrow A(u_1,...,u_k) \leftarrow B(u_k)$$  

### 4.8 Summary
* template for an infinite set of BNs, each induced by a different set of domain objects  
* parameters and structure are reused within a BN and across different BNs
* Models encode correlations across multiple objects, allowing collective inference
* multiple "languages", each with different trade-offs in expressive power.

## Lecture 5. Overview: Structured CPDs

### 5.1 Tabular Representations
Traditional representation, but become very very large when there are a lot of variables being conditioned on.

5 binary variables will result in a tabular CPD with $2^5$ cells.  

### 5.2 General CPD
* CPD $P(X \mid Y_1, ..., Y_k)$ specifies a distribution over $X$ for each assignment of $(y_1, ..., y_k)$
* Can use any function to specify a factor $\Phi (X,Y_1,...,Y_k)$ such that:  
$$\sum_x \Phi (X,Y_1,...,Y_k) = 1.0$$ for all $(y_1, ..., y_k)$

### 5.3 examples
* Deterministic CPD
* Tree-structured CPD
* Logistic CPDs and generalizations
* Noisy OR / AND
* Linear Gaussians and generalizations

### 5.4 Context-Specific Independence
$$P \models (X \bot_c Y \mid Z,c)$$  

where $c$ is an assignment to some set of variables $C$    

$P(X,Y \mid Z,c) = P(X \mid Z,c)P(Y \mid Z,c)$  
$P(X \mid Y,Z,c) = P(X \mid Z,c)$  
$P(Y \mid X,Z,c) = P(Y \mid Z,c)$  

#### example using the OR model

<img src="./images/or_model.png">  

How to solve this using the rules for context specific independence from above  
<img src="./images/contect_specific_independence_ex.jpeg">  

## Lecture 6. Tree-Structured CPDs

### 6.1 Job Offer Tree CPD

<img src="./images/tree_structured_cpd.png">  

Certain **context specific independencies** are implied in a tree CPD as follows:  

<img src="./images/context_specific_indep_in_tree.png"> 

### 6.2 Multiplexor CPD

In this example, the $Choice$ variable determines if a connection is present in the graph for the other variables $Letter_1$ and $Letter_2$.

the $Choice$ variable acts as a **switch** 

<img src="./images/multiplexor_1.png"> 

An important conditional independence in this example is that 

$$(L_1 \bot L_2 \mid J,C)$$

In this example:  
* $Y$ can take on ONLY ONE of the CPDs for $Z_k$.
* $A$ takes on a single value $a \subset set\{1,2,...,k\}$
* $A$ tells $Y$ which value of $Z_k$ to take on

<img src="./images/multiplexor_2.png"> 

The Probability Distribution specified for $Y$ is the following:  

$$P(Y \mid A,Z_1,...,Z_k) = \begin{cases} 1, & \mbox{if } Y=Z_a \\ 0, & \mbox{if otherwise} \end{cases}$$  

In other words, if  $$A = a \Rightarrow Y = Z_a$$

## Lecture 7. Independence of Causal Influence

### 7.1 Noisy OR CPD

some cases of exponentially growing CPDs do not lend themselves to the Tree CPD

Another option is the Noisy OR CPD

<img src="./images/noisy_OR_CPD.png"> 

Given that $y = 0$, we know that all $Z_i$'s are 0; in particular, $Z_1$ and $Z_2$ are 0, **so that blocks the trail of influence from $X_1$ to $X_2$**. This is a very subtle point - make sure you understand it!

What context-specific independencies are induced by a noisy OR CPD?  

$(X_1 \bot_c X_2 \mid y^0)$  

Given that $y = 0$, we know that all $Z_i$'s are 0; in particular, $Z_1$ are 0, so that blocks the trail of influence from $X_1$ to $X_2$.   

### 7.2 Independence of Causal Influence
The Noisy OR CPD can be generalize to a broader notion of causal influence called: **Independece of Causal Influence**: When we have  bunch of causes for a variable and each of them acts independently to effect the truth of that variable.  

<img src="./images/indep_causal_inf.png">   

There are no interactions of each separate cause that acts on $Z$.  

Example Models:  
* Noisy OR - most commonly used  
* Noisy AND  
* Noisy Max  
* Sigmoid CPD  

### 7.3 Sigmoid CPD
<img src="./images/sigmoid_CPD.png"> 

$$Z=w_0+\sum^k_{(i=1)} w_iX_i $$  

$w_i$ - weights that indicate the strength of the effect of the influence of a random variable $X_i$ on $Z$  
$w_0$ - bias term 

$Z$ is a continuous number from $-\infty$ to $+\infty$  

the sum $Z$ is converted to a real probability distribution by passing it through the sigmoid function.  

$$sigmoid(z) = \frac{e^z}{1+e^z}$$  

**Behavior of Sigmoid CPD**  

<img src="./images/behavior_1.png">

The more parent $X_i$'s that are true, the higher the output of the sigmoid function.  

Multiplying $z$ by a factor of 10 makes the slope of the sigmoid function more extreme. 

$P(y^1 \mid X_1,...,X_k) =w_0+\sum^k_{(i=1)} w_iX_i$

The odds ratio of $Y$ is: $O(\textbf{x}) = \frac{P(y^{1}|\mathbf{x})}{P(y^{0}|\mathbf{x})}$  

It captures the relative likelihood of the two values of $Y$   

By what factor does $O(\textbf{x})$ change if the value of $X_i$ goes from 0 to 1?

Solution = $e^{w_i}$  

<img src="./images/IMG-3224.JPG">

## Lecture 8. Continuous Variables

### 8.1 The Normal Distribution  

Normal distributions (Gaussian distributions) are parameterized by their **mean and standard deviation**.  

In a **linear Gaussian** the parameters are fixed.

In a **Conditional Linear Gaussian**, the parameters are conditioned on the value of another variable.  In the following figure, $alpha$ and $T$ depend on the value of another variable, the Door variable.   

<img src="./images/gaussian_1.PNG">  

### 8.2 Linear Gaussian Model

A linear function of the parents $X_i$ and whose variance doesn't depend at all on the values of the parents.

$$X \sim \mathcal{N}(\mu,\,\sigma^{2})$$

Let L and V be the location and velocity of a car. Assume that the CPD on the right is a linear Gaussian. Which of the following statements could possibly be consistent with that CPD?

<img src="./images/car_model.PNG">  
Answers: 
* $L(t+1)$ might possibly end up far from its expected position.  

* Due to friction, the single most likely value for $L^{(t+1)}$ is $L^{(t)} + 0.9* V^{(t)} \Delta t$.

### 8.3 Conditional Linear Gaussian Model
This introduces the idea of one or more discrete parents where the parameters of the linear Gaussian depend on the value of the discrete parents.  In this example $A$ is the discrete parent.  

$$X \sim \mathcal{N}(w_{a0} + \sum w_{ai}X_i ;\ \sigma^{2}_a)$$  

<img src="./images/clg.PNG">  