$\newcommand{\bs}[1]{\boldsymbol{#1}}$
$\renewcommand{\vec}[1]{\bs{#1}}$

# 4 Graphical Models

## 4.1 Introduction
- Two key principles for building learning models are: *modularity* & *abstraction*, and probability theory brings both under an aligned approach
- **Probabilistic Graphical Models (PGMs)** are a math formalism to reason about parameters that describe probabilistic behaviours
    - based on graphs where nodes are rvs and vertices (or lack of vertices) represent conditional independence between rvs
    - useful to model complex systems and define conditional ind to compute estimates or inference


### 4.2 Directed graphical models (Bayes nets)
- Based on *Directed Probabilistic Graphical Models (DPGMs)* which are *directed acyclic graphs (DAGs)* aka. **Bayes Nets/belief networks**
    - fun fact, they don't have anything to do w/ Bayes, its just a model for reasoning about prob dists


### 4.2.1 Representing the joint distribution
- A nice property of DAGs is that nodes are ordered such that childs $x_{i}$ always come after predecesor or parent nodes $\vec{x}_{\text{pred}(i)/\text{par}(i)}$ such that: $x_{i} \perp \vec{x}_{\text{pred}(i)/\text{par}(i)}\mid \vec{x}_{\text{par}(i)}$
- Thus, joint dists for any phenomena using prob chains of $N_G$ nodes: $p(\vec{x}_{1:N_G})=\prod_{i=1}^{N_G}p(x_i\mid\vec{x}_{\text{par}(i)})$
    - where $p(x_i\mid\vec{x}_{\text{par}(i)})$ is the *Conditional Prob Dist (CPD)* for node $i$ 
    - KEY ADVANTAGE for expressing dists in this way is that the number of parameters needed is significanlty less.
        - eg. if $N_G$ is the number of nodes and rv have $K$ discrete states then in an *unstructured joint prob* we need $O(K^{N_G})$ params to specify the prob of every configuration
        - conversely, in a DAG we only need predecesors and parents (say we have at most $N_{P}$ parents) then we only need $O(N_{G}K^{N_{P}+1})$ params



### 4.2.2 Examples
- Examples of how DPGMs can be useful


#### 4.2.2.1 Markov chains
- If we are dealing w/ Markov chains then the joint dist is very similar to the joint dist above (Sec.4.2.1), but now time dictates sequence
    - for a one-dim Markov model (unigram): $p(\vec{x}_{1:T})=p(x_1)\prod_{t=2}^{T}p(x_t\mid \vec{x}_{1:t-1})$
    - for a two-dim Markov model (bigram): $p(\vec{x}_{1:T})=p(x_1, x_2)\prod_{t=3}^{T}p(x_t\mid \vec{x}_{t-2:t-1})$
    - where, in either case, the lookup table aka **Conditional Probability Table (CPT)** $\theta_{jk}$ is bounded to $[0,1]$ & row-normalized
    
    
#### 4.2.2.2 The "student" network
- This is another exmple, where we want to know the prob of a student taking a class, and all this is depended on 5 params (D: difficulty, I:intelligence, G: grade, L: reccom letter, S: SAT score).
    - Joint prob is written respecting th topology of the graph (Fig.4.2) and then expanded using the chain rule of probs, lastly simplify wahtever that can be simplified based on the context (eg. in this case L is cond independent to all other parents except for G): 
    - $p(D, I, G, L, S)=p(L\mid S,G,D,I)p(S\mid G,D,I)p(G\mid D,I)p(D\mid I)p(I)=p(L\mid G)p(S\mid I)p(G\mid D,I)p(D)p(I)$
- In DPGMs formulation we can write the CPT for the $i$-th node as: $\theta_{ijk}=p(x_i=k\mid\vec{x}_{\text{par}(i)}=j)$, where we satisfy 
    - boundedness: $0\leq\theta_{ijk}\leq 1$
    - normalization: $\sum_{k=1}^{K}\theta_{ijk}=1$ for all $\forall j$
    - $i\in[N_G]$ indexes nodes; $k\in[K_i]$ indexes node states ($K_i$ is num of states for $i$-th node); $j\in[J_i]$ indexes joint parent states ($J_i=\prod_{p\in\text{par}(i)}K_{p}$)
    - latter on we'll see better more parsimonius representations. So far we have the number of params in a CPT: $O(K^{p+1})$, where $K$ is the num of states per node and $p$ the num of parent nodes
    
#### 4.2.2.3 Sigmoid beliefs nets
- A **sigmoid belief net** is a special case of a **deep generative model** (we'll discuss hierarchical deep gen models in Chapter.21) 
- (eg. Fig.4.3a) if we want to model two hidden layers (not-autoregressive) with $\vec{x}$ as visible nodes (shaded), $\vec{z}$ as hidden internal nodes ($K_l$ hidden nodes at $l$-th level), the joint prob is: $p(\vec{x},\vec{z})=p(\vec{z}_2)p(\vec{z}_1\mid\vec{z}_2)p(\vec{x}\mid\vec{z}_1)=\prod_{k=1}^{K_2}p(z_{2,k})\prod_{k=1}^{K_1}p(z_{1,k}\mid\vec{z}_2)\prod_{d=1}^{D}p(x_d\mid\vec{z}_1)$
    - the *sigmoid belief net* is the special case where all latent vars are binary and all latent CPDs are log-regs: $p(\vec{z}_l\mid\vec{z}_{l+1},\vec{\theta})=\prod_{k=1}^{K_l}\operatorname{Ber}(z_{l,k}\mid\sigma(\vec{w}_{l,k}^{\top}\vec{z}_{l+1}))$
    - and at the bottom layer we use whatever appropriate model fits the case eg. normal: $p(\vec{x},\vec{z}_1,\vec{\theta})=\prod_{d=1}^{D}\mathcal{N}\left(x_d\mid\vec{w}_{1,d,\mu}^{\top}\vec{z}_1, \exp(\vec{w}_{1,d,\sigma}^{\top}\vec{z}_1)\right)$
- Fig.4.3b adds direct connections between hidden layers, called **Deep Autoregressive Network (DARN)** combining ideas from latent var modeling and autoregressive modeling

<img src="images/ch04222-sigmoid-belief-nets.png" width="70%">

### 4.2.3 Gaussian Bayes nets
- When all layer transfers are linears, the joint dist for $i$-th node is: $p(x_i\mid\vec{x}_{\text{par}(i)})=\mathcal{N}(x_i\mid\mu_i+\vec{w}_i^{\top}\vec{x}_{\text{par}(i)},\sigma_i^2)$
- This is generalized by multiplying all nodes: $p(\vec{x})=\mathcal{N}(\vec{x}\mid\vec{\mu},\mathbf{\Sigma})$, where with some manipulations we can calculate:
    - vector of outcomes $\vec{x}$ (center-shifted for mathematical convenience): $\vec{x}-\vec{\mu}=(\mathbf{I}-\mathbf{W})^{-1}\vec{e}=\mathbf{U}\mathbf{S}\vec{z}$, with the var chg: $\mathbf{U}=(\mathbf{I}-\mathbf{W})^{-1}$ and noise $\vec{e}=\mathbf{S}\vec{z}$
    - covariance mat: $\operatorname{Cov}[\vec{x}-\vec{\mu}]=\operatorname{Cov}[\vec{x}]=\mathbf{U}\mathbf{S}^2\mathbf{U}^{\top}$'
    
    
### 4.2.4 Conditional independence properties
- We say that $A$ is **Conditionally Independent** of $B$ given $C$ in graph $G$: $\vec{x}_A \perp_{G}\vec{x}_B\mid \vec{x}_C$
    - $I(G)$: set of all CI statements encoded in the graph and $I(p)$: set of CI statements that hold true in some dist $p$
    - iif $I(G)\subseteq I(p)$ (=graph statements doesn't make CI assertions that don't hold in dist $p$) then we say it $G$ is an (independence map) **I-map** OR $p$ is **Markov**
    - this enables to use the graph as a proxi for $p$'s CI properties regardless of the diversity of prob classes that may be involved
    - $G$ is a **minimal I-map** of $p$ when its an I-map and there is no additional $G^{\prime}\subseteq G$ 
- Subsections below explore how to derive $I(G)$, which properties are defined by DAG

#### 4.2.4.1 Global Markov properties (d-separation)
- Graph rules to recognize CI in sets of nodes (see Fig.4.4 in book)
- An *undirected path* $P$ is **d-separated** by a $C$ set of nodes iif one of the three cases is true
    - i) observed rv(s) $m\in C$ are in between other nodes in a directed chain/pipe. ii) $m\in C$ is the forking node. iii) $m\not\in C$ nor further child nodes when it is the edge in v-structure (yields to **explaining away/Berkson's paradox** ie. $m$ conditions parent nodes/makes them dependent)
- After at least one of these verify we can write the **global Markov property**: $X_A\perp X_B\mid X_C$ $\Leftrightarrow$ $A$ is d-separated from $B$ given observed $C$

#### 4.2.4.2 Explaining away (Berkon's paradox)
- aka. **sampling bias** eg:
    - if we run 100 experiments of two coins tosses, but ONLY RECORD when we have at least one head, we'd register approx 70 datapoints
    - another example, three Normal uncorrelated dists $p(x,y\mid z)=\mathcal{N}_x\mathcal{N}_y\mathcal{N}_z$ can appear correlated if we truncate measurements (of conditional $z$) $p(x,x\mid z>2.5)$ (Fig.4.6 in book)


#### 4.2.4.3 Markov blankets
- Smallest set of nodes $\text{mb}(i)$ that ensure CI for $i$-th node with all other nodes $X_{-i}$ in the graph 
    - $\text{mb}(i)=\text{ch}(i) \cup \text{par}(i) \cup \text{copar}(i)$ : considering child $\text{ch}(i)$, parent $\text{par}(i)$ and co-parent $\text{copar}(i)$ nodes
    - key result is that we can define the Markov blanket based on nodes that are in "scope" of $X_i$, and claim CI for all the graph! (eq.4.27-31, terms that don't involve $X_i$ cancel out): $p(X_i\mid X_{-i})\propto p\left(X_i\mid\text{par}(X_i)\right) \prod_{Y_{j}\in\text{ch}(X_i)}p\left(Y_j\mid\text{par}(Y_j)\right)$
    - then the **full conditional** follows: $p(x_i\mid\vec{x}_{-i})=p\left(x_i\mid\vec{x}_{\text{mb}(i)}\right)\propto p\left(x_i\mid\vec{x}_{\text{par}(i)}\right)\prod_{k\in\text{ch}(i)}p\left(x_k\mid\vec{x}_{\text{par}(k)}\right)$
        - this is connected to Gibbs sampling (eq.12.19) and Mean Field Variational Inference (eq.10.87)


#### 4.2.4.4 Other Markov properties
- Basically these are the foundations to establish how to treat joint posteriors (as full joints or conditional)
- We have three key properties to reason about CI in DPGMs graphs when focusing on a specific node $i$. The notation $A\diagdown B$ means the set $A$ except $B$s, we've implied $B\subseteq A$
    - 1. (G) *Global Markov property* (Sec.4.2.4.1): $X_A\perp X_B\mid X_C$ $\Leftrightarrow$ $A$ is d-separated from $B$ given observed $C$
    - 2. (L) *Local Markov property*: $i\perp \text{nd}(i)\diagdown\text{par}(i)\mid\text{par}(i)$
    - 3. (O) *Ordered Markov propery*: $i\perp \text{pred}(i)\diagdown \text{par}(i)\mid \text{par}(i)$  
- There is a hierarchy of how we apply these criteria: $G \Rightarrow L \Rightarrow O$ or $O\Rightarrow L \Rightarrow G$ as well


### 4.3.5 Generation (sampling)
- Easy to sample in DGPMs: **ancestral sampling** sample each node obeying *topological order* (parents first, childs given parents follow)
    - following this we are guaranteed to get independent samples from the joint $(x_1,\ldots,x_{N_G})\sim p(\vec{x}\mid\vec{\theta})$

#### 4.2.4.5 Inference
- Note on notation for general unambiguity in this sub-section, $Q$ are query nodes, $V$ are visible nodes and nuisance nodes are $R=\{1,\ldots,N_G\}\diagdown \{Q,V\}$ (can represent hparams or noise). 
- The posterior marginal for node $Q$ is (summ for discrete &  int for continuous) is:
    - we want to infer $Q$ given $V$ (derived directly from the chain rule of cond probs) and marginalize out nuisance vars: $p_{\vec{\theta}}(Q\mid V)=\frac{p_{\vec{\theta}}(Q,V)}{p_{\vec{\theta}}(V)}=\frac{\sum_R p_{\vec{\theta}}(Q,V,R)}{p_{\vec{\theta}}(V)}$
    - if $R$ is noise or irrelevant factors and is intrinsic to $Q$ then, re-write the post dist in terms of full hidden vars $H=Q\cup R$: $p_{\vec{\theta}}(H\mid V)=\frac{p_{\vec{\theta}}(H,V)}{p_{\vec{\theta}}(V)}=\frac{p_{\vec{\theta}}(H,V)}{\sum_{H^\prime}p_{\vec{\theta}}(H^\prime,V)}$
- Unfortunately, this is **NP-hard** in general! 
    - we only have efficient solutions for some certain graph structures eg. chains, trees, sparse graphs

#### 4.2.6.1 Example: inference in the student network
- See book


### 4.6.7 Learning
- So far we've assumed that $G$ (nodes that satisfy the global Markov property G) and $\vec{\theta}$ are known. However, it is possible to learn both from data $\mathcal{D}$ (=$V$), assuming $G$ is fixed
    - so the posterior is (as usual) $p(\vec{\theta}\mid\mathcal{D})$, but in reality it is easier (or even only feasible) to compute a point estimate $\hat{\vec{\theta}}_{\text{MAP}/\text{MLE}}$
    - turns out that $\hat{\vec{\theta}}$ is not a bad approx since it depends on all the data (all other nodes) in the graph as opposed to hparams that depend on a smaller subset of $N_G$
    
#### 4.6.7.1 Learning from complete data
- Lets explore the example of a supervised generative classifier (Fig.4.9) 
    - where we have $N$ obervations (shaded nodes). We observe both: i) labels $\vec{y}$ that condition ii) $\vec{x}$'s classification and all data is complete! Additionally the global params G are: $\vec{\theta}_x$ and $\vec{\theta}_y$

<img src='images/ch0427-dpgm-example.png' width='70%'>

- Following the CI properties from the graph we can write the joint dist (factorizing the corresponding CI nodes): $p(\vec{\theta},\mathcal{D})=p(\vec{\theta}_x)p(\vec{\theta}_y)\left[\prod_{n}^{N}p(y_n\mid\vec{\theta}_y)p(\vec{x}_{n}\mid\vec{\theta}_x, y_n) \right]=\ldots=\left[p(\vec{\theta}_y)p(\mathcal{D}_y\mid\vec{\theta}_y)\right]\left[p(\vec{\theta}_x)p(\mathcal{D}_x\mid\vec{\theta}_x)\right]$
    - where $D_y=\{y_n\}_{n=1}^{N}$ ($D_x=\{\vec{x}_{n},y_n\}_{n=1}^{N}$) are the observations for the $2N$ nodes $y$ ($\vec{x}$)
    - we see that things factorize nicely in a familiar format (prior $\times$ likelihood, for each)! 
    - so we can compute the post for each node independently: $p(\vec{\theta},\mathcal{D})=\prod_{i=1}^{N_G}\text{posterior}_{i}=\prod_{i=1}^{N_G}p(\vec{\theta}_i)p(\mathcal{D}_i\mid\vec{\theta}_i)$
    - and a point approx, eg. MLE: $\hat{\vec{\theta}}=\operatorname{argmax}_{\vec{\theta}}\prod_{i=1}^{N_G}p(\mathcal{D}_i\mid\vec{\theta}_i)$, can be computed for each node independently (see next Sec.4.6.7.2)
    
    
#### 4.6.7.2 Example: computing the MLE for CPTs
- We'll speedrun through this section. The most general expression of the likelihood (generalizes the previous example) is a prod of $N$ observations and prod of all nodes $N_G$ : $p(\mathcal{D}\mid\vec{\theta})=\prod_{n=1}^{N}\prod_{i=1}^{N_G}p(x_{n,i}\mid\vec{x}_{n,\text{par}(i),\vec{\theta}_i})$
    - which its params can be written with an indicator matrix notation: $\theta_{ijk}=p(x_i=k\mid \vec{x}_{n,\text{par}(i)}=\vec{j})$ ie. node $i$ is in state $k$ whileparent nodes are in the joint state $\vec{j}$
    - then the *sufficient stats* in the configuration are: $N_{ijk}=\sum^{N}\mathbb{I}(x_{n,i}=k\mid\vec{x}_{n,\text{par}(i)}=\vec{j})$
    - the MLE then is: $\hat{\theta}_{ijk}=\frac{N_{ijk}}{\sum_{k^\prime}N_{ijk^\prime}}$
- A huge problem is, again, sparsity. This causes estimates to be prone to biases (small sample size eg. *zero-count*), see next Sec.4.6.7.3 for Bayesian solutions to this


#### 4.6.7.3 Example: Computing posterior for CPTs
- In the last section we've seen how to obtain a CPT for a discrete Bayes net. The problem was *zero-count*. Here, we see a Bayesian workaround using *Dirichlet priors* on every row $\vec{\theta}_{ij}\sim\operatorname{Dir}(\vec{\alpha}_{ij}) \Rightarrow \vec{\theta}_{ijk}\mid\mathcal{D}\sim\operatorname{Dir}(\mathbf{N}_{ij}+\vec{\alpha}_{ij})$ 
    - where $N_{ijk}$ is num of times node $i$ is in state $k$ while its parents are in joint state $\vec{j}$
    - we copmute the posterior mean by basically adding pseudocounts to the empirical counts: $\bar{\theta}_{ijk}=\frac{N_{ijk}+\alpha_{ijk}}{\sum_{k^\prime}(N_{ijk^\prime}+\alpha_{ijk^\prime})}$
    - MAP uses $\alpha_{ijk}-1$ instead of just $\alpha_{ijk}$


#### 4.6.7.4 Learning from incomplete data
- If we have incomplete or missing data we can no longer decompose the CPD's likelihoods nor posteriors based on CI! As opposed to Sec.4.2.7.1
    - Fig.4.10 (in book) shows this, basically is the same graph as Fig.4.9 with renamed nodes ${\vec{y}_n}\rightarrow \vec{z}_n$ that are a hidden variables (not observable/shaded)
    - likelihood can be written as a prod over the $N$ observable $\vec{x}$ nodes, where for each $n$-th node we account for all other $\vec{z}_{1:N}$ hidden nodes (classification labels in this example): $p(\mathcal{D}\mid\vec{\theta})=\prod_{n=1}^{N}\sum_{\vec{z}_{n}}p(\vec{z}_{n}\mid\theta_{z})p(\vec{x}_{n}\mid\vec{z}_{n},\vec{\theta}_x)$
    - since the log-likelihood doesn't distribute over hidden nodes: $l(\theta)=\sum_{n=1}^{N}\log{\left[\sum_{\vec{z}_{n}}\ldots\right]}$.  We can't compute the MLE nor posterior for each node independently! 
    - this is where optimization methods (eg. expectation maximization EM) come to save the situation 
        - we'll focus on optimization methods for MLE and leave Bayesian inference for latter chapters!
        
        
        
#### 4.6.7.5 Using EM to fit CPTs in the incomplete data case
- 