$\newcommand{\bs}[1]{\boldsymbol{#1}}$
$\renewcommand{\vec}[1]{\bs{#1}}$

# 4 Graphical Models

## 4.1 Introduction
- Two key principles for building learning models are: *modularity* & *abstraction*, and probability theory brings both under an aligned approach
- **Probabilistic Graphical Models (PGMs)** are a math formalism to reason about parameters that describe probabilistic behaviours
    - Based on graphs where nodes are rvs and vertices (or lack of vertices) represent conditional independence between rvs


### 4.2 Directed graphical models (Bayes nets)
- Based on *Directed Probabilistic Graphical Models (DPGMs)* which are *directed acyclic graphs (DAGs)* aka. **Bayes Nets/belief networks**
    - fun fact, they don't have anything to do w/ Bayes, its just a model for reasoning about prob dists


### 4.2.1 Representing the joint distribution
- A nice property of DAGs is that nodes are ordered such that childs $x_{i}$ always come after predecesor or parent nodes $\vec{x}_{\text{pred}(i)/\text{par}(i)}$ such that: $x_{i} \perp \vec{x}_{\text{pred}(i)/\text{par}(i)}\mid \vec{x}_{\text{par}(i)}$
- Thus, joint dists for any phenomena using prob chains of $N_G$ nodes: $p(\vec{x}_{1:N_G})=\prod_{i=1}^{N_G}p(x_i\mid\vec{x}_{\text{par}(i)})$
    - where $p(x_i\mid\vec{x}_{\text{par}(i)})$ is the *Conditional Prob Dist (CPD)* for node $i$ 
    - KEY ADVANTAGE for expressing dists in this way is that the number of parameters needed is significanlty less.
        - eg. if $N_G$ is the number of nodes and rv have $K$ discrete states then in an *unstructured joint prob* we need $O(K^{N_G})$ params to specify the prob of every configuration
        - conversely, in a DAG we only need predecesors and parents (say we have at most $N_{P}$ parents) then we only need $O(N_{G}K^{N_{P}+1})$ params



### 4.2.2 Examples
- Examples of how DPGMs can be useful


#### 4.2.2.1 Markov chains
- If we are dealing w/ Markov chains then the joint dist is very similar to the joint dist above (Sec.4.2.1), but now time dictates sequence
    - for a one-dim Markov model (unigram): $p(\vec{x}_{1:T})=p(x_1)\prod_{t=2}^{T}p(x_t\mid \vec{x}_{1:t-1})$
    - for a two-dim Markov model (bigram): $p(\vec{x}_{1:T})=p(x_1, x_2)\prod_{t=3}^{T}p(x_t\mid \vec{x}_{t-2:t-1})$
    - where, in either case, the lookup table aka **Conditional Probability Table (CPT)** $\theta_{jk}$ is bounded to $[0,1]$ & row-normalized
    
    
#### 4.2.2.2 The "student" network
- This is another exmple, where we want to know the prob of a student taking a class, and all this is depended on 5 params (D: difficulty, I:intelligence, G: grade, L: reccom letter, S: SAT score).
    - Joint prob is written respecting th topology of the graph (Fig.4.2) and then expanded using the chain rule of probs, lastly simplify wahtever that can be simplified based on the context (eg. in this case L is cond independent to all other parents except for G): 
    - $p(D, I, G, L, S)=p(L\mid S,G,D,I)p(S\mid G,D,I)p(G\mid D,I)p(D\mid I)p(I)=p(L\mid G)p(S\mid I)p(G\mid D,I)p(D)p(I)$
- In DPGMs formulation we can write the CPT for the $i$-th node as: $\theta_{ijk}=p(x_i=k\mid\vec{x}_{\text{par}(i)}=j)$, where we satisfy 
    - boundedness: $0\leq\theta_{ijk}\leq 1$
    - normalization: $\sum_{k=1}^{K}\theta_{ijk}=1$ for all $\forall j$
    - $i\in[N_G]$ indexes nodes; $k\in[K_i]$ indexes node states ($K_i$ is num of states for $i$-th node); $j\in[J_i]$ indexes joint parent states ($J_i=\prod_{p\in\text{par}(i)}K_{p}$)
    - latter on we'll see better more parsimonius representations. So far we have the number of params in a CPT: $O(K^{p+1})$, where $K$ is the num of states per node and $p$ the num of parent nodes
    
#### 4.2.2.3 Sigmoid beliefs nets
- A **sigmoid belief net** is a special case of a **deep generative model** (we'll discuss hierarchical deep gen models in Chapter.21) 
- (eg. Fig.4.3a) if we want to model two hidden layers (not-autoregressive) with $\vec{x}$ as visible nodes (shaded), $\vec{z}$ as hidden internal nodes ($K_l$ hidden nodes at $l$-th level), the joint prob is: $p(\vec{x},\vec{z})=p(\vec{z}_2)p(\vec{z}_1\mid\vec{z}_2)p(\vec{x}\mid\vec{z}_1)=\prod_{k=1}^{K_2}p(z_{2,k})\prod_{k=1}^{K_1}p(z_{1,k}\mid\vec{z}_2)\prod_{d=1}^{D}p(x_d\mid\vec{z}_1)$
    - the *sigmoid belief net* is the special case where all latent vars are binary and all latent CPDs are log-regs: $p(\vec{z}_l\mid\vec{z}_{l+1},\vec{\theta})=\prod_{k=1}^{K_l}\operatorname{Ber}(z_{l,k}\mid\sigma(\vec{w}_{l,k}^{\top}\vec{z}_{l+1}))$
    - and at the bottom layer we use whatever appropriate model fits the case eg. normal: $p(\vec{x},\vec{z}_1,\vec{\theta})=\prod_{d=1}^{D}\mathcal{N}\left(x_d\mid\vec{w}_{1,d,\mu}^{\top}\vec{z}_1, \exp(\vec{w}_{1,d,\sigma}^{\top}\vec{z}_1)\right)$
- Fig.4.3b adds direct connections between hidden layers, called **Deep Autoregressive Network (DARN)** combining ideas from latent var modeling and autoregressive modeling

<img src="images/ch04222-sigmoid-belief-nets.png" width="70%">

### 4.2.3 Gaussian Bayes nets
- When all layer transfers are linears, the joint dist for $i$-th node is: $p(x_i\mid\vec{x}_{\text{par}(i)})=\mathcal{N}(x_i\mid\mu_i+\vec{w}_i^{\top}\vec{x}_{\text{par}(i)},\sigma_i^2)$
- This is generalized by multiplying all nodes: $p(\vec{x})=\mathcal{N}(\vec{x}\mid\vec{\mu},\mathbf{\Sigma})$, where with some manipulations we can calculate:
    - vector of outcomes $\vec{x}$ (center-shifted for mathematical convenience): $\vec{x}-\vec{\mu}=(\mathbf{I}-\mathbf{W})^{-1}\vec{e}=\mathbf{U}\mathbf{S}\vec{z}$, with the var chg: $\mathbf{U}=(\mathbf{I}-\mathbf{W})^{-1}$ and noise $\vec{e}=\mathbf{S}\vec{z}$
    - covariance mat: $\operatorname{Cov}[\vec{x}-\vec{\mu}]=\operatorname{Cov}[\vec{x}]=\mathbf{U}\mathbf{S}^2\mathbf{U}^{\top}$'
    
    
### 4.2.4 Conditional independence properties
- We say that $A$ is **Conditionally Independent** of $B$ given $C$ in graph $G$: $\vec{x}_A \perp_{G}\vec{x}_B\mid \vec{x}_C$
    - $I(G)$: set of all CI statements encoded in the graph and $I(p)$: set of CI statements that hold true in some dist $p$
    - iif $I(G)\subseteq I(p)$ (=graph statements doesn't make CI assertions that don't hold in dist $p$) then we say it $G$ is an (independence map) **I-map** OR $p$ is **Markov**
    - this enables to use the graph as a proxi for $p$'s CI properties regardless of the diversity of prob classes that may be involved
    - $G$ is a **minimal I-map** of $p$ when its an I-map and there is no additional $G^{\prime}\subseteq G$ 
- Subsections below explore how to derive $I(G)$, which properties are defined by DAG