### Notes

PGM Chapter 12: Particle-Based Approximate Inference

Section 12.3
- Markv Chain methods apply equally well to directed and to undirected models. The algo is easier to present in the context of a distribution $P_\Phi$ defined in terms of a general set of factors $\Phi$.
- To apply Gibbs Sampling to a network with evidence, we first reduce all of the factors by the observatiosn $\mathbf{e}$, so that the distribution $P_\Phi$ used in the algorithm corresponds to $P_\Phi(\mathbf{X}|\mathbf{e})$.
- Markov Chain: a Markov chain is defined in terms of a graph of states over which the sampling algorithm takes a random walk. In the case of graphical models, this graph is not the original graph, but rather a graph whose nodes are the possible assignments to our variables $\mathbf{X}$.
- Markov chain Monte carlo (MCMC) sampling is a process that mirrors the dynamics of the Markov chain; the process of generating an MCMC trajectory is shown in algorithm 12.5. The sample $\mathbf{x}^{(t)}$ is drawn from the distribution $P^{(t)}$. We are interested in the limit of this process, that is, whether $P^{(t)}$ converges, and if so, to what limit.
- We have that $P^{(t+1)}(\mathbf{x}') = \sum_{\mathbf{x} \in Val(\mathbf{X})}P^{(t)}(\mathbf{x})\mathcal{T}(\mathbf{x}\rightarrow \mathbf{x'})$
- For large $t$, we expect $P^{(t)}(\mathbf{x}') \approx P^{(t+1)}(\mathbf{x}')$. Whatever distribution that makes this work is called an invariant distribution. We may refer to it with $\pi(\mathbf{x}')$
- There is no guarantee that the stationary distribution is unique. Sometimes it depends on the starting state. Situtations like this occer when the chain has several distinct regions that are not reachable from each other. Chains such as this are called *reducible Markov Chains*. Also, it may be due to a fixed cycle behavior. These are called periodic Markov Chains.
- We want to restrict attention to Markov Chains that have a unique stationary distributions which is reached from *any* starting distribution $P^{(0)}$. The most commonly used condition to guarantee this behavior is that the chain is *ergodic*.
- Ergodic: When the state space $Val(\mathbf{X})$ is finite, it's equivalent (though not defined this way?) is: A markov chain is said to be *regular* if there exists some number $k$ such that for every $\mathbf{x}', \mathbf{x} \in Val(\mathbf{X})$, the probability of getting from $\mathbf{x}$ to $\mathbf{x}'$ in exactly $k$ steps is greater than zero.
- Theorem: if a finite state Markov Chain $\mathcal{T}$ is regular, then it has a unique stationaty distribution.
- Ensuring regularity is usually easy - you need two conditions"
    - it is possible to get from any state to any state using a positive probability path in the state graph
    - There is a non-zero chance of remaining on a state for all sates
- Those two conditions are sufficient but not necessary. They typcally hold in practice.
- For graphical models, each state is an assignment to many variables. When defining a transition model over this sapce, we can consider a fully general case, when a transition is from any state to any other state. However, it is often convenient to decompose the transition model, considering transition that update only a single component of the state vector at a time, ther is, only a value for a single variable.
    - Consider an extension to our Grasshopper chain, where the grasshopper lives, not on a line, but in a two-dimensional plane. In this case, the state of the system is defined via a pair of random variables $X$; $Y$ . Although we could define a joint transition model over both dimensions simultaneously, it might be easier to have separate transition models for the $X$ and $Y$ coordinate.
- In such a case, we often define a set of transition models: $\mathcal{T}_i$, called kernels. These may help us guarantee regularity or speed convergence. To sample, we have to add an intermediate step to sample which $\mathcal{T}_i$. All of them together we can think define a global $\mathcal{T}$.


### Answer Structure

What should be mentioned?
- Gibbs rule from chain rule
- Markov Chains are between states
- Markov Chain may be badly behaved
- Maybe: The big idea is MCMC and Gibbs Sampling is the one algo?
- 

Outlier
- Introduction
- Refresher?
    - What is a PGM?
    - Conversion to Gibbs Rule from Chain Rule
- Define MCMC
- Explain Metropolis Hasting
- Gibbs 
    

# How are Monte Carlo methods used to perform inference in Probabilistic Graphical Models?

(This answer is the 4th answer in a 7 part series on Probabilistic Graphical Models ('PGMs'). Though, the first answer [link] is all you need to understand this one.)

So far, our running definition of inference has been:

"The task of using a given model of a system, complete with fitted parameters, to answer questions regarding that system."

Here, we'll discuss a general class designed to *approximate* these answers with simulations. That is, we'll get a set of samples drawn from a distribution which approximates the distribution we're asking about.

Due to one reason only, I enjoy these techniques the most. It's their *generality*. Exact inference algorithms demand the graphs are sufficiently simple and factorizable. Variable inference (a type of approximate approach[link to answer 3]) demands defining approximate distributions spaces and means to search them effectively. Monte Carlo methods, however, demand no qualifying inspection of these graph - all graphs are fair game. This provides us with a much wider class of models to fit our reality.

This isn't to say such methods are a cure-all. Yes, there are circumstances in which we fail to get answers. But the responsible reason, outside of the vague 'lack of convergence', is not well understood. So we accept these problem specific battles as the cost of the supreme generality.

But before we dive in, a short review will help.

### Refresher

(This refresher is different.. we need to update the old refreshers to sound similar)

Our task is to understand a system of $n$ random variables ('RVs'), which we refer to with $\mathcal{X} = \{X_1,\cdots,X_n\}$. We take it that there exists some true but unknown joint distribution, $P$, which govern these RVs. Our goal is to answer two types of questions regarding this $P$:

1. **Probability Queries**: Compute the probabilities $P(\mathbf{Y}|\mathbf{E}=\mathbf{e})$. What is the distribution of the RV's of $\mathbf{Y}$ given we have some observation ($\mathbf{e}$) of the RVs of $\mathbf{E}$?
2. **MAP Queries**: Determine $\textrm{argmax}_\mathbf{y}P(\mathbf{Y}=\mathbf{y}|\mathbf{E}=\mathbf{e})$. That is, determine the most likely values of some RVs given an assignment of other RVs.

(Where $\mathbf{E}$ and $\mathbf{Y}$ are two arbitrary non-overlapping subsets of $\mathcal{X}$ and $\mathbf{e}$ is an observed assignment of $\mathbf{E}$. If this notation is unfamiliar, see the 'Notation Guide' section from the first answer [link]).

The idea behind PGMs is to estimate $P$ using two things:

1. Graph: a set of nodes, each of which represents an RV from $\mathcal{X}$, and a set of edges between these nodes.
2. Parameters: objects that, when paired with a graph and a certain rule, allow us to calculate probabilities of assignments of $\mathcal{X}$.

PGMs fall into two categories, Bayesian Networks ('BNs') and Markov Networks ('MNs'), depending on the specifics of these two.

A **Bayesian Network** involves a graph, denoted as $\mathcal{G}$, with *directed* edges and no directed cycles. So $\mathcal{G}$ is a DAG [link]. The parameters are Conditional Probability Tables ('CPDs' or 'CPTs'). These tell us the right hand side of the Chain Rule, which dictates how we calculate probabilities according to a BN:

$$
P_{B}(X_1,\cdots,X_n)=\prod_{i=1}^n P_{B}(X_i|\textrm{Pa}_{X_i}^\mathcal{G})
$$

A **Markov Network**'s graph, denoted as $\mathcal{H}$, is different in that it's edges are *undirected* and we may have cycles. The parameters are *functions* which map assignments of subsets of $\mathcal{X}$ to positive (nonnegative?) numbers. Those subsets, which we'll call $\mathbf{D}_i$'s, correspond to *complete subgraphs* of $\mathcal{H}$ and their union makes up the whole of $\mathcal{H}$. If we say there are $m$ of these functions, we can refer to this set as $\Phi=\{\phi_i(\cdots)\}_{i=1}^m$. With that, we say that the 'Gibbs Rule' for calculation probabilities is:

$$
P_M(X_1,\cdots,X_n) = \frac{1}{Z} \prod_{i = 1}^m \phi_i(\mathbf{D}_i)
$$

where $Z$ is a normalizer. Conceptually, it's helpful to picture the 'Gibbs Table', which lists out all unnormalized probabilities (denoted as $\tilde{P}_M(\cdots)$) for all possible assignments to $\mathcal{X}$. In an example system, $\mathcal{X}=\{C,D,I,G,S\}$, we thought of it like this:

![Title](GibbsTable_labeled.png)

From this angle, *conditioning on some observation $\mathbf{E}=\mathbf{e}$* is conceptually simple. Just filter this table to all assignments that agree with $\mathbf{E}=\mathbf{e}$ and we get the conditional distribution $P_M(\cdots|\mathbf{e})$. Since this is just a new MN, we refer to its probabilities, factors and normalizer with $P_{M|\mathbf{e}},\Phi_\mathbf{|e}$ and $Z_\mathbf{|e}$ respectively.

Answer two [link] isn't a prerequiste, but we need one idea from it. That is, we can always recreate the probabilities produced by a BN's Chain Rule with an another invented MN and its Gibbs Rule. Essentially, we define factors that reproduce a BN's CPDs to do. This equivalence allows us to reason solely in terms of the Gibbs Rule, while assured that whatever we discover will also hold for BNs. So, that's what we'll do. (This is copied from answer 3!).

Whew! Ok, now..


### What's our starting point?

We are handed a MN (or a BN that we'll convert to an MN). That is, we get a graph $\mathcal{H}$ and a set of factors $\Phi$. We're interested in the distribution of a subset of RVs, $\mathbf{Y} \subset \mathcal{X}$, conditional on an observation of other RVs ($\mathbf{E}=\mathbf{e}$). We'll 'have' our answer if we can generate samples, $\mathbf{y}$'s, that come (approximately) from this distribution.

The first step is to address conditioning. All we do is determine $\mathcal{H}_{|\mathbf{e}}$ and $\Phi_{|\mathbf{e}}$ and throw away the original $\mathcal{H}$ and $\Phi$. As we've mentioned (Have we!? check!), $\mathcal{H}_{|\mathbf{e}}$ is just the subgraph from $\mathcal{H}$ create from all nodes other than $\mathbf{E}$ and $\Phi_{|\mathbf{e}}$ are the factors from $\Phi$ with the assignment of $\mathbf{e}$ plugged in. For the sake of cleaniness, I'll drop the 'conditional on $\mathbf{e}$' subscript and take it that you realize we've already done the conditioning conversion.

So this means, if we can generate samples from a MN, we can answer our queries. 

Finally, we're ready for the big idea.


### Markov Chain Monte Carlo (MCMC)

In a nutshell, MCMC finds a way to *sequentially* sample $\mathbf{y}$'s such that, eventually, these $\mathbf{y}$'s are distributed as $P_M(\mathbf{Y}|\mathbf{e})$.

To understand this, we first need to understand a **Markov Chain**. All this is is a set of states and transition probabilities defined between such states. These probabilities are the chances we transition to any other state given the current state. For our purposes, the set of states is $Val(\mathbf{Y})$ - all possible joint assignments of $\mathbf{Y}$.

Next, **Monte Carlo** refers to the use of *simulations* to solve our problem. That is, we'll pick some starting state, $\mathbf{y}^{(0)}$, and then sample the next state, $\mathbf{y}^{(1)}$, using the transition probabilities to do so. We do this repeatedly, giving us a long list of $\mathbf{y}$'s. But here's the kicker. We may set these transition probabilities such that the amount of time we spend in a particular state, $\mathbf{y}$, is in proportion to $P_M(\mathbf{Y}=\mathbf{y}|\mathbf{e})$... eventually. The 'eventually' means that this doesn't happen until we are some number of samples deep.

This might sound rather abstract, so let's discuss a particular algorithm.

### Gibbs Sampling - a type of MCMC.

We start by considering all variables, $\mathbf{X}=\mathcal{X}$, and randomly selecting a state (so that's $\mathbf{x}^{(0)}$ from $Val(\mathbf{X})$). Our transition probabilities will be such that we are forced to change this vector *one element at a time* to produce our samples. Specifically:

1. Pick out $X_1$ from $\mathbf{X}$ and list out all values of $Val(X_1)$. For example, $Val(X_1)=[x_1^1,x_1^2,x_1^3,x_1^4]$.
2. For each $x_1^i \in Val(X_1)$, create a new vector by subbing that $x_1^i$ into the $X_1$-position of $\mathbf{x}^{(0)}$ (call it $\mathbf{x}^{(0)}_{i-subbed}$).
3. Plug each $\mathbf{x}^{(0)}_{i-subbed}$ into our Gibbs Rule, giving us 4 positive numbers. Normalize these into probabilities.
4. Randomly sample from this size-4 probability vector to give us one particular $x_1^i$. Plug that number into $\mathbf{x}^{(0)}$ giving us $\mathbf{x}^{(1)}$.
5. Go back to step 1, but use $X_2$ this time.

And we keep cycling through these steps for as long as we'd like. In effect, steps 2-4 are sampling from a specially defined set of transition probabilities. As a result, if we pick a $t$ large enough, $\mathbf{x}^{(t)}$'s will come from $P_M(\mathbf{X})$[2]. Since $\mathbf{Y}$ is a subset of $\mathbf{X}$, we just select out the $\mathbf{Y}$-elements from $\mathbf{x}^{(t)}$, giving us a $\mathbf{y}^{(t)}$ which comes from $P_M(\mathbf{Y})$. Ultimately, we've generated a sample from our desired distribution.

But 'a' sample? Well, $\mathbf{x}^{(t)}$ is only different from $\mathbf{x}^{(t-1)}$ by one element, so these sequences are seriously correlated over short distances. In other words, this series isn't a set of *independent* samples. This means that to get many independent samples (which would constitute a complete answer to our query) we either have to restart this algorithm many times, or sample at sufficiently far distances such that this correlation is no threat.

### Footnotes

[1] In fact, this is actually $\mathbf{X}=\mathcal{X}\setminus \mathbf{E}$, however I'm assuming that we've already done conditioning. So that conditioning resulted in a new system and MN.
[2] Well, ($P_M(\mathbf{X}|\mathbf{e})$) in fact.