### Notes

PGM Chapter 12: Particle-Based Approximate Inference

Section 12.3
- Markv Chain methods apply equally well to directed and to undirected models. The algo is easier to present in the context of a distribution $P_\Phi$ defined in terms of a general set of factors $\Phi$.
- To apply Gibbs Sampling to a network with evidence, we first reduce all of the factors by the observatiosn $\mathbf{e}$, so that the distribution $P_\Phi$ used in the algorithm corresponds to $P_\Phi(\mathbf{X}|\mathbf{e})$.
- Markov Chain: a Markov chain is defined in terms of a graph of states over which the sampling algorithm takes a random walk. In the case of graphical models, this graph is not the original graph, but rather a graph whose nodes are the possible assignments to our variables $\mathbf{X}$.
- Markov chain Monte carlo (MCMC) sampling is a process that mirrors the dynamics of the Markov chain; the process of generating an MCMC trajectory is shown in algorithm 12.5. The sample $\mathbf{x}^{(t)}$ is drawn from the distribution $P^{(t)}$. We are interested in the limit of this process, that is, whether $P^{(t)}$ converges, and if so, to what limit.
- We have that $P^{(t+1)}(\mathbf{x}') = \sum_{\mathbf{x} \in Val(\mathbf{X})}P^{(t)}(\mathbf{x})\mathcal{T}(\mathbf{x}\rightarrow \mathbf{x'})$
- For large $t$, we expect $P^{(t)}(\mathbf{x}') \approx P^{(t+1)}(\mathbf{x}')$. Whatever distribution that makes this work is called an invariant distribution. We may refer to it with $\pi(\mathbf{x}')$
- There is no guarantee that the stationary distribution is unique. Sometimes it depends on the starting state. Situtations like this occer when the chain has several distinct regions that are not reachable from each other. Chains such as this are called *reducible Markov Chains*. Also, it may be due to a fixed cycle behavior. These are called periodic Markov Chains.
- We want to restrict attention to Markov Chains that have a unique stationary distributions which is reached from *any* starting distribution $P^{(0)}$. The most commonly used condition to guarantee this behavior is that the chain is *ergodic*.
- Ergodic: When the state space $Val(\mathbf{X})$ is finite, it's equivalent (though not defined this way?) is: A markov chain is said to be *regular* if there exists some number $k$ such that for every $\mathbf{x}', \mathbf{x} \in Val(\mathbf{X})$, the probability of getting from $\mathbf{x}$ to $\mathbf{x}'$ in exactly $k$ steps is greater than zero.
- Theorem: if a finite state Markov Chain $\mathcal{T}$ is regular, then it has a unique stationaty distribution.
- Ensuring regularity is usually easy - you need two conditions"
    - it is possible to get from any state to any state using a positive probability path in the state graph
    - There is a non-zero chance of remaining on a state for all sates
- Those two conditions are sufficient but not necessary. They typcally hold in practice.
- For graphical models, each state is an assignment to many variables. When defining a transition model over this sapce, we can consider a fully general case, when a transition is from any state to any other state. However, it is often convenient to decompose the transition model, considering transition that update only a single component of the state vector at a time, ther is, only a value for a single variable.
    - Consider an extension to our Grasshopper chain, where the grasshopper lives, not on a line, but in a two-dimensional plane. In this case, the state of the system is defined via a pair of random variables $X$; $Y$ . Although we could define a joint transition model over both dimensions simultaneously, it might be easier to have separate transition models for the $X$ and $Y$ coordinate.
- In such a case, we often define a set of transition models: $\mathcal{T}_i$, called kernels. These may help us guarantee regularity or speed convergence. To sample, we have to add an intermediate step to sample which $\mathcal{T}_i$. All of them together we can think define a global $\mathcal{T}$.
- The Gibbs Chain is not necessarily regular! Whoa! However, it is if the distribution is positive. However, positivity is not a requirement.
- Gibbs sampling may take a long time to mix.
- It is NOT the case that Gibbs Sampling and the MH algorithm satisfy the desired properties of markov chain
- Picking samples at a distance is provably worse than using all samples!


### Answer Structure

What should be mentioned?
- Gibbs rule from chain rule
- Markov Chains are between states
- Markov Chain may be badly behaved
- Maybe: The big idea is MCMC and Gibbs Sampling is the one algo?
- 

Outlier
- Introduction
- Refresher?
    - What is a PGM?
    - Conversion to Gibbs Rule from Chain Rule
- Define MCMC
- Explain Metropolis Hasting
- Gibbs 
    

# How are Monte Carlo methods used to perform inference in Probabilistic Graphical Models?

(This is the 4th answer in a 7 part series[link] on Probabilistic Graphical Models ('PGMs').)

So far, our running definition of inference has been:

"The task of using a given graphical model of a system, complete with fitted parameters, to answer certain questions regarding that system."

Here, we'll discuss a general class designed to *approximate* these answers with simulations. That is, we'll get a set of samples drawn from a distribution which approximates the distribution we're asking about.

Due to one reason only, I enjoy these techniques the most. It's their *generality*. Exact inference algorithms demand the graphs are sufficiently simple and factorizable. Variable Inference (a type of approximate approach[link to answer 3]) demands defining approximate distribution spaces and a means to search them effectively. Monte Carlo methods, however, demand no qualifying inspection of these graph - all graphs are fair game. This provides us with a much wider class of models to fit our reality.

This isn't to say such methods are a cure-all. Yes, there are circumstances in which we fail to get answers. But the responsible reason, outside of the vague 'lack of convergence', is not well understood. So we accept these problem specific battles as the cost of the supreme generality.

But before we dive in, a short review will help.

### Refresher (same as refresher in answer 3)

In the first answer [link], we discovered why PGMs are useful tools for representing complex system. We defined a complex system as a set of $n$ random variables (which we call $\mathcal{X}$) with a relationship we'd like to understand. We take it that there exists some true but unknown joint distribution, $P$, which govern these RVs. We take it that a 'good understanding' means we can answer two types of questions regarding this $P$:

1. **Probability Queries**: Compute the probabilities $P(\mathbf{Y}|\mathbf{e})$. What is the distribution of the RV's of $\mathbf{Y}$ given we have some observation ($\mathbf{e}$) of the RVs of $\mathbf{E}$?
2. **MAP Queries**: Determine $\textrm{argmax}_\mathbf{Y}P(\mathbf{Y}|\mathbf{e})$. That is, determine the most likely values of some RVs given an assignment of other RVs.

(Where $\mathbf{E}$ and $\mathbf{Y}$ are two arbitrary subsets of $\mathcal{X}$. If this notation is unfamiliar, see the 'Notation Guide' section from the first answer [link]).

The idea behind PGMs is to estimate $P$ using two things:

1. A graph: a set of nodes, each of which represents an RV from $\mathcal{X}$, and a set of edges between these nodes.
2. Parameters: objects that, when paired with a graph and a certain rule, allow us to calculate probabilities of assignments of $\mathcal{X}$.

PGMs fall into two main categories, Bayesian Networks ('BNs') and Markov Networks ('MNs'), depending on the specifics of these two.

A **Bayesian Network** involves a graph, denoted as $\mathcal{G}$, with *directed* edges and no directed cycles. The parameters are Conditional Probability Tables ('CPDs' or 'CPTs'), which are, as the naming suggests, select conditional probabilities from the BN. They give us the right hand side of the Chain Rule, which dictates we calculate probabilities according to a BN:

$$
P_{B}(X_1,\cdots,X_n)=\prod_{i=1}^n P_{B}(X_i|\textrm{Pa}_{X_i}^\mathcal{G})
$$

A **Markov Network**'s graph, denoted as $\mathcal{H}$, is different in that it's edges are *undirected* and we may have cycles. The parameters are a size $m$ set of *functions* which map assignments of $m$ subsets of $\mathcal{X}$ to nonnegative numbers. Those subsets, which we'll call $\mathbf{D}_i$'s, correspond to *complete subgraphs* of $\mathcal{H}$ and their union makes up the whole of $\mathcal{H}$. We can refer to this set as $\Phi=\{\phi_i(\cdots)\}_{i=1}^m$. With that, we say that the 'Gibbs Rule' for calculation probabilities is:

$$
P_M(X_1,\cdots,X_n) = \frac{1}{Z} \underbrace{\prod_{i = 1}^m \phi_i(\mathbf{D}_i)}_{\text{we call this }\tilde{P}_M(X_1,\cdots,X_n)}
$$

where $Z$ is a normalizer - it ensures our probabilities sum to 1.

To crystallize this idea, it's helpful to imagine the 'Gibbs Table', which lists unnormalized probabilities for all assignments. In the second answer [link], we pictured an example where $\mathcal{X}=\{C,D,I,G,S\}$ as:

![Title](GIbbsTable_labeled.png)

Lastly, we recall that the Gibbs Rule may expression the Chain Rule. That is, we can always recreate the probabilities produced by a BN's Chain Rule with an another invented MN and its Gibbs Rule. Essentially, we define factors as those that reproduce looking up a conditional probability in a BN's CPDs. This equivalence allows us to reason solely in terms of the Gibbs Rule, while assured that whatever we discover will also hold for BNs. In other words, with regards to inference, if something works for $P_M$, then it works for $P_B$.

### What's our starting point?

We are handed a MN (or a BN that we'll convert to an MN). That is, we get a graph $\mathcal{H}$ and a set of factors $\Phi$. We're interested in the distribution of a subset of RVs, $\mathbf{Y} \subset \mathcal{X}$, conditional on an observation of other RVs ($\mathbf{E}=\mathbf{e}$). We'll have our answer presumably (to both queries) if we can generate samples, $\mathbf{y}$'s, that come (approximately) from this distribution.

The first step is to address conditioning. To do so, let's steal one idea that appeared in the second answer. That is, inference in a MN conditional on $\mathbf{E}=\mathbf{e}$ gives the same answer as *unconditional* inference in a specially defined MN, with a graph we'll call $\mathcal{H}_{|\mathbf{e}}$ and a set of factors $\Phi_{|\mathbf{e}}$. $\mathcal{H}_{|\mathbf{e}}$ is $\mathcal{H}$, but with all $\mathbf{E}$ nodes and any edges involving them deleted. $\Phi_{|\mathbf{e}}$ is the set of factors $\Phi$, but with $\mathbf{E}=\mathbf{e}$ fixed as an input assignment. The point is if we can do unconditional inference, we can do conditional inference. For the sake of cleaniness, I'll drop the '$|\mathbf{e}$' subscript and take it that you realize we've already done the conditioning conversion. 

Finally, we're ready for the big idea.


### Markov Chain Monte Carlo (MCMC)

In a nutshell, MCMC finds a way to *sequentially* sample $\mathbf{y}$'s such that, eventually, these $\mathbf{y}$'s are distributed as $P_M(\mathbf{Y}|\mathbf{e})$.

To see this, we must first defined a **Markov Chain**. All this is is a set of states and transition probabilities defined between such states. These probabilities are the chances we transition to any other state given the current state. For our purposes, the set of states is $Val(\mathbf{X})$ where $\mathbf{X}=\mathcal{X}$ - all possible joint assignments of all variables. For any two states $\mathbf{x},\mathbf{x}' \in Val(\mathbf{X})$, we write the transition probability as $\mathcal{T}(\mathbf{x} \rightarrow \mathbf{x}')$. We may refer to all such probabilities or the whole Markov Chain as $\mathcal{T}$.

As a simple example, suppose our system was one RV, $X$, that could take 3 possible values (so $Val(X)=[x^1,x^2,x^3])$. Then we might have this Markov Chain:

![title](MarkovChainExample.png)

Thinking generally again,  to 'sample' a Markov Chain means we sample a starting $\mathbf{x}^{(0)}\in Val(\mathbf{X})$ according to some starting distribution that we'll call $P_\mathcal{T}^{(0)}$. Then, we use our $\mathcal{T}(\mathbf{x} \rightarrow \mathbf{x}')$ probabilities to determine the next state, giving us $\mathbf{x}^{(1)}$. Then we repeat, giving us a long series of $\mathbf{x}^{(t)}$'s. If we were to restarting the sampling procedure many times and select out the $t$-th sample, we'd observe a distribution that we'll call $P_\mathcal{T}^{(t)}$.

So, a sample of our toy examples might be: $x^1$ (33% chance) $\rightarrow x^3$ (75% chance) $\rightarrow x^2$ (50%) $\rightarrow x^2$ (70%). Simple enough, right?

By the nature of this procedure, we can figure this relation:

$$
P_\mathcal{T}^{(t+1)}(\mathbf{x}) = \sum_{\mathbf{x}' \in Val(\mathbf{X})} P_\mathcal{T}^{(t)}(\mathbf{x}') \mathcal{T}(\mathbf{x}'\rightarrow \mathbf{x})
$$

Now, for a large $t$, it's reasonable to expect $P_\mathcal{T}^{(t)}(\mathbf{x})$ to be very similar to $P_\mathcal{T}^{(t+1)}(\mathbf{x})$. Under some conditions, that's correct intuition. Whatever that common distribution is, we call it the **stationary distribution** of $\mathcal{T}$, and it's written as $\pi_\mathcal{T}$. It is the *single* distribution that works for both $P_\mathcal{T}^{(t+1)}$ and $P_\mathcal{T}^{(t)}$ in that above relation. That is, it solves:

$$
\pi_\mathcal{T}(\mathbf{x}) = \sum_{\mathbf{x}' \in Val(\mathbf{X})} \pi_\mathcal{T}(\mathbf{x}') \mathcal{T}(\mathbf{x}'\rightarrow \mathbf{x})
$$

In effect, $\pi_\mathcal{T}$ is the distribution that $P_\mathcal{T}^{(t)}$ converges too.

With that, we're ready for the big insight:

"
We may choose our Markov Chain, $\mathcal{T}$, such that $\pi_\mathcal{T} = P_M$
"

Now, conceivably, we could make our choice of $\mathcal{T}$ with $P_M$ in mind and solve for the stationary distribution to get our answer. However, in general, this isn't possible. Hence, we need our next 'MC'. That is, we'll use **Monte Carlo simulations** to solve our problem. Instead of trying to solve for $\pi_\mathcal{T}$, we execute the sampling procedure to produce a series of $\mathbf{x}^{(t)}$'s, and then we observe an empirical approximation to $\pi_\mathcal{T}$ (and hence $P_M$) after a number of iterations. If we are concerned with a subset $\mathbf{Y}$ from $\mathcal{X}$, we simply select out the $\mathbf{Y}$-elements from our series of $\mathbf{x}^{(t)}$'s and use those.

Uhh, but...

### How do we choose $\mathcal{T}$?

This is where things get hairy. Fortunately, the algorithms make this decision for us, but to understand their relative advantages, we need to understand their common aim. None of them entirely nail that aim.

In a nutshell, that aim is:

"
We'd like a $\mathcal{T}$ for which sampling will converge *quickly* to a *single* $\pi_\mathcal{T}$, equal to $P_M$, from *any* starting $P^{(0)}_\mathcal{T}$.
"

This is hard. Here are the major ways it may crash and burn:

* Imagine a $\mathcal{T}$ with two states where if you're on one, you *always* transition to the other state. This makes for a  heavy dependence on $P^{(0)}_\mathcal{T}$ and *no* stationary distribution $\pi_\mathcal{T}$. Cyclic behind like this means the Markov chain is periodic -  we hate periodic chains.
* A $\mathcal{T}$ with a low 'conductance' is one in which there are regions of the state space which are very hard to go between. This means that if you start in one, it'll take you an extremely long time before you explore the other. So we have a near-dependency on $P^{(0)}_\mathcal{T}$. Also, convergence will require traversing that narrow bridge, so it certainly won't be 'quick'. If dotted lines imply small transition probabilities, this is an example of a low conductance $\mathcal{T}$:

![title](LowConductance.png)

* If there exist two states such that if you're on one, you can *never* reach the other, that $\mathcal{T}$ is called *reducible* and the consequence is that there may be more than one stationary distribution.

The protection against this unruly behavior are some theoretical properties you may demand of a $\mathcal{T}$. You may demand it's aperiodic for example.

The most important one is called **detailed balance**. A $\mathcal{T}$ with this property has a $\pi_\mathcal{T}$ such that:

$$
\pi_\mathcal{T}(\mathbf{x})\mathcal{T}(\mathbf{x}\rightarrow \mathbf{x}') = \pi_\mathcal{T}(\mathbf{x}')\mathcal{T}(\mathbf{x}'\rightarrow \mathbf{x})
$$

for any pair of $\mathbf{x}, \mathbf{x}' \in Val(\mathbf{X})$.

Compare this to the equation that defines the stationary distribution. The right side of that has many more terms, and as such, it has many more degrees of freedom for $\pi_\mathcal{T}(\mathbf{x})$ to sit within. Intuitively, detailed balance means the stationary distribution follows from *single* step transitions, and not from large cycles of many transitions. It's a kind of 'well connectedness'. If $\mathcal{T}$ implies any two states are reachable from each other, is aperoidic and has detailed balance, than we'll converge to the unique stationary distribution from any starting distribution.

Enough with the theoretics - let's see an algorithm.


### Gibbs Sampling - a type of MCMC.

Our first step is to uniformly sample a $\mathbf{x}^{(0)}$ from  $Val(\mathcal{X})$. In Gibbs Sampling, our transition probabilities will be such that we are forced to change this vector *one element at a time* to produce our samples. Specifically:

1. Pick out $X_1$ from $\mathbf{X}$ and list out all values of $Val(X_1)$. For example, $Val(X_1)=[x_1^1,x_1^2,x_1^3,x_1^4]$.
2. For each $x_1^i \in Val(X_1)$, create a new vector by subbing that $x_1^i$ into the $X_1$-position of $\mathbf{x}^{(0)}$ (call it $\mathbf{x}^{(0)}_{i-subbed}$).
3. Plug each $\mathbf{x}^{(0)}_{i-subbed}$ into our Gibbs Rule[3], giving us 4 positive numbers and normalize them into probabilities.
4. Randomly sample from this size-4 probability vector to give us one particular $x_1^i$. Plug that number into $\mathbf{x}^{(0)}$ giving us $\mathbf{x}^{(1)}$.
5. Go back to step 1, but use $X_2$ this time.

And we keep cycling through these steps for as long as we'd like. In effect, steps 2-4 are sampling from a specially defined set of transition probabilities. As a result, if we pick a $t$ large enough, $\mathbf{x}^{(t)}$ will come from $P_M(\mathbf{X})$[2].

But, you may have noticed an issue. $\mathbf{x}^{(t)}$ is only different from $\mathbf{x}^{(t-1)}$ by one element, so these sequences are seriously correlated over short distances. This series isn't a set of *independent* samples!

To address this, people keep track of the **effective sample size** of their simulations. Imagine our sampling produced a series that is *perfectly* correlated - they are all the same. Clearly, this is effectively a single independent sample. At the other end, imagine there is no correlation - then we have our independence and we have as many independent samples as samples (beyond that 'large enough' $t$). So the effective sample size is a number between these two extremes. There exists some heuristics to estimate this figure from lagged correlations. It's an important figure to keep handy, as it gives you a clue as to how good our approximated inference is.

But correlation isn't the only issue. Since we update one $X_i$ at a time, it's also fairly slow.

We need to get more general.

### The Metropolis-Hastings Algorithm

The issue in Gibbs Sampling is that we move through the states very slowly. It would help if we could control how fast we explore this space. To guarantee that the resulting $\pi_\mathcal{T}$ equals our $P_M$, we'd have to apply a correction to compensate for that control. Roughly, this is the idea behind the Metropolis-Hasting algorithm.

More specifically, we invent *another* Markov Chain, which we'll call $\mathcal{T}^Q$, which is our *proposal* Markov Chain. That is, it's defined over the same space ($Val(\mathbf{X}))$ and is responsible for proposing the next state, given whatever current state. This is where we can envoke large leaps if we'd like. Our correction will be to sample a yes-no event according to a certain 'acceptance' probability. If we draw a yes, we transition to the proposed state. If we draw a no, we remain at the current state. Now, this acceptance probability is specially design to ensure detailed balance. It is:

$$
\mathcal{A}(\mathbf{x}\rightarrow\mathbf{x}') = \textrm{min}\bigg(1, \frac{\tilde{P}_M(\mathbf{x}')\mathcal{T}^Q(\mathbf{x}'\rightarrow\mathbf{x})}{\tilde{P}_M(\mathbf{x})\mathcal{T}^Q(\mathbf{x}\rightarrow\mathbf{x}')} \bigg)
$$

This acceptance probability ensures that we will converge to $P_M$ from any starting distribution, given $\mathcal{T}^Q$ isn't especially badly behaved. 

For the intuiters, let's decompose $\mathcal{A}$. It's effectively made up of $\frac{\tilde{P}_M(\mathbf{x}')}{\tilde{P}_M(\mathbf{x})}$ and $\frac{\mathcal{T}^Q(\mathbf{x}'\rightarrow\mathbf{x})}{\mathcal{T}^Q(\mathbf{x}\rightarrow\mathbf{x}')}$.

$\frac{\tilde{P}_M(\mathbf{x}')}{\tilde{P}_M(\mathbf{x})}$ makes us more likely to reject states that are unfavorable according to $P_M$. Notice it uses the unnormalized probabilities - no intractable $Z$ involved! Also, since it's a ratio involving the Gibbs Rule, if $\mathbf{x}'$ and $\mathbf{x}$ share some identical terms (maybe by design of our $\mathcal{T}^Q$), it's possible some factors cancel. So we could save time by avoiding computation of the full Gibbs Rule.

$\frac{\mathcal{T}^Q(\mathbf{x}'\rightarrow\mathbf{x})}{\mathcal{T}^Q(\mathbf{x}\rightarrow\mathbf{x}')}$ makes us unlikely to accept states that are easy to transition to (according to our proposal) but difficult to return from. Such behavior is at odds with detailed balance, so it's curtailing it for that sake of this property.

Now, there are some practical considerations to keep in mind. First, $\mathcal{T}^Q$ should be able to propose everything in $Val(\mathbf{X})$, otherwise we'll give a zero probability to something $P_M$ might favor. Second, a rejected proposed sample is a waste of your computers time, so in this sense, we like high $\mathcal{A}$'s. However, very high $\mathcal{A}$'s might mean you aren't exploring the space quickly. Together, this means that $\mathcal{T}^Q$ must be tuned right to explore the space efficiently.

### Let's see it

I can't think of a better visual than the one I've seen in Kevin Murphy's text (Chapter 24, see citation)[4], so I'll use that. Let's pretend our 'intractable' $P_M$ is a mixture of two Gaussian distributions, like this:

![title](Pm2humps.png)

We'd like a set of samples which occupy the horizontal axis in proportion to the height of this graph. To do so, we'll use a normal distribution centered on our current state: $\mathcal{T}^Q(\mathbf{x}\rightarrow\mathbf{x}') = \mathcal{N}(\mathbf{x}'|\mathbf{x},v)$ where $v$ is the variance. We'll take it that we can evaluate the ratio of two samples of our intractable distribution. With that, we know how to generate proposals, accept/reject them and produce samples. From the book, this process can be represented as:

![title](MH_variances.png)


### What's next?

At this point, especially if you've read answers 2 and 3 as well, you may have had your fill of inference. You might be curious as to how we actually *learn* the parameters of a BN or a MN. Over the next 2 mondays, I'll be addressing those:

[5] How are the parameters of a Markov Network learned? [link] (Posting on DATE)

[6] How are the parameters of a Bayesian Network learned? [link] (Posting on DATE)

If you're interested, please follow these!

### Footnotes

[1] In fact, this is actually $\mathbf{X}=\mathcal{X}\setminus \mathbf{E}$, however I'm assuming that we've already done conditioning. So that conditioning resulted in a new system and MN.

[2] Well, ($P_M(\mathbf{X}|\mathbf{e})$) in fact.

[3] Actually, you don't need to compute the full Gibbs Rule product - you only need to consider the factors for which $X_1$ appears in. Just think about it - for all the factors for which $X_1$ *doesn't* appear, their product remains constant as you plug in different assignments of $X_1$. When we normalize, this constant will be divided out, so we don't need to consider it. For large MNs, this is a huge efficiency gain!

[4] In the text, he credits Christophe Andrieu and Nandode Freitas for the code of this visuals, so I should as well.

### Scrap 

To do so, we need to define the **stationary distribution**

As a starting point, we can figure out $\mathcal{T}$ that have properties we *don't* want.

You might look at this and think: "Ok, well how do I choose $\mathcal{T}$? Also, I don't have infinite time, so what's the biggest $t$ I need?". Both of these are seriously involved questions.

Let's address the first


We'll use $P_\mathcal{T}^{(t)}(\mathbf{x})$  to refer to the distribution of $\mathbf{x}$ at iteration $t$ created by the Markov Chain. To understand $P_\mathcal{T}^{(t)}(\mathbf{x})$, image restarting the sampling procedure many times and selecting out the $t$-th samples. The distribution of these samples is 

Now we're ready for the major insight. First, let's use $P_\mathcal{T}^{(t)}(\mathbf{x})$  to refer to the distribution of $\mathbf{x}$ at iteration $t$ created by the Markov Chain. In other words, imagine restarting the sam

"
We may *choose* these $\mathcal{T}(\mathbf{x} \rightarrow \mathbf{x}')$'s such that, if we were to transition between states for an infinite number of iterations, the proportion of time we spend in a state $\mathbf{x}$ would be $P_M(\mathbf{X}=\mathbf{x}|\mathbf{e})$.
"

Since this is important, let's state it a bit more technically. Let's use $P_\mathcal{T}^{(t)}(\mathbf{x})$  to refer to the distribution of $\mathbf{x}$ at iteration $t$ created by the Markov Chain.


With that, we are face

The big insight is that we may *choose* these $\mathcal{T}(\mathbf{x} \rightarrow \mathbf{x}')$'s such that, if we were to transition between states for an infinite number of times, the amount of time

The big insight is that if we were to transition between states according to a particular choice of $\mathcal{T}(\mathbf{x} \rightarrow \mathbf{x}')$, then in the limit, the proportion of time we spend in a certain state $\mathbf{x}$ *is* $P_M(\mathbf{X}=\mathbf{x}|\mathbf{e})$. If we are only concerned about a $\mathbf{Y}$ that is a subset of $\mathbf{X}$, we simply select out the $\mathbf{Y}$-elements from our $\mathbf{X}$ samples. That long term behavior

Next, **Monte Carlo** refers to the use of *simulations* to solve our problem. That is, we'll pick some starting state, $\mathbf{y}^{(0)}$, and then sample the next state, $\mathbf{y}^{(1)}$, using the transition probabilities to do so. We do this repeatedly, giving us a long list of $\mathbf{y}$'s. But here's the kicker. We may set these transition probabilities such that the amount of time we spend in a particular state, $\mathbf{y}$, is in proportion to $P_M(\mathbf{Y}=\mathbf{y}|\mathbf{e})$... eventually. The 'eventually' means that this doesn't happen until we are some number of samples deep.

This might sound rather abstract, so let's discuss a particular algorithm.