### What is a Graphical Model?

Notes (PGM)

It might be a good idea to read the intro/end of all the chapters.

Chapter 1
* they separate between knowledge (the model of our system) and reasoning (how we go about answering questions regarding that model)
* PGM's models are focused on dealing with uncertainty
* the finest-grain truth of reality might be a pure determinism (think laplace demon), but our limited knowledge of it effectively makes it stochastic.
* pure certainty almost never happen, so we are almost always faced with a large set of possibilities. Given that, we must decide which are probable (probability theory allows us to do this).
* 'Probabilistic graphical models use a graph-based representation as the basis for compactly encoding a complex distribution over a high-dimensional space.'
* A graph represents two perspectives:
    * It's a way of representing a set of independence statements for a distribution.
    * It's a way of representing how the calculation of the probability to an assignment to the distribution can be broken up into a product of factors.
* It turns out (in a 'deep way') that these two perspective are equivalent.
* There are two families of graphical representations of distributions: directed and undirected. They different in the independence relations they may encode.
* The graphical model framework has 3 advantages:
    * It gives us a general framework for models to represent our world
    * It allows for inference (probabilistic queries)
    * It allows for learning - a data driven means of determining specific models.
* P factorizes according to G iff P satisfies the local (or global) CI statements of G.
    
Chapter 3
* 'The compact representations we explore in this chapter are based on two key ideas: the representation of independence properties of the distribution, and the use of an alternative parameterization that allows us to exploit these finer-grained independencies.'
* Bayesian Networks (think CPTs) allow us to avoid representing probabilities of each joint assignment. Instead, we can use CPTs and a set of rules to produce whatever probability we are interested in. This makes for much fewer parameters.

Chapter 4
* Undirected models are simplier from the angle of inference and the CI statements.
* Undirected models sometimes require we restrict attention to discrete state spaces.
* We want to capture affinities between groups of RVs.
* A factor is a function from Val(D) to R. A nonnegative factor yields all nonnegative values
* P106: Key result: P satisfies (X perp Y | Z) if we can write P(X,Y,Z)  = f1(X,Z)f2(Y,Z)
* P106: Independence properties are simplier: I think if they are separated by Z in G, then X and Y are independent in any distribution that factorizes according to G.
* Parameterization is more complicated in a Markov Network - they don't correspond to conditional probs.
* A factor is inclusive of a joint distribution and a conditional prob. A joint distribution is one type of factor. Same with a conditional prob function.
* Top of P108: If we have a BN along with CPDs, we can recreate all it's probabilities using a Markov Network defined over the same $\mathcal{X}$.
* Say you define a $P$ use factors. We say that $P$ factors according to MN $\mathcal{H}$ is the factor correspond to subset of RVs that are *complete subgraphs* in $\mathcal{H}$. We could reduce the numbers of factors if we require that these are complete subgraphs, but that can obscure useful structure.
* P111: This is all about how conditioning on an assignment yields another simple Gibbs (?) distribution over a graph that includes all variables that aren't associated with the assignment.
* P114: Independence statements are indeed simplier:
* Hammersley Clifford: If $P$ is a positive distribution and $\mathcal{H}$ is a MN and an I-map for $P$, then $P$ is a gibbs distribution that factors of $\mathcal{H}$. According to MLPP, a positive distribution satisfies the CI statements of an $\mathcal{H}$ if it can be written as a product of factors over the maximal cliques of $\mathcal{H}$.
* Example 4.4: I can write a MN $\mathcal{H}$ that has the same CI statements as a P, but where P does not factorize over $\mathcal{H}$. What!? This is due to it being a *non-negative* distribution.
* Let $\mathcal{H}$ be a 

Elsewhere
* I-equivalence of two graphs immediately implies that any distribution P that can be factorized over one of these graphs can be factorized over the other.
* Difference BNs can encode the same CI statements
* Soundness: If X and Y are d-separated, then for all distributions that factor over G, X and Y are conditionally independent.
* Completeness: If X and Y are conditionally independent in P which factorizes over G, then they are d-separated.

Should I include a "the complicated parts" section?

* Many different BNs can represent the same CI statements (p87)

What topics might I touch?

* What is a Random Random?
* Should address undirected vs directed networks
* Maybe Markov Equivalence?
* What an 'assignment' to a random variable is?
* Maybe mention the 'generative story' that comes with some directed models (does it come with the undirect versions as well?)
* I need to mention that I'm only considering the discrete end.

Possible outline:

* What are we concerned with? A probability distribution in high dimensions and answerings queries regarding it.
* Set up the 'problem'. A set of variables $\mathcal{X}$ (though this is restrictive! sometimes we actually care about a growing set of random varibale), an assignment, different queries...

Possible outline 2:

1. What generic problem do PGMs address?
2. What i    

2. What is a PGM?
* A PGM

What is the approach of PGMs?

* Establish a relationship between probability distributions and a graphical model.

BIG ISSUE: Can you always represent the CI Statements of a BN with a MN? No I believe we are OK, see 19.2.2 of MLPP

THIS IS NOT TRUE: If P factorizes according to the BN G, then I(P) = I(G).

See the completeness section of PGM: If X and Y aren't D-seperated in G, then you could construct a P that factorizes over G where they are dependent. Also, you could pick CPDs such they are independent. This is why G represents all P's it satisfies.

Should mention the overall approach: I'll show the big picture for each topic and then an algorithm from that topic.. and then use the language of that algorithm to discuss the rest of that topic.

### What are Probabilistic Graphical Models and why are they useful?

![title](pgm_examples.png)

*These* are Probabilistic Graphical Models. They are arguably our most complete and promising toolkit for inferring truth from complexity. They're born from a single set of principles that endow our machines to dominate chess, diagnose disease, translate language, decipher sound, recognize images and drive cars. 'Neural Networks' and 'Probabilistic Programming' are famous signatures of the Machine Learning community simply because they are effective toolsets for applying these devices.

My aim here is to reveal the machinery behind this magic. I intend to show what they are, why we use them and how we actually use them. To do that, I'll answer 7 questions on this topic over 7 Mondays, of which this is the first. Those are:

1. What are Probabilistic Graphical Models and why are they useful?
2. What is 'exact inference' in the context of Probabilistic Graphical Models? [link] (Posting on 9/3/2018)
3. What is Variational Inference in the context of Probabilistic Graphical Models? [link] (Posting on 9/10/2018)
4. How are Monte Carlo methods used to perform inference in Probabilistic Graphical Models? [link] (Posting on 9/17/2018)
5. How are the parameters of a Bayesian Network learned? [link] (Posting on 9/24/2018)
6. How are the parameters of a Markov Network learned? [link] (Posting on 10/1/2018)
7. How is the graph structure of a Probabilistic Graphical Models learned? [link] (Posting on 10/8/2018)

I realize this is a good deal to digest, especially for internet browsing. But allow me to sell you. This information is typically delivered with a worthwhile 1000+ page textbook[link] to graduate computer scientists. We can 80-20 these ideas with just a few answers! It'll take discipline, but you'll gain a surprisingly good understanding of an absolutely foundational theory of Machine Learning.

As a compromise, I've structure things such that you need only read a subset of these answers to get a full picture. Here's a map of that structure:

![Title](Map.png)

For example, if you read $1 \rightarrow  2 \rightarrow 6 \rightarrow 7$, you'll get a complete taste.  Also, I'll include refreshers at the beginning of each answer - this should make things more self contained. (If you read these answers in sequence, I'd skip those refreshers, as they will sound redundant.)

If this sounds like a good deal to you, please follow those questions!

Now, let's start walking.

### Notation Guide

As a first stop, we'll review notation, an admittedly boring place. But, it's my unconventional belief that most confusion is due to notation. So if we wish to survive, we'll need a few tips:

* An upper case non-bold letter indicates a single random variable ('RV'). The same letter lowercased with a super script indicates a specific value that RV may take. For example, $X=x^1$ is the *event* the RV $X$ took on the value $x^1$. We call this event an **assignment**. The set of unique values an RV may take is given by $Val(X)$. So we might have $Val(X)=\{x^0,x^1\}$ in this case.
* A bold upper case letter indicates a *set* of RVs (like $\mathbf{X}$) and a bold lower case letter indicates a set of values they may take. For example, we may have $\mathbf{X}=\{A,B\}$ and $\mathbf{x}=\{a^3,b^1\}$. Then the event $\mathbf{X}=\mathbf{x}$ is the event that $A=a^3$ happens *and* $B=b^1$ happens. Natutally, $Val(\mathbf{X})$ is the set of all possible unique joint assignments to the RVs in $\mathbf{X}$.
* If you see $\mathbf{x}$ (or $\mathbf{y}$ or $\mathbf{z}$ etc...) within a probability expression, like $P(\mathbf{x}|\cdots)$ or $P(\cdots|\mathbf{x})$, that's always an abbreviation of the event '$\mathbf{X}=\mathbf{x}$'.
* Perhaps confusingly, we also abbreviate the event '$\mathbf{X}=\mathbf{x}$' as '$\mathbf{X}$', though this isn't a clean abbreviation. Omission of $\mathbf{x}$ means one of two things: either we mean this for *any* given $\mathbf{x}$ or for *all* possible $\mathbf{x}$'s. As an example for the latter case, 'calculate $P(\mathbf{X})$'  would mean calculate the set of probabilities $P(\mathbf{X}=\mathbf{x})$ for all $\mathbf{x}\in Val(\mathbf{X}).$ 
* $\sum_\mathbf{X}f(\mathbf{X})$ is shorthand for $\sum_{\mathbf{x}\in Val(\mathbf{X})}f(\mathbf{X}=\mathbf{x})$. This is similarly true for $\prod_\mathbf{X}(\cdot)$ and $\textrm{argmin}_\mathbf{X}(\cdot)$. Look out for this one - it can sneak in there and changes things considerably.
* You may see equations like $f(A,B,C)=g(\mathbf{X})h(\mathbf{Y})$. They look strange - the RVs on the left aren't on the right! Well, in such cases, you also have something like $\mathbf{X} = \{A,B\}$ and $\mathbf{Y} = \{B,C\}$. So the equation really is $f(A,B,C)=g(A,B)h(B,C)$
* Probability distributions are references with a $P$, $\textrm{Q}$, $q$  or $\pi$ with some descriptive subscripts/superscripts. Keep in mind that distributions are a special kind of *function*. Remember that!
* Everything is in reference to the *discrete* case. Unfortunately, the continuous case is *not* a simple generalization from the discrete case. This minor exception is in the visuals. The discrete case is less friendly to graphs, so I might use some continuous distributions. As it relates to the discussion, pretend these are infact discrete distributions with a fine granularity and an implied ordering of the values.

Almost all of this notation comes from the text Probabilistic Graphical Models[link] - one of those 1000 page monsters. That book is extremely thorough, and should be consider stop number 8.

Look, you've already done the hardest part! Onto the fun stuff - we ask:

### What generic problem do PGMs address?

Our goal is to understand a complex system. We assume the complex system manifests as $n$ RVs, which we may write as $\mathcal{X} = \{X_1,X_2,\cdots,X_n\}$ [1][2]. We take it that 'a good understanding' means we can answer two types of questions *accurately* and *efficiently* for these RVs. If we say $\mathbf{Y}$ and $\mathbf{E}$ are two given subsets of $\mathcal{X}$, then those questions are:

1. **Probability Queries**: Compute the probabilities $P(\mathbf{Y}|\mathbf{E}=\mathbf{e})$. Which means, what is the distribution of the RV's of $\mathbf{Y}$ given we have some observation of the RVs of $\mathbf{E}$?
2. **MAP Queries**: Determine $\textrm{argmax}_\mathbf{Y}P(\mathbf{Y}|\mathbf{E}=\mathbf{e})$. That is, determine the most likely assignments of RVs given an assignment of other RVs.

Before continuing, we should point a few things out:

* Since $\mathbf{Y}$ and $\mathbf{E}$ are any two subsets of $\mathcal{X}$, there is potentially a remaining set (call it $\mathbf{Z}$) that's in $\mathcal{X}$. In other words, $\mathbf{Z} = \mathcal{X} \backslash \{\mathbf{Y},\mathbf{Z}\}$ . This set appears left out of our questions, but is very much at play. We have to sum these RVs out, which can considerably complicate our calculations. For example, $P(\mathbf{y}|\mathbf{e})$ is actually $\sum_\mathbf{Z}P(\mathbf{y},\mathbf{Z}|\mathbf{e})$. On a note of terminology, we say $P(\mathbf{y}|\mathbf{e})$ is a 'marginal' probability, since some other RVs were summed out.
* We haven't mention any model yet. This set up is asking generically for probabilities and values that accurately track reality.

To this end, we are assisted by the fact that we have some, at least partial, joint observations of these RVs, $\mathcal{X}$. However, some of our $n$ RVs may *never* be observed. These are called 'hidden' variables and they will complicate our lives later on.

This set up is extremely general, and as such, this problem is extremely hard.

### The problem with joint distributions.

Our starting point, perhaps surprisingly, will be to consider the joint distribution of our RVs $\mathcal{X}$, which we aren't given in real application (but we'll get there). We'll call that joint distribution $P$. Conceptually, we can think of this as a table that lists out all possible joint assignments of $\mathcal{X}$ and their associated probabilities. So if $\mathcal{X}$ is made up of 10 RVs, each of which can take 1 of 100 values, this table has $100^{10}$ rows, each indicating a particular assignment of $\mathcal{X}$ and it's probability.

The issue is, for complex system, this table is too big. Even if we had the crystal ball luxury of having $P$, we *can't handle it*. So now what?

### The Conditional Independence statement

We need a **compact representation** of $P$ - something that gives us all the information of that table, but without having to actually write it down. To this end, our saving grace is the **conditional independence (CI) statement**:

"
Given subsets of RVs $\mathbf{X}$, $\mathbf{Y}$ and $\mathbf{Z}$ from $\mathcal{X}$, we say $\mathbf{X}$ is conditionally independent of $\mathbf{Y}$ given $\mathbf{Z}$ if

$$
P(\mathbf{x},\mathbf{y}|\mathbf{z})=P(\mathbf{x}|\mathbf{z})P(\mathbf{y}|\mathbf{z})
$$

for *all*  $\mathbf{x}\in Val(\mathbf{X})$, $\mathbf{y}\in Val(\mathbf{Y})$ and $\mathbf{z}\in Val(\mathbf{Z})$. This is written '$P$ satisfies $(\mathbf{X}\perp \mathbf{Y}|\mathbf{Z})$[3]'
"

Now, if we had sufficient time and summation abilities, we could calculate the left side and the right side for a distribution $P$. If the equations holds for all values, then, by definition, the independence statements holds. Intuitively, though not obviously from the equations, this means that if you are given the assignment of $\mathbf{Z}$, then knowing the assignment of $\mathbf{X}$ will never help you guess $\mathbf{Y}$ (and visa versa). In other words, $\mathbf{X}$ provides no information for predicting $\mathbf{Y}$ beyond what $\mathbf{Z}$ has. Similarly, you can't predict $\mathbf{X}$ from $\mathbf{Y}$ any better.

Knowing such statements turns out to be massively useful - they give us that compact representation we need. To see this, let's say $(X_i \perp X_j)$ for all $i \in \{1,\cdots,10\}$ and $j \in \{1,\cdots,10\}$ where $i\neq j$. This is to say, all RVs are independent of all other RVs. It turns out that with these statement, we only need to know the marginal probabilities of each value for each RV (which is a total of $10\cdot100=1000$ values) and may reproduce all the probabilities of $P$. So if we are considering the case where $\mathbf{X}=\mathcal{X}$ and would like to know the probability $P(\mathbf{X}=\mathbf{x})$, we simply return $\prod_{i=1}^{10}P(X_i=x_i)$, where $x_i$ is the $i$-th element of $\mathbf{x}$.

Though this isn't just a save on storage. This is a simplification on $P$ that will ease virtually any interaction with $P$, including summing over many assignments and finding most likely assignment. So at this point, I'd like you to think that CI statements regarding $P$ are a requirement for wielding it.

Now put a pin in this and let's switch gears.

### The Bayesian Network

It's time to introduce the first type of Probabilistic Graphic Model - the **Bayesian Network** ('BN'). A BN refers to two things, both in relation to some $\mathcal{X}$: a BN graph (called $\mathcal{G}$) and an associated probability distribution $P_B$. $\mathcal{G}$ is a set of nodes, one for each RV of $\mathcal{X}$, and a set of *directed* edges, such that there are no directed cycles. Said differently, it's a DAG [link]. $P_B$ is a distribution with probabilities for assignments of $\mathcal{X}$ using a certain rule and Conditional Probability Tables ('CPTs' and 'CPDs'), which augment $\mathcal{G}$. That rule (called the 'Chain Rule for BNs') for determining probabilities can be written:

$$
P_B(X_1,\cdots,X_n)=\prod_{i=1}^n P_B(X_i|\textrm{Pa}_{X_i}^\mathcal{G})
$$

where $\textrm{Pa}_{X_i}^\mathcal{G}$ indicates the set of parents nodes/RVs of $X_i$ according to $\mathcal{G}$. The CPDs tell us what the $P_B(X_i|\textrm{Pa}_{X_i}^\mathcal{G})$ probabilities are. That is, a CPD lists out the probabilities of all assignments of $X_i$ given any joint assignment of $\textrm{Pa}_{X_i}^\mathcal{G}$[4]. These CPDs are the *parameters* of our model. Their form is to list out actual conditional probabilities from $P_B$.

To help, let's consider a well utilized example from that monsterous text: the 'Student Bayesian Network'. Here, we're concerned with a system of five RVs: a student's intelligence ($I$), their class's difficulty ($D$), their grade in that class ($G$), their letter of recommendation ($L$) and their SAT score ($S$). So $\mathcal{X}=\{I,D,G,L,S\}$. The BN graph along with the CPDs can be represented as:

![title](StudentBN.png)

According to our rule, we have that any joint assignment of $\mathcal{X}$ factors as:

$$
P_B(I,D,G,S,L) = P_B(I)P_B(D)P_B(G|I,D)P_B(S|I)P_B(L|G)
$$

So we would calculate a given assignment as:

$$
\begin{align}
P_B(i^1,d^0,g^2,s^1,l^0) = & P_B(i^1)P_B(d^0)P_B(g^2|i^1,d^0)P_B(s^1|i^1)P_B(l^0|g^2)\\
= & 0.3\cdot 0.6\cdot 0.08\cdot 0.8\cdot0.4 \\
= & 0.004608\\
\end{align}
$$

Not too bad, right? All this is to show is that a BN along with CPDs give us a way to calculate probabilities for assignments of $\mathcal{X}$.

Now we're ready for:

### The big idea.

It's so big, it gets it's own quote block:

"The BN graph, just those nodes and edges, implies a set of CI statements regarding it's accompanying $P_M$."

It's a consequence of the Chain Rule for calculating probabilities. As a not-at-all-obvious result, a BN graph represents all $P$'s that satisfy these CI statements and each of those $P$'s could be attained with an appropriate choice of CPDs.

For a BN, one form of those CI statements are:

"
$(X_i \perp$ NonDescendants$_{X_i}|\textrm{Pa}_{X_i}^\mathcal{G})$ for $X_i \in \mathcal{X}$
"

So in the student example, we'd have this set:

$$
(L\perp I,D,S|G)\\
(S\perp D,G,L|I)\\
(G\perp S|I,D)\\
(I\perp D)\\
(D\perp I,S)\\
$$

The third statement tells us that if you already know the student's intelligence and their class's difficulty, then knowing their SAT score won't help you guess their grade. This is because the SAT score is correlated with their grade *only via* their intelligence, and you already know that.

These are referred to as the 'local semantics' of the BN graph. To complicate matters, there are almost always many other true CI statements associated with a BN graph outside of the local semantics. To determine those by inspecting the graph, we use a scary 'D-separation' algorithm that I will shamelessly not explain[link].

There is a reason this is so important. Since a BN graph is a way of *representing* CI statements and such statements are a requirement for handling a complex system's joint distribution (if you had it), then this is good reason to use a BN to represent such systems. If we can accurately *represent* a system with a BN, we will be able to calculate our probability and MAP queries. Therefore, BNs will solve our problems when we're dealing with a certain class of $P$'s. This choice, unsurprisingly, is called our **representation**.

But there's an issue - I said a 'class' of $P$'s. It's not hard to invent $P$'s that come with CI statements a BN cannot represent.

So now what? Well, we have other tools, the biggest of which is...

### The Markov Network

A **Markov Network** ('MN') is likewise composed of a graph (call it $\mathcal{H}$) and a probability distribution (call it $P_M$). Though this time, the graph's edges are *undirected* and it may have cycles. The consequence is that a MN can represent a different set of CI statements. But, the lack of directionality means we can no longer use CPDs. Instead, that information is delivered with a **factor**, which is a *function* (function! remember it) that maps from an assignment of some subset of $\mathcal{X}$ to some nonnegative number. These factors are used to calculate probabilities with the 'Gibbs Rule'[5].

To understand the Gibbs Rule, we must define a **complete subgraph**. A subgraph is exactly what it sounds like - we make a subgraph by picking a set of nodes from $\mathcal{H}$ and including all edges from $\mathcal{H}$ that are between nodes from this set. A 'complete' graph is one which has every edge it can - each node has an edge to every other node.

Now, let's say $\mathcal{H}$ breaks up into a set of $m$ complete subgraphs. By 'break up', I mean that the union of all nodes and edges across these subgraphs gives us all the nodes and edges from $\mathcal{H}$. Let's write the RVs associated with the nodes of these subgraphs as $\{\mathbf{D}_i\}_{i=1}^m$ . Let's also say we have one factor (call it $\phi_i(\cdot)$) for each of these. We refer to these factors together with $\Phi$, so $\Phi=\{\phi_i(\cdot)\}_{i=1}^m$. For terminology's sake, we say that the 'scope' of the factor $\phi_i(\cdot)$ is $\mathbf{D}_i$ because $\phi_i(\cdot)$ takes an assignment of $\mathbf{D}_i$ as input.

Finally, the Gibbs Rule says we calculate a probability as:

$$
P_M(X_1,\cdots,X_n) = \frac{1}{Z} \prod_{i=1}^m \phi_i(\mathbf{D}_i)
$$

where

$$
Z = \sum_{\mathbf{x}\in Val(\mathcal{X})} \prod_{i=1}^m \phi_i(\mathbf{D}_i)
$$

(It's hidden from this notation, but we're assuming it's clear how to match up the assignment of $X_1,\cdots,X_n$ with the assignments of the $\mathbf{D}_i$'s.)

Wait - the MN was introduced because it represents a different set of CI statement. So, which ones? It's considerably simplier in the case of a MN. A MN implies the CI statement $(\mathbf{X} \perp \mathbf{Y}|\mathbf{Z})$ if all paths between $\mathbf{X}$ and $\mathbf{Y}$ go through $\mathbf{Z}$. Easy!

Now let's get specific. Below is an MN for the system $\mathcal{X}=\{A,B,C,D\}$ and the CI statements it represents:

![title](MN1.png)

As you may notice, it's not hard to write those CI statements by viewing the graph.

While we're here, let's write out the Gibbs Rule. By looking at this, we could identify our complete subgraphs as: $\{\{A,B\},\{B,C\},\{C,D\},\{D,E\}\}$. With that, we calculate a probability as:

$$
P(A,B,C,D) = \frac{1}{Z}\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_3(D,A)
$$

where

$$
Z = \sum_{\mathbf{x}\in Val(\mathcal{X})}\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_3(D,A)
$$

To repeat, each $\phi_i(\cdot,\cdot)$ is just a function that maps from it's given joint assignment to some nonnegative. So if $A$ and $B$ could only take on two values each, $\phi_1(\cdot,\cdot)$ would relate the four possible assignments to four nonnegative numbers. These functions serve as our parameters just at the CPDs did. Determining these functions brings us from a class of $P$'s we can represent with CI statements to a specific $P$ within it, defined with probabilities.

But, ahem, uhh.. there's an issue. In the BN case, I said:

"
 As a not-at-all-obvious result, a BN graph represents all $P$'s that satisfy its CI statements and each of those  $P$'s could be attained with an appropriate choice of CPDs.
"

The analogous is *not* true in the case of MNs. There may exist a $P$ that satisfy the CI statements of a MN graph, but we *can't* calculate it's probabilities with the Gibb's rule. Damn!

Fortunately, these squirrely $P$'s falls into a simple, though large, category: those which assign a *zero* probability to one or more assignments. This leads us to the Hammersley-Clifford theorem:

"
If $P$ is a positive distribution ($P(\mathbf{X}=\mathbf{x})>0$ for all $\mathbf{x} \in Val(\mathcal{X})$) which satisfies the CI statements of $\mathcal{H}$, then we may use the Gibb's Rule, along with a choice of complete subgraphs and associated factors, to yield the probabilities of $P$. [6]
"

And that about does it for the basics of MNs. They are just another way of representing another class of $P$'s. 

### How do BNs and MNs compare?

At this point, we're not evolved enough for a full comparison, so let's do a partial one.

First, it's clearly easier to determine CI statements in a MN - no fancy D-separation algorithm required. This follows from their simple symmetric undirected edges, which make them a natural candidate for certain problems. Broadly, MNs do better when we have decidedly associative observations - like pixels on a screen or concurrent sounds. BNs are better suited when we suspect the data attests to a causal structure. Timestamps and an outside expectation of what's producing the data is helpful for that.

Also, there's a certain overlap between a MN and a BN that'll unify our discuss in later answers. That is, the probabilities produced by the Chain Rule of any given BN can be *exactly reproduced* by the Gibbs Rule of a specially defined MN. To see this, look at the Chain Rule - $P_B(X_i|\textrm{Pa}_{X_i}^\mathcal{G})$ is just the conditional probability of some (unspecified) $X_i$ value given some assignment of the parent RVs. Well, to translate this to the Gibbs Rule, let $\mathbf{D}_i=\{X_i\}\cup\textrm{Pa}_{X_i}^\mathcal{G}$. Next, *defined* $\phi_i(\mathbf{D}_i)$ to produce the same output you'd get from looking up the BN conditional probability in the CPD (which is $P_B(X_i|\textrm{Pa}_{X_i}^\mathcal{G})$). Awesome - now the Gibbs Rule is the same expression as the Chain Rule. This is useful because we can speak solely in terms of the Gibbs Rule and whatever we discover, we know will also work for the Chain Rule (and hence BNs). What this *doesn't* mean is that MNs are a substitute for BNs. If you were to look at this invented MN, it would likely imply way more edges in its graph and therefore, fewer CI statements and therefore, a wider and more unwieldly class of $P$'s. In other words, BNs are still useful representations.

### But there's more to learn.

Let's say we determined our graphical model along with its parameters. How do we actually answer those queries? Well, I have three suggestions:

[2] What is 'exact inference' in the context of Probabilistic Graphical Models? How is it performed? [link] (Posting on DATE)
[3] What is Variance Inference? [link] (Posting on DATE)
[4] How are Monte Carlo methods used to perform inference in Probabilistic Graphical Models? [link] (Posting on DATE)

### Footnotes

[1] This is the one exception where we don't refer to a set of RVs with a bold uppercase letter.

[2] This actually isn't the fully general problem specification. In complete generality, the set of RVs should be allowed to grow/shrink over time. That, however, is outside what I expect to accomplish in these posts.

[3] There is a subtlety of language here. Often we'll say '$P$ statisfy these CI statements'. That means those CI statements are true for $P$, but *others may be true as well*. So it means 'these CI statements' are a subset of all $P$'s true CI statements. This technicality matters, so keep an eye out for it.

[4] If $X_i$ doesn't have any parents, then the CPD is the *unconditional* probability distribution of $X_i$.

[5] This isn't a real name I'm aware of, but the form of that distribution make it a Gibb's distribution [link] and I'd like to maintain an analogy to BNs, which had the Chain Rule.

[6] The implication goes the other way as well: If the probabilities of $P$ can be calculated with the Gibb's Rule, then it's a positive distribution which satisfies CI statements implied by a graph which has cliques of RVs that correspond to the RVs of each factor. This direction, however, doesn't fit into the story I'm telling, so it sits as a lonely footnote.

### Sources

[1] Koller, Daphne; Friedman, Nir. Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning series). The MIT Press. Kindle Edition. This is the source of the notation, the graphics in this answers and my appreciation of this subject.