### What is a Graphical Model?

Notes (PGM)

It might be a good idea to read the intro/end of all the chapters.

Chapter 1
* they separate between knowledge (the model of our system) and reasoning (how we go about answering questions regarding that model)
* PGM's models are focused on dealing with uncertainty
* the finest-grain truth of reality might be a pure determinism (think laplace demon), but our limited knowledge of it effectively makes it stochastic.
* pure certainty almost never happen, so we are almost always faced with a large set of possibilities. Given that, we must decide which are probable (probability theory allows us to do this).
* 'Probabilistic graphical models use a graph-based representation as the basis for compactly encoding a complex distribution over a high-dimensional space.'
* A graph represents two perspectives:
    * It's a way of representing a set of independence statements for a distribution.
    * It's a way of representing how the calculation of the probability to an assignment to the distribution can be broken up into a product of factors.
* It turns out (in a 'deep way') that these two perspective are equivalent.
* There are two families of graphical representations of distributions: directed and undirected. They different in the independence relations they may encode.
* The graphical model framework has 3 advantages:
    * It gives us a general framework for models to represent our world
    * It allows for inference (probabilistic queries)
    * It allows for learning - a data driven means of determining specific models.
* P factorizes according to G iff P satisfies the local (or global) CI statements of G.
    
Chapter 3
* 'The compact representations we explore in this chapter are based on two key ideas: the representation of independence properties of the distribution, and the use of an alternative parameterization that allows us to exploit these finer-grained independencies.'
* Bayesian Networks (think CPTs) allow us to avoid representing probabilities of each joint assignment. Instead, we can use CPTs and a set of rules to produce whatever probability we are interested in. This makes for much fewer parameters.

Chapter 4
* Undirected models are simplier from the angle of inference and the CI statements.
* Undirected models sometimes require we restrict attention to discrete state spaces.
* We want to capture affinities between groups of RVs.
* A factor is a function from Val(D) to R. A nonnegative factor yields all nonnegative values
* P106: Key result: P satisfies (X perp Y | Z) if we can write P(X,Y,Z)  = f1(X,Z)f2(Y,Z)
* P106: Independence properties are simplier: I think if they are separated by Z in G, then X and Y are independent in any distribution that factorizes according to G.
* Parameterization is more complicated in a Markov Network - they don't correspond to conditional probs.
* A factor is inclusive of a joint distribution and a conditional prob. A joint distribution is one type of factor. Same with a conditional prob function.
* Top of P108: If we have a BN along with CPDs, we can recreate all it's probabilities using a Markov Network defined over the same $\mathcal{X}$.
* Say you define a $P$ use factors. We say that $P$ factors according to MN $\mathcal{H}$ is the factor correspond to subset of RVs that are *complete subgraphs* in $\mathcal{H}$. We could reduce the numbers of factors if we require that these are complete subgraphs, but that can obscure useful structure.
* P111: This is all about how conditioning on an assignment yields another simple Gibbs (?) distribution over a graph that includes all variables that aren't associated with the assignment.
* P114: Independence statements are indeed simplier:
* Hammersley Clifford: If $P$ is a positive distribution and $\mathcal{H}$ is a MN and an I-map for $P$, then $P$ is a gibbs distribution that factors of $\mathcal{H}$. According to MLPP, a positive distribution satisfies the CI statements of an $\mathcal{H}$ if it can be written as a product of factors over the maximal cliques of $\mathcal{H}$.
* Example 4.4: I can write a MN $\mathcal{H}$ that has the same CI statements as a P, but where P does not factorize over $\mathcal{H}$. What!? This is due to it being a *non-negative* distribution.
* Let $\mathcal{H}$ be a 

Elsewhere
* I-equivalence of two graphs immediately implies that any distribution P that can be factorized over one of these graphs can be factorized over the other.
* Difference BNs can encode the same CI statements
* Soundness: If X and Y are d-separated, then for all distributions that factor over G, X and Y are conditionally independent.
* Completeness: If X and Y are conditionally independent in P which factorizes over G, then they are d-separated.

Should I include a "the complicated parts" section?

* Many different BNs can represent the same CI statements (p87)

What topics might I touch?

* What is a Random Random?
* Should address undirected vs directed networks
* Maybe Markov Equivalence?
* What an 'assignment' to a random variable is?
* Maybe mention the 'generative story' that comes with some directed models (does it come with the undirect versions as well?)
* I need to mention that I'm only considering the discrete end.

Possible outline:

* What are we concerned with? A probability distribution in high dimensions and answerings queries regarding it.
* Set up the 'problem'. A set of variables $\mathcal{X}$ (though this is restrictive! sometimes we actually care about a growing set of random varibale), an assignment, different queries...

Possible outline 2:

1. What generic problem do PGMs address?
2. What i    

2. What is a PGM?
* A PGM

What is the approach of PGMs?

* Establish a relationship between probability distributions and a graphical model.

BIG ISSUE: Can you always represent the CI Statements of a BN with a MN? No I believe we are OK, see 19.2.2 of MLPP

THIS IS NOT TRUE: If P factorizes according to the BN G, then I(P) = I(G).

See the completeness section of PGM: If X and Y aren't D-seperated in G, then you could construct a P that factorizes over G where they are dependent. Also, you could pick CPDs such they are independent. This is why G represents all P's it satisfies.

### Should all this be a series of answers and questions?

1. What are Probabilistic Graphical Models and why are they useful?
2. What is exact inference and how is it performed in a Probabilistic Graphic Model?
2. What is approximate inference and how is it performed in a Probabilistic Graphic Model?
3. How do we learning a Probabilistic Graphic Model?


**DROP THIS AS A SUMMARY OF PGM.** It makes it look unoriginal and restricts us. Maybe mention that this follows the strucutre of that book.

### What are Probabilistic Graphical Models and why are they useful?

Probabilistic Graphical Models are arguably our most complete and promising toolkit for inferring truth from complexity. It's a single set of principles that endow our machines to dominate chess, diagnose disease, translate language, decipher sound, recognize images and drive cars. 'Neural Networks' and 'Probabilistic Programming' are famous signatures of the ML community simply because they are effective toolsets for applying PGMs.

My aim here is to reveal the machinery behind this magic. I intend to show what they are, why we use them and how we actually use them. To do that, I'll answer X questions on this topic, of which this is the first. Those are:

1. What are Probabilistic Graphical Models and why do we use them?
2. What is 'exact inference' in the context of Probabilistic Graphical Models? How is it performed? [link] (Posting on DATE)
3. What is Variance Inference? [link] (Posting on DATE)
4. How are Monte Carlo methods used to perform inference in Probabilistic Graphical Models? [link] (Posting on DATE)
5. How are the parameters of a Markov Network learned? [link] (Posting on DATE)
6. How are the parameters of a Bayesian Network learned? [link] (Posting on DATE)
7. How is the graph structure of a PGM learned? [link] (Posting on DATE)

I realize this is a good deal to digest, especially for Quora. But this information is generally delivered in a graduate class paired with a 1000+ page textbook. We can get to the core of these valuable principles with just a few questions! It'll take a bit of discipline, which I understand is exceptionally rare for internet browsing, but afterwards, you'll have a decent understanding of a general, empirically effective, self-consistent theory of knowledge acquistion.

To help, here's a map of these questions:

![Title](Map.png)

You need only follow one path to get a complete taste. Also, I'll include recaps of requisite information at the beginning of each answer, which should make these answers more self contained.

Now, let's start walking.


### Preamble Notes

These explanations will follow the notation and some of the structure of the infamous text Probabilistic Graphical Models[link]. With that, we'll need a few survival tips:

* I'm only going to consider the discrete case for everything. This simplifies things substantially. Unfortunately, the continuous case is *not* a simple generalization from the discrete case.
* An upper case non-bold letter indicates a single random variable (RV). The same letter lowercased with a super script indicates a specific value that RV may take. For example, $X=x^1$ is the *event* the RV $X$ took on the value $x^1$. Sometimes we call this event an 'assignment'. The set of unique values an RV may take is given by $Val(X)$. A bold upper case letter indicates a *set* of RVs and a bold lower case letter indicates a set of values of these RVs may take. $Val(\mathbf{X})$ is the set of all possible unique joint assignments to the RVs in $\mathbf{X}$. For example, we may have $\mathbf{X}=\{A,B\}$ and $\mathbf{x}=\{a^3,b^1\}$. Then the event $\mathbf{X}=\mathbf{x}$ is the event that $A=a^3$ happens *and* $B=b^1$ happens.
* Sometimes, we abbreviate the event '$\mathbf{X}=\mathbf{x}$' as '$\mathbf{x}$', under the assumption that the corresponding $\mathbf{X}$ is obvious and understood.
* Perhaps confusingly, we also abbreviate the event '$\mathbf{X}=\mathbf{x}$' as '$\mathbf{X}$', though this isn't a clean abbreviation. Omission of $\mathbf{x}$ means one of two things: either we mean this for *any* given $\mathbf{x}$ or for *all* possible $\mathbf{x}$'s. As an example for the latter case, 'calculate $P(\mathbf{X})$'  would mean calculate the set of probabilities $P(\mathbf{X}=\mathbf{x})$ for all $\mathbf{x}\in Val(\mathbf{X}).$ 
* $\sum_\mathbf{X}f(\mathbf{X})$ is shorthand for $\sum_{\mathbf{x}\in Val(\mathbf{X})}f(\mathbf{X}=\mathbf{x})$.
* There is some overloading of the notation of $P$. Sometimes it used to refer to the true but unknown joint distribution of a system. Other times it's used to refer to the joint probabilities associated with a particular model. Context is important to disambiguate the two.

It's my unconventional belief that most statistics confusion is due to notation. If you think rigorously and refer to this section frequently, you'll understand everything here.

Now we ask:

### What generic problem do PGMs address?

Our goal is to understand a complex system. We assume the complex system manifests as $n$ RVs, which we may write as $\mathcal{X} = \{X_1,X_2,\cdots,X_n\}$ [1][2]. We take it that 'a good understanding' means we can answer two classes of questions *accurately* and *efficiently* for these RVs. If we say $\mathbf{Y}$ and $\mathbf{X}$ are two given non-intersecting subsets of $\mathcal{X}$, then those questions are:

1. **Probability Queries**: Compute the probabilities $P(\mathbf{Y}|\mathbf{X}=\mathbf{x})$. What is the distribution of the RV's of $\mathbf{Y}$ given we have some observation of the RVs of $\mathbf{X}$?
2. **MAP Queries**: Determine $\textrm{argmax}_\mathbf{y}P(\mathbf{Y}=\mathbf{y}|\mathbf{X}=\mathbf{x})$. That is, determine the most likely values of RVs given an assignment of other RVs.

Before continuing, we should point a few things out:

* Since $\mathbf{Y}$ and $\mathbf{X}$ are any two non-intersecting subsets of $\mathcal{X}$, there is potentially a remaining set (call it $\mathbf{Z}$) that's in $\mathcal{X}$. In other words, $\mathbf{Z} = \mathcal{X} \backslash \{\mathbf{Y},\mathbf{Z}\}$ . This set appears left out of our questions, but is very much at play. We have to sum these RVs out, which can considerably complicate our calculations. For example, $P(\mathbf{Y}=\mathbf{y}|\mathbf{X}=\mathbf{x})$ is actually $\sum_\mathbf{Z}P(\mathbf{Y}=\mathbf{y},\mathbf{Z}|\mathbf{X}=\mathbf{x})$. On a note of terminology, we say $P(\mathbf{Y}=\mathbf{y}|\mathbf{X}=\mathbf{x})$ is a 'marginal' probability, since some other RVs were summed out.
* We haven't mention any model yet. This set up is asking generically for probabilities and values that accurately track reality.

To this end, we are assisted by the fact that we have some, at least partial, joint observations of these RVs. However, some of our $n$ RVs may *never* be observed. These are called 'hidden' variables and they will complicate our lives later on.

This set up is extremely general, and as such, this problem is extremely hard.

### The problem with joint distributions.

Our starting point, perhaps surprisingly, will be to consider the joint distribution of our RVs $\mathcal{X}$, which we aren't given in real application (but we'll get there). We'll call that joint distribution $P$. Conceptually, we can think of this as a table that lists out all possible joint assignments of $\mathcal{X}$ and their associated probabilities. So if $\mathcal{X}$ is made up of 10 RVs, each of which can take 1 of 100 values, this table has $100^{10}$ rows, each indicating a particular assignment of $\mathcal{X}$ and it's probability.

The issue is, for complex system, this table is too big. Even if we had the crystal ball luxury of having the joint distribution of our system, we *can't handle it*. So now what?

### The Conditional Independence statement

We need a *compact* representation of $P$ - something that gives us all the information of that table, but without having to actually write it down. To this end, our saving grace is the **conditional independence (CI) statement**:

"
Given subsets of RVs $\mathbf{X}$, $\mathbf{Y}$ and $\mathbf{Z}$ from $\mathcal{X}$, we say $\mathbf{X}$ is conditionally independent of $\mathbf{Y}$ given $\mathbf{Z}$ if

$$
P(\mathbf{X}=\mathbf{x},\mathbf{Y}=\mathbf{y}|\mathbf{Z}=\mathbf{z})=P(\mathbf{X}=\mathbf{x}|\mathbf{Z}=\mathbf{z})P(\mathbf{Y}=\mathbf{y}|\mathbf{Z}=\mathbf{z})
$$

for all  $\mathbf{x}\in Val(\mathbf{X})$, $\mathbf{y}\in Val(\mathbf{Y})$ and $\mathbf{z}\in Val(\mathbf{Z})$. This is written '$P$ satisfies $(\mathbf{X}\perp \mathbf{Y}|\mathbf{Z})$[2.1]'
"

Now, if we had sufficient time and summation abilities, we could calculate the LHS and each of the factors on the RHS for a distribution $P$. If the equations holds for all values, then, by definition, the independence statements holds. This is sometimes stated as '$P$ satisfies this CI statement.' Intuitively, this means that if you are given the assignment of $\mathbf{Z}$, then knowing the assignment of $\mathbf{X}$ will never help you guess $\mathbf{Y}$ (and visa versa). Said differently, if $\mathbf{X}$ provides any

Knowing such statements turns out to be massively useful - they give us that compact representation we need. To see this, let's say $(X_i \perp X_j)$ for all $i \in \{1,\cdots,10\}$ and $j \in \{1,\cdots,10\}$ (this is, 'given the empty set'). This is to say, all RVs are independent of all other RVs. So instead of knowing that fat table, we only need to know the marginal probabilities of each value for each RV, which is a total of $10\cdot100=1000$ values. So if we are considering the case where $\mathbf{X}=\mathcal{X}$ and would like to know the probability $P(\mathbf{X}=\mathbf{x})$, we simply return $\prod_{i=1}^{10}P(X_i=x_i)$, where $x_i$ is the $i$-th element of $\mathbf{x}$.

Though this isn't just a save on storage. This is a simplification on $P$ that will ease virtually any interaction with $P$, including summing over many assignments and finding most likely assignment. So at this point, I'd like you to think that CI statements regarding $P$ are a requirement for wielding it.

Now put a pin in this and let's switch gears.

### The Bayesian Network

It's time to introduce a Probabilistic Graphic Model. I'll do so with a specific type, the **Bayesian Network** (BN). A BN refers to two things, both in relation to some $\mathcal{X}$: a BN graph (called $\mathcal{G}$ typically) and an associated probability distribution $P$. $\mathcal{G}$ is a set of nodes, one for each RV of $\mathcal{X}$, and a set of *directed* edges, such that there are no directed cycles - it's a DAG [link]. $P$ is a distribution with probabilities for assignment of $\mathcal{X}$ using a certain rule and Conditional Probability Tables ('CPTs' and 'CPDs'), which augment $\mathcal{G}$. That rule (called the 'Chain Rule for BNs') for determining probabilities can be written:

$$
P(X_1,\cdots,X_n)=\prod_{i=1}^n P(X_i|\textrm{Pa}_{X_i}^\mathcal{G})
$$

where $\textrm{Pa}_{X_i}^\mathcal{G}$ indicates the set of parents nodes/RVs of $X_i$ according to $\mathcal{G}$. The CPDs tell us what the $P(X_i|\textrm{Pa}_{X_i}^\mathcal{G})$ probabilities are. That is, a CPD lists out the probabilities of all assignments of $X_i$ given any joint assignment of $\textrm{Pa}_{X_i}^\mathcal{G}$[3]. Think of these CPDs as the parameters of our model. When a given $P$ has probabilities that may be calculated this way, we say '$P$ factorizes according to $\mathcal{G}$'. 

To help, let's consider a well utilized example from the text: the 'Student Bayesian Network'. Here, we're concerned with a system of five RVs: a student's intelligence ($I$), their class's difficulty ($D$), their grade in that class ($G$), their letter of recommendation ($L$) and their SAT score ($S$). So $\mathcal{X}=\{I,D,G,L,S\}$. The BN graph along with the CPDs can be represented as:

![title](StudentBN.png)

According to our rule, we have that any joint assignment of $\mathcal{X}$ factors as:

$$
P(I,D,G,S,L) = P(I)P(D)P(G|I,D)P(S|I)P(L|G)
$$

So we would calculate a given assignment as[4]:

$$
\begin{align}
P(i^1,d^0,g^2,s^1,l^0) = & P(i^1)P(d^0)P(g^2|i^1,d^0)P(s^1|i^1)P(l^0|g^2)\\
= & 0.3\cdot 0.6\cdot 0.08\cdot 0.8\cdot0.4 \\
= & 0.004608\\
\end{align}
$$

Not too bad, right? All this is to show is that a BN along with CPDs give us a way to calculate probabilities for assignments of $\mathcal{X}$.

### But so what?

Well, there's a big idea hiding behind this: *the BN graph, just the graph of nodes and edges, implies a set of CI statements associated with it's accompanying $P$*. It's a consequence of the Chain Rule for calculating probabilities. So a BN graph represents all $P$'s that satisfy its CI statements and each of those $P$'s could be attained with an appropriate choice of CPDs.

For a BN, one form of those CI statements are:

"
$(X_i \perp$ NonDescendants$_{X_i}|\textrm{Pa}_{X_i}^\mathcal{G})$ for all $X_i \in \mathcal{X}$
"

So in the student example, we'd have this set:

$$
(L\perp I,D,S|G)\\
(S\perp D,G,L|I)\\
(G\perp S|I,D)\\
(I\perp D)\\
(D\perp I,S)\\
$$

The third statement tells us that if you already know the student's intelligence and their class's difficulty, then knowing their SAT score won't help you guess their grade. This is because the SAT score is correlated with their grade *only via* their intelligence, and you already know that. On a point of terminology, if the graph was such that knowing their SAT score helped you guess their grade, then we would say this path is 'active' given $I$ and $D$.

These are referred to as the 'local semantics' of the BN graph. To complicate matters, there are potentially many other true CI statements associated with a BN graph outside of the local semantics. To determine those by inspecting the graph, we use a scary looking algorithm that relies on the concept of D-separation [link].

There is a reason this is so important. Since a BN graph is a way of representing CI statements and such statements are a requirement for handling a complex system's joint distribution (if you had it), then this is good reason to use a BN to represent such systems. If we can accurately represent a system with a BN, we will be able to calculate our probability queries, MAP queries and calculate expectations. Therefore, BNs will solve our problems when we're dealing with a certain class of $P$'s.

But there's an issue - I said a 'class' of $P$'s. It's not hard to invent $P$'s that come with CI statements a BN cannot represent.

So now what? Well, we have other tools, the biggest of which is.. 

### The Markov Network

A **Markov Network** MN is associated with a graph (called $\mathcal{H}$ frequently), which, like a BN graph, is a set of nodes and edges. This time, however, the edges are *undirected* and the graph may have cycles. The consequence is that a MN can represent a different set of CI statements. But the lack of directionality means we can no longer use CPDs. Instead, that information is delivered with a *factor*, which is a *function* that maps from some assignment of some subset of $\mathcal{X}$ to some positive number. The subset of RVs for a particular factor is called the 'scope' of that factor.

But which CI statements? It's considerably simplier in the case of a MN. A MN implies the CI statement $(\mathbf{X} \perp \mathbf{Y}|\mathbf{Z})$ if for every path between an RV in $\mathbf{X}$ and an RV in $\mathbf{Y}$, some RV of $\mathbf{Z}$ is on that path. 

But let's get specific. Below is an MN for the system $\mathcal{X}=\{A,B,C,D\}$ and the CI statements it represents:

![title](MN1.png)

To calculate the probability of a particular assignment, we'll use the 'Gibb's Rule' for calculating probabilities.[4.1] To do so, we first have to identify a set of *complete subgraphs* that are collectively inclusive of all edges and nodes of $\mathcal{H}$[5]. A subgraph is exactly what it sounds like; we construct a subgraph by picking a subset of nodes from $\mathcal{H}$ and including all edges from $\mathcal{H}$ that are between nodes from this subset. A complete graph is one which each node has an edge to every other node. 'Collectively inclusive' means that the union of all nodes and edges across each subgraphs yields all the nodes and edges from $\mathcal{H}$. Finally, for each complete subgraph, we define a factor that'll be used in our probability calculation. For the above, the complete subgraphs had nodes: $\{\{A,B\},\{B,C\},\{C,D\},\{D,E\}\}$. With that, we calculate a probability as:

$$
P(A,B,C,D) = \frac{1}{Z}\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_3(D,A)
$$

where

$$
Z = \sum_{\mathbf{x}\in Val(\mathcal{X})}\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_3(D,A)
$$

To repeat, each $\phi_i(\cdot,\cdot)$ is just a function that maps from it's given joint assignment to some positive number. So if $A$ and $B$ could only take on two values each, $\phi_1(\cdot,\cdot)$ would relate each of four possible assignments to one of four positive numbers. These functions serve as our parameters just at the CPDs did. Determining these functions brings us from a class of $P$'s we can represent with CI statements to a specific $P$ within it, defined with probabilities.

Now let's write the Gibb's Rule in its full generality. Let's say we've chosen a set of subsets of $\mathcal{X}$ that are collectively inclusive of $\mathcal{H}$ and correspond to complete subgraphs of $\mathcal{H}$. We can write that set as $\{\mathbf{D}_i\}_{i=1}^m$ . Let's also say we have one factor for each of these, that we'll call $\phi_i(\cdot)$. We refer to these factors together with $\Phi$, so $\Phi=\{\phi_i(\cdots)\}_{i=1}^m$ Then we calculate a probability as:

$$
P(X_1,\cdots,X_n) = \frac{1}{Z} \prod_{\phi_i\in \Phi} \phi_i(\mathbf{D}_i)
$$

where

$$
Z = \sum_{\mathbf{x}\in Val(\mathcal{X})} \prod_{\phi_i\in \Phi} \phi_i(\mathbf{D}_i)
$$

(It's a bit hidden from this notation, but we're assuming it's clear how to match up the assignment of $X_1,\cdots,X_n$ with the assignments of the $\mathbf{D}_i$'s.)

And that's the Gibb's rule! But, ahem, uhh.. there's an issue. In the BN case, I said:

"
So a BN graph represents all $P$'s that satisfy its CI statements and each of those  $P$'s could be attained with an appropriate choice of CPDs.
" (This needs to be verified)

The analogous is *not* true in the case of MNs. There may exist a $P$ that satisfy the CI statements of a MN graph, but we *can't* calculate it's probabilities with the Gibb's rule. F!

Fortunately, these squirrely $P$'s falls into a simple (though large) category: those which assign a *zero* probability to some assignments. This leads us to the Hammersley-Clifford theorem:

"
If $P$ is a positive distribution ($P(\mathbf{X}=\mathbf{x})>0$ for all $\mathbf{x} \in Val(\mathcal{X})$) which satisfies the CI statements of $\mathcal{H}$, then we may use the Gibb's Rule, involving a choice of complete subgraphs and a choice of factors, to yield the probabilities of $P$. [6]
"

And that about does it for the basics of MNs. They are just another way of representing another class of $P$'s. 

### How do BNs and MNs compare?

There are a few points of comparison we may call out from this description. First, it's easier to determine CI statements in a MN network. You don't require a fancy D-separation algorithm like you do for BNs. This follows from their symmetry, a consequence of their undirected edges, which can make them a natural candidate for a variety of problems: image segment, [continue list]. On the other hand, specification of a MN's parameter is much more unnatural. If we were to attempt to extract this parameter information from a group of experts, we'd have a much easier time discussing CPDs than a MN's entangled factors.

There are more pros and cons than revealed here, but we'll cover those when we cover the necessary topics.

### Wrapping it up

Let's recap. We are concerned with studying a system of RVs, $\mathcal{X}$. That study takes the form of some specific questions we'd like to answer regarding that system. As a starting point, we assume we have the system's joint distribution $P$ and realize we need some simplifications on $P$ to answer our questions. Those simplifications are CI statements. The big idea is to *represent* those CI statements with graphical models, which therefore represent a class of $P$'s that satisfy those CI statements. We nail down the specific $P$ represented with 'parameters' of our graphical model. In the case of BNs, these were the CPDs and in the case of MNs, these were the factors.

But we have more to do. Let's say we determined our graphical model along with its parameters. How do we actually answer those questions? That's addressed with:

What is 'inference' in the context of Probabilistic Graphical Models? How is it performed? [link]

Ok, but how did we determine our graphical model? Isn't it time to consider our starting point, which is just a set of observations? Yes it is:

What does it mean to 'learn' a Probabilistic Graphical Model? How is it performed? [link]


### Footnotes

[1] This is the one exception where we don't refer to a set of RVs with a bold uppercase letter.

[2] This actually isn't the fully general problem specification. In complete generality, the set of RVs should be allowed to grow/shrink over time. I'm excluding that level of complexity because it's difficult to cover without getting too far afield.

[2.1] There is a subtlety of language here. Often we'll say '$P$ statisfy these CI statements'. That means those CI statements are true for $P$, but *others may be true as well*. So it means 'these CI statements' are a subset of all $P$'s true CI statements. This technicality matters, so keep an eye out.

[3] If $X_i$ doesn't have any parents, then the CPD is the *unconditional* probability distribution of $X_i$.

[4] I'm abbreviating, for example, $P(I=i^1)$ as $P(i^1)$.

[4.1] This isn't a real name I'm aware of, but the form of that distribution make it a Gibb's distribution [link] and I'd like to maintain an analogy to BNs, which had the Chain Rule.

[5] A collectively inclusive set of complete subgraphs is *not* necessarily unique to $\mathcal{H}$ - there may be multiple such sets that get the job done. This, unfortunately, is another choice to make and may make a considerable difference. Let's say $\mathcal{H}$ is fully connected for the $n$ RV's of $\mathcal{X}$, each of which may take one of two values. This graph implies no CI statements. One collectively inclusively set might involve one factor for each edge in $\mathcal{H}$. Each of these maps the 4 joint assignments of two variables to 4 positive number. So one factor involves 4 parameters. Since there are ${n \choose 2}$ edges, we have a total of $4{n \choose 2}$ parameters. But we know the joint distribution over $n$ binary value variables has $2^n-1$ free parameters. So choosing pairwise factors as our complete subgraphs restricted our ability to represent some $P$'s, even if the CI statements are satisfied (This is example 4.1 from the text). This inability to represent falls away as you increase the size of the scope of the factors. So if we had just one factor which treats all of $\mathcal{H}$ as the complete subgraph, we would have the $2^n-1$ free parameters. Because of this, some automate the choice as to the 'maximal' complete subgraphs. These are complete subgraphs for which adding any node would make them no longer complete.

[6] The implication goes the other way as well: If the probabilities of $P$ can be calculated with the Gibb's Rule, then it's a positive distribution which satisfies CI statements implied by a graph which has cliques of RVs that correspond to the RVs of each factor. This direction, however, doesn't fit into the story I'm telling, so it sits as a lonely footnote.

### Sources

[1] Koller, Daphne; Friedman, Nir. Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning series). The MIT Press. Kindle Edition. 

### Scrap

Probabilistic Graphical Models ('PGMs') is an absolutely fundamental concept to the study of machine learning. They provide an extremely useful framework for modeling our reality, learning that model from data and reasoning with respect to that model such that we can answer questions and determine actions.

It's a broad topic populated by some mind-bending algorithms, but the core objectives and ideas are accessible and surprisingly finite. In this explanation, I plan to abstract away the head-hurting details for the sake of the big picture. I'll do that by answering these questions:

1. What generic problem do PGMs address?

2. What is a PGM and how does it address our problem?

3. How would we actually use a PGM?

Now, I know what you're thinking:

'This answer looks long and boring.'

You're right - it's long, but it's worth it. Here I'm delivering the core principles of the titanic textbook Probabilistic Graphical Models, which is typically used to defined a graduate class for brainy machine learnist. We will be 80-20-ing all of that in this post - this is a worthwhile read!

### What is a PGM and how does it address our problem?

Perhaps surprisingly, we will not start by considering our observations/data. When we do start there, it'll be important to know our ending position, which I'll now explain.

Here, our starting point will be a table - a joint distribution table that tells us the probabilities of all possible assignments of values to the RVs of $\mathcal{X}$. So if $\mathcal{X}$ is made up of 10 RVs, each of which can take 1 of 100 values, this table has $100^{10}$ probabilities. It's a big table that *defines* the joint distribution over $\mathcal{X}$. I'll call the generic joint distribution defined here $P$.

We want the information of this table, but we need a *compact* form of it. We cannot store or process $100^{10}$ values easily. In this effort, our saving grace will be the **conditional independence (CI) statement**:

Given subsets of RVs $\mathbf{X}$, $\mathbf{Y}$ and $\mathbf{Z}$ from $\mathcal{X}$, we say $\mathbf{X}$ is conditionally independent of $\mathbf{Y}$ given $\mathbf{Z}$ if

$$
p(\mathbf{X}=\mathbf{x},\mathbf{Y}=\mathbf{y}|\mathbf{Z}=\mathbf{z})=p(\mathbf{X}=\mathbf{x}|\mathbf{Z}=\mathbf{z})p(\mathbf{Y}=\mathbf{y}|\mathbf{Z}=\mathbf{z})
$$

for all possible values of $\mathbf{x}$, $\mathbf{y}$ and $\mathbf{z}$. This is written '$P$ satisfies $(\mathbf{X}\perp \mathbf{Y}|\mathbf{Z})$'

So if we had sufficient time and summation abilities, we could calculate the LHS and each of the factors on the RHS for a distribution $P$. If the equations holds for all values, then, by definition, the independence statements holds.

Knowing such statements turns out to be massively useful - they give us that compact representation we need. To see this, let's say $(X_i \perp X_j)$ for all $i \in \{1,\cdots,10\}$ and $j \in \{1,\cdots,10\}$ (this is, 'given the empty set'). This is to say, all RVs are independent of all other RVs. So instead of knowing that fat table, we only need to know the marginal probabilities of each value for each RV, which is a total of $10\cdot100=1000$ values. So if we are considering the case where $\mathbf{X}=\mathcal{X}$ and would like to know the probability $p(\mathbf{X}=\mathbf{x})$, we simply return $\prod_{i=1}^{10}p(X_i=x_i)$, where $x_i$ is the $i$-th element of $\mathbf{x}$.

Though this isn't just a save on storage. This is a simplification on $P$ that will ease virtually any interaction with $P$. So at this point, I'd like you to think that CI statements regarding $P$ are requirement for wielding it.

But what does this have to do with a PGM? Well, a PGM is a model expressed with a **graph**. A graph is a set of nodes, each representing an RV within $\mathcal{X}$, and a set of edges (directed or undirected). For example:

![title](PGM1.png)

The big idea is *a graph represents a set of CI statements and can, therefore, represent the class of probability distributions consistent with those CI statements*.

To understand how, we have to understand how a graph, when augmented with some specifying information, represents a *specific* probability distribution $P$, like the one defined in our table.

So how? Well, there are rules.

To understand those rules, let's focus on a specific type of graph: **The Bayesian Network** (BN). A Bayesian Network is a graph where XYZ. The effect of this is to restrict the CI statements this graph can represent. In a Bayesian Network, we associate a Conditional Probability Tables (CPT or CPD) with each directed edge from 'parent' node (which is also an RV) to 'child' node. A CPT is a grid telling us the *conditional* probability of the child RV's possible values given the parent RV's value. For example, say we were considering two RV's connected with a directed edge: $X$ and $Y$

![title](PGM2.png)

If $X$ can only take on the values $\{x^1,x^2\}$ and $Y$ can only take on the values $\{y^1,y^2\}$, then a CPT might be:

![title](CPT1.png)

You can think of this as a way of 'generating' $Y$ given $X$. For example, if $X=x^1$, we'd flip a fair coin to determine $Y$'s value.

Now if we add a table to give the prior distribution over $X$, like this:

![title](CPT2.png)

We can suddenly use the graph and these tables to determine probabilities associated with any joint value assignment of $X$ and $Y$.

But let's look at a more complicated BN:

![title](StudentBN.png)

Here, we consider 5 RVs associated with a student. Their class's $Difficult$, their's $Intelligence$, their $Grade$ in that class, their $Letter$ of recommendation and their $SAT$ score. Using this more complicated BN and the augmented CPDs, we can


Now, there is a worthwhile complaint at this point. In the case of a BN, we could look at the graph and know the exact form of all CPDs. It is precisely those that tell us $P(X_i|\textrm{Pa}_{X_i}^\mathcal{G})$ for all possible assignments to $X_i$ and to its parents. In the case of a MN, the factors to define are less clear. Couldn't we have defined just one factor that takes all assignments of $\mathcal{X}$ and attain any $P$ we want, including those we hope to represent with the MN above?

Yes, but that defeats the purpose. That form wouldn't enjoy the CI properties of the given MN. To determine factors that'll be faithful to CI statements of $\mathcal{H}$, we use the following theorem that relates a CI statement and the calculation of a probability of $P$:

$P$ satisfies $(\mathbf{X} \perp \mathbf{Y}|\mathbf{Z})$ if there exists functions $f_1(\cdot,\cdot)$ and $f_2(\cdot,\cdot)$ such that $P(\mathbf{X},\mathbf{Y},\mathbf{Z})=f_1(\mathbf{X},\mathbf{Z})f_2(\mathbf{Y},\mathbf{Z})$ 

We can look at an $\mathcal{H}$ and immediately read off the CI statements. This theorem hints as to how we should write our probability-calculation rule in terms of factors using these CI statements.

### What is a PGM and how does it address our problem? (Attempt 2)

Perhaps surprisingly, we will not start by considering our observations/data. When we do start there, it'll be important to know our ending position, which I'll now explain.

In front of us, we have a rather daunted relationship to build: that of a joint probability distribution over all values of $\mathcal{X}$ and it's representation with a PGM. To start, I'll tell you

### Old intro:

This answer will be the first of X answers on this topic. I realize this is strange behavior for Quora, but Probabilistic Graphical Models attention proportionate

This is going to be the first of four answers, which will explain the core principles behind the infamous text Probabilistic Graphic Models [link]. My motivation comes from a bit of my own frustration. When studying this subject, the specifics of the nearest algorithm or theorem can fog the goal or bury the operating principle. This is particularly unfortunate for this subject, considering it's powered by only a handful of ingenuous ideas.

So I felt this topic could use a high level explanation - one where we emphasize the big ideas without oversimplifying. I'll do that by answering:

1. What are Probabilistic Graphical Models and why are they useful?
2. What is 'exact inference' in the context of Probabilistic Graphical Models? How is it done? [link] (Posting on DATE)
2. What is 'approximate inference' in the context of Probabilistic Graphical Models? How is it done? [link] (Posting on DATE)
3. What does it mean to 'learn' a Probabilistic Graphical Model? How is it done? [link] (Posting on DATE)

OK, I know what you're thinking:

"
Duane, this is ridiculously long for Quora. You have no life.
"

It's not ridiculous. The actual text is a worthwhile 1000+ pages for any aspiring data scientist, ML engineer or AI researcher. We can 80-20 that with just 4 answers! It'll take some discipline, but afterwards you'll have a surprisingly good understanding of this generally inaccessible subject.

That said, this is not a substitute for the text. If you enjoy this, the natural next step is to peek at that book. It's great for gaining statistical maturity or, at the very least, observing a comprehensive, self-contained theory of knowledge acquisition.