What should I mention in this answer?

- All parameters are coupled
- We need to pick a representation of markov networks. We will consider MNs with *positive* factors rather than nonnegative.
- With complete data, what defines parameters that are our MLE?
- What is the gradient?
- What happens with incomplete data?


What is the structure
- Intro (I don't know what I call out yet)
- Refresher but only on MNs
- Introduce the form of the MNs: Imagine any positive factors. We can recreate them XYZ way with indicator features. This allows us to bring out the parameters as coefficients that can be any real number.
- What defines the MLE?
- What is the gradient? (Should I include this?)
- WHat happens with missing data?


# How are the parameters of a Markov Network Learned?

INTRO

### Refresher

(Copy this from other online answers)

In the first answer [link], we discovered why PGMs are useful tools for representing complex system. We defined a complex system as a set of $n$ random variables (which we call $\mathcal{X}$) with a relationship we'd like to understand. We take it that there exists some true but unknown joint distribution, $P$, which govern these RVs. We take it that a 'good understanding' means we can answer two types of questions regarding this $P$:

1. **Probability Queries**: Compute the probabilities $P(\mathbf{Y}|\mathbf{e})$. What is the distribution of the RV's of $\mathbf{Y}$ given we have some observation ($\mathbf{e}$) of the RVs of $\mathbf{E}$?
2. **MAP Queries**: Determine $\textrm{argmax}_\mathbf{Y}P(\mathbf{Y}|\mathbf{e})$. That is, determine the most likely values of some RVs given an assignment of other RVs.

(Where $\mathbf{E}$ and $\mathbf{Y}$ are two arbitrary subsets of $\mathcal{X}$. If this notation is unfamiliar, see the 'Notation Guide' section from the first answer [link]).

The idea behind PGMs is to estimate $P$ using two things:

1. A graph: a set of nodes, each of which represents an RV from $\mathcal{X}$, and a set of edges between these nodes.
2. Parameters: objects that, when paired with a graph and a certain rule, allow us to calculate probabilities of assignments of $\mathcal{X}$.

A **Markov Network**'s graph, denoted as $\mathcal{H}$, is different in that it's edges are *undirected* and we may have cycles. The parameters are a size $m$ set of *functions* (called ‘factors’) which map assignments of $m$ subsets of $\mathcal{X}$ to nonnegative numbers. Those subsets, which we'll call $\mathbf{D}_j$'s, correspond to *complete subgraphs* of $\mathcal{H}$ and their union makes up the whole of $\mathcal{H}$. We can refer to this set as $\Phi=\{\phi_j(\cdots)\}_{j=1}^m$. With that, we say that the 'Gibbs Rule' for calculation probabilities is:

$$
P_M(X_1,\cdots,X_n) = \frac{1}{Z} \underbrace{\prod_{j = 1}^m \phi_j(\mathbf{D}_j)}_{\text{we call this }\tilde{P}_M(X_1,\cdots,X_n)}
$$

where $Z$ is a normalizer - it ensures our probabilities sum to 1.

To crystallize this idea, it's helpful to imagine the 'Gibbs Table', which lists unnormalized probabilities for all assignments. In the second answer [link], we pictured an example where $\mathcal{X}=\{C,D,I,G,S\}$ as:

![Title](GIbbsTable_labeled.png)

### Got it, so what are we doing?

We're here to determine those factors given a lump of data, $\mathcal{D}$, and a MN graph, $\mathcal{H}$. To start, we'll assume our data is complete, meaning each observation is a joint assignment to *all* RVs of $\mathcal{X}$. More specifically, let $\mathbf{X}=\mathcal{X}$ and let's write our data as: $\mathcal{D}=\{\mathbf{x}^{(i)}\}_{i=1}^w$, where no entries in these vectors are missing.

It'll help, actually, to restrict our goal slightly. In Markov Networks, we run into problems with factors that give zero probabilities to some assignments. So, we'll be better off considering factors that avoid that. We can do with a function called a *feature*, which map assignments to some *real number*. If we denote these features as $f_j(\cdot)$, then we define them with:

$$
\phi_j(\mathbf{D}_j) = \exp(f_j(\mathbf{D}_j))
$$

Think of a feature as just another way of specifying a factor, where that factor has to be positive. Now we can rewrite our Gibbs Rule:

$$
\begin{align}
P_M(X_1,\cdots,X_n) & = \frac{1}{Z} \prod_{j = 1}^m \exp(f_j(\mathbf{D}_j)) \\
& = \frac{1}{Z} \exp\big(\sum_{j = 1}^m f_j(\mathbf{D}_j)\big) \\
\end{align}
$$

This is good, but it would be nice if we could see the parameters. Right now, they are hiding as the outputs of $f_j(\cdot)$. Here's an idea: let's *redefine* $f_j(\cdot)$ as an indicator function. It's 1 when $\mathbf{D}_j$ takes on a certain assignment[1] and 0 otherwise. Then we can say:

$$
P_M(X_1,\cdots,X_n) = \frac{1}{Z} \exp\big(\sum_{j = 1}^m \theta_j f_j(\mathbf{D}_j)\big)
$$

where $\theta_j$ is a real valued number. This is serving as the output in the previous construction.

At this point, you might have a reasonable complaint. That is, if we original had $m$ feature function, each of which map assignments to real numbers, we can't reproduce it's output with *just* $m$ indicator functions and $m$ real valued numbers. There aren't enough degrees of freedom! Yes, I've sneakily increased $m$. The point is, given any set of features, we *could* invent another *larger* set of indicator functions and $\theta_j$'s that produce it's same output. The confusing part is that we are overloading on notation. The $m$, $f_j$'s and $\mathbf{D}_j$'s all change when we make the switch to indicator world. Specifically, we now repeat a $\mathbf{D}_j$ for every *value* in the original $Val(\mathbf{D}_j)$. The associated $f_j(\cdot)$ for a repeated $\mathbf{D}_j$ is an indicator for that value.

So, in one sentence: our goal is to determine the $\theta_j$'s give some data $\mathcal{D}$ and a set of indicator functions $f_j(\cdot)$.[2]

### The Likelihood Function

To determine the $\theta_i$'s, we need the *Likelihood Function*, which was explained in answer 5[link] as:

"
To do so, we have to introduce the **likelihood function**. A likelihood function accepts parameters and returns a number that tells us how *likely* a fixed set of data is according to those parameters. The guiding principle is to pick parameters that maximize this function, i.e. the likelihood of our data. Such parameters are called our **Maximum Likelihood Estimate ('MLE').** To write this out, it's typical to say $\boldsymbol{\theta}$ is a vector that lists out our parameters that our likelihood function is $\mathcal{L}(\boldsymbol{\theta})$. Further, the argmax of such a function is the same as the argmax of the *log* of this function, and since log-ing turns difficult multiplication into easy addition, we maximize the log likelihood, $\log \mathcal{L}(\boldsymbol{\theta})$.
"

In the case of a MN, the parameters are those used in our specification of $P_M$, so $\boldsymbol{\theta}=[\theta_1,\cdots,\theta_m]$. With that, our goal is to find:

$$
\boldsymbol{\theta}^* = \textrm{argmax}_\boldsymbol{\theta} \log \mathcal{L}(\boldsymbol{\theta})
$$

In the case of a BN, we saw a very fortunate decomposition. That was, we could optimize pieces of $\boldsymbol{\theta}$ *independently*, paste them together at the end, and be assured that was the global optimum. In this general form of a MN, we do *not* have such good fortune. What this means is that we have to optimize all $\theta_i$'s *together*, which means we're responsible for searching a volume with dimension $m$. As you can imagine, that volume might be huge. The reason for this is that damn $Z$ term, which is the sum of the $\tilde{P}_M$ column in our Gibbs Table. It creates an iteraction between all our $\theta_j$'s whereby the best choice of any one of them may depend on the settings of all the others. Because we'd like to call attention to this fact and that $Z$ is a function of $\boldsymbol{\theta}$, we'll change notation slightly as: $Z \rightarrow Z(\boldsymbol{\theta}).$

No matter, we've brave - let's charge forward.

### Optimizing $\log \mathcal{L}(\boldsymbol{\theta})$

This is where the math gets nicely intuitive. Let's write out the log-likelihood with our specification of a MN. It'll simplify things to consider the log-likelihood *per sample*, so:

$$
\begin{align}
\frac{1}{w} \log \mathcal{L}(\boldsymbol{\theta}) & = \frac{1}{w} \log \big( \prod_{i=1}^w P_M(\mathbf{x}^{(i)}) \big) \\
& = \frac{1}{w} \log \big( \prod_{i=1}^w \frac{1}{Z(\boldsymbol{\theta})} \exp\big(\sum_{j = 1}^m \theta_j f_j(\mathbf{d}_j^{(i)})\big) \big) \\
& = \frac{1}{w} \big( \sum_{i=1}^w \sum_{j = 1}^m \theta_j f_j(\mathbf{d}_j^{(i)}) \big) - \log Z(\boldsymbol{\theta}) \\
& =  \sum_{j = 1}^m \theta_j \mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] - \log Z(\boldsymbol{\theta})
\end{align}
$$

where $\mathbf{d}_j^{(i)}$ are the $\mathbf{D}_j$ elements from the observation $\mathbf{x}^{(i)}$ and $\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)]$ is the *empirical* expectation of $f_j(\mathbf{D}_j)$. In other words, it's the number of times that indicator function produced a 1 in our data, divided by our number of observations, $w$.

Now, the cool part is when we take derivations. I'll just state this clean result[2]:

$$
\frac{\partial }{\partial \theta_j} \log Z(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]
$$

Let's make sure we understand that right side. As we know, a choice of $\boldsymbol{\theta}$ tells us the distribution $P_M$. $\mathbb{E}_{\boldsymbol{\theta}}[\cdot]$ means we take expectations of something involving $\mathbf{X}$, which is distributed according to that $\boldsymbol{\theta}$-determined $P_M$. In this case, that is the indicator function $f_j(\mathbf{D}_j)$. 

So, all this means is that our log-likelihood derivative is:

$$
\frac{\partial }{\partial \theta_j} \log \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] - \mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]
$$

Interesting. Let's build some intuition. The left side is how sensitive our objective (at some $\boldsymbol{\theta}$ value) is to $\theta_j$. This equation tells us that sensitivity is the *difference* of two things:

1. $\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)]$: The proportion of times our $f_j$ indicator produced a 1 in our data.
2. $\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$: The proportion of times you'd *expect* it to produce a 1 if $\mathbf{X}$ were generated according to a $P_M$ with parameters set as $\boldsymbol{\theta}$.

This result is by no means obvious, but it easily passes an intuitive check. Let's say $\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] \gg \mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$. This means $f_j(\cdot)$ is 1 much more frequently in our data than what is expected under $\boldsymbol{\theta}$. To correct for that, $\theta_j$ is our most direct lever. Increasing it will increase $\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$ and lessen that difference. As an effect, our parameters will agree more with our data. So this argument shows that $\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)]$ and $\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$ tell us how to change $\theta_j$ to increase agreement with our data. Since $\mathcal{L}(\boldsymbol{\theta})$ is a measure of that agreement, it's not terribly surprising $\frac{\partial }{\partial \theta_j} \log \mathcal{L}(\boldsymbol{\theta})$ is some simple function of those two terms. It just so happens to be their difference.

Now I will prooflessly tell you another fact: $\log \mathcal{L}(\boldsymbol{\theta})$ is a *concave* function [link] of $\boldsymbol{\theta}$. For our purposes, this means $\log \mathcal{L}(\boldsymbol{\theta})$ has no local optimum. Eventually, gradient descent optimization will always leads you to a global optimum.

From this, we get the fact that if, for all $i$, we have:

$$
\frac{\partial }{\partial \theta_j} \log \mathcal{L}(\boldsymbol{\theta}) \Big|_{\boldsymbol{\theta}=\boldsymbol{\theta}^*} = 0
$$

then the $\boldsymbol{\theta}^*$ that accomplishes this *must* be the global optimum. If we rewrite this with our formula for the derivative, we get:

$$
\mathbb{E}_\mathcal{D}[f_j(\mathbf{D}_j)] = \mathbb{E}_{\boldsymbol{\theta}^*}[f_j(\mathbf{D}_j)]
$$

Ahh, OK - so our best choice of $\boldsymbol{\theta}^*$ will imply a $P_M$ that makes $f_j(\cdot)$ produce a 1 with the same frequency as observed in the data. That makes intuitive sense.

So this means that, with complete data, we know how to calculate the gradient of our objective and with gradient descent methods, we'll be assured to arrive at a global optimum (eventually). Great!

But now for some bad news.

### Optimizing $\log \mathcal{L}(\boldsymbol{\theta})$ is costly!

Yes - and that's a big problem.

To see this, let's ask: how do we actually calculate the $\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$'s for each $j$?

Hmm, let's think about what we have:

1. We have our $\boldsymbol{\theta}$ parameter vector determined.
2. We have our indicator functions, $f_j$, along with the the set of RVs, $\mathbf{D}_j$, associated with each. We know which assignments of $\mathbf{D}_j$ make $f_j$ produce a 1.

It's clearly not some simple use of this information to produce a $\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$. It turns out that calculating this result is an *inference* task. Remember, we defined inference as:

"
The task of using a given graphical model of a system, complete with fitted parameters, to answer certain questions regarding that system.
"

One of those 'certain questions' is a **probability query**, where we ask for the distribution of one set of RVs given another. In our case, we need to know the distribution of $\mathbf{D}_j$ (given nothing - it's an unconditional query) according to our $\boldsymbol{\theta}$.

This isn't good - inference can be very *expensive*. Plus, we need to perform this task for *each* gradient descent step. So it's a magnitude extra worth of bad news. Unfortunately, there aren't any tricks to get around this. But before we discuss the work arounds, let's make our lives even harder!

### Missing data

What happens if some of our data is missing? That is, $\mathcal{D}=\{\mathbf{x}^{(i)}\}_{i=1}^w$, contains missing entries. Also, to keep things simple, let's assume the entries are 'missing completely at random', meaning the pattern of missing entries doesn't depend on the values of the RVs.

In such a case, we have to adjust our log likelihood to consider all possible ways our missing variables *could be*. To see this, let's consider a system with 5 RVs: $\mathbf{X}=\mathcal{X}=\{X_1,\cdots,X_5\}$. Let's also say that our first entry is:

$$
\mathbf{x}^{(1)} = [x_1^0,\ ?,\ x_3^1, ?,\  x_5^3] 
$$

So assuming we have our parameters, what is $P_M(\mathbf{x}^{(1)})$? Well, the appropriate way to think about it is as the *marginal probability* of the assignment $X_1=x_1^0$, $X_3=x_3^1$ and $X_5=x_5^3$. This is to say, it's the sum of probabilities across all assignment vectors that agree with this assignment of $X_1$, $X_3$ and $X_5$. To say it with notation, let $h(i)$ be the number possible assignment to the missing variables at observaiton $i$. So, if $Val(X_2)$ has 4 elements and $Val(X_4)$ has 5 elements, then there are $h(1)=4\times 5 = 20$ ways to assign $X_2$ and $X_4$. Let's also say $k$ indexes over all these assignments and that $\mathbf{x}^{(i),k}$ is $\mathbf{x}^{(i)}$, but with the $k$-th assignment of the missing variables filled in. With that, we can now say this:

$$
P_M(\mathbf{x}^{(i)}) = \sum_{k=1}^{h(i)} P_M(\mathbf{x}^{(i),k})
$$

Ok, now let's revisit our average per sample log likelihood:

$$
\begin{align}
\frac{1}{w} \log \mathcal{L}(\boldsymbol{\theta}) & = \frac{1}{w} \log \big( \prod_{i=1}^w P_M(\mathbf{x}^{(i)}) \big) \\
& = \frac{1}{w} \sum_{i=1}^w \log \big( \sum_{k=1}^{h(i)} P_M(\mathbf{x}^{(i),k}) \big) \\
& = \frac{1}{w} \sum_{i=1}^w \log \big( \sum_{k=1}^{h(i)} \exp\big(\sum_{j = 1}^m \theta_j f_j(\mathbf{d}_j^{(i),k}) \big) -\log Z(\boldsymbol{\theta})\\
\end{align}
$$

As you can probably tell, this is a much uglier objective to work with. We have sums intertwined with logs and exponentiation. One big cost of this is that we lose our concavity - $\mathcal{L}(\boldsymbol{\theta})$ may have local optima.

But there's another issue. Let me tell you the derivative in this case:

$$
\frac{\partial }{\partial \theta_j} \log \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{\mathcal{D},\boldsymbol{\theta}}[f_j(\mathbf{D}_j)] - \mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]
$$

What's new is $\mathbb{E}_{\mathcal{D},\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$. This is the average expectation of $f_j(\mathbf{D}_j)$ given the *non-missing* observations. We use $\boldsymbol{\theta}$ to determine the expectation with respect to the missing variables. So we could write it this way:

$$
\mathbb{E}_{\mathcal{D},\boldsymbol{\theta}}[f_j(\mathbf{D}_j)] = \frac{1}{w}\sum_i^m \mathbb{E}_\boldsymbol{\theta}[f_j(\mathbf{D}_j)|\mathbf{d}_j^{(i)}]
$$

where, as mentioned, $\mathbf{d}_j^{(i)}$ are the $\mathbf{D}_j$-values of $\mathbf{x}^{(i)}$. Since could have missing values, so could $\mathbf{d}_j^{(i)}$, and it's over those missing values that this expectation is taken.

And that expectation is exactly the problem. Like last time, it's another *inference* task and so it's another computational cost. It's not quite as heavy $\mathbb{E}_{\boldsymbol{\theta}}[f_j(\mathbf{D}_j)]$ (since it's weight is reduced by whatever we observe) but it definitely doesn't help.


### Scrap

This is good, but it would be nice if we could see the parameters. Right now, they are hiding as the outputs of $f_i(\cdot)$. Here's an idea: let's say $k_i$ is an index over all the values in $Val(\mathbf{D}_i)$. That is, $k_i \in {1,\cdots,|Val(\mathbf{D}_i)|}$ where $|Val(\mathbf{D}_i)|$ is the number of elements in $Val(\mathbf{D}_i)$. Then, let's say $f_{i,k_i}(\cdot)$ is an indicator function that is 1 if the given $\mathbf{D}_i$-value is the $k_i$-th value. In other words, $f_{i,k_i}(\cdot)$ goes off when it gets one specific value of $\mathbf{D}_i$. Let's also $\theta_{i,k_i}$ is a real value associated with each of these. Then we could say:

$$
f_i(\mathbf{D}_i) = \sum_{k_i=1}^{|Val(\mathbf{D}_i)|} \theta_{i,k_i} f_{i,k_i}(\mathbf{D}_i)
$$

Here $\theta_{i,k_i}$ is the output of $f_i(\cdot)$ of the $k_i$-th value in $Val(\mathbf{D}_i)$. All we've done is rewritten it using indicators functions. This changes our Gibbs Rule to:

$$
\begin{align}
P_M(X_1,\cdots,X_n) & = \frac{1}{Z} \prod_{i = 1}^m\exp(-\sum_{k_i=1}^{|Val(\mathbf{D}_i)|} \theta_{i,k_i} f_{i,k_i}(\mathbf{D}_i)) \\
& = \frac{1}{Z} \exp(-\sum_{i = 1}^m \sum_{k_i=1}^{|Val(\mathbf{D}_i)|} \theta_{i,k_i} f_{i,k_i}(\mathbf{D}_i)) \\
\end{align}
$$

OK, this does what we want, but uhh...it's heavy on the notation, don't you think? Well, that double sum across indices can be thought of as a single sum across one fat index. So let's make this change:

$$
\sum_{i = 1}^m \sum_{k_i=1}^{|Val(\mathbf{D}_i)|} \theta_{i,k_i} f_{i,k_i}(\mathbf{D}_i)) \rightarrow \sum_{i = 1}^m \theta_i f_i(\mathbf{D}_i))
$$

This is indeed a change. The $i$'s and $m$'s on the right are *redefined* such that they match the left. They mean something totally different then what they did before - I'm just reusing the $i$ and $m$ notation. Also, implicitly, the $\mathbf{D}_i$'s are now repeated a bunch

$$
\begin{align}
\frac{\partial }{\partial \boldsymbol{\theta}} \log Z(\boldsymbol{\theta}) & = \frac{1}{Z(\boldsymbol{\theta})} \frac{\partial }{\partial \boldsymbol{\theta}} Z(\boldsymbol{\theta}) \\
& = \frac{\partial }{\partial \boldsymbol{\theta}} \log \Big(\overbrace{\sum_{\mathbf{x}\in Val(\mathbf{X})} \exp\big(\sum_{i = 1}^m \theta_i f_i(\mathbf{D}_i)\big)}^{\text{Definition of }Z(\boldsymbol{\theta})}\Big)\\
\end{align}
$$

(OLD STUFF BELOW)

Since we'd like maximize this, the first step number is always calculate the derivative. To keep things short, I'll just tell you that:

$$
\frac{1}{w} \frac{\partial}{\partial \boldsymbol{\theta}} \log \mathcal{L}(\boldsymbol{\theta}) = sum_{i = 1}^m \theta_i \mathbb{E}_\mathcal{D}[f_i(\mathbf{D}_i)]
$$

$$
\begin{align}
\frac{\partial}{\partial \boldsymbol{\theta}} \log \mathcal{L}(\boldsymbol{\theta})  & = w \frac{1}{Z(\boldsymbol{\theta})} \frac{\partial }{\partial \boldsymbol{\theta}} Z(\boldsymbol{\theta}) + \\ 
\end{align}
$$

### Footnotes

[1] In fact, it doesn't *have* to indicate only a single assignment. It could turn on for a whole set of assignments. That doesn't change any of the downstream math.

[2] Wait, but aren't also given a graph $\mathcal{H}$? Yes, but that's baked into the indicator functions already, so we actually don't need it.

[3] In fact, this can be taken a step further with BLAH BLAH

### Sources

Daphne's book. Give reference to chapter 20.