# Machine Learning (Summer 2018)

## Practice Session 5

May, 17th 2018

Ulf Krumnack

Institute of Cognitive Science
University of Osnabrück

## Today's Session

The EM Algorithm:
* Recap Lecture Slides on the EM algorithm
* Sheet 3, Assignment: Expectation Maximization

Background reading:
* Bishop (2006): *Pattern Recognition and Machine Learning*, chapter 9

# Lecture Slides on EM

Exercise 1: Missing value problem. Recap ML04 slides 14-19 and discuss the problem with your neighbor. (5 min)

Exercise 2: EM algorithm: Study ML04 slides 20-23 and prepare to explain them to the class. (10 min)

Exercise 3: Convergence of EM: Understand the proof on ML04 slide 24. (5 min)

# Soft Clustering with Gaussian Mixture


Gaussian Mixture models:
* can be used for soft clustering since it allows us to express varying degrees of certainty about the membership of individual samples
* belong to the most widely used models since Gaussian distributions generally have the property of fitting all different kinds of data reasonably well

## Mixture Models

A mixture model with $K$ components is in general of the form:

$$ p(\mathbf{x}|\mathbf{\theta}) = \sum_{k=1}^K\pi_kp_k(\mathbf{x}|\mathbf{\theta}_k)$$
where $\sum_{k=1}^K\pi_k = 1$.

The probability of observing a dataset $\mathbf{x}$ given the parameter vector $\mathbf{\theta}$ can be expressed as the sum of $K$ individual distributions $p_k$ with parameters $\mathbf{\theta}_k \subseteq {\theta}$ which are weighted by respective class probabilities $\pi_k$.

### Gaussian Mixture Models

In a Gaussian mixture model one sets

$$p_k \sim \mathcal{N}(\mu_k,\sigma_k)$$

with independent parameters $\mu_k$ and $\sigma_k$ for each component of the distribution.

### Sampling

If we were to randomly pick values for the parameter vector $\theta$ then we would now have a generative model that can produce naturally clustered data for us, we would just have to sample $\hat{x} \sim p(\mathbf{x}|\mathbf{\theta})$. 

## Task: Parameter estimation

Find the most likely model parameters for the given data. This can be calculated easily by Bayes' Theorem for each model $k$, where the (latent) probability for choosing model $k$ is $p(k|\mathbf{\theta}_k) = \pi_k$:

$$
\begin{align}
p(k|\mathbf{x},\mathbf{\theta})
&=\frac{p(k|\mathbf{\theta}_k)p(\mathbf{x}|k,\mathbf{\theta}_k)}{\sum_{k'=1}^Kp(k'|\mathbf{\theta}_{k'})p(\mathbf{x}|k',\mathbf{\theta}_{k'})}\\
& = \frac{\pi_kp_k(\mathbf{x}|\mathbf{\theta}_k)}{\sum_{k'=1}^K\pi_{k'}p_{k'}(\mathbf{x}|\mathbf{\theta}_{k'})}
\end{align}
$$

We want to maximize the log likelihood given as
$$\mathcal{\ell}(\mathbf{\theta})=\sum_{i=1}^N\log p(x_i|\mathbf{\theta}) = \sum_{i=1}^N\log\left[\sum_{k=1}^Kp_k(x_i|\mathbf{\theta}_k)\right].$$

All the problems occur because we have a sum inside the logarithm and so we can't pull the logarithm further in towards the densitiy and that is what makes the problem so hard. If we just *ignore* the inner sum we get an expression
$$\mathcal{\ell}_c(\mathbf{\theta}) = \sum_{i=1}^N\log p_k(x_i|\mathbf{\theta}_k)$$
which would be much nicer to compute. But now we have a free floating $k$ in the subscript of our density! Which one of the mixing distributions are we talking about here? 

Kind of all of them at once. But we need one quantity to represent all the distributions. So to get rid of the $k$ we take the expected value with respect to the latent variable $k$ and receive a function that only depends on $\mathbf{\theta}$:
$$Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right) = \mathbb{E}\left[\mathcal{\ell}_c\left(\mathbf{\theta}\middle|\mathcal{\theta}^{t-1}\right)\right]$$

The final formula looks as follows:

$$\begin{align}
Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right) &= \sum_i\sum_k p\left(k\middle|x_i,\mathbf{\theta}^{t-1}\right)\log\pi_k + \sum_i\sum_k p\left(k\middle|x_i,\mathbf{\theta}^{t-1}\right)\log p_k\left(x_i\middle|\mathbf{\theta}\right)
\end{align}$$

Since $\theta^{t-1}$ is known at time $t$ we can calculate $p\left(k\middle|x_i,\mathbf{\theta}^{t-1}\right)$ with Bayes' Theorem as stated above and replace these expressions with constants $r_{i,k}.$

## Sheet 03, Assignment 3: Implement Expectation Maximization for Soft Clustering 

### Step 1) Load the data


Load the provided data set. It is stored in `em_normdistdata.txt`. We call the set $X$ and each individual data $x \in X$. 

*Hint:* Figure out a way on how numpy can load text data.

In [None]:
import numpy as np

def load_data(file_name):
    """
    Loads the data stored in file_name into a numpy array.
    """
    result = np.loadtxt(file_name)
    return result

In [None]:
data = load_data('em_normdistdata.txt')
assert data.shape == (200,) , "The data was not properly loaded."

*Optional:* The data consists of 200 data points drawn from three normal distributions. To get a feeling for the data set you can plot the data with the following cell. Change the number of bins to get a rough idea of how the three distributions might look like.

In [None]:
import matplotlib.pyplot as plt

plt.figure('Data overview 1')
plt.scatter(data, np.zeros_like(data))
plt.show()

In [None]:
import matplotlib.pyplot as plt

plt.figure('Data overview 1')
plt.hist(data, bins=50)
plt.show()

### Step 2) Initialize EM

Below is a class definition `NormPDF` which represents the probability density function (pdf) of the normal distribution with an additional parameter $\alpha$. The class is explained in the next cells.

In [None]:
import numpy as np
class NormPDF():
    """
    A representation of the probability density function of the normal distribution
    for the EM Algorithm.
    """

    def __init__(self, mu=0, sigma=1, alpha=1):
        """
        Initializes the normal distribution with mu, sigma and alpha.
        The defaults are 0, 1, and 1 respectively.
        """
        self.mu = mu
        self.sigma = sigma
        self.alpha = alpha
        

    def __call__(self, x):
        """
        Returns the evaluation of this normal distribution at x.
        Does not take alpha into account!
        """
        return np.exp(-(x - self.mu) ** 2 / (2 * self.sigma ** 2)) / (np.sqrt(np.pi * 2) * self.sigma)


    def __repr__(self):
        """
        A simple string representation of this instance.
        """
        return 'NormPDF({self.mu:.2f},{self.sigma:.2f},{self.alpha:.2f})'.format(self=self)

`__init__`: This is the constructor. When a new instance of the class is created this method is used. It takes the parameters `mu`, `sigma`, and `alpha`. Note that if you leave out parameters, they will be set to some default values.
So you can create `NormPDF` instances like this:

In [None]:
a = NormPDF()             # No parameters: mu = 0, sigma = 1, alpha = 1
b = NormPDF(1)            # mu = 1, sigma = 1, alpha = 1
c = NormPDF(1, alpha=0.4) # skips sigma but sets alpha, thus: mu = 1, sigma = 1, alpha = 0.4
d = NormPDF(0, 0.5)       # mu = 0, sigma = 0.5, alpha = 1
e = NormPDF(0, 0.5, 0.9)  # mu = 0, sigma = 0.5, alpha = 0.9

`__call__`: This is a very cool feature of Python. By implementing this method one can make an instance *callable*. That basically means one can use it as if it was a function. The `NormPDF` instances can be called with an x value (or a numpy array of x values) to get the evaluation of the normal distribution at x.

In [None]:
normpdf = NormPDF()
print(normpdf(0))
print(normpdf(0.5))
print(normpdf(np.linspace(-2, 2, 10)))

`__repr__`: This method will be used in Python when one calls `repr(NormPDF())`. As long as `__str__` is not implemented (which you saw in last week's sheet) `str(NormPDF())` will also use this method. This comes in handy for printing:

In [None]:
normpdf1 = NormPDF()
normpdf2 = NormPDF(1, 0.5, 0.9)
print(normpdf1)
print([normpdf1, normpdf2])

It is also possible to change the values of an instance of the NormPDF:

In [None]:
normpdf1 = NormPDF()
print(normpdf1)
print(normpdf1(np.linspace(-2, 2, 10)))

normpdf1.mu = 1
normpdf1.sigma = 2
normpdf1.alpha = 0.9
print(normpdf1)
print(normpdf1(np.linspace(-2, 2, 10)))

Now that you know how the `NormPDF` class works, it is time for the implementation of the initialization function. Here is the task again:

Write a function `gaussians = initialize_EM(data, num_distributions)` to initialize the EM.

Each normal distribution $j$ has three parameters: $\mu_j$ (the mean), $\sigma_j$ (the standard deviation), $\alpha_j$ (the proportion of the normal distribution in the mixture, that means $\sum\limits_j\alpha_j=1$).
Initialize the three parameters using three random partitions $S_j$ of the data set. Calculate each $\mu_j$ and $\sigma_j$ and set $\alpha_j = \frac{|S_j|}{|X|}$.

In [None]:
def initialize_EM(data, num_distributions):
    """
    Initializes the EM algorithm by calculating num_distributions NormPDFs
    from a random partitioning of data. I.e., the data set is randomly
    divided into num_distribution parts, and each part is used to initialize
    mean, standard deviation and alpha parameter of a NormPDF object.
    
    Args:
        data (array): A collection of data.
        num_distributions (int): The number of distributions to return.
        
    Returns:
        A list of num_distribution NormPDF objects, initialized from a
        random partioning of the data.
    """
    # generate len(data) many random integers between 0 and num_distributions
    partition_mapping = np.random.randint(0, num_distributions, len(data))
    # initialize num_distributions many NormPDFs
    gaussians = [NormPDF() for i in range(num_distributions)]
    
    # calculate the mean, standard deviation and alpha depending on the partition mapping
    for index, gaussian in enumerate(gaussians):
        gaussians[index].mu = np.mean(data[partition_mapping == index])
        gaussians[index].sigma = np.std(data[partition_mapping == index])
        gaussians[index].alpha = len(data[partition_mapping == index]) / len(data)

    return gaussians

In [None]:
normpdfs_ = initialize_EM(np.linspace(-1, 1, 100), 2)
assert len(normpdfs_) == 2, "The number of initialized distributions is not correct."
# 1e-10 is 0.0000000001
assert abs(1 - sum([normpdf.alpha for normpdf in normpdfs_])) < 1e-10 , "Sum of all alphas is not 1.0!"

In [None]:
ls = np.linspace(data.min(),data.max(),100)

plt.figure()
for pdf in normpdfs_:
    plt.plot(ls,pdf(ls),label=pdf)
plt.legend()
plt.show()

### Step 3) Implement the expectation step

Perform a soft classification of the data samples with the normal distributions. That means: Calculate the likelihood that a data sample $x_i$ belongs to distribution $j$ given parameters $\mu_j$ and $\sigma_j$. Or in other words, what is the likelihood of $x_i$ to be drawn from $N_j(\mu_j, \sigma_j)$? When you got the likelihood, weight the result by $\alpha_j$.

As a last step normalize the results such that the likelihoods of a data sample $x_i$ sum up to $1$.

*Hint:* Store the data in a different array before you normalize it to not run into problems with partly normalized data.

In [None]:
def expectation_step(gaussians, data):
    """
    Performs the expectation step of the EM.
    
    Args:
        gaussians (list): A list of NormPDF objects.
        data (array): The data vector.
        
    Returns:
        An array of shape (len(data), len(gaussians))
        which contains normalized likelihoods for each sample
        to denote to which of the normal distributions it 
        most likely belongs to.
    """

    # Calculates the likelihoods of the samples per 
    # distribution and weights the results by alpha.
    tmp = np.empty((len(data), len(gaussians)))
    for j, N in enumerate(gaussians):
        tmp[:,j] = N.alpha * N(data)

    # Normalize the results.
    expectation = np.zeros_like(tmp)
    for j, _ in enumerate(gaussians):
        expectation[:,j] = tmp[:,j] / np.sum(tmp, 1)

    return expectation
    

assert expectation_step([NormPDF(), NormPDF()], np.linspace(-2, 2, 100)).shape == (100, 2) , "Shape is not correct!"

### Step 4) Implement the maximization step

In the maximization step each $\mu_j$, $\sigma_j$ and $\alpha_j$ is updated. First calculate the new means:

$$\mu_j = \frac{1}{\sum\limits_{i=1}^{|X|} p_{ij}} \sum\limits_{i=1}^{|X|} p_{ij}x_i$$

That means $\mu_j$ is the weighted mean of all samples, where the weight is their likelihood of belonging to distribution $j$.

Then calculate the new $\sigma_j$. Each new $\sigma_j$ is the standard deviation of the normal distribution with the new $\mu_j$, so for the calculation you already use the new $\mu_j$:

$$\sigma_j = \sqrt{ \frac{1}{\sum\limits_{i=1}^{|X|} p_{ij}} \sum\limits_{i=1}^{|X|} p_{ij} \left(x_i - \mu_j\right)^2 }$$

To calculate the new $\alpha_j$ for each distribution, just take the mean of $p_j$ for each normal distribution $j$.

**Caution:** For the next step it is necessary to know how much all $\mu$ and $\sigma$ changed. For that the function `maximization_step` should return a numpy array of those (absolute) changes. For example if $\mu_0$ changed from 0.1 to 0.15, $\sigma_0$ from 1 to 0.9, $\mu_1$ from 0.5 to 0.6, and $\sigma_1$, $\mu_2$, and $\sigma_2$ stayed the same, we expect the function to return `np.array([0.05, 0.1, 0.1, 0, 0, 0])` (however, the order is not important).

In [None]:
def maximization_step(gaussians, data, expectation):
    """
    Performs the maximization step of the EM.
    Modifies the gaussians by updating their mus and sigmas.
    
    Args:
        gaussians (list): A list of NormPDF objects.
        data (array): The data vector.
        expectation (array): The expectation values for data element
            (as computed by expectation_step()).

    Returns:
        A numpy array of absolute changes in any mu or sigma, 
        that means the returned array has twice as many elements as
        the supplied list of gaussians.
    """
    changes = []

    for j, N in enumerate(gaussians):
        # calculate new parameters
        # @ is the matrix multiplication (behaves like numpy.matlib.matmul)
        mu = expectation[:,j] @ data / np.sum(expectation[:,j])
        sigma = np.sqrt((expectation[:,j] @ (data - mu) ** 2) / np.sum(expectation[:,j]))
        alpha = np.mean(expectation[:,j])
        
        # append relevant changes
        changes += [np.abs(N.mu - mu)]
        changes += [np.abs(N.sigma - sigma)]
        
        # update gaussian
        N.mu = mu
        N.sigma = sigma
        N.alpha = alpha

    return np.array(changes)

### 5) Perform the complete EM and plot your results:**

Initialize three normal distributions whose parameters will be changed iteratively by the EM to converge close to the original distributions.

Build a loop around the iterative procedure of expectation and maximization which stops when the changes in all $\mu_j$ and $\sigma_j$ are sufficiently small enough.

Plot your results after each step and mark which data points belong to which normal distribution. If you don't get it to work, just plot your final solution.

*Hint:* Remember to load the data and initialize the EM before the loop.

*Hint:* A function `plot_intermediate_result` to plot your result after each step is already defined in the next cell. Take a look at what arguments it takes and try to use it in your loop.

*Hint:* To plot your final result the first three images and corresponding code examples on the tutorial of [`plt.plot(...)`](http://matplotlib.org/users/pyplot_tutorial.html) should help you.

*Optional:* Run the code multiple times. If your results are changing, use `np.random.seed(2)` in the beginning of the cell to get consistent results (any other integer will work as well, but 2 has some good results for the example solutions).

In [None]:
%matplotlib notebook
import time
import itertools

import numpy as np
import matplotlib.pyplot as plt
# Sets the random seed to a fix value to make results consistent
np.random.seed(2)

colors = itertools.cycle(['r', 'g', 'b', 'c', 'm', 'y', 'k'])
figure, axis = plt.subplots(1)
axis.set_xlim(-5, 5)
axis.set_ylim(-0.2, 4)
axis.set_title('Intermediate Results')
plt.figure('Final Result')

def plot_intermediate_result(gaussians, data, mapping):
    """
    Gets a list of gaussians and data input. The mapping
    parameter is a list of indices of gaussians. Each value
    corresponds to the data value at the same position and 
    maps this data value to the proper gaussian.
    """
    x = np.linspace(-5, 5, 100)
    if len(axis.lines):
        for j, N in enumerate(gaussians):
            axis.lines[j * 2].set_xdata(x)
            axis.lines[j * 2].set_ydata(N(x))
            axis.lines[j * 2 + 1].set_xdata(data[mapping == j])
            axis.lines[j * 2 + 1].set_ydata([0] * len(data[mapping == j]))
    else:
        for j, N in enumerate(gaussians):
            axis.plot(x, N(x), data[mapping == j], [0] * len(data[mapping == j]), 'x', color=next(colors), markersize=5)
    figure.canvas.draw()
    time.sleep(0.5)


# Perform the initialization.
data = load_data('em_normdistdata.txt')
gaussians = initialize_EM(data, 3)

# Loop until the changes are small enough.
eps = 0.005
changes = [float('inf')] * 2
while max(changes) > eps:
    # Iteratively apply the expectation step, followed by the maximization step.
    expectation = expectation_step(gaussians, data)
    changes = maximization_step(gaussians, data, expectation)

    # Optional: Calculate the parameters to update the plot and call the function to do it.
    plot_intermediate_result(gaussians, data, np.argmax(expectation, 1))


# Plot your final result and print the final parameters.
x = np.linspace(-5, 5, 1000)
plt.plot(x, gaussians[0](x), 'r', x, gaussians[1](x), 'g', x, gaussians[2](x), 'b')
print(gaussians)

## Finding optimal parameters $\theta$

$\DeclareMathOperator*{\argmax}{arg\,max}$
In the lecture you saw a proof that if we choose
$$\mathbf{\theta}^t = \argmax_{\mathbf{\theta}} Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right)$$
that the likelihood of the parameter is non-decreasing then. So we want to maximize $Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right)$ for the parameters $\left(\pi_1\dots,\pi_K\right)$ and $\theta = \left(\mu_1,\dots,\mu_K,\sigma_1,\dots,\sigma_K\right)$. 

So your job is to take the derivative of 
$$\begin{align}
Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right) &= \sum_i\sum_k r_{i,k}\log\pi_k + \sum_i\sum_k r_{i,k}\log p_k\left(x_i\middle|\mathbf{\theta}\right)
\end{align}$$
with respect to these variables, to set it equal to 0 and to solve for the value that you are currently maximizing for. You only have to do this for the one dimensional case, i.e. 
$$p_k(x_i|\mathbf{\theta}_k) = \frac{1}{\sqrt{2\pi\sigma_k^2}}\exp\left({-\frac{\left(x_i-\mu_k\right)^2}{2\sigma_k^2}}\right)$$

### Calculate the maximizer for the $\mu_k$:


We have to compute the derivative
$$\frac{\partial}{\partial \mu_k}Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right)$$

which amounts to
$$\frac{\partial}{\partial \mu_k}\left( \sum_i\sum_k r_{i,k}\log\pi_k + \sum_i\sum_k r_{i,k}\log p_k\left(x_i\middle|\mathbf{\theta}\right)\right)
$$

and sind $\log\pi_k$ does not depend on $\mu_k$ simplifies to
$$\frac{\partial}{\partial \mu_k}\sum_i\sum_k r_{i,k}\log p_k(x_i|\mathbf{\theta})$$


The derivative of every summand is

\begin{align}
& \frac{\partial}{\partial \mu_k}\log p_k(x_i|\mathbf{\theta}) \\
& = \frac{\partial}{\partial \mu_k}\log \frac{1}{\sqrt{2\pi\sigma_k^2}}\exp\left({-\frac{\left(x_i-\mu_k\right)^2}{2\sigma_k^2}}\right) \\
& = \frac{\partial}{\partial \mu_k}\left(\log \frac{1}{\sqrt{2\pi\sigma_k^2}} + \log\exp\left({-\frac{\left(x_i-\mu_k\right)^2}{2\sigma_k^2}}\right)\right) \\
& = \frac{\partial}{\partial \mu_k}\left(\log \frac{1}{\sqrt{2\pi\sigma_k^2}} - {\frac{\left(x_i-\mu_k\right)^2}{2\sigma_k^2}}\right)  \\
& = 0 + \frac{1}{2\sigma_k^2}2\left(x_i-\mu_k\right) \\
& = \frac{1}{\sigma_k^2}\left(x_i-\mu_k\right)
\end{align}

So the derivative of $Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right)$ is
\begin{align} 
\frac{\partial}{\partial \mu_k}Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right)
&= \frac{\partial}{\partial \mu_k}\sum_i\sum_k r_{i,k}\log p_k(\mathbf{x}_i|\mathbf{\theta}) \\
&= \sum_i r_{i,k}\frac{1}{\sigma_k^2}\left(x_i-\mu_k\right) \\
&= \frac{1}{\sigma_k^2}\sum_i r_{i,k}(x_i-\mu_k) \stackrel{!}{=} 0 
\end{align}

$$\Leftrightarrow \sum_i r_{i,k}x_i = \mu_k\sum_i r_{i,k}$$

and so we get for $\mu_k$:
$$\Leftrightarrow \mu_k = \frac{\sum_i r_{i,k}x_i}{\sum_ir_{i,k}}$$

### Calculate the maximizer for the $\sigma_k^2$:


\begin{align}
& \frac{\partial}{\partial \sigma_k^2}Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right) \\
& =\frac{\partial}{\partial \sigma_k^2}\left( \sum_i\sum_k r_{i,k}\log\pi_k + \sum_i\sum_k r_{i,k}\log p_k\left(x_i\middle|\mathbf{\theta}\right)\right)
\end{align}

and as the first term does not depend on $\sigma_k^2$:

$$=\frac{\partial}{\partial \sigma_k^2}\left(\sum_i\sum_k r_{i,k}\log p_k\left(x_i\middle|\mathbf{\theta}\right)\right)$$

We have to compute the derivative for
$$\begin{align}
\log p_k(x_i|\mathbf{\theta}_k) 
& = \log \left(\frac{1}{\sqrt{2\pi\sigma_k^2}}\exp\left({-\frac{\left(x_i-\mu_k\right)^2}{2\sigma_k^2}}\right)\right) \\
& = \log \frac{1}{\sqrt{2\pi\sigma_k^2}} +
\log \exp\left({-\frac{\left(x_i-\mu_k\right)^2}{2\sigma_k^2}}\right)
\end{align}
$$

This amounts to
$$\begin{align}
\frac{\partial}{\partial \sigma_k^2}\log \frac{1}{\sqrt{2\pi\sigma_k^2}}
&= -\frac{1}{2}\log\left(\sigma_k^2\right) \\
\frac{\partial}{\partial \sigma_k^2}\log \exp\left({-\frac{\left(x_i-\mu_k\right)^2}{2\sigma_k^2}}\right)
&= -\frac{1}{2\sigma_k^4}\left(x_i-\mu_k\right)^2
\end{align}
$$

So in the end we get 
\begin{align}
& \frac{\partial}{\partial \sigma_k^2}Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right) \\
& = \sum_i r_{i,k}\left(\frac{1}{\sigma_k^4}(x_i-\mu_k)^2 -\frac{1}{\sigma_k^2}\right) \stackrel{!}{=} 0
\end{align}

And hence we can conclude:
$$\Leftrightarrow \sum_i r_{i,k}(x_i-\mu_k)^2 = \sigma_k^2\sum_ir_{i,k}$$

$$\Leftrightarrow \sigma_k^2 = \frac{\sum_i r_{i,k}(x_i-\mu_k)^2}{\sum_ir_{i,k}}$$

### Calculate the maximizer for the $\pi_k$ (You need the ensure $\sum_k\pi_k =1$. You can either use a Lagrangian Multiplier for this or use the formula to express one of the $\pi_i$ in terms of all the others):


$$\begin{align} 
0 & =
\frac{\partial}{\partial \pi_k}\left(Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right) - \lambda \left(\sum_k \pi_k - 1\right)\right)\\
&= \sum_i \frac{r_{i,k}}{\pi_k} - \lambda
\end{align}$$

$$\Leftrightarrow \sum_i \frac{r_{i,k}}{\lambda} = \pi_k $$

\begin{align}
\frac{\partial}{\partial \lambda}Q\left(\mathbf{\theta},\mathbf{\theta}^{t-1}\right) + \lambda \left(\sum_k \pi_k - 1\right) &= \sum_k \pi_k - 1 \stackrel{!}{=} 0 \Leftrightarrow \sum_k \pi_k = 1 \\
\Rightarrow \frac{1}{\lambda}\sum_k\sum_i r_{i,k} &= 1 \\
\Rightarrow \pi_k &= \frac{1}{N}\sum_i r_{i,k}
\end{align}

Because of popular request: If we don't want to use the Lagrangian the notation becomes a bit more cumbersome but the overall strategy remains the same.

$$\sum_{k=1}^K\pi_k = 1 \Leftrightarrow \pi_K = 1 - \sum_{k=1}^{K-1}\pi_k$$

\begin{align}
\frac{\partial}{\partial \pi_k}Q &= \frac{\partial}{\partial \pi_k}\sum_{i=1}^N \left(\sum_{k=1}^{K-1}r_{ik}\log\pi_k + r_{ik}\log(1-\sum_{k=1}^{K-1}\pi_k)\right) \\
&= \sum_{i=1}^N\left(\frac{r_{ik}}{\pi_k} - \frac{r_{ik}}{1-\sum_{k=1}^{K-1}\pi_k}\right) \stackrel{!}{=} 0 \\
\Leftrightarrow \pi_k &= \left(\sum_{i=1}^{N}r_{ik}\right)\frac{\left(1 - \sum_{k=1}^{K-1}\pi_k\right)}{\sum_{i-1}^Nr_{iK}}
\end{align}

If we now sum over $k$ we get
\begin{align}
\frac{\left(1 - \sum_{k=1}^{K-1}\pi_k\right)}{\sum_{i-1}^Nr_{iK}} \sum_{k=1}^K\sum_{i=1}^Nr_{ik} &= 1 \\
\Leftrightarrow \frac{\left(1 - \sum_{k=1}^{K-1}\pi_k\right)}{\sum_{i-1}^Nr_{iK}} &= \frac{1}{N}
\end{align}