Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 03: Basics of Data Mining

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, May 1, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

There are a lot of implementations with fewer theory questions on this sheet, but don't worry: To be able to implement most of the code, you have to understand the theory.

This week's assignments make use of two packages: `numpy` and `matplotlib`. We already expected you to install those as part of sheet 1. If you did not do so, go back to those instructions or just run the following command in the `terminal`/`cmd.exe` to do so (Mac/Linux: Might require `sudo`; Windows: Use `pip` instead of `pip3`). This will also upgrade your current installation.

    pip3 install --upgrade jupyter numpy matplotlib

One note about `matplotlib`: If you run code which contains a plot like the cell below, it can sometimes take a while to execute the code and show the results. During that process the invocation count will be shown as a little Asterisk (\*) like this:

    In [*]:

Just be patient for a few seconds. The following cell tests if `numpy` and `matplotlib` are installed and work:

In [None]:
%matplotlib notebook
import importlib
import numpy as np
import matplotlib.pyplot as plt

assert importlib.util.find_spec('numpy') is not None , 'numpy not found'
assert importlib.util.find_spec('matplotlib') is not None, 'matplotlib not found'

figure_intro = plt.figure('Example plot')
plt.plot(np.random.randn(1000, 1))
figure_intro.canvas.draw()

## Assignment 1: Rosner test [5 Points]

The Rosner test is an iterative procedure to remove outliers of a data set via a z-test. In this exercise you will implement it and apply it to a sample data set.

### a) Outliers

First of all, think about why we use procedures like this and answer the following questions: 

What are causes for outliers? And what are our options to deal with them? 

There are different types of outliers which can have different causes. They could arise through measurement or technical errors when collecting data. This may be connected to having a sharp cut-off in regard to the range of measurements, which could lead to a high concentration of values at the artificial boundaries of an experiment. However they may also show us a true underlying effect in our data that we didn't expect or account for. This might be the case when we are treating the measurements as a single distribution, when in reality there are actually two underlying distributions. Lastly, our distribution might actually naturally have a high variance, which makes outliers or extreme values a natural part of the distribution.

First, we need to detect probable outliers. In order to decide which data points we want to declare as an outlier we have to find a model for regular, meaning "not outlying", data points. What we do most of the time is to assume a normal distribution underlying the data (or a multivariate distribution where each cluster is normally distributed).

One option is to calculate the z-value for each data point (a measure of the distance from the mean in terms of the standard deviation) -- data points with a high z-value would be regarded outliers, a common threshold would be a z value bigger than 3. This can be improved by using the median instead of the mean and tweaking the threshold. The Rosner test takes it one step further by iteratively calculating z-values and removing found outliers until none can be found anymore. This can be done one outlier at a time or k outliers at a time for more efficiency. 

A different approach would be to not remove the outliers completely, but to weight them according to the z-values. And lastly an alternative for complete removal would be to fill up the emerging gaps with values that fit the distribution better.

### b) Rosner test

In the following you find a stub for the implementation. The dataset is already generated. Now it is your turn to write the Rosner test and detect the outliers in the data.

In [None]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt

# generate dataset
data = np.random.normal(50, 20, 100)
xtr_points = np.random.normal(-50, 10, 6)

data = np.concatenate((xtr_points, data))
outliers = []

# just to check if everything is pretty
fig_rosner_data = plt.figure('The Dataset')
plt.plot(data, 'x')
fig_rosner_data.canvas.draw()

# TODO: now find the outliers!
# Add them to 'outliers' and remove them from 'data'.
z = float('inf')

while z > 3:
    stdev = np.std(data)
    m = np.mean(data)
    zs = [abs(value - m) / stdev for value in data]

    z = max(zs)
    z_index = zs.index(z)
    
    # check if we have to remove the value
    if z > 3: 
        outliers.append([z_index, data[z_index]])
        data = np.delete(data, z_index)

# plot results        
fig_rosner = plt.figure('Rosner Result')
plt.plot(data,'bx', label='cleared data')
plt.scatter([x[0] for x in outliers], [y[1] for y in outliers], c='red', marker='x', label='outliers')
plt.legend(loc='lower right');
fig_rosner.canvas.draw()

## Assignment 2: p-norm [5 Points]

A very well known norm is the euclidean distance. However, it is not the only norm: It is in fact just one of many p-norms where $p = 2$. In this assignment you will take a look at other p-norms and see how they behave.

Implement a function `pdist` which expects a vector $x \in \mathcal{R}^n$ and a scalar $p \geq 1, p \in \mathcal{R}$ and returns the p-norm of $x$ which is defined as:

$$||x||_p = \left(\sum\limits_{i=1}^n |x_i|^p \right)^{\frac{1}{p}}$$

*Note:* Even though the norm is only defined for $p \geq 1$, values $0 < p < 1$ are still interesting. In that case we can not talk about a norm anymore, as the triangle inequality ($||a|| + ||b|| \geq ||a + b||$) does not hold. We will still take a look at some of these values, so your function should handle them as well.

In [None]:
import numpy as np
def pdist(x, p):
    """
    Calculates the p-norm of x.
    Also allows values between 0 and 1 for p.
    """
    if p <= 0:
        raise ValueError('p has to be > 0!')
    return np.sum(np.abs(np.array(x)) ** p) ** (1 / p)


# 1e-10 is 0.0000000001
assert pdist(1, 2)      - 1          < 1e-10 , "pdist is incorrect for x = 1, p = 2"
assert pdist(2, 2)      - 2          < 1e-10 , "pdist is incorrect for x = 2, p = 2"
assert pdist([2, 1], 2) - np.sqrt(5) < 1e-10 , "pdist is incorrect for x = [2, 1], p = 2" 
assert pdist(2, 0.5)    - 2          < 1e-10 , "pdist is incorrect for x = 2, p = 0.5"

Implement another function `pdist2` which expects two vectors $x_0 \in \mathcal{R}^n, x_1 \in \mathcal{R}^n$ and a scalar $p \geq 1, p \in \mathcal{R}$ and returns the distance between $x_0$ and $x_1$ on the p-norm defined by $p$. Again handle $0 < p < 1$ as well.

In [None]:
import numpy as np
def pdist2(x0, x1, p):
    """
    Calculates the distance between x0 and x1
    given the p-norm with p.
    Also allows values between 0 and 1 for p.
    """
    if p <= 0:
         ValueError('p has to be > 0!')
    return np.sum(np.abs(np.array(x0) - np.array(x1)) ** p) ** (1 / p)


# 1e-10 is 0.0000000001
assert pdist2(1, 2, 2)           - 1          < 1e-10 , "pdist2 is incorrect for x0 = 1, x1 = 2, p = 2"
assert pdist2(2, 5, 2)           - 3          < 1e-10 , "pdist2 is incorrect for x0 = 2, x1 = 5, p = 2"
assert pdist2([2, 1], [1, 2], 2) - np.sqrt(2) < 1e-10 , "pdist2 is incorrect for x0 = [2, 1], x1 = [1, 2], p = 2" 
assert pdist2([2, 1], [0, 0], 2) - np.sqrt(5) < 1e-10 , "pdist2 is incorrect for x0 = [2, 1], x1 = [0, 0], p = 2" 
assert pdist2(2, 0, 0.5)         - 2          < 1e-10 , "pdist2 is incorrect for x0 = 2, x1 = 0, p = 0.5"

Now we will compare some different p-norms. Below is part of a code to plot data in nice scatter plots. 

Your task is to calculate the data to plot. The variable `data` is currently simply filled with zeros. Instead, fill it as follows:

- Use the function `np.linspace()` to create a vector of `50` evenly distributed values between `-100` and `100` (inclusively).
- Fill `data`: It should have 2500 rows. Each of the 2500 rows should contain `[x, y, d]`, where `x` is the x coordinate and `y` the y coordinate of a point, and `d` the p-norm of `(x, y)`. Use either `pdist` or `pdist2` to calculate `d`. 
- Normalize the data in `data[:,2]` (i.e. all d-values) so that they are between 0 and 1.

Run your code and take a look at your results. Darker colors mean that a value is closer to the center (0, 0) according to the p-norm used.

*Hint:* To give you an idea of how `data` should look like, here is an example for three evenly distributed values between `-1` and `1` and a p-norm with `p = 2`.

Before normalization of the d-column:

```python
data = np.array([[-1.         -1.          1.41421356]
                 [-1.          0.          1.        ]
                 [-1.          1.          1.41421356]
                 [ 0.         -1.          1.        ]
                 [ 0.          0.          0.        ]
                 [ 0.          1.          1.        ]
                 [ 1.         -1.          1.41421356]
                 [ 1.          0.          1.        ]
                 [ 1.          1.          1.41421356]])
```

After normalization of the d-column:

```python
data = np.array([[-1.         -1.          1.        ]
                 [-1.          0.          0.70710678]
                 [-1.          1.          1.        ]
                 [ 0.         -1.          0.70710678]
                 [ 0.          0.          0.        ]
                 [ 0.          1.          0.70710678]
                 [ 1.         -1.          1.        ]
                 [ 1.          0.          0.70710678]
                 [ 1.          1.          1.        ]])
```

In [None]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ColorConverter

color = ColorConverter()
figure_norms = plt.figure('p-norm comparison')

# TODO: create the linspace vector
ls = np.linspace(-100, 100, 50)
assert len(ls) == 50 , 'ls should be of length 50.'
assert (min(ls), max(ls)) == (-100, 100) , 'ls should range from -100 to 100, inclusively.'

for i, p in enumerate([1/8, 1/4, 1/2, 1, 1.5, 2, 4, 8, 128]):
    # TODO: Create a numpy array containing useful values instead of zeros.
    # data = np.zeros((2500, 3))
    data = np.array([[x, y, pdist((x, y), p)] for x in ls for y in ls])
    data[:,2] = data[:,2] / np.max(data[:,2])

    assert all(data[:,2] <= 1), 'The third column should be normalized.'

    # Plot the data.
    colors = [color.to_rgba((0.9, 0.4, 0, 0.7 * (1-a))) for a in data[:,2]]
    a = plt.subplot(3, 3, i + 1)
    plt.scatter(data[:,0], data[:,1], marker='.', color=colors)
    a.set_ylim([-100, 100])
    a.set_xlim([-100, 100])
    a.set_title('{:.3g}-norm'.format(p))
    a.set_aspect('equal')
    plt.tight_layout()
    figure_norms.canvas.draw()

## Assignment 3: Expectation Maximization [10 Points]

In this assignment you will implement the Expectation Maximization algorithm (EM) for 1D data sets.

As some parts of this exercise would require some more knowledge of Python than what was already discussed in the practice sessions we built a small number of templates for you to use. However, if you prefer to do so you are also allowed to just go ahead and implement everything yourself! **Don't forget [task b)](#b%29-EM-and-missing-values)**!

### a) Implement Expectation Maximization

Use the next cell to implement your own solution or, if you want some more guidance, skip the next cell and continue the exercise at  [Step 1) Load the data](#Step-1%29-Load-the-data).

Here is an overview of what you have to do:

**1) Load the data:**

Load the provided data set. It is stored in `em_normdistdata.txt`. We call the set $X$ and each individual data $x \in X$.

**2) Initialize EM:**

Initialize three normal distributions whose parameters will be changed iteratively by the EM to converge close to the original distributions.

Each normal distribution $j$ has three parameters: $\mu_j$ (the mean), $\sigma_j$ (the standard deviation), $\alpha_j$ (the probability of the normal distribution in the mixture, that means $\sum\limits_j\alpha_j=1$).

Initialize the three parameters using three random partitions $S_j$ of the data set. Calculate each $\mu_j$ and $\sigma_j$ and set $\alpha_j = \frac{|S_j|}{|X|}$.

**3) Implement the expectation step:**

Perform a soft classification of the data samples with the three normal distributions. That means: Calculate the probability that a data sample $x_i$ belongs to distribution $j$ given parameters $\mu_j$ and $\sigma_j$. Or in other words, what is the probability of $x_i$ to be drawn from $N_j(\mu_j, \sigma_j)$? When you got the probability, weight the result by $\alpha_j$.

As a last step normalize the results such that the probabilities of a data sample $x_i$ sum up to $1$.

**4) Implement the maximization step:**

In the maximization step each $\mu_j$, $\sigma_j$ and $\alpha_j$ is updated. First calculate the new means:

$$\mu_j = \frac{1}{\sum\limits_{i=1}^{|X|} p_{ij}} \sum\limits_{i=1}^{|X|} p_{ij}x_i$$

That means $\mu_j$ is the weighted mean of all samples, where the weight is their probability of belonging to distribution $j$.

Then calculate the new $\sigma_j$. Each new $\sigma_j$ is the standard deviation of the normal distribution with the new $\mu_j$, so for the calculation you already use the new $\mu_j$:

$$\sigma_j = \sqrt{ \frac{1}{\sum\limits_{i=1}^{|X|} p_{ij}} \sum\limits_{i=1}^{|X|} p_{ij} \left(x_i - \mu_j\right)^2 }$$

To calculate the new $\alpha_j$ for each distribution, just take the mean of $p_j$ for each normal distribution $j$.

**5) Perform the complete EM and plot your results:**

Build a loop around the iterative procedure of expectation and maximization which stops when the changes in all $\mu_j$ and $\sigma_j$ are sufficiently small enough.

Plot your results after each step and mark which data points belong to which normal distribution. If you don't get it to work, just plot your final solution of the distributions.

In [None]:
# Free space to implement your own solution -- either use this OR use the following step by step guide. 
# You may use scipy.stats.norm.pdf for your own implementation.





#### Step 1) Load the data


Load the provided data set. It is stored in `em_normdistdata.txt`. We call the set $X$ and each individual data $x \in X$. 

*Hint:* Figure out a way on how numpy can load text data.

In [None]:
import numpy as np
def load_data(file_name):
    """
    Loads the data stored in file_name into a numpy array.
    """
    return np.loadtxt(file_name)

assert load_data('em_normdistdata.txt').shape == (200,) , "The data was not properly loaded."

*Optional:* The data consists of 200 data points drawn from three normal distributions. To get a feeling for the data set you can plot the data with the following cell. Change the number of bins to get a rough idea of how the three distributions might look like.

In [None]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt

data = load_data('em_normdistdata.txt')

fig_data_test = plt.figure('Data overview')
plt.hist(data, bins=5)
fig_data_test.canvas.draw()

#### Step 2) Initialize EM

Below is a class definition `NormPDF` which represents the probability density function (pdf) of the normal distribution with an additional parameter $\alpha$. The class is explained in the next cells.

In [None]:
import numpy as np
class NormPDF():
    """
    A representation of the probability density function of the normal distribution
    for the EM Algorithm.
    """
    def __init__(self, mu=0, sigma=1, alpha=1):
        """
        Initializes the normal distribution with mu, sigma and alpha.
        The defaults are 0, 1, and 1 respectively.
        """
        self.mu = mu
        self.sigma = sigma
        self.alpha = alpha


    def __call__(self, x):
        """
        Returns the evaluation of this normal distribution at x.
        Does not take alpha into account!
        """
        return np.exp(-(x - self.mu) ** 2 / (2 * self.sigma ** 2)) / (np.sqrt(np.pi * 2) * self.sigma)


    def __repr__(self):
        """
        A simple string representation of this instance.
        """
        return 'NormPDF({self.mu:.2f},{self.sigma:.2f},{self.alpha:.2f})'.format(self=self)

The class `NormPDF` offers several class methods: `__init__`, `__call__`, `__repr__`. They are all special Python functions which are overloaded so they can be used in a nice way. Note that all methods take as the first parameter `self`: this is just the python way of passing the instance itself to the method so that it becomes possible to access its data. You can always ignore it for now and just assume that the methods only need the parameters which follow.

`__init__`: This is the constructor. When a new instance of the class is created this method is used. It takes the parameters `mu`, `sigma`, and `alpha`. Note that if you leave out parameters, they will be set to some default values.
So you can create `NormPDF` instances like this:

In [None]:
a = NormPDF()             # No parameters: mu = 0, sigma = 1, alpha = 1
b = NormPDF(1)            # mu = 1, sigma = 1, alpha = 1
c = NormPDF(1, alpha=0.4) # skips sigma but sets alpha, thus: mu = 1, sigma = 1, alpha = 0.4
d = NormPDF(0, 0.5)       # mu = 0, sigma = 0.5, alpha = 1
e = NormPDF(0, 0.5, 0.9)  # mu = 0, sigma = 0.5, alpha = 0.9

`__call__`: This is a very cool feature of Python. By implementing this method one can make an instance *callable*. That basically means one can use it as if it was a function. The `NormPDF` instances can be called with an x value (or a numpy array of x values) to get the evaluation of the normal distribution at x.

In [None]:
normpdf = NormPDF()
print(normpdf(0))
print(normpdf(0.5))
print(normpdf(np.linspace(-2, 2, 10)))

`__repr__`: This method will be used in Python when one calls `repr(NormPDF())`. As long as `__str__` is not implemented (which you saw in last week's sheet) `str(NormPDF())` will also use this method. This comes in handy for printing:

In [None]:
normpdf1 = NormPDF()
normpdf2 = NormPDF(1, 0.5, 0.9)
print(normpdf1)
print([normpdf1, normpdf2])

It is also possible to change the values of an instance of the NormPDF:

In [None]:
normpdf1 = NormPDF()
print(normpdf1)
print(normpdf1(np.linspace(-2, 2, 10)))

normpdf1.mu = 1
normpdf1.sigma = 2
normpdf1.alpha = 0.9
print(normpdf1)
print(normpdf1(np.linspace(-2, 2, 10)))

Now that you know how the `NormPDF` class works, it is time for the implementation of the initialization function. Here is the task again:

Write a function `gaussians = initialize_EM(data, num_distributions)` to initialize the EM. Initialize three normal distributions whose parameters will be changed iteratively by the EM to converge close to the original distributions.

Each normal distribution $j$ has three parameters: $\mu_j$ (the mean), $\sigma_j$ (the standard deviation), $\alpha_j$ (the probability of the normal distribution in the mixture, that means $\sum\limits_j\alpha_j=1$).
Initialize the three parameters using three random partitions $S_j$ of the data set. Calculate each $\mu_j$ and $\sigma_j$ and set $\alpha_j = \frac{|S_j|}{|X|}$.

In [None]:
def initialize_EM(data, num_distributions):
    """
    Initializes the EM algorithm by calculating num_distributions NormPDFs
    from a random partitioning of data.
    """
    partition_mapping = np.random.randint(0, num_distributions, len(data))
    gaussians = [NormPDF() for i in range(num_distributions)]
    
    for index, gaussian in enumerate(gaussians):
        gaussians[index].mu = np.mean(data[partition_mapping == index])
        gaussians[index].sigma = np.std(data[partition_mapping == index])
        gaussians[index].alpha = len(data[partition_mapping == index]) / len(data)
    return gaussians


normpdfs_ = initialize_EM(np.linspace(-1, 1, 100), 2)
assert len(normpdfs_) == 2, "The number of initialized distributions is not correct."
assert abs(1 - sum([normpdf.alpha for normpdf in normpdfs_])) < 1e-10 , "Sum of all alphas is not 1.0!" # 1e-10 is 0.0000000001

#### Step 3) Implement the expectation step

Perform a soft classification of the data samples with the three normal distributions. That means: Calculate the probability that a data sample $x_i$ belongs to distribution $j$ given parameters $\mu_j$ and $\sigma_j$. Or in other words, what is the probability of $x_i$ to be drawn from $N_j(\mu_j, \sigma_j)$? When you got the probability, weight the result by $\alpha_j$.

As a last step normalize the results such that the probabilities of a data sample $x_i$ sum up to $1$.

*Hint:* Store the data in a different array before you normalize it to not run into problems with partly normalized data.

In [None]:
def expectation_step(gaussians, data):
    """
    Performs the expectation step of the EM.
    Returns an array of shape (len(data), len(gaussians))
    which contains normalized probabilities for each sample
    to denote to which of the normal distributions it 
    most likely belongs to.
    """
    # Calculates the probabilities of the samples per 
    # distribution and weights the results by alpha.
    tmp = np.empty((len(data), len(gaussians)))
    for j, N in enumerate(gaussians):
        tmp[:,j] = N.alpha * N(data)

    # Normalize the results.
    expectation = np.zeros_like(tmp)
    for j, N in enumerate(gaussians):
        expectation[:,j] = tmp[:,j] / np.sum(tmp[:,:], 1)
    return expectation

assert expectation_step([NormPDF(), NormPDF()], np.linspace(-2, 2, 100)).shape == (100, 2) , "Shape is not correct!"

#### Step 4) Implement the maximization step

In the maximization step each $\mu_j$, $\sigma_j$ and $\alpha_j$ is updated. First calculate the new means:

$$\mu_j = \frac{1}{\sum\limits_{i=1}^{|X|} p_{ij}} \sum\limits_{i=1}^{|X|} p_{ij}x_i$$

That means $\mu_j$ is the weighted mean of all samples, where the weight is their probability of belonging to distribution $j$.

Then calculate the new $\sigma_j$. Each new $\sigma_j$ is the standard deviation of the normal distribution with the new $\mu_j$, so for the calculation you already use the new $\mu_j$:

$$\sigma_j = \sqrt{ \frac{1}{\sum\limits_{i=1}^{|X|} p_{ij}} \sum\limits_{i=1}^{|X|} p_{ij} \left(x_i - \mu_j\right)^2 }$$

To calculate the new $\alpha_j$ for each distribution, just take the mean of $p_j$ for each normal distribution $j$.

**Caution:** For the next step it is necessary to know how much all $\mu$ and $\sigma$ changed. For that the function `maximization_step` should return a numpy array of those (absolute) changes. For example if $\mu_0$ changed from 0.1 to 0.15, $\sigma_0$ from 1 to 0.9, $\mu_1$ from 0.5 to 0.6, and $\sigma_1$ stayed the same, we expect the function to return `np.array([0.05, 0.1, 0.1, 0])` (however, the order is not important).

In [None]:
def maximization_step(gaussians, data, expectation):
    """
    Performs the maximization step of the EM.
    Modifies the gaussians by updating their mus and sigmas.
    Returns a numpy array of absolute changes in any mu or sigma, 
    that means the returned array has twice as many elements as
    the supplied list of gaussians.
    """
    changes = []
    for j, N in enumerate(gaussians):
        # calculate new parameters
        mu = np.sum(expectation[:,j] * data) / np.sum(expectation[:,j])
        sigma = np.sqrt(np.sum(expectation[:,j] * (data - mu) ** 2) / np.sum(expectation[:,j]))
        alpha = np.mean(expectation[:,j])
        
        # append relevant changes
        changes += [np.abs(N.mu - mu)]
        changes += [np.abs(N.sigma - sigma)]
        
        # update gaussian
        N.mu = mu
        N.sigma = sigma
        N.alpha = alpha
    return np.array(changes)

**5) Perform the complete EM and plot your results:**

Build a loop around the iterative procedure of expectation and maximization which stops when the changes in all $\mu_j$ and $\sigma_j$ are sufficiently small enough.

Plot your results after each step and mark which data points belong to which normal distribution. If you don't get it to work, just plot your final solution.

*Hint:* Remember to load the data and initialize the EM before the loop.

*Hint:* A function `plot_intermediate_result` to plot your result after each step is already defined in the next cell. Take a look at what arguments it takes and try to use it in your loop.

*Hint:* To plot your final result the first three images and corresponding code examples on the tutorial of [`plt.plot(...)`](http://matplotlib.org/users/pyplot_tutorial.html) should help you.

*Optional:* Run the code multiple times. If your results are changing, use `np.random.seed(2)` in the beginning of the cell to get consistent results (any other integer will work as well, but 2 has some good results for the example solutions).

In [None]:
%matplotlib notebook
import time
import itertools

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2)

colors = itertools.cycle(['r', 'g', 'b', 'c', 'm', 'y', 'k'])
figure, axis = plt.subplots(1)
axis.set_xlim(-5, 5)
axis.set_ylim(-0.2, 4)
axis.set_title('Intermediate Results')
plt.figure('Final Result')
def plot_intermediate_result(gaussians, data, mapping):
    """
    Gets a list of gaussians and data input. The mapping
    parameter is a list of indices of gaussians. Each value
    corresponds to the data value at the same position and 
    maps this data value to the proper gaussian.
    """
    x = np.linspace(-5, 5, 100)
    if len(axis.lines):
        for j, N in enumerate(gaussians):
            axis.lines[j * 2].set_xdata(x)
            axis.lines[j * 2].set_ydata(N(x))
            axis.lines[j * 2 + 1].set_xdata(data[mapping == j])
            axis.lines[j * 2 + 1].set_ydata([0] * len(data[mapping == j]))
    else:
        for j, N in enumerate(gaussians):
            axis.plot(x, N(x), data[mapping == j], [0] * len(data[mapping == j]), 'x', color=next(colors), markersize=5)
    figure.canvas.draw()
    time.sleep(0.5)

    
# TODO: Perform the initialization.
data = load_data('em_normdistdata.txt')
gaussians = initialize_EM(data, 3)

# TODO: Loop until the changes are small enough.
eps = 0.05
changes = [float('inf')] * 2
while max(changes) > eps:
    # TODO: Iteratively apply the expectation step, followed by the maximization step.
    expectation = expectation_step(gaussians, data)
    changes = maximization_step(gaussians, data, expectation)
    
    # (Optional:) TODO: Calculate the parameters to update the plot and call the function to do it.
    plot_intermediate_result(gaussians, data, np.argmax(expectation[:], 1))
    
# TODO: Plot your final result and print the final parameters.
x = np.linspace(-5, 5, 1000)
plt.plot(x, gaussians[0](x), 'r', x, gaussians[1](x), 'g', x, gaussians[2](x), 'b')
print(gaussians)

### b) EM and missing values

Describe in your own words: How does the EM-algorithm deal with the missing value problem?

In the EM-Algorithm all known values are considered via their probability depending on the distribution. In the same way hidden (i.e. missing) values are considered as depending on the probability distribution and additionally on the known values. So the complete distribution can be seen as the product of two probability distributions (known and missing values).

The algorithm searches for the parameters that maximize the log-likelihood. As they depend on the missing values, those are averaged out. In an iterative procedure the estimated parameter is improved (M-step) followed by averaging over the missing values using the obtained parameter (E-step). This will lead the estimation of the parameter to converge to a local maximum which hopefully is close to the real parameter value. The principle in handling missing values here is to not try to regain them somehow, but to invent values from a model optained through the probability distribution. In the best case this does not lead to information loss, although it generally does. However, this at least makes the existing values technically usable.