# Machine Learning (Summer 2024)

## Practice Session 3: Clustering

April, 30th 2024

Ulf Krumnack & Lukas Niehaus

Institute of Cognitive Science,
University of Osnabrück

## Today's Session

* New sheet 03 
* Discussion sheet 02
    - Coin Flip
* Discussion sheet 01

# EM-Algorithm

This exercise aims to develop a more intuitive understanding of the Expectation-Maximization-algorithm (EM-algorithm). The exercise is based on [this paper](https://www.nature.com/articles/nbt1406).

Explanation:
Assume we'd have two coins A and B which have different probabilities of showing heads or tails (the coins are skewed!). For the dataset, someone picked one coin and flipped it ten times and recorded whether the coin showed heads or tails. This is repeated five times so we get five series of 10 coin tosses. The results are stored in a list like below:

In [None]:
import numpy as np

# coin tosses in list: 1 represents Heads
data = np.array([
    [1, 0, 0, 0, 1, 1, 0, 1, 0, 1],
    [1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
    [1, 0, 1, 1, 1, 1, 1, 0, 1, 1],
    [1, 0, 1, 0, 0, 0, 1, 1, 0, 0],
    [0, 1, 1, 1, 0, 1, 1, 1, 0, 1]
])

What we don't know is which series came from which coin. Also we don't know the coin's probabilities of showing heads. The EM-Algorithm serves us a procedure to estimate the probabilities of our coins. Your task is to implement the algorithm in order to assimilate the true probabilites of coins A and B. (To begin with, you may compute the probabilities by hand).

### 2. Make an initial guess for $\theta_{A}$ and $\theta_{B}$

In order to start the algorithm you need to make an initial guess for $\theta_{A}$ and $\theta_{B}$. Later you can experiment with these variables in order to see how the algorithm behaves. $\theta_{A}$ and $\theta_{B}$ have to be different!

In [None]:
### Your guess:
theta0 = np.array((.6, .5))

### 3. Implement the Expectation-Step
Now that you have guessed initial probabilities, you can compute the probability for each series (10 tosses) whether it came from coin A or B. More precisely, you need to compute the two probabilities of a series being the outcome of your selected distribution with your guessed $\theta_{A}$ or $\theta_{B}$. Once you have these probabilities you can complete the expectation step! For each series, compute the amount of heads and tails you'd expect from coin A and B respectively. Store these heads and tails with respect to the coin you expect them from so you can use these numbers later on if you need them!

In [None]:
import numpy as np

def e_step(data, theta):
    """Apply the expectation step to create new weighted
    training examples from a given example. The weights are
    obtained as the conditional probability that `data` was
    generated with coin A or coin B respectively, assuming
    current model parameters `theta_A` and `theta_B`.
    The two new weighted training examples are then created
    by splitting the given data according the these weights.
    
    Return:
        A 2x2 matrix, listing in each row the data for one coin,
        that is the number of heads and tails attributed to that
        coin.
    """
    # get amount of heads and tails in the given data
    heads, tails = data.sum(axis=-1), data.shape[-1] - data.sum(axis=-1)
    if data.ndim == 2:
        heads, tails = heads[:,np.newaxis], tails[:,np.newaxis]
    # compute liklihood of observing the data
    likelihood = (theta**heads) * ((1-theta)**tails)
    # normalize to get weights (conditional probabilities)
    weights = likelihood/likelihood.sum(axis=-1)[:,np.newaxis]
    print(weights)
    # distribute heads and tails according to weights
    return np.stack((heads*weights, tails*weights), axis=-1)

Run the E-step on the given data, to obtain a split version of the data (compare to table in figure 1 (b)):

In [None]:
expected_data = e_step(data, theta=theta0)
print(expected_data)

### 4. Implement the M-Step

Now that you have access to all expected heads and tails from each coin for each series, you can update your initial guess of $\theta_{A}$ and $\theta_{B}$. The goal of doing so, is to find the distribution parameters that model the expected data best! Compute a new $\theta_{A}$ and $\theta_{B}$ so that you can start a new iteration

In [None]:
def m_step(expected_data):
    """Maximization step: choose new parameters to maximize
    the likelihood of observing the aggregated data for both coins.
    
    Returns:
        The new model parameters `theta_A` and `theta_B` derived
        from the given split dataset.
    """
    aggregated = expected_data.sum(axis=0)
    return aggregated[:,0] / aggregated.sum(axis=1)

### 5. Putting it all together

You have implemented the E-Step and M-Step. You may have computed a new $\theta_{A}$ and $\theta_{B}$ which are closer to the true probabilties of coins A and B. The EM-Algorithm repeats these two steps until it finds $\theta_{A}$ and $\theta_{B}$ converging. Transform what you have implemented so far into a funtion which takes the coin_tosses list and your guesses for $\theta_{A}$ and $\theta_{B}$.

In [None]:
# Solution
def EM_algorithm(data, theta):

    # terminate if runs exceeds 50
    for run in range(50):
        # use E-step from above
        expected_data = e_step(data, theta)
        # use M-step from above
        new_theta = m_step(expected_data)

        print(f"New Theta: A:{new_theta[0]:.2f}, B:{new_theta[1]:.2f}")

        # terminate if theta_A and theta_B converge
        if np.allclose(new_theta, theta):
            print(f"Algorithm terminated after {run} runs.")
            return new_theta

        theta = new_theta
    return theta

Now run the EM algorithm on the given example data:

In [None]:
theta = EM_algorithm(data, theta=theta0)