Osnabrück University - Machine Learning (Summer Term 2018) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 12

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, July 1, 2018**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

## Assignment 1: Temporal probability models [4 Points]

### a) Hidden Markov Model

Explain the structure of a Hidden Markov Model. What probabilities have to be provided for such a model and how can they be specified?

A Hidden Markov Model is a Markov process using a single variable.

### b) Inference tasks

What is the goal of most likely explanation? Why is the most likely sequence not the sequence of the most likely states? Give an example where these two sequences disagree.

YOUR ANSWER HERE

### c) Speech recognition

Summarize in you own words the approach to speech recognition presented in the lecture. Explain what kind of inference problems occur and how they are solved.

YOUR ANSWER HERE

## Assignment 2: Implementing HMM [6 Points]

**a)** Implement the basic inference algorithms for HMMs. You may do so by filling in the stubs in the following class. In this implementation, we represent finite probability distributions as one-dimensional numpy arrays, with values summing up to one, e.g. the initial state distribution over three stats $a,b,c$ with $P(a)=0.3, P(b)=0.2, P(c)=0.5$ would be represented by the array `[0.3, 0.2, 0.5]`. Transition matrices are realized by two-dimensional arrays.

If you prefer to write your own code, you will find an empty code cell below.

In [None]:
import numpy as np

class HMM:
    """A class implementing a Hidden Markov Model. This class provides methods
    to perform the standard inference tasks.
    """
    
    def __init__(self, states, outputs, p_initial, p_transition, p_output):
        """Create a new HMM.
        
        Args:
            states (list): a list a valid states.
            outputs (list): a list of valid output symbols.
            p_initial (array_like): initial state distribution. Should add up to one.
            p_transition (array_like, ndim=2): state transition probabilities.
              Each row is a probability distribution over states, should add up to one.
            p_output (array_like, ndim=2): output emission probabilities.
              Each row is a probability distribution over output values, should add up to one.
        """
        self.states = states
        self.outputs = outputs
        self.p_initial = np.asarray(p_initial)
        self.p_transition = np.asarray(p_transition)
        self.p_output = np.asarray(p_output)
        
        # Some sanity checks
        assert self.p_initial.shape == (len(self.states),), "Invalid shape for initial state distribution."
        assert self.p_initial.sum() == 1.0, "Initial state probabilities do not add up to one."
        assert self.p_transition.shape == (len(self.states),len(self.states)), "Invalid shape for state transition table."
        assert np.all(np.equal(self.p_transition.sum(axis=1),1.0)), "State transition probabilities do not add up to one."
        assert self.p_output.shape == (len(self.states),len(self.outputs)), "Invalid shape for emission table."
        assert np.all(np.equal(self.p_output.sum(axis=1),1.0)), "Emission probabilities do not add up to one."

    
    def prediction(self, p_states):
        """Compute a prediction step, i.e. from a given
        state distribution P(X_{t}), compute the next
        state distribution P(X_{t+1}).
        
        Args:
            p_states (ndarray): the current state distribution.
            
        Retuns:
            ndarray: the probability distribution for the next state.
        """
        # YOUR CODE HERE


    def forward(self, p_states, observation):
        """Compute a forward step, i.e. from a given
        state distribution P(X_{t}), and an observation
        e_{t+1} compute the next state distribution
        P(X_{t+1}).
        
        Args:
            observation: The next observation.
            
        Retuns:
            The probability distribution
        """
        # YOUR CODE HERE


    def backward(self, p_observations, observation):
        """Compute a backward step, i.e. from a given
        output distribution P(e_{t+2,T}|X_{t+1}) and an
        observation e_{t+1} compute the next previous
        output distribution P(e_{t+1,T}|X_{t}).
        
        Args:
            observation: The next observation.
            
        Retuns:
            The probability distribution
        """
        # YOUR CODE HERE


    def filtering(self, observations):
        """Filter this sequence, i.e., iteratively
        determine the state probabilities given an 
        observed output sequence.
    
        Args:
            observations (list): The sequence of observations.

        Returns:
            list of ndarray: A sequence of state probability
            distributions.
        """
        # YOUR CODE HERE


    def smoothing(self, observations, k):
        """The forward-backward algorithm to determine a state 
        distribution based on past and future observations.
        
        Args:
            observations (list): The sequence of observations.
            k (int): The index for which to determine the state
            probabilities.

        Returns:
            ndarray: the state probability distributions for the
            given index k.
        """
        # YOUR CODE HERE


    def viterbi(self, observations):
        """The Viterbi algorithm. Determine the most likely sequence
        of hidden states, given an observed output sequence.
    
        Args:
            observations (list): The sequence of observations.

        Returns: two return values:
            1. list: the most likely sequence of states
            2. list of ndarray: A sequence of probability vectors,
            providing for each time t and state s the probability
            of the most likely initial sequence ending in that state.
        """
        # YOUR CODE HERE

        return sequence, P_max


In [None]:
# If you prefer to do your own implementation, place your code here ...
# YOUR CODE HERE

**The umbrella example**

The following cell initializes a HMM based on the example from the lecture (ML-12 slide 13ff). Use the cells below to check your implementation. 

In [None]:
# The values of the hidden states
states = ['rain','sun']

# The possible output values
outputs = [True, False]


# The initial distribution of states
initial = [0.5, 0.5]

# The state transition table
transition = [[0.7, 0.3],
              [0.3, 0.7]]

# The output probabilities for each state
output = [[0.9,0.1],
          [0.2,0.8]]

model = HMM(states, outputs, initial, transition, output)

In [None]:
# Check the filtering example from the lecture (ML-12 slide 16)

observations = [True,True]
P = model.filtering(observations)

assert np.allclose(P[0], [0.500, 0.500], rtol=5e-2), "Bad initial distribution for filtering."
assert np.allclose(P[1], [0.818, 0.182], rtol=5e-2), "Bad filter values (step 1)."
assert np.allclose(P[2], [0.883, 0.117], rtol=5e-2), "Bad filter values (step 2)."

In [None]:
# Check the smoothing example from the lecture (ML-12 slide 21)

observations = [True,True]

assert np.allclose(model.smoothing(observations,1), [0.883, 0.117], rtol=5e-2), "Bad smoothing result (k=1)"
assert np.allclose(model.smoothing(observations,2), [0.883, 0.117], rtol=5e-2), "Bad smoothing result (k=2)"

In [None]:
# Check the most likely explanation example from the lecture (ML-12 slide 24)

observations = [True,True,False,True,True]
most_likely_sequence, P_max = model.viterbi(observations)

assert [model.states[s] for s in most_likely_sequence] == ['rain', 'rain', 'sun', 'rain', 'rain'], "Wrong sequence (Viterbi)"
assert np.allclose(P_max[0], [0.8182, 0.1818], rtol=5e-2), "Bad viterbi (step 0)."
assert np.allclose(P_max[1], [0.5155, 0.0491], rtol=5e-2), "Bad viterbi (step 1)."
assert np.allclose(P_max[2], [0.0361, 0.1237], rtol=5e-2), "Bad viterbi (step 2)."
assert np.allclose(P_max[3], [0.0334, 0.0173], rtol=5e-2), "Bad viterbi (step 3)."
assert np.allclose(P_max[4], [0.0210, 0.0024], rtol=5e-2), "Bad viterbi (step 4)."

**b)** Now use your implementation to study the behaviour of such a model.

**1.** Run your model to predict the state distributions without providing any output evidence, i.e. only use the state transition matrix. What do you observe? How does the behavior change if you provide another initial distribution? 

In [None]:
# YOUR CODE HERE

YOUR ANSWER HERE

**2.** Drive your model by providing some observations. Compare the sequence of most likely states with the most likely sequence of states. Can you provide a case where these are different for the "umbrella" model? 

In [None]:
# YOUR CODE HERE

YOUR ANSWER HERE

# Recap (part II)

This is the second part of the recap material. These exercises do not need to be solved in order to qualify for the final exam but it is highly recommended for preparation. Also if you hit any question that should be discussed in more detail, please let us know.

## Recap 6: Neural Networks [2 Points]

### a) Neural Networks

Name three different kinds of Artificial Neural Networks discussed in the lecture.

1. Multilayer Perceptron
2. Self-Organizing Maps
3. Radial Basis Function Network 
4. Recurrent Network

### b) Backpropagation

Which of the following formulae describes the backpropagation of the error through hidden layers in a Multilayer Perceptron?
Assume they are calculated for each $k=L_H \dots 1$ and $i=1\dots N(k)$.

1. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k+1, k)o_j(k)$
2. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k+1, k)\delta_j(k+1)$
3. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k, k-1)\delta_j(k+1)$

The second formula accurately describes backpropagration for a MLP.

### c) Hebb's rule
Explain Hebb's rule. Provide a formula. What is the relation to Oja's rule?

Hebb's rule models the adaption of neurons during learning and can also be used in neural networks to adapt the weights of the neurons in order to learn a target function. The weight between two increases when they fire together. Hebb's rule is calculated by
$$\Delta w_i = \varepsilon\cdot(\vec{x}\cdot\vec{y})x_i$$
where $\varepsilon$ is the learning rate. A problem with Hebb's rule is that weights can become arbitrarily large, since they are never *unlearned*, e.g. weights don't decrease. This can be fixed by applying Oja's Rule, which includes a **weight decay** term, that makes it so that the weights of neuron's that don't fire anymore are decreased over time.

## Recap 7: Local Methods [2 Points]

### a) Local methods

What are differences between local and global methods? What are advantages or disadvantages?

Local methods are local in respect to the input space, so the output is calculated independently for different regions of the input space, meaning that changes in one region don't affect the output of other regions.  
In contrast, when using global methods, a single example may influence the performance of the complete method for any other example.

|        | Advantages | Disadvantages |
| -----  | ---------- | ------------- |
| local  | easier to manage parameters | can get stuck in local extrema |
|        | usually more robust | |
| global | can learn more complex models | overall performance may decrease due to a single example |

### b) MLP and RBFN

Is an MLP or are RBFN local methods? Why?

A MLP is not local, the overall error of the network is used to update the weights of all neurons. A RBFN is local, because each neuron is a local function which may or may not be modified by an example and therefore changes only have local effects.

### c)  Nearest neighbor

How does the nearest neighbor approach work? How can it be improved?

The nearest neighbour approach stores all input examples and matches previously unseen data to the *nearest* stored example. For example, I could have a (somewhat dumb) training dataset containing 3 examples, each belonging to a different class. Now, for all new data, I look at my 3 stored examples and calculate the distance from my new data point to each of those. Then I simply choose the *nearest* (depends on the metric used) and output the class of the nearest neighbour as the class of the new data point.  

This classifier can be improved by not only choosing 1 neighbour, but rather a fixed amount of *k* neighbours, and then, depending on whether my output is discrete or real, I can deduct the output:
* **discrete**: choose the most occuring class (if there is a tie, toss a coin)
* **real**: calculate the mean of all *k* neighbours

This adaption is called *K-nearest neighbour* (or neighbor if you prefer American English).

## Recap 8: Classification [2 Points]

### a) Classfier

What is a classifier? What is the relation to a concept?

A classifier is an algorithm (of some kind) that assigns one class of a finite set of classes to a specific instance. The relation to a concept can be made clear quite easily, because a classifier with only 2 classes *IsACar* and *IsNotACar* is pretty much the same as a concept, which is a binary function saying whether an example is an instance of a specific concept or not.

### b) Comparison of classifiers

Name three different classifiers and compare them. Think about biases and assumptions, separatrices, sensitivity, locality, parameters and speed. 

* **LDA**$^1$
    * bias
        * identical a priori probability for all classes
        * all classes exhibvit a Gaussian distribution with equal means and covariances
    * linear separatrix, explicitly represented
    * Global
    * Sensitive to far outliers
    * No Parameters needed
    * Fast (training)
* **Nearest-neighbour**
    * distance in input space is equal to distance in output space
    * implicitly defined linear separatrix
    * local (as discussed previously)
    * No sensitivities
    * No parameters
    * Fast (trainint) speed (but needs lots of memory) and finding the best matching neighbour can take time
* **SVM**$^2$
    * only 2 classes
    * linear, but can be used to model non-linear separatrix using the *kernel trick*
    * local (TODO?)
    * usually sensitive to outliers, but an extension can solve that
    * No parameters
    * Fairly fast

---

$^1$ *linear discriminant analysis*  
$^2$ *support vector machine*

### c) SVM

What is a support vector? How does the kernel trick work?

A support vector is an example/instance that lies on the boundary of the margin (or rather, all suport vectors *make up* the margin) and therefore, they are the instances most hard to classifiy. The kernel trick projects the data to a higher-dimensional space, which makes the problem linearly seperable. That would normally be quite hard to compute, but since a SVM only uses inner products of vectors, the inner product of the data in that higher-dimensional projection can be computed in the original input space using a *kernel function*.

## Recap 9: Reinforcement Learning [2 Points]

### a)

What is an agent in terms of reinforcement learning? Name an example of an agent.

An agent in terms of reinforcement learning is someone that has a state from which he can transition to other states using one of the finite set of possible actions.

An agent is defined by:
* Performance
    * A goal or task to be perfomed by the agents (which may be discrete or real in terms of success)
* Environment
    * The environment in which the agent acts.
* Actuators
    * Available actions, control mechanism, *things* which the agent can use to act
* Sensors
    * Means by which information is aquired
    
So an agent is a system embedded in an environment, which it explores using its sensors and who can interact with said environment using its actuators (weird word).

**Example:** Expert System
* Performance
    * Answer a query appropriately
* Environment
    * User, database
* Actuators
    * Database return value
* Sensors
    * Query (text or speech input)

### b)

What is the Markov assumption? How is it related to Q-learning? Give an example for which it does not hold.

The Markov assumption says that all successor states only depend on the current state of an agent (not earlier ones) and that the same applies to rewards, e.g. the next reward only depends of the current reward. Q-learning employs the Markov assumption, e.g. it only looks at the estimated reward from the current state to the next und uses that to update the value of the current state (and transitions to that next state). In that way, the current state is assumed to hold all information necessary to make a decision.  

**Example:**  
Suppose you have 1 container filled with 3 balls of which 1 is red and 2 are green. For 3 days, each day 1 ball is drawn (without replacement) from that container. On day 1, a green ball is drawn. On day 2, the other green ball is drawn. Now, under the Markov assumption, you would only know that 1 green ball has been removed, therefore the probability for either a red or a green ball being left are identical ($\frac{1}{2}$). However, if you know which ball was drawn on the first day, then you know for certain that the red ball is left in the container.

### c) 
What does the $Q$-function express in reinforcement learning and how is it used?

The $Q$-function $Q^\pi(s,a)$ is the action-value function, in other words, it is the expected reward that is received when starting from state $s$ and performing action $a$ under policy $\pi$

## Recap 10: Modeling Uncertainty [2 Points]

### a) Uncertainty

Why do we need to model uncertainties?

There are many reasons for which uncertainties can occur, including imprecise technology, high complexity, partially unobervable environments or simply natural uncertainty. Since uncertainty can occur, it is useful to devise strategies to deal with it, in order to be able to produce good models even in the face of uncertainty. 

### b) Naive Bayes

What is a naive Bayes classifier? Why is it naive? 

A naive Bayes classifier is a probabilistic classifier that applies Bayes' Theorem to given a priori and conditional probabilities. It is naive because it assumes that all features are independent.

If $C$ is a cause and $E_i$ for $i = 1,\dots n$, are the effects, then
\begin{align*}
P(C,E_1, E_2,\dots, E_n) &= P(E_1 | C,E_2,\dots,E_n)P(C,E_2,\dots,E_n)\\
&= P(E_1 | C,E_2,\dots,E_n)P(E_2 | C,E_3,\dots,E_n)P(C,E_3,\dots,E_n)\\
&= \prod_{i=1}^n P(E_i | C) P(C)
\end{align*}

### c) Probabilities

Given the following table, calculate the probability of drawing a blue candy blindly (assume the bags are chosen equally likely). Then calculate the probability that our drawn blue candy was drawn from the red bag.

|                | blue candies | green candies |
|----------------|--------------|---------------|
| **red bag**    |            5 |            10 |
| **yellow bag** |           20 |            10 |

\begin{align*}
P(b) &= \frac{1}{2} \cdot \frac{1}{3} + \frac{1}{2} \cdot \frac{2}{3}\\
&= \frac{1}{2} = 0.5\\
P(r|b) &= \frac{P(b|r)P(b)}{P(r)}\\
&= \frac{\frac{1}{3}\cdot \frac{1}{2}}{\frac{1}{2}}\\
&\approx \frac{0.166}{0.5} \approx 0.0833
\end{align*}