Osnabrück University - Machine Learning (Summer Term 2018) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 12

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, July 1, 2018**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

## Assignment 1: Temporal probability models [4 Points]

### a) Hidden Markov Model

Explain the structure of a Hidden Markov Model. What probabilities have to be provided for such a model and how can they be specified?

A Hidden Markov Model (HMM) explains an observable sequence of data by a sequence of underlying hidden (discrete) states. Each state has certain emission probabilities for the observable variables (sensor model). Furthermore, one has to provide a table of transition probabilities, describing the likelihood of state changes. Finally, one has to provide a prior probability, describing the initial state of the system. In a HMM it is assumed that both, the transition probabilities and the emission probabilities are stationary and fulfill the (first-order) Markov property, allowing for a compact representation of the model.

### b) Inference tasks

What is the goal of most likely explanation? Why is the most likely sequence not the sequence of the most likely states? Give an example where these two sequences disagree.

The inference task of finding the most likely explanation is to determine the sequence of hidden states $x_{1:T}$ that is most likely to generate a sequence of observations $e_{1:T}$. If one would compute the most likely state for each time step, one would get distributions over single time steps, whereas to find the most likely sequence we must consider joint probabilities over all the time steps. Given two states $A$ and $B$ are very likely for respective emissions $a$ and $b$, if the transition probability from $A$ to $B$ is 0, no sequence can contain the subsequence $A,B$.

### c) Speech recognition

Summarize in you own words the approach to speech recognition presented in the lecture. Explain what kind of inference problems occur and how they are solved.

In speech recognition, one aims to reason about a sequence of words (symbols) based on a series of observations (physical signals). This mapping is usually not done directly, but through a chain of more and more abstract representions.
1. On the lowest level, the analogous and time continous speech signal is discretized by sampling. This is a rather direct operation, not involving more sophisticated reasoning.
1. The resulting sequence is subdivided into overlaping frames for which energy spectra are computed. This form allows for an easy extraction of elementary sound features.
1. A phone model explains what features result from a given phone. As a phone has a certain temporal extension, that stretches over more than one feature frame, it is best modeled by some sequential model that takes the phone's internal structure into account, e.g., a Phone HMM. This allows to reconstruct the most likely phone sequence from feature sequence using the Viterbi algorithm.
1. A word model combines phone models to describe possible phonetic realizations of a word. So the most likely word can be inferred from a given phone sequence. 
1. A language model describes the sentences of a language, or in a probabilistic setting, how likely certain sequences of words are in a language. A bigram language model can be combined with a word model to form an HMM. One could use this model to infer the most likely word sequence from the sequence of most likely words, but this will usually result in very poor results. To get better results, one should also include less likely words into the computation.

## Assignment 2: Implementing HMM [6 Points]

**a)** Implement the basic inference algorithms for HMMs. You may do so by filling in the stubs in the following class. In this implementation, we represent finite probability distributions as one-dimensional numpy arrays, with values summing up to one, e.g. the initial state distribution over three stats $a,b,c$ with $P(a)=0.3, P(b)=0.2, P(c)=0.5$ would be represented by the array `[0.3, 0.2, 0.5]`. Transition matrices are realized by two-dimensional arrays.

If you prefer to write your own code, you will find an empty code cell below.

In [1]:
import numpy as np

class HMM:
    """A class implementing a Hidden Markov Model. This class provides methods
    to perform the standard inference tasks.
    """
    
    def __init__(self, states, outputs, p_initial, p_transition, p_output):
        """Create a new HMM.
        
        Args:
            states (list): a list a valid states.
            outputs (list): a list of valid output symbols.
            p_initial (array_like): initial state distribution. Should add up to one.
            p_transition (array_like, ndim=2): state transition probabilities.
              Each row is a probability distribution over states, should add up to one.
            p_output (array_like, ndim=2): output emission probabilities.
              Each row is a probability distribution over output values, should add up to one.
        """
        self.states = states
        self.outputs = outputs
        self.p_initial = np.asarray(p_initial)
        self.p_transition = np.asarray(p_transition)
        self.p_output = np.asarray(p_output)
        
        # Some sanity checks
        assert self.p_initial.shape == (len(self.states),), "Invalid shape for initial state distribution."
        assert self.p_initial.sum() == 1.0, "Initial state probabilities do not add up to one."
        assert self.p_transition.shape == (len(self.states),len(self.states)), "Invalid shape for state transition table."
        assert np.all(np.equal(self.p_transition.sum(axis=1),1.0)), "State transition probabilities do not add up to one."
        assert self.p_output.shape == (len(self.states),len(self.outputs)), "Invalid shape for emission table."
        assert np.all(np.equal(self.p_output.sum(axis=1),1.0)), "Emission probabilities do not add up to one."

    
    def prediction(self, p_states):
        """Compute a prediction step, i.e. from a given
        state distribution P(X_{t}), compute the next
        state distribution P(X_{t+1}).
        
        Args:
            p_states (ndarray): the current state distribution.
            
        Retuns:
            ndarray: the probability distribution for the next state.
        """
        ### BEGIN SOLUTION
        return p_states @ self.p_transition
        ### END SOLUTION


    def forward(self, p_states, observation):
        """Compute a forward step, i.e. from a given
        state distribution P(X_{t}), and an observation
        e_{t+1} compute the next state distribution
        P(X_{t+1}).
        
        Args:
            observation: The next observation.
            
        Retuns:
            The probability distribution
        """
        ### BEGIN SOLUTION
        i = self.outputs.index(observation)
        p_trans = p_states @ self.p_transition
        p_unnormalized = p_trans * self.p_output[:,i]
        return p_unnormalized/p_unnormalized.sum()
        ### END SOLUTION


    def backward(self, p_observations, observation):
        """Compute a backward step, i.e. from a given
        output distribution P(e_{t+2,T}|X_{t+1}) and an
        observation e_{t+1} compute the next previous
        output distribution P(e_{t+1,T}|X_{t}).
        
        Args:
            observation: The next observation.
            
        Retuns:
            The probability distribution
        """
        ### BEGIN SOLUTION
        i = self.outputs.index(observation)
        return self.p_output[:,i].T * p_observations @ self.p_transition
        ### END SOLUTION


    def filtering(self, observations):
        """Filter this sequence, i.e., iteratively
        determine the state probabilities given an 
        observed output sequence.
    
        Args:
            observations (list): The sequence of observations.

        Returns:
            list of ndarray: A sequence of state probability
            distributions.
        """
        ### BEGIN SOLUTION
        P = [self.p_initial]
        for t, s in enumerate(observations):
            P.append(self.forward(P[-1], s))

        return P
        ### END SOLUTION


    def smoothing(self, observations, k):
        """The forward-backward algorithm to determine a state 
        distribution based on past and future observations.
        
        Args:
            observations (list): The sequence of observations.
            k (int): The index for which to determine the state
            probabilities.

        Returns:
            ndarray: the state probability distributions for the
            given index k.
        """
        ### BEGIN SOLUTION
        P_forward = self.p_initial
        for t, s in enumerate(observations[:k]):
            P_forward = self.forward(P_forward, s)

        P_backward = np.ones(len(self.outputs))
        for t, s in reversed(list(enumerate(observations))[k:]):
            P_backward = self.backward(P_backward, s)

        p_unnormalized = P_forward * P_backward
        return p_unnormalized/p_unnormalized.sum()
        ### END SOLUTION


    def viterbi(self, observations):
        """The Viterbi algorithm. Determine the most likely sequence
        of hidden states, given an observed output sequence.
    
        Args:
            observations (list): The sequence of observations.

        Returns: two return values:
            1. list: the most likely sequence of states
            2. list of ndarray: A sequence of probability vectors,
            providing for each time t and state s the probability
            of the most likely initial sequence ending in that state.
        """
        ### BEGIN SOLUTION
        P_max = [self.forward(self.p_initial,observations[0])]
        backpointers = []
        for t, s in enumerate(observations[1:]):
            i = self.outputs.index(s)
            p_tmp = P_max[-1] * self.p_transition
            P_max.append(p_tmp.max(axis=1) * self.p_output[:,i])
            backpointers.append(p_tmp.argmax(axis=1))

        sequence = [P_max[-1].argmax()]
        for pointer in reversed(backpointers):
            sequence.insert(0,pointer[sequence[0]])
        ### END SOLUTION

        return sequence, P_max


In [2]:
# If you prefer to do your own implementation, place your code here ...
### BEGIN SOLUTION
# ...
### END SOLUTION

**The umbrella example**

The following cell initializes a HMM based on the example from the lecture (ML-12 slide 13ff). Use the cells below to check your implementation. 

In [3]:
# The values of the hidden states
states = ['rain','sun']

# The possible output values
outputs = [True, False]


# The initial distribution of states
initial = [0.5, 0.5]

# The state transition table
transition = [[0.7, 0.3],
              [0.3, 0.7]]

# The output probabilities for each state
output = [[0.9,0.1],
          [0.2,0.8]]

model = HMM(states, outputs, initial, transition, output)

In [4]:
# Check the filtering example from the lecture (ML-12 slide 16)

observations = [True,True]
P = model.filtering(observations)

assert np.allclose(P[0], [0.500, 0.500], rtol=5e-2), "Bad initial distribution for filtering."
assert np.allclose(P[1], [0.818, 0.182], rtol=5e-2), "Bad filter values (step 1)."
assert np.allclose(P[2], [0.883, 0.117], rtol=5e-2), "Bad filter values (step 2)."

In [5]:
# Check the smoothing example from the lecture (ML-12 slide 21)

observations = [True,True]

assert np.allclose(model.smoothing(observations,1), [0.883, 0.117], rtol=5e-2), "Bad smoothing result (k=1)"
assert np.allclose(model.smoothing(observations,2), [0.883, 0.117], rtol=5e-2), "Bad smoothing result (k=2)"

In [6]:
# Check the most likely explanation example from the lecture (ML-12 slide 24)

observations = [True,True,False,True,True]
most_likely_sequence, P_max = model.viterbi(observations)

assert [model.states[s] for s in most_likely_sequence] == ['rain', 'rain', 'sun', 'rain', 'rain'], "Wrong sequence (Viterbi)"
assert np.allclose(P_max[0], [0.8182, 0.1818], rtol=5e-2), "Bad viterbi (step 0)."
assert np.allclose(P_max[1], [0.5155, 0.0491], rtol=5e-2), "Bad viterbi (step 1)."
assert np.allclose(P_max[2], [0.0361, 0.1237], rtol=5e-2), "Bad viterbi (step 2)."
assert np.allclose(P_max[3], [0.0334, 0.0173], rtol=5e-2), "Bad viterbi (step 3)."
assert np.allclose(P_max[4], [0.0210, 0.0024], rtol=5e-2), "Bad viterbi (step 4)."

**b)** Now use your implementation to study the behaviour of such a model.

**1.** Run your model to predict the state distributions without providing any output evidence, i.e. only use the state transition matrix. What do you observe? How does the behavior change if you provide another initial distribution? 

**2.** Drive your model by providing some observations. Compare the sequence of most likely states with the most likely sequence of states. Can you provide a case where these are different for the "umbrella" model? 

# Recap (part II)

This is the second part of the recap material. These exercises do not need to be solved in order to qualify for the final exam but it is highly recommended for preparation. Also if you hit any question that should be discussed in more detail, please let us know.

## Recap 6: Neural Networks [2 Points]

### a) Neural Networks

Name three different kinds of Artificial Neural Networks discussed in the lecture.

* The *multilayer perceptron* (MLP) consists of multiple layers of nodes through which activation is fed forward to compute an output vector to a given input pattern. It usually uses some non-linear activation function in each node and can be trained by a form of error gradient descent called back propagation.

* A *radial basis function network* (RBFN) can be considered as a threee layer network: a given input pattern activates the hidden layer using a radial activation function. The output value is then determined as a linear combination of these values. In contrast to the MLP, the RBFN can be considered as a local classifier.

* A *self-organizing map* (SOM) is a two layer architecture, in which a high-dimensional input space is connected to a low-dimension grid. The SOM learns a discretized, low dimension representation of the input data. In contrast to MLP and RBFN, the SOM is an unsupervised approach.

### b) Backpropagation

Which of the following formulae describes the backpropagation of the error through hidden layers in a Multilayer Perceptron?
Assume they are calculated for each $k=L_H \dots 1$ and $i=1\dots N(k)$.

1. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k+1, k)o_j(k)$
2. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k+1, k)\delta_j(k+1)$
3. $\delta_i(k) = f^\prime(o_i(k)) \sum\limits_{j=1}^{N(k+1)} w_{ji}(k, k-1)\delta_j(k+1)$

* Formula 1 uses the output instead of the deltas.
* Formula 2 is correct. 
* Formula 3 uses the wrong weights.

### c) Hebb's rule
Explain Hebb's rule. Provide a formula. What is the relation to Oja's rule?

The idea of Hebb's rule is to strengthen connections between neurons that fire simultanously. This is expressed by the formula
$$\Delta w_i = \varepsilon y(\vec{x}\cdot\vec{w})\cdot x_i$$
A high activation value $y(\vec{x}\cdot\vec{w})$ coinciding with high values in the input $x_i$ results in a strong adaptation. 

Oja's rule is a modification of the standard Hebb Rule. It addresses the problem that the simple form of Hebb's rule does not allow the connection weights to decrease, so that they will eventually become arbitrarily large. Oja's rule avoids this problem by introducing some multiplicative normalization.

## Recap 7: Local Methods [2 Points]

### a) Local methods

What are differences between local and global methods? What are advantages or disadvantages?

A model is termed local, if the adaptation of model parameters only has local effects, i.e. it will only effect a subset of input values, located close to each other in the input space. In contrast, changing a parameter of a global model may effect all input values. Hence, local methods are considered to be more robust during training, as single (faulty) traning examples only effect a part of the system. Furthermore, such methods may be better to manage, as the effect of a single parameter is easier to understand.

### b) MLP and RBFN

Is an MLP or are RBFN local methods? Why?

RBFN are a local method as each hidden neuron has a local area of responsibility. In contrast, a MLP is global, as changing a single weight may change the input-output mapping for all input patterns.

### c)  Nearest neighbor

How does the nearest neighbor approach work? How can it be improved?

In nearest neighbor learning, all training examples are stored in memory. Upon inference, when a new input is given, the most similar training example (the nearest neighbor) is retrieved and used to provide the result. In the $k$-nearest neighbor approach, not only one, but $k$ nearest neighbors are retrieved, and the result is determined by averaging over these values. A more advanced version uses a weighted average, including the distance of the neighbors from the given data point.

## Recap 8: Classification [2 Points]

### a) Classfier

What is a classifier? What is the relation to a concept?

A classifier assigns a class to an entity based on its attributes (attributes might be color, height, weight, shape, ..., classes might be car, house, person, banana, yes, ...). Formally, a classifier is a function $c:X\to C$ that assigns a class $c(x)\in C$ to every object $x\in X$. Hence, a concept is a special classifier with only two classes $C=\{\operatorname{true},\operatorname{false}\}$.

### b) Comparison of classifiers

Name three different classifiers and compare them. Think about biases and assumptions, separatrices, sensitivity, locality, parameters and speed. 

*Usually assuming 0 mean and only binary problems!*

| Classifier           | Biases and Assumptions | Separatrices | Sensitivity | Locality | Parameters | Speed |
|----------------------|------------------------|--------------|-------------|----------|------------|-------|
| Euclidean classifier | voronoi tesselation around class centers | linear | sensitive to far outliers | global | none | very fast |
| Linear discriminant analysis | normally distributed data with equal covariances | linear | sensitive to far outliers | global | none | very fast |
| Quadratic classifier (e.g. QDA) | ? | conic: e.g. hyperbola, parabola, ellipsis, line | sensitive to outliers | global | none | fast |
| Polynom classifier | ? | almost arbitrary | overfitting for high degrees | global | polynomial degree | fast |
| Nearest neighbor classifier  | classification for neighbors are similar | implicit: neighbors (voronoi cells around training data) | distance function | local | number of neighbors $k$ | $\mathcal{O}(N)$ (instant training, linear classification) |
| Bayesian classifier | expected cost is minimized | discriminate functions (probability distributions) | overlapping classifications (only probabilities), noise is modeled | global | none | varies (underlying data and method for discriminate functions, see ML-09 Slides 5f) |
| MLP (not necessarily binary) | smooth interpolation | almost arbitrary | noise sensitive | global | activation functions, learning rate | slow |
| RBFN (not necessarily binary) | locality in data/clusters | ellipses/circular | robust to noise | local | regions of responsibility,  learning rate | comparably slow |
| SVM | mercer's condition, input mapping, kernel function | high dimensional hyperplane, nonlinear in data space | handles noise with slacking variables | global | none | efficient |

### c) SVM

What is a support vector? How does the kernel trick work?

Given a two part dataset, the support vectors are those vectors of each class, that are closest to vectors from the other class. Then the separatrix is computed as the hyperplane with maximal distance to these support vectors.

In many cases, two classes can not be separated by a simple hyperplane. However, often one can find an embedding of the data into a higher-dimensional space where it becomes linearly separable. The kernel trick uses the fact, that for many tasks one does not have to compute the embedding explicitly, but it suffices to be able to compute the inner product of embedded datapoints, using an appropriate *kernel function*. This trick is most prominently used in support vector based classification, but it can also be used for other tasks like clustering and PCA.

## Recap 9: Reinforcement Learning [2 Points]

### a)

What is an agent in terms of reinforcement learning? Name an example of an agent.

An agent has sensors and can perform actions in a specified environment. (See PEAS: Performance, Environment, Actuators, Sensors)

Some possible agents are: mobile robots or game AI

### b)

What is the Markov assumption? How is it related to Q-learning? Give an example for which it does not hold.

The (first-order) Markov assumption means that state $s_{t+1}$ only depends on its predecessor state $s_t$ and the action $a_t$ performed then, i.e.: $s_{t+1} = \delta(s_t, a_t)$. This allows to specify a $Q$-function of the form $Q(s_t,a_t)$, instead of $Q(s_0,a_0,\ldots,s_t,a_t)$. The Markov assumption does not hold in situations where more information is needed than provided by the previous state. For example for sentence parsing with each word being a state the Markov assumption does not hold.

### c) 
What does the $Q$-function express in reinforcement learning and how is it used?

The $Q$-function provides the (discounted) maximal cummulative reward for performing an action $a$ in state $s$. The $Q$-function can be stated in a recursive form as
$$Q(s,a) = r(s,a) + \gamma\cdot\max_{a'\in\operatorname{Actions}(s')}{Q(s',a')}.$$
with $\gamma$ being the discount factor.
The $Q$-function can be used to define an optimal action selection policy for a reinforcement learning problem by
$$a^{\ast}(s) = \operatorname{argmax}_{a\in\operatorname{Actions}(s)}Q(s,a)$$
In $Q$-learning the $Q$-function is approximated by an iterative procedure.

Remark: There is no relation of the $Q$-function of reinforcement learning and the $Q$-function in the EM algorithm.

## Recap 10: Modeling Uncertainty [2 Points]

### a) Uncertainty

Why do we need to model uncertainties?

Uncertainty can occur due to many reasons:
* sparse data
* unreliable data (noise)
* uncertain outcomes
* high complexities

### b) Naive Bayes

What is a naive Bayes classifier? Why is it naive? 

Naive Bayes is a probabilistic classifier, i.e. a classifier that instead of a class assignment $x\mapsto c(x)\in C$ provides a probability value $P(C=c\mid X=x)$. The naive Bayes classifier applies Bayes theorem to compute the posterior (diagnostical) probability $P(C\mid X)$ from likelihood values $P(X\mid C)$ (and prior $P(X)$) that have been learned from training data. Naivity refers to the simplifying assumption that the different features $X_1,\ldots,X_n$ are conditional independent, given the class $C$, an assumption that may not be true in general but proves to work well in many applications.

### c) Probabilities

Given the following table, calculate the probability of drawing a blue candy blindly (assume the bags are chosen equally likely). Then calculate the probability that our drawn blue candy was drawn from the red bag.

|                | blue candies | green candies |
|----------------|--------------|---------------|
| **red bag**    |            5 |            10 |
| **yellow bag** |           20 |            10 |

$$P(C=b) = \frac{1}{2} \left( \frac{5}{15} + \frac{20}{30} \right) = \frac{1}{2}$$
$$P(B=r|C=b) = \frac{P(C=b|B=r)P(B=r)}{P(C=b)} = \frac{ \frac{5}{15} \frac{1}{2} }{ \frac{1}{2} } = \frac{5}{15} = \frac{1}{3}$$