Osnabrück University - Machine Learning (Summer Term 2018) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 11

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, June 24, 2018**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

## Assignment 1: Uncertainty and probability [6 Points]

This exercise will focus on concepts introduced in the first part of lecture (ML-11).

### a) Modeling uncertainty

In the lecture it is claimed that probabilities can summarize several factors:

1. missing knowledge
1. incapability to devise complete models of complex domains
1. chance

Think of an example for each of these points and explain how probabilities can be applied in modeling your example.

1. If I know that 50% of the time, if the grass is wet if the lawn sprinkler was on, then I don't know what cause there may be the other half of the time, but I can still say that the grass is wet due to a lawn sprinkler with a probability of 50%.
2. The same.
3. Chance can be modelled by probability quite nicely, because probability essentially reflects the behaviour of chance. Think of a six-faced die, it can take any of the values between 1 and 6, each possibility has a probability of $\frac{1}{6}$, so each possibility accounts for the fact that this event might *not* occur.

### b) Inference by enumeration

Given the full joint distribution shown on (ML-11 slide 15), calculate the following:
1. $P(\neg toothache)$
1. $P(cavity)$
1. $P(toothache \mid cavity)$
1. $P(cavity \mid toothache \vee catch)$

|         |  toothache <br> catch| <br>  ¬catch |   ¬toothache <br> catch | <br> ¬catch |
|:--------|-------------------------:|--------------------------:|--------------------------:|---------------------------:|
| cavity  |                    0.108 |                     0.012 |                     0.072 |                      0.008 |
| ¬cavity |                    0.016 |                     0.064 |                     0.144 |                      0.576 |

1. $P(\neg toothache) = 0.072 + 0.008 + 0.144 + 0.576 = 0.8$
2. $P(cavity) = 0.108 + 0.012 + 0.072 + 0.008 = 0.2$
3. 
\begin{align*}
P(toothache~|~cavity) &= \frac{0.108 + 0.012}{0.2}\\
&= 0.6
\end{align*}
4. 
\begin{align*}
P(cavity~|~toothache\lor catch) &= \frac{P(cavity\land toothache\lor catch)}{P(toothache\lor catch}\\
&= \frac{0.108 + 0.012 + 0.072}{0.108 + 0.012 + 0.072 + 0.016 + 0.064 + 0.144}\\
&\approx 0.4615
\end{align*}

### c) Conditional probability

For each of the following statements, either prove it is true or give a counterexample.
1. If P (a | b, c) = P (b | a, c), then P (a | c) = P (b | c)
1. If P (a | b, c) = P (a), then P (b | c) = P (b)
1. If P (a | b) = P (a), then P (a | b, c) = P (a | c)

### d) Independence and conditional independence


It is quite often useful to consider the effect of some specific propositions in the
context of some general background evidence that remains fixed, rather than in the complete
absence of information. The following questions ask you to prove more general versions of
the product rule and Bayes’ rule, with respect to some background evidence e:

1. Prove the conditionalized version of the general product rule:
$$P(X, Y \mid e) = P(X \mid Y, e)\cdot P(Y \mid e) .$$
1. Prove the conditionalized version of Bayes’ rule:
$$P(Y \mid X, e) = \frac{P(X \mid Y, e)\cdot P(Y \mid e)}{P(X \mid e)} $$

### e) Naive Bayes models

Text categorization is the task of assigning a given document to one of a fixed set of
categories on the basis of the text it contains. Naive Bayes models are often used for this
task. In these models, the query variable is the document category, and the “effect” variables
are the presence or absence of each word in the language; the assumption is that words occur
independently in documents, with frequencies determined by the document category.
1. Explain precisely how such a model can be constructed, given as “training data” a set of documents that have been assigned to categories.
1. Explain precisely how to categorize a new document.
1. Is the conditional independence assumption reasonable? Discuss.

1. Given the training data, one can count the appearances of words in correlation to the document type. These numbers can be used to calculate the necessary a prior and conditional probabilities, which can in turn be used to answer queries.
2. Given a new document, one counts the occurences of all (key?) words and then weighs the conditional probabilities according to these and then calculates the probability.
3. It is not accurate but reasonable in the sense that it is rarely possible to ever really know all dependencies between variables. It is not accurate in the sense that in real life, there are barely any completely independent events (consider medicine, where it is often not possible to say what factor contribute to the development of a disease).

## Assignment 2: Bayes networks [4 Points]

### a) Bayes networks

Explain in your own words the idea of a Bayes network. How is conditional independence represented in such a network? How can the full joint distribution be regained from such a network?

A Bayesian network is a probabilistic graphical model that represents the relationships between events or other observable quantities. Is is represented as a directed acyclic graph, where nodes are the observable quantities that output this quantities probability given an assignment of its parent's variables, and edges indicate conditional dependence (A $\to$ B means that B depends on a) between two nodes.  

Using the chain rule the full joint distribution can be obtained by calculating the product of all conditional probabilities.

### b) Independence in Bayes networks

Consider the Bayes network in (ML-11 slide 32):
1. If no evidence is observed, are Burglary and Earthquake independent? Prove this from the numerical semantics and from the topological semantics.
1. If we observe Alarm = true, are Burglary and Earthquake independent? Justify your answer by calculating whether the probabilities involved satisfy the definition of conditional independence.

1. Yes, they are independent.
2. No?

# Recap (part I)

This part of the sheet is intended to revise some topics from the lecture, a second part is following on the next sheet. These exercises do not need to be solved in order to qualify for the final exam but it is highly recommended for preparation. Also if you hit any question that should be discussed in more detail, please let us know.

## Recap 1: Concept Learning [2 Points]

### a) Concept Learning

What is Concept Learning? Is it supervised? Is it local?

Concept learning means devising a model that has learned a concept like *car* and using that to classifiy whether a given example is an instance of that concept or not (boolean). It is supervised because the correct class/concept of the training data needs to be known during the training process. Concept learning searches through the hypothesis space to find the best fitting hypotheses.

### b) Find-S
Describe the Find-S Algorithm in pseudo code. What is its inductive bias? What are its advantages and drawbacks?

**Pseudo Code:**  
```
initialize S to the most specific hypothesis (empty)
for all training samples x with c(x) = 1:
    if S is not consistent with x:
        generalize S minimally such that it is consistent with x
```
        
**Inductive Bias:**  
Inductive bias is the set of assumptions underlying a learning algorithm, which is used in order to learn (e.g. generalize) from already seen data. Without inductive bias, one cannot learn, because no implications can be drawn from already seen data. The disadvantage of inductive bias is that is dictates *how* an algorithm learns.

### c) Hypotheses space

What is the hypotheses space for Candidate-Elimination used in the lecture?

The hypothesis space for Candidate-Elimination contains all possible hypotheses consisting of conjunctions of literals. It does not contain disjunctive hypotheses.

## Recap 2: Decision Trees [2 Points]

### a) Overfitting
What is overfitting? How can it be avoided?

Overfitting is the process of training a model such that it learns too much minor details of the training data, leading to worse performance on previously unseen data, because it learns statictically irrelevant features.  

It can be avoided by finding a suitable minimum node size for decision trees, pruning the resulting tree or employing the method or *random forests*, which uses several independet trees trained based on random subsets of the actual training data.

### b) Pruning

Name one method for pruning a decision tree and describe it!

**Reduced Error Pruning:**  
This method removes (leaf?) nodes and assigns the most common class as long as the performance on an independet validation set does not decrease.

### c) Information gain
What are entropy and information gain? Provide explanation and formulae. How are they used in ID3?

Entropy is a measure of data impurity. Consider an example of a data set, which can be labelled with one of two classes *A* or *B*. When I say that all data belongs to class *A*, but there are data for which that is not the case, then my dataset is somewhat impure, and entropy measures *how* impure a dataset is and can be calculated by:
$$\text{Entropy}(S) = -\sum_{c\in S}p_c(S)\log(p_c(S))$$

Information gain describes how much *purer* a dataset becomes if a split for feature *x* is performed, for which it uses the entropy of a dataset. Consider the example below: Let's say my data consists of 5 apples, which each have the two features **expired** and **color**:

| Apple | expired | color | tasty |
| ----- | ------- | ----- | ----- |
|   1   | true    | green | false |
|   2   | true    |  red  | false |
|   3   | false   | green | true  |
|   4   | false   |  red  | true  |
|   5   | false   | green | true  |

Before I split the data, only 60% of all apples are tasty. If I split the dataset for the feature *color*, then 50% of red apples are tasty, while 66% of green apples are tasty. If I split the dataset for the feature *expired*, then 100% of the **not** expired apples are tasty, while 0% of the expired apples are tasty.  
So the second split gives me more information than the first split, because it reduces the entropy of the resulting datasets. In general, information gain is calculated as follows:
$$\text{Information Gain}(S,A) = \text{Entropy}(S) -\sum_{v\in values(A)} \text{Entropy}(S_v)\cdot\frac{|S_v|}{|S|}$$

## Recap 3: Data Mining [2 Points]

### a) Missing values

How can you deal with missing values? Name an important algorithm and explain how to use it.

Although missing values can also be dealt with using linear regression or simply using the median of the present data, an important algorithm is called **Expectation Maximization**, where the data's underlying distribution is estimated and then values are randomly drawn from that estimated distribution.

### b) Outliers

What are outliers? Can we detect them? If so, how?

Outliers are data samples that have extreme values, compared to the rest of the dataset. For example, if I have 200 data samples which contain values between 0 and 1, except for one datum whose value is 42, then that datum is an outlier. Outlierts can occur due to technical reasons (regarding measurement) or because the data naturally has high variation.  
Outliers can be detected by calculating their distance from the median $m$ in terms of the standard deviation $\sigma$:
$$z_i = \frac{|x_i - m|}{\sigma}$$
A datum with a z-value greater than 3.5 is then considered an outlier and removed from the dataset. An algorithm that employs this method is the *Rosner Test*, where outliers are removed iteratively following this scheme:
```Python
while maximum z > 3.5
    maximum z = 0
    m = median
    s = standard deviation
    for x in dataset
        z = (x-m)/s
        if z > maximum z
            maximum z = z
    remove x with z = maximum z
```

### c) 
What does the Q-function express in the EM algorithm?

It expresses the expected value of the log likelihood function under the current estimate of the parameters $\theta_t$.

## Recap 4: Clustering [2 Points]

### a) Clustering

Explain the difference between single-linkage and complete-linkage clustering.

Single- and complete-linkage clustering differ only in the metric they use in order to assign data to clusters. Single-linkage employs the *minimum* cluster distance $D_{min} = \min_{x\in X,y\in Y} d(x,y)$ and complete linkage employs the *maximum* cluster distance $D_{max} = \max_{x\in X,y\in Y} d(x,y)$, where $d(x,y)$ is the euclidean distance function.  

Due to these differing cluster distances, single-linkage clustering prefers long thin clusters, where the minimum distance between to clusters is small but the maximum distance may be quite large. On the other hand, complete-linkage clustering prefers compact clusters.

### b) Metrics

Name three different distance measures and briefly explain them. Check the metric axioms for one of them.

\begin{align*}
D_{min} &= \min_{x\in X,y\in Y} d(x,y)\\
D_{max} &= \max_{x\in X,y\in Y} d(x,y)\\
D_{mean} &= \frac{1}{|X||Y|}\sum_{x\in X,y\in Y} d(x,y)\\
\end{align*}

**Metric Axioms:**  
1. $D(x,y) = \begin{cases} > 0 & \text{if } x \neq y\\ = 0 & \text{else}\end{cases}$
2. $D(x,y) = D(y,x)$
3. $D(x,y) \leq D(x,z) + D(z,y)$  

$D_{min}$:
1. is always given, because the euclidean distance function is 0 for equal values ($x=y$) and greater than 0 otherwise
2. is given because symmetry holds for the euclidean distance function
3. is given because the triangle inequality holds for the euclidean distance function

### c) Mixture models

What is a mixture model? Explain. Can you provide a formula?

K-Means with different gaussians?

## Recap 5: Dimension Reduction [2 Points]

### a) Visualization

Name three different data visualization techniques to visualize high dimensional data. Explain one in detail.

* PCA (principal component analysis)
* Scatterplot Matrix
* Parallel Plot

A scatterplot matrix is a matrix of all combinations of 2 features visualized in a scatterplot each (except for the combinations where $x=y$).
![Example Image](http://support.sas.com/documentation/cdl/en/grstatproc/62603/HTML/default/images/gsgscmat.gif)

### b) PCA

Draw a few data points (ASCII arts or on a sheet of paper) and mark the principal components. What are the principal components?

**Principal Components** are the axes of the n-dimensional ellipsoid which is fit to the data, whose lengths indicate the amount of variance in that dimension.


### c) Covariance matrix
What does a covariance matrix express? How is it computed from data? How is it used in PCA?

A covariance matrix contains the covariance between all pairs of elements of an n-dimensional random variable. Covariance is a measure of the joint variability of two elements, it is positive if both increase simultaneously and negative if one increases while the other decreases.

It can be calculated by
$$\text{cov} = E(x-\mu_x\cdot y-\mu_y)$$
where $\mu_x$ is the mean for variable $x$ and $E$ is the expected value (mean).

In PCA, the covariance matrix is used to calculate the eigenvector ans eigenvalues from, which are the principaö components used to project the data differently.