Osnabrück University - Machine Learning (Summer Term 2018) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 11

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, June 24, 2018**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

## Assignment 1: Uncertainty and probability [6 Points]

This exercise will focus on concepts introduced in the first part of lecture (ML-11).

### a) Modeling uncertainty

In the lecture it is claimed that probabilities can summarize several factors:

1. missing knowledge
1. incapability to devise complete models of complex domains
1. chance

Think of an example for each of these points and explain how probabilities can be applied in modeling your example.

### b) Inference by enumeration

Given the full joint distribution below, calculate the following:

|         |  toothache <br> catch| <br>  ¬catch |   ¬toothache <br> catch | <br> ¬catch |
|:--------|-------------------------:|--------------------------:|--------------------------:|---------------------------:|
| cavity  |                    0.108 |                     0.012 |                     0.072 |                      0.008 |
| ¬cavity |                    0.016 |                     0.064 |                     0.144 |                      0.576 |

1. $P(\neg toothache)$
1. $P(cavity)$
1. $P(toothache \mid cavity)$
1. $P(cavity \mid toothache \vee catch)$

If you are familiar with `pandas` you can use the dataframe below to find the solutions. You can of course also write code without using pandas or calculate the answers manually. 

1. This asks for the probability that Toothache is true.
$$P (toothache ) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2$$
1. This asks for the vector of probability values for the random variable Cavity. It has two
values, which we list in the order htrue, falsei. First add up $0.108 + 0.012 + 0.072 +
0.008 = 0.2$. Then we have
$$P(Cavity) = \langle 0.2, 0.8\rangle.$$
1. This asks for the vector of probability values for Toothache, given that Cavity is true.
$$P(Toothache |cavity) = \langle(.108 + .012)/0.2, (0.072 + 0.008)/0.2\rangle = \langle 0.6, 0.4\rangle$$
1. This asks for the vector of probability values for Cavity, given that either Toothache or
Catch is true. First compute $P(toothache\vee catch) = 0.108 + 0.012 + 0.016 + 0.064 +
0.072 + 0.144 = 0.416$. Then
$$P(Cavity|toothache \vee catch) =
\langle(0.108 + 0.012 + 0.072)/0.416, (0.016 + 0.064 + 0.144)/0.416\rangle =
\langle 0.4615, 0.5384 \rangle$$


In [1]:
import pandas as pd
columns = pd.MultiIndex.from_product((('toothache', '¬toothache'), ('catch', '¬catch'))) 
index = ('cavity', '¬cavity')
data = [[0.108, 0.012, 0.072, 0.008],
        [0.016, 0.064, 0.144, 0.576]]
joint_distribution = pd.DataFrame(data, index, columns)
joint_distribution

Unnamed: 0_level_0,toothache,toothache,¬toothache,¬toothache
Unnamed: 0_level_1,catch,¬catch,catch,¬catch
cavity,0.108,0.012,0.072,0.008
¬cavity,0.016,0.064,0.144,0.576


In [2]:
# 1.
joint_distribution.sum().sum(level=0)

toothache     0.2
¬toothache    0.8
dtype: float64

In [3]:
# 2.
joint_distribution.sum(axis='columns')

cavity     0.2
¬cavity    0.8
dtype: float64

In [4]:
# 3.
joint_distribution.loc['cavity', :].sum(level=0) / joint_distribution.loc['cavity', :].sum()

toothache     0.6
¬toothache    0.4
Name: cavity, dtype: float64

In [5]:
# 4. 
toothache_or_catch = joint_distribution.sum(axis='columns') - joint_distribution.loc[:, ('¬toothache', '¬catch')]
toothache_or_catch / toothache_or_catch.sum()

cavity     0.461538
¬cavity    0.538462
dtype: float64

### c) Conditional probability


For each of the following statements, either prove it is true or give a counterexample.
1. If P (a | b, c) = P (b | a, c), then P (a | c) = P (b | c)
1. If P (a | b, c) = P (a), then P (b | c) = P (b)
1. If P (a | b) = P (a), then P (a | b, c) = P (a | c)

1. True. By the product rule we know P (b, c)P (a|b, c) = P (a, c)P (b|a, c), which by
assumption reduces to P (b, c) = P (a, c). Dividing through by P (c) gives the result.
1. False. The statement P (a|b, c) = P (a) merely states that a is independent of b and c,
it makes no claim regarding the dependence of b and c. A counter-example: a and b
record the results of two independent coin flips, and c = b.
1. False. While the statement P (a|b) = P (a) implies that a is independent of b, it does
not imply that a is conditionally independent of b given c. A counter-example: a and b
record the results of two independent coin flips, and c equals the xor of a and b.

### d) Independence and conditional independence


It is quite often useful to consider the effect of some specific propositions in the
context of some general background evidence that remains fixed, rather than in the complete
absence of information. The following questions ask you to prove more general versions of
the product rule and Bayes’ rule, with respect to some background evidence e:

1. Prove the conditionalized version of the general product rule:
$$P(X, Y \mid e) = P(X \mid Y, e)\cdot P(Y \mid e) .$$
1. Prove the conditionalized version of Bayes’ rule:
$$P(Y \mid X, e) = \frac{P(X \mid Y, e)\cdot P(Y \mid e)}{P(X \mid e)} $$

The basic axiom to use here is the definition of conditional probability:

1. We have
$$P(A, B|E) = \frac{P(A, B, E)}{P(E)}$$
and
$$P(A|B, E)\cdot P(B|E) = \frac{P(A, B, E)}{P(B, E)}\cdot \frac{P(B, E)}{P(E)}
=\frac{P(A, B, E)}{P(E)}$$ 
hence
$$P(A, B|E) = P(A|B, E)\cdot P(B|E)$$
1. The derivation here is the same as the derivation of the simple version of Bayes’ Rule. First write down the dual form of the conditionalized product rule, simply by switching A and B in the above derivation:
$$P(A, B|E) = P(B|A, E)\cdot P(A|E)$$
Therefore the two right-hand sides are equal:
$$P(B|A, E)\cdot P(A|E) = P(A|B, E)\cdot P(B|E)$$
Dividing through by $P(B|E)$ you get
$$P(A|B, E) = \frac{P(B|A, E)\cdot P(A|E)}{P(B|E)}$$


### e) Naive Bayes models

Text categorization is the task of assigning a given document to one of a fixed set of
categories on the basis of the text it contains. Naive Bayes models are often used for this
task. In these models, the query variable is the document category, and the “effect” variables
are the presence or absence of each word in the language; the assumption is that words occur
independently in documents, with frequencies determined by the document category.
1. Explain precisely how such a model can be constructed, given as “training data” a set of documents that have been assigned to categories.
1. Explain precisely how to categorize a new document.
1. Is the conditional independence assumption reasonable? Discuss.

1. The model consists of the prior probability $P(Category)$ and the conditional probabilities $P(Word_i|Category)$. For each category $c$, $P(Category = c)$ is estimated as the fraction of all documents that are of category $c$. Similarly, 
$P(Word_i = true|Category = c)$ is estimated as the fraction of documents of category $c$ that contain $Word_i$.
1. Here, every evidence variable $Word_i$ is observed, since we can tell if any given word appears in a given document or not. Hence by Bayes rule we can estimate $P(Category|Word_i)$.
1. The independence assumption is clearly violated in practice. For example, the word pair "machine learning" occurs more frequently in any given document category than would be suggested by multiplying the probabilities of "machine" and "learning".


## Assignment 2: Bayes networks [4 Points]

### a) Bayes networks

Explain in your own words the idea of a Bayes network. How is conditional independence represented in such a network? How can the full joint distribution be regained from such a network?

x

### b) Independence in Bayes networks

Consider the Bayes network in (ML-11 slide 32):
1. If no evidence is observed, are Burglary and Earthquake independent? Prove this from the numerical semantics and from the topological semantics.
1. If we observe Alarm = true, are Burglary and Earthquake independent? Justify your answer by calculating whether the probabilities involved satisfy the definition of conditional independence.

1. Yes. According to (ML-11 slide 33)
\begin{align*}
  P(X_1,\ldots,X_n) = \prod_{i=1}^n P(X_i\mid\textit{Parents}(X))
\end{align*}
So numerically one can compute that in our case
\begin{align*}
  P(B,E,A,J,M) 
  & = P(B|\textit{Parents}(B))\cdot
  P(E|\textit{Parents}(E))\cdot
  P(A|\textit{Parents}(A))\cdot
  P(J|\textit{Parents}(J))\cdot
  P(M|\textit{Parents}(M)) \\
  & = P(B)\cdot
  P(E)\cdot
  P(A|B,E)\cdot
  P(J|A)\cdot
  P(M|A)
\end{align*}
To get $P(B,E)$ we can apply marginalization:
\begin{align*}
  P(B,E,A,J,M) 
  &= \sum_{a}\sum_{j}\sum_{m} P(B,E,a,j,m) \\
  &= \sum_{a}\sum_{j}\sum_{m} P(B)\cdot P(E)\cdot
  P(a|B,E)\cdot
  P(j|a)\cdot
  P(m|a) \\
  &= \sum_{a} P(B)\cdot P(E)\cdot
  P(a|B,E)\cdot 1 \cdot 1 \\
  &= P(B)\cdot P(E)\cdot 1
\end{align*}
This shows the independence of $B$ and $E$. Topologically $B$ and $E$ are d-separated by $A$ (i.e. any path connecting $B$ and $E$ goes through $A$).
1.  We check whether $P(B,E|a) = P(B|a)P(E|a)$. First computing $P(B,E|a)$:
\begin{align} P(B,E|a) = \alpha P(a|B,E)P(B, E)
&= \alpha
\begin{cases}
 0.29 \cdot 0.001 \cdot 0.002 & \text{if $B=b$ and $E=e$} \\
 0.94 \cdot 0.001 \cdot 0.998 & \text{if $B=b$ and $E=\neg e$} \\
 0.29 \cdot 0.999 \cdot 0.002 & \text{if $B=\neg b$ and $E=e$} \\
 0.001 \cdot 0.999 \cdot 0.998 & \text{if $B=\neg b$ and $E=\neg e$}
 \end{cases}
\\
&= \alpha
\begin{cases}
 0.0008 & \text{if $B=b$ and $E=e$} \\
 0.3728 & \text{if $B=b$ and $E=\neg e$} \\
 0.2303 & \text{if $B=\neg b$ and $E=e$} \\
 0.3962 & \text{if $B=\neg b$ and $E=\neg e$}
 \end{cases}
\end{align}
where $\alpha$ is a normalization constant. Checking $B = \neg b$ whether $P = (b, e|a) = P (b|a)\cdot P(e|a)$ we
find 
$$P(b,e|a) = 0.0008\neq 0.0863 = 0.3736 \cdot 0.2311 = P(b|a)\cdot P(e|a)$$
showing that $B$ and $E$ are not conditionally independent given $A$.

[RN, ex 14.4] 

# Recap (part I)

This part of the sheet is intended to revise some topics from the lecture, a second part is following on the next sheet. These exercises do not need to be solved in order to qualify for the final exam but it is highly recommended for preparation. Also if you hit any question that should be discussed in more detail, please let us know.

## Recap 1: Concept Learning [2 Points]

### a) Concept Learning

What is Concept Learning? Is it supervised? Is it local?

Concept learning aims at acquiring knowledge that allows to distinguish exemplars from non exemplars of a given category (concept). It can be formalized as learning a unary predicate $p_c$ on the domain $X$ or equivalently an indicator function $c:X\to\{0,1\}$.

Concept learning is usually supervised: the teacher tells the learner if an example falls under the concept or not.

As soon as there are is some metric given on the data, there may be local and global concept learners. One may for example use a nearest neighbor learner (local) or a multilayer-perceptron (global) to learn concepts.

### b) Find-S
Describe the Find-S Algorithm in pseudo code. What is its inductive bias? What are its advantages and drawbacks?

    1. Initialize $h$ to the most specific hypothesis in H.
    2. For each positive training instance x do
           For each attribute constraint $a_i$ in h do
               If ($a_$i is not satisfied by x) then
                   Replace $a_i$ in h by the next more general constraint
                     that is satisfied by x.
               End if
           End for
       End for
    3. Output h.

Inductive Bias: The target concept can be described in its hypothesis space (in our case: it is a conjunction of features). All instances are negative instances unless demonstrated otherwise.

Drawback: it does not take negative instances into account.

### c) Hypotheses space

What is the hypotheses space for Candidate-Elimination used in the lecture?

The hypothesis space for Candidate-Elimination spreads between the most general and most specific hypotheses. The other hypotheses are made up by conjunction of features which biases the learner and makes it impossible to find a disjunctive solution.

The version space on the other hand is a subset of the hypotheses space. It is the set of all hypotheses between and including the general and the specific boundary.



## Recap 2: Decision Trees [2 Points]

### a) Overfitting
What is overfitting? How can it be avoided?

Overfitting means an overly specific adaptation of the learner to the training data. Not only the general structure of the training data has been learned, but also its specific noise, i.e. artifacts, are learned and hence the learner looses the capability to generalize and work on other data.

Overfitting can be detected by using a separate test data set. If the error on the test data increases during training, this indicates overfitting.

### b) Pruning

Name one method for pruning a decision tree and describe it!

Pruning can be applied to reduce overfitting of a decision tree. Two types of pruning have been introduced in the lecture:

*Reduced error pruning:* removes nodes from the decision tree to achieve better generalization on the test set.

*Rule based pruning:* translate the decision tree into a set of rules and then prune an individual rule by removing any preconditions that result in improving its accuracy on the
validation set.

### c) Information gain
What are entropy and information gain? Provide explanation and formulae. How are they used in ID3?

Entropy measures the inhomogeneity of a data set (the minimal number of bits needed to encode elements from the set) 
$$E(S) = -p_{+}\log_2 p_{+} - p_{-}\log_2p_{-}$$
where $p_{+}$ denotes the fraction of positive and $p_{-}$ that of negative examples in the data set. A set $S$ with only positive (or only negative) examples would have no entropy (i.e. $E(S)=0$), while a set with the same number of positive and negative examples has maximal entropy ($E(S)=1$).

Information gain is the expected reduction in entropy due to splitting the data set $S$ based on one attribute $A$: denote for every value $v\in\operatorname{Values}(A)$ the subset of elements from $S$ where $A=v$ by $S_v$. Then the information gain is given by
$$\operatorname{Gain}(S,A) = E(S) - \sum_{v\in\operatorname{Values}(A)}E(S_v)\cdot\frac{|S_v|}{|S|}$$
that is, from the entropy of $S$ the entropy values for $S_v$ are subtracted and weighted by their respective sizes. If the subsets $S_v$ are all homogeneous ($E(S_v)=0$), then the information gain is maximal, namely $E(S)$, i.e. the data set can be fully explained by the single attribute $A$. On the other hand, if all $S_v$ have maximal entropy,  no information is gained by splitting based on $A$. In practice, something between these extremes will be the case.

ID3 places the node with highest information gain at the root of the decision tree.

## Recap 3: Data Mining [2 Points]

### a) Missing values

How can you deal with missing values? Name an important algorithm and explain how to use it.

Data records with missing values may be simply ignored, or one may try to "fix" the record by inserting artificial values into empty slots. The most simple way is to insert just zeros (or some other value) but this will lead to poor data quality. Better approaches try to use statistical properties of the data set to introduce "natural" fillers. One approach is to use the mean of the missing attribute, however this ignores possible dependencies between the different attributes. A more sophisticated approach is expectation maximization (EM) to estimate the joint probability distribution of all attributes in an iterative process. Once it is computed, one can use it to determine the most likely value for the missing datapoint.

### b) Outliers

What are outliers? Can we detect them? If so, how?

An outlier is a value that seemingly does not belong to the rest of the data. It is probably caused by some measurement error (but it may also reflect some real phenomenon).

A simple method to detect outliers is to consider their distances from the mean (or median) of the full data set. If this is too large (e.g. greater than 3 standard deviations), the data point is considered to be an outlier (z-test). The Rosner test iteratively removes those outliers until the dataset does not contain anymore of them.

### c) 
What does the Q-function express in the EM algorithm?

The EM algorithm aims at finding model parameters $\theta$ that best explain observed data $x$ (which may have missing values $h$). It does so by alternating steps of calculating the expected value of the (log) likelihood function $L(\theta,x,h)$, using the current estimated parameters $\theta_t$ (E step), and then finding parameter values $\theta'$ that maximize this quantity (M step). The $Q$-function expresses the expected likelihood function:
$$Q(\theta\mid\theta_{t}) = E_{h\mid x,\theta_t}[\log L(\theta,x,h)] = \int P(h\mid x,\theta_t)\cdot \log P(h\mid x,\theta)\operatorname{d} h+\log P(x\mid\theta)$$

## Recap 4: Clustering [2 Points]

### a) Clustering

Explain the difference between single-linkage and complete-linkage clustering.

Single-linkage clustering is based on the *minimum distance* that defines the distance between two clusters from the distance of their closest points. Single-linkage clustering tends to chaining.

Complete-linkage clustering is based on the *maximum distance* that defines the distance of two clusters to be the maximal distance of two of their points. Complete linkage clustering prefers compact clusters.

### b) Metrics

Name three different distance measures and briefly explain them. Check the metric axioms for one of them.

* Hamming distance: the number of positions where two strings of equal length differ
* Chebyshev distance (also: maximum distance): maximal absolute difference in a single coordinate.
* p-norm: family of norms, defined by the formula $\sqrt[p](\sum_{i=1}^{L}|x_i-y_i|^p)$. Important special cases: city block (aka Manhattan, p=1), euclidean distance (p=2)
* Jaccard distance: for binary attributes

Metric axioms for Chebyshev distance $d(\mathbf{x},\mathbf{y}) := \max_{i=1,\ldots,L}|x_i-y_i|$:
1. Symmetry: Here we use that the absolute value of the difference is symmetric: $|a-b| = |b-a|$, hence $d(\mathbf{x},\mathbf{y}) = \max_{i=1,\ldots,L}|x_i-y_i| = \max_{i=1,\ldots,L}|y_i-x_i| = d(\mathbf{y},\mathbf{x})$
2. Coincidence (identity of indiscernibles): $d(\mathbf{x},\mathbf{x}) = \max_{i=1,\ldots,L}|x_i-x_i| = 0$
3. Triangle equation: Here we apply that the triangle inequality holsd for the absolute value of the difference: $|a-c|+|c-b|\geq|a-b|$, and hence
\begin{align}
  d(\mathbf{x},\mathbf{z}) + d(\mathbf{z},\mathbf{y}) =
\max_{i=1,\ldots,L}|x_i-z_i| + \max_{i=1,\ldots,L}|z_i-y_i| &
 \geq \max_{i=1,\ldots,L}(|x_i-z_i| + |z_i-y_i|) \\
& \geq \max_{i=1,\ldots,L}(|x_i-y_i|) = d(\mathbf{x},\mathbf{y})
\end{align}


### c) Mixture models

What is a mixture model? Explain. Can you provide a formula?

A mixture model describes a two-step process to mix different (simple) data distributions. Such an approach can be used to model a large population with different subpopulations, each which individual characteristics.

Formally, one provides a specific distribution $P(X\mid Z=z)$ for every subpopulation $z$. These are mixed according to the probability $P(Z=z)$ to select an individual from that subpopulation, i.e.
$$P(X=x) = \sum_{z}P(Z=z)\cdot P(X=x\mid Z=z)$$


## Recap 5: Dimension Reduction [2 Points]

### a) Visualization

Name three different data visualization techniques to visualize high dimensional data. Explain one in detail.

* a *scatterplot matrix* shows 2D projections of the data for all combination of axes
* *Chernoff faces:* map parameters to facial features
* *parallel coordinates:* map the different data dimensions to different x-coordinates and plot the corresponding values at the y-axis.

### b) PCA

Draw a few data points (ASCII arts or on a sheet of paper) and mark the principal components. What are the principal components?

Principal components form a set of linear independent vectors, pointing into the direction of the largest variance. Their length corresponds to the variance in that direction.

### c) Covariance matrix
What does a covariance matrix express? How is it computed from data? How is it used in PCA?

The covariance matrix contains the covariance values for all pairs of coordinates. A positive covariance value means that high values for the first coordinate correspond to high values for the second coordinate. A negative covariance value expresses a correspondence of high values in the first coordinate with low values in the second coordinate. A value of $0$ means, that the values of the two coordinates do not correspond to each other.

Given a set of $n$ data points in a $d$-dimensional data space as an $n\times d$-matrix $D$, the covariance matrix is computed as $$C=(D-\mu)^T\cdot(D-\mu)$$ where $\mu$ denotes the mean vector of the data set.

In PCA, the principal components are computed as eigenvectors of the covariance matrix.