# Machine Learning (Summer 2018)

## Practice Session 4

May, 16th 2018

Ulf Krumnack

Institute of Cognitive Science
University of Osnabrück

## Plan for the next sessions

* today: misc
* tomorrow: the EM algorithm
* next Tuesday: PCA

## Today's Session

* exercise sheet 06
* continuing MatPlotLib
* norms and metrices

# Continuing MatPlotLib


In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

## Legends, labels and titles, annotations

Usually, you should be "decorated" your figure to explain what is depicted.

### Figure titles

A title can be added to each axis instance in a figure. To set the title, use the `plt.title` (or `ax.set_title`) method in the axes instance:

In [None]:
x = np.linspace(0,2*np.pi,70)
plt.figure()
plt.plot(x, np.sin(x), x, np.cos(x))
plt.title("Sine and cosine")
plt.show()

### Axis labels

Similarly, with the methods `plt.xlabel` and `plt.ylabel` (or `ax.set_xlabel` and `ax.set_ylabel`), we can set the labels of the X and Y axes:

In [None]:
x = np.linspace(0,2*np.pi,70)
plt.figure()
plt.plot(x, np.sin(x), x, np.cos(x))
plt.xlabel("x")
plt.ylabel("y")
plt.show()

### Legends

Legends can be added to curves by providing a `label` for each plot and calling the `legend` method.

In [None]:
x = np.linspace(0,2*np.pi,70)
plt.figure()
plt.plot(x, np.sin(x), label='sine')
plt.plot(x, np.cos(x), label='cosine')
plt.legend()
plt.show()

### Text annotations

* test annotations in matplotlib figures can be added using the `text` function.
* the `text` function supports LaTeX formatting just like axis label texts and titles (but beware of escape characters!)

In [None]:
x = np.linspace(-0.75, 1., 100)

plt.figure()
plt.plot(x, x**2, x, x**3)
plt.text(0.15, 0.2, r"$\alpha$", fontsize=40, color="blue")
plt.text(0.65, 0.1, "$y=x^3$", fontsize=20, color="green")
plt.show()

Exercises:
1. Draw a sine and a cosine curve into a graph and decorate it with axis label, title and legend in the lower right corner (look up the documentation to learn how to position the legend).
1. Mark the point $(\pi,0)$ in the point an place the label $\pi$ next to it

## Other 2D plot styles

* in addition to the regular `plot` method, there are a number of other functions for generating different kind of plots
* see the matplotlib plot gallery for a complete list of available plot types: http://matplotlib.org/gallery.html

In [None]:
# A scatter plot
n = 500
x = np.random.randn(n)
y = np.random.randn(n)

plt.figure()
plt.scatter(x,y)
plt.title("scatter")
plt.show()

In [None]:
# A step function
n = np.array([0,1,2,3,4,5])
plt.figure()
plt.step(n, 10-n**2, lw=2)
plt.title("step")
plt.show()

In [None]:
# A bar diagram
plt.figure()
plt.bar(n, n**2, align="center", width=0.5, alpha=0.5)
plt.title("bar")
plt.show()

In [None]:
x = np.linspace(0,2*np.pi,70)
plt.figure()
plt.fill_between(x, np.sin(x), np.cos(x), color="green", alpha=0.5)
plt.title("fill_between")
plt.show()

In [None]:
# A histogram
n = np.random.randn(100000)
plt.figure()

plt.hist(n)
plt.title("Default histogram")
plt.xlim((min(n), max(n)));

## Subplots

* subplots allow to put multiple plots into one figure
* subplots are arranged in a rectangular grid
* subplots are indexed starting with 1 (not 0)!

In [None]:
rows = 2
columns = 3
N=10

plt.figure()
for i in range(rows*columns):
    plt.subplot(rows, columns, i+1)
    plt.title('Subplot {}'.format(i+1))
    plt.plot(np.arange(N), np.random.rand(N))
plt.tight_layout()
plt.show()

Exercises:

1. Create a normally distributed 2D dataset with given mean and standard derivation.
2. Create a scatter plot to display your dataset
3. Indicate the standard deviation by adding a corresponding ellipses to your plot
4. Next to your scatter plot draw a histogram that shows the distribution of distances to the center point

In [None]:
N = 500
mean = np.array([1,1])
sigma = np.array([[2,1],[1,4]])



## Summary

* MatPlotLib provides plotting functionality
* Today we saw some basic concepts that should allow you to do most of the exercises
* We may introduce some additional functionality in future sessions
* For the curious one: visit [https://matplotlib.org/]

# Sheet 03: Assignment 2: p-norm [5 Points]

A very well known norm is the euclidean distance. However, it is not the only norm: It is in fact just one of many p-norms where $p = 2$. In this assignment you will take a look at other p-norms and see how they behave.

Implement a function `pnorm` which expects a vector $x \in \mathcal{R}^n$ and a scalar $p \geq 1, p \in \mathcal{R}$ and returns the p-norm of $x$ which is defined as:

$$||x||_p = \left(\sum\limits_{i=1}^n |x_i|^p \right)^{\frac{1}{p}}$$

*Note:* Even though the norm is only defined for $p \geq 1$, values $0 < p < 1$ are still interesting. In that case we can not talk about a norm anymore, as the triangle inequality ($||a|| + ||b|| \geq ||a + b||$) does not hold. We will still take a look at some of these values, so your function should handle them as well.

In [None]:
import numpy as np

def pnorm(x, p):
    """
    Calculates the p-norm of x.
    
    Args:
        x (array): the vector for which the norm is to be computed.
        p (float): the p-value (a positive real number).
        
    Returns:
        The p-norm of x.
    """
    # If p is not valid, raise an error:
    if p <= 0:
        raise ValueError('p has to be > 0!')
    result = np.sum(np.abs(x) ** p, axis=-1) ** (1 / p)
    return result

In [None]:
# 1e-10 is 0.0000000001
assert pnorm(1, 2)      - 1          < 1e-10 , "pnorm is incorrect for x = 1, p = 2"
assert pnorm(2, 2)      - 2          < 1e-10 , "pnorm is incorrect for x = 2, p = 2"
assert pnorm([2, 1], 2) - np.sqrt(5) < 1e-10 , "pnorm is incorrect for x = [2, 1], p = 2" 
assert pnorm(2, 0.5)    - 2          < 1e-10 , "pnorm is incorrect for x = 2, p = 0.5"

Implement another function `pdist` which expects two vectors $x_0 \in \mathcal{R}^n, x_1 \in \mathcal{R}^n$ and a scalar $p \geq 1, p \in \mathcal{R}$ and returns the distance between $x_0$ and $x_1$ on the p-norm defined by $p$. Again handle $0 < p < 1$ as well.

In [None]:
import numpy as np

def pdist(x0, x1, p):
    """
    Calculates the distance between x0 and x1
    using the p-norm.
    
    Arguments:
        x0 (array): the first vector.
        x1 (array): the second vector.
        p (float): the p-value (a positive real number).
        
    Returns:
        The p-distance between x0 and x1.
    """
    result = pnorm(np.array(x0) - np.array(x1), p)
    return result

In [None]:
# 1e-10 is 0.0000000001
assert pdist(1, 2, 2)           - 1          < 1e-10 , "pdist is incorrect for x0 = 1, x1 = 2, p = 2"
assert pdist(2, 5, 2)           - 3          < 1e-10 , "pdist is incorrect for x0 = 2, x1 = 5, p = 2"
assert pdist([2, 1], [1, 2], 2) - np.sqrt(2) < 1e-10 , "pdist is incorrect for x0 = [2, 1], x1 = [1, 2], p = 2" 
assert pdist([2, 1], [0, 0], 2) - np.sqrt(5) < 1e-10 , "pdist is incorrect for x0 = [2, 1], x1 = [0, 0], p = 2" 
assert pdist(2, 0, 0.5)         - 2          < 1e-10 , "pdist is incorrect for x0 = 2, x1 = 0, p = 0.5"

Now we will compare some different p-norms. Below is part of a code to plot data in nice scatter plots. 

Your task is to calculate the data to plot. The variable `data` is currently simply filled with zeros. Instead, fill it as follows:

- Use the function `np.linspace()` to create a vector of `50` evenly distributed values between `-100` and `100` (inclusively).
- Fill `data`: Data is basically the cartesian product of the vector you created before with itself filled up with each value's norm. It should have 2500 rows. Each of the 2500 rows should contain `[x, y, d]`, where `x` is the x coordinate and `y` the y coordinate of a point, and `d` the p-norm of `(x, y)`. Use either `pnorm` or `pdist` to calculate `d`.
- Normalize the data in `data[:,2]` (i.e. all d-values) so that they are between 0 and 1.

Run your code and take a look at your results. Darker colors mean that a value is further away from the center (0, 0) according to the p-norm used.

*Hint:* To give you an idea of how `data` should look like, here is an example for three evenly distributed values between `-1` and `1` and a p-norm with `p = 2`.

Before normalization of the d-column:

```python
data = np.array([[-1.         -1.          1.41421356]
                 [-1.          0.          1.        ]
                 [-1.          1.          1.41421356]
                 [ 0.         -1.          1.        ]
                 [ 0.          0.          0.        ]
                 [ 0.          1.          1.        ]
                 [ 1.         -1.          1.41421356]
                 [ 1.          0.          1.        ]
                 [ 1.          1.          1.41421356]])
```

After normalization of the d-column:

```python
data = np.array([[-1.         -1.          1.        ]
                 [-1.          0.          0.70710678]
                 [-1.          1.          1.        ]
                 [ 0.         -1.          0.70710678]
                 [ 0.          0.          0.        ]
                 [ 0.          1.          0.70710678]
                 [ 1.         -1.          1.        ]
                 [ 1.          0.          0.70710678]
                 [ 1.          1.          1.        ]])
```

In [None]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ColorConverter

color = ColorConverter()
figure_norms = plt.figure('p-norm comparison')

# create the linspace vector
ls = np.linspace(-100, 100)

assert len(ls) == 50 , 'ls should be of length 50.'
assert (min(ls), max(ls)) == (-100, 100) , 'ls should range from -100 to 100, inclusively.'

for i, p in enumerate([1/8, 1/4, 1/2, 1, 1.5, 2, 4, 8, 128]):
    # Create a numpy array containing useful values instead of zeros.
    data = np.array([[x, y, pnorm((x, y), p)] for x in ls for y in ls])
    data[:,2] = data[:,2] / np.max(data[:,2])

    assert data[100,2]>0.9 and data[100,2]<1, "Wrong result for p norm, make sure you use NORM and not pdist!"
    assert all(data[:,2] <= 1), 'The third column should be normalized.'

    # Plot the data.
    colors = [color.to_rgb((1, 1-a, 1-a)) for a in data[:,2]]
    a = plt.subplot(3, 3, i + 1)
    plt.scatter(data[:,0], data[:,1], marker='.', color=colors)
    a.set_ylim([-100, 100])
    a.set_xlim([-100, 100])
    a.set_title('{:.3g}-norm'.format(p))
    a.set_aspect('equal')
    plt.tight_layout()
    figure_norms.canvas.draw()

Exercises:

1. Show that the $p$-norm is not a norm for $p<1$ (give an example where the triangle inequality is violated).
2. Replace the line `data = np.array([[x, y, pnorm((x, y), p)] for x in ls for y in ls])` by a proper numpy-ish vectorized way of coding (avoiding loops).
3. Improve the figure by showing unit "circles" of the different $p$-norms instead of the color shading.