# Machine Learning (Summer 2024)

## Practice Session 2: Introduction

April, 23th 2024

Ulf Krumnack & Lukas Niehaus

Institute of Cognitive Science,
University of Osnabrück

## Today's Session

* Finish Python Introduction (session01)
* Numpy
* Sheet02
* Decision Trees with Scikit-Learn

# Numpy

## Background

Numpy is an extension package for Python. Numpy
* provides multidimensional arrays
* is closer to hardware (efficiency)
* designed for scientific computation
* is array oriented computing

Numpy arrays can be used for:
* values of an experiment/simulation at discrete time steps
* signals recorded by a measurement device, e.g. sound wave
* pixels of an image, grey-level or colour
* 3D data measured at different X-Y-Z positions, e.g. MRI scan

### The `import` statement

To use an extension in Python, it has to be imported first.
The recommended way to import numpy is: 

   `import numpy as np`.

This does two things:
1. load the extension
2. provides data and functions with the prefix `np.`

In [None]:
import numpy as np

## Numpy arrays

Numpy provides a datatype for $N$-dimensional arrays (`ndarray`). Such arrays can be initialized from Python lists using the `np.array` function:

In [None]:
a = np.array([0,1,2,3])

In [None]:
a

In [None]:
print(a)
print(type(a))
print(len(a))
print(a.dtype)
print(a.ndim)
print(a.shape)

Multidimensional arrays are possible

In [None]:
b = np.array([[0,1,2],[3,4,5]])
b

In [None]:
print(type(b))
print(b.ndim)
print(b.shape)
print(b.size)

In [None]:
c = np.array([[[1,2,3], [4,5,6]], [[7,8,9], [10,11,12]]])
c

In [None]:
print(c.ndim)
print(c.shape)
print(c.size)

There are also other ways to create arrays

In [None]:
np.arange(10)

In [None]:
np.arange(1,9,2) # start, end (exclusive),  step

In [None]:
np.linspace(0, 1, 6) # start, end, numpoints

In [None]:
np.linspace(0, 1, 5, endpoint=False)

In [None]:
np.ones((2,2))

In [None]:
np.zeros((3,3,3))

In [None]:
np.eye(4)

In [None]:
np.diag([1,2,3,4])

## Reshaping and combining arrays

In [None]:
a = np.reshape(np.arange(12),(3,4))
a

In [None]:
a.flatten()

**Attention:** flattening happens automatically in some situations (we will see examples below)

### Combining two arrays:

In [None]:
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])

print(a)
print(b)

In [None]:
print(np.append(a,b))

In [None]:
print(np.append(a,b,axis=0))

We can also use `np.hstack` and `np.vstack`:

In [None]:
print(np.hstack((a,b)))
print(np.vstack((a,b)))

`numpy.delete` allows to removes element from an array:

In [None]:
print(np.delete(a,1))

Notice that the array is flattened!

## Indexing and slicing

One-dimensional arrays behave similar to lists:

In [None]:
a = np.arange(13)
a

Indices start at 0!

In [None]:
a[2:4]

In [None]:
a = np.reshape(np.arange(12),(3,4))
a

In [None]:
a[1,2]

A multidimensional array is basically an array of arrays:

In [None]:
a[1]

In [None]:
len(a)

### Logical indexing

A boolean matrix `b` can be used for indexing a matrix `a` *of the same shape*:

In [None]:
a = np.reshape(np.arange(25),(5,5))
b = np.eye(5, dtype=bool)
print("a =\n", a)
print("b =\n", b)
print("a[b] =", a[b])

### Exercise

Create the following array:

```
array([[ 0,  1,  3,  4],
       [10, 11, 13, 14],
       [20, 21, 23, 24]])
```

In [None]:
np.delete(np.arange(25).reshape(5,5)[::2],2,axis=1)

## Numerical operations on arrays

### Elementwise operations

In [None]:
a = np.arange(10)
print(a)

In [None]:
a + a

In [None]:
2 * a

In [None]:
a - a 

In [None]:
a * a

In [None]:
a ** 2

In [None]:
np.sqrt(a)

### Exercise
* Create an $n\times 3$-array that contains `a1` in its first row, `a2` in its second row and the sum of `a1` and `a2` in its third row

In [None]:
a1 = np.array([2,5,8,4])
a2 = np.array([7,1,1,3])

### BEGIN SOLUTION
np.vstack([a1,a2,a1+a2])
### END SOLUTION

### Matrix operations

In [None]:
a = np.reshape(np.arange(1,10),(3,3))
e = np.eye(3)
print(a)
print(e)

In [None]:
a + e # sum of two matrices

In [None]:
2 * a # multiplication with a scalar

In [None]:
a * e # pointwise multiplication (not the matrix multiplication!)

In [None]:
a.dot(e) # matrix multiplication

In [None]:
a @ e

In [None]:
a.T # matrix transposition

### Some more mathematical functions

In [None]:
data = np.random.randn(10) # standard normal distribution (mean 0, variance 1)
print(data)

In [None]:
max(data) # largest value in the dataset

In [None]:
data.max()

In [None]:
np.max(data)

In [None]:
data.argmax()

In [None]:
np.argmax(data) # index of the largest value in the dataset

In [None]:
# also works with multi-dimensional data:
a = np.random.randn(5,5)
print(a)
print(np.argmax(a))

flat index!

In [None]:
np.mean(data) # the mean value of the data

In [None]:
np.std(data) # the standard deviation of the data

In [None]:
np.abs(data) # absolute value (remove sign)

In [None]:
np.sum(data) # sum of all data

In [None]:
np.prod(data) # alias: np.product

#### Exercise
* Compute the mean and variance of `data` (without using the buildin functions `mean` and `std`). Then compare your results with the values from `mean` and `std`.

In [None]:
my_mean = data.sum()/data.size
my_std = np.sqrt(((data-my_mean)**2).sum()/data.size)

print("Mean: {} vs. {}".format(my_mean, data.mean()))
print("Std: {} vs. {}".format(my_std, data.std()))

## Some general hints

Help (the docstring) for function or value can be displayed by appending a question mark to its name:

In [None]:
np.argmax?

looking for something: `np.lookfor('create array')`

In [None]:
np.lookfor('create array')

### References

* on the web: [https://numpy.org/]

# Decision Trees with Scikit-Learn 

## The package Scikit-Learn (`sklearn`)

* package for machine learning in Python
* provides implementations for many algorithms discussed in the lecture
* good documentation on [scikit-learn.org](https://scikit-learn.org/stable/index.html)

In [None]:
import sklearn

print(sklearn.__version__)

Remark on `import`:
* does two things:
  1. load the module code
  2. adds symbol to the local namespace
* different syntactic variants:

```python
import foo

import foo as f

from foo import bar

from foo import bar as baz
```
  

## Learning a decision tree

* Scikit-Learn implements [decision tree algorithms](https://scikit-learn.org/stable/modules/tree.html) in module [`sklearn.tree`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
* A decision tree based classifier is provided by the class [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* To use this class one has to instantiate it:

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(criterion='entropy')

## Providing training data

* Scikit-learn provides some standard datasets in the module `sklearn.datasets`
* we will again use the Iris Flower Datasets

In [None]:
from sklearn import datasets

iris = datasets.load_iris()

print("Data shape:", iris.data.shape)
print("Target shape:", iris.target.shape)
print("Features names:", iris.feature_names)
print("Target names:", iris.target_names)

The Iris dataset is a *labeled* dataset for a classification problem. One often refers to the data as follows:
* `X` (feature vector): the feature values for each datapoint
* `y` (class label): the class to which the datapoint belongs

In [None]:
X, y = iris.data , iris.target

Alternative way of loading just the data:

In [None]:
X, y = datasets.load_iris(return_X_y=True)

In [None]:
print(X[:5])

In [None]:
print(y)

In [None]:
# Visualize data partially
%matplotlib widget
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))

attr_num1 = 0
attr_num2 = 1

plt.scatter(X[:,attr_num1], X[:,attr_num2], c=y, cmap=plt.cm.Set1, edgecolor='k')
plt.xlabel(iris.feature_names[attr_num1])
plt.ylabel(iris.feature_names[attr_num2])
plt.show()

## Train a decision tree classifier

* in many scikit-learn classes, training is done via the `fit` method
* for training a classifier, we have to provide labeled training data:
  - feature vectors (X)
  - corresponding class labels (y)

In [None]:
decision_tree.fit(X, y)
print(decision_tree.get_depth())

## Outputting the decision tree

* The module `sklearn.tree` provides the function `export_tree` to obtain a string representation:

In [None]:
from sklearn.tree import export_text

print(export_text(decision_tree, feature_names=iris.feature_names))  # , max_depth=2

## Plotting the decision tree

* Since version 0.21 (May 2019), scikit-learn can plot decision trees using `matplotlib`

In [None]:
# check for the sklearn version, it has to be at least 0.21
import sklearn
print("Scikit-Learn version: ", sklearn.__version__)

from packaging.version import Version
assert Version(sklearn.__version__) >= Version("0.21"), "Your scikit-learn is too old"

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12,12))
plot_tree(decision_tree, 
          feature_names=iris.feature_names, 
          class_names=list(iris.target_names), 
          filled=True, 
          rounded=True, 
          fontsize=8)
plt.show()

The nodes contain the following information:
* the decision (e.g. "petal width (cm) <= 0.8")
* a measure of impurity (here entropy)
* the number of training samples represented by the node (e.g. `150`)
* the distribution of samples over the classes (e.g. `[50,50,50]`)
* the (majority) class of this node 
* the color also indicates majority class and impurity

## Plotting decision boundaries

* plotting decision boundaries is another way to understand a classifier
* only works for two- (and maybe three-) dimensional data
* idea: create a grid of points and classsify them (`predict`) using the decision tree
* we demonstrate this for different pairs of features (selected from the 4-dimensional dataset)

In [None]:
import numpy as np

# Parameters
n_classes = len(iris.target_names)
plot_colors = "ryb" # red, yellow, blue
plot_step = 0.02

plt.figure()
for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train a decision tree classifier just on the selected pair of features
    clf = DecisionTreeClassifier(criterion='entropy').fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
 
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    edgecolor='black', s=15)

plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend(loc='lower right', borderpad=0, handletextpad=0)
plt.axis("tight")
plt.show()

## Inference: the `predict` method

In [None]:
y_predict = decision_tree.predict(iris.data)
y_predict

## Evaluating the prediction

In [None]:
y == y_predict

### Evaluation metrics

The module [`sklearn.metrics`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) provides different evaluation metrics. The most popular metric for classification is *accuracy*.

In [None]:
from sklearn.metrics import accuracy_score

print(f"Accuracy: {accuracy_score(y, y_predict)}")

## Training a small decision tree

In [None]:
X, y = iris.data , iris.target

small_decision_tree = DecisionTreeClassifier(criterion='entropy', max_depth=2)
small_decision_tree.fit(X, y)
print(export_text(small_decision_tree, feature_names=iris.feature_names))

In [None]:
y_predict_small = small_decision_tree.predict(X)
y == y_predict_small

In [None]:
print(f"Accuracy: {accuracy_score(y, y_predict_small)}")

## Train and Test data

Split your dataset into two parts:
* train: used to train your model (e.g., build your decision tree)
* test: used to estimate how good you model *generalizes*, that is, how well it performs on *new*, unseen data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
decision_tree = DecisionTreeClassifier(criterion='entropy')  # , max_depth=2
decision_tree.fit(X_train, y_train)

y_predict_train = decision_tree.predict(X_train)
y_predict_test = decision_tree.predict(X_test)

print(f"Training accuracy: {accuracy_score(y_train, y_predict_train)*100:.2f}%")
print(f"Test accuracy: {accuracy_score(y_test, y_predict_test)*100:.2f}%")