# Basics of Machine Learning `part 2`

In [None]:
%pylab inline

# Table of Contents
<!-- MarkdownTOC autolink=true autoanchor=true bracket=round -->

- [Realworld dataset](#realworld_dataset)
- [Performance metrics in classification](#perf_metrics)
- [Model selection](#model_selection)

<a name="realworld_dataset"></a>
# Realworld dataset

In [None]:
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore') 

from basics.utils import reduce_dataset

We consider here a more complex dataset, stemming from "real-world" data: the **covtype** dataset provided within sklearn. From the sklearn user guide (https://scikit-learn.org/stable/datasets/index.html#covtype-dataset): 
> The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch’s cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass classification problem. Each sample has 54 features, described on the dataset’s homepage. Some of the features are boolean indicators, while others are discrete or continuous measurements.

Characteristics:

  | Element | value |
  |---------|---|
  | Classes | 7 |
  | Samples total |	581012 |
  | Dimensionality |	54 |
  | Features | int |
  
 Example of instance: ![](https://archive.ics.uci.edu/ml/assets/MLimages/Large31.jpg)
 
Dataset website: https://archive.ics.uci.edu/ml/datasets/Covertype

In [None]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
from sklearn import datasets
dataset = datasets.fetch_covtype()

In [None]:
print(dataset.data.shape)

In [None]:
plt.plot(dataset.data[1001])

In [None]:
print(np.unique(dataset.target))

In [None]:
features = dataset.data
labels = dataset.target

We will select a subset of the dataset in order to be able to have immediate results

In [None]:
[features, labels] = reduce_dataset(features, labels, num_obs=1660)

In [None]:
print(features.shape)
print(labels.shape)

<a name="perf_metrics"></a>
# Performance metrics in classification

## Accuracy

We have already mentioned that accuracy is a widely metrics to assess the performance of a model. It counts the number of good predictions among all the predictions

Let us inspect accuracy on the dataset considered. We first split the dataset into a training set and a testing set.

**Good practice**: shuffle the dataset 

In [None]:
import random

In [None]:
indexes = np.arange(len(features.data))
print(indexes)

In [None]:
random.shuffle(indexes)
print(indexes)

Shuffled dataset:

In [None]:
X = features[indexes]
Y = labels[indexes]

Now we can split between **training set** and **testing set**

In [None]:
train_X = X[:int(0.8 * len(X))]
train_Y = Y[:int(0.8 * len(Y))]

In [None]:
test_X = X[int(0.8 * len(X)):]
test_Y = Y[int(0.8 * len(Y)):]

Inspect total sizes:

In [None]:
print(train_X.shape)
print(test_X.shape)

Let's get a classifier of our choice (e.g. SVM)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier()

In [None]:
clf.fit(train_X, train_Y)

In [None]:
accuracy = clf.score(test_X, test_Y)
print('accuracy =', accuracy)

**QUESTION: what does it mean?**

## Confusion matrix

It's good to have a statistics on the number of good answers our classifier is able to predict, but if we want to understand its behaviour we need more information. 

A very common metric is to inspect the confusion the classifier is making between classes, it's what we called confusion matrix: it is counting the number of times an instance from class `i` has been predicted as class `j` where `j` can be `i` or another one. 

<img src='./assets/confusion_matrix.png' style="width:20%"></img>

In [None]:
pred_Y = clf.predict(test_X)

In [None]:
print(pred_Y)

### Computing the confusion matrix "manually"

In [None]:
classes = np.unique(test_Y)
num_classes = len(classes)
print(num_classes, classes)

In [None]:
confmat = np.zeros((num_classes, num_classes))
for obs_i in range(len(test_X)):
    pred_Y_i = clf.predict([test_X[obs_i]])
    confmat[test_Y[obs_i] - 1, pred_Y_i - 1] += 1

In [None]:
confmat

In [None]:
plt.imshow(confmat)
plt.colorbar()

### Using sklearn

Sklearn provides a method for computing confusion matrices, see dedicated page: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
pred_Y = clf.predict(test_X)

In [None]:
confmat = confusion_matrix(test_Y, pred_Y)

In [None]:
plt.imshow(confmat)
plt.colorbar()

## Additional important metrics

### Notion of true positives and co.

We differentiate between true positives, true negatives, false positives and false negatives.

Definitions. Let consider the case where we have to classify ann image `I` as either a `cat`, a `dog`, or a `donkey`. We first consider the class `cat`:
- **True Positives (TPs)**: we know that an image `I` belongs to class `cat`, and it has been rightly labelled `cat` by the classifier
- **True Negatives (TNs)**: we know that an observation `I` does not belong to class `cat` (in our exemple, it can belong to class `dog` or `donkey`, we don't really care) and it has not been labelled `cat` by the classifier
- **False Positives (TPs)**: we know that an observation `I` does not belong to class `cat` (it belongs to class `dog` or `donkey`), and it has been wrongly labelled `cat` by the classifier
- **False Negatives (TNs)**: we know that an observation `I` belongs to class `cat` and it has not been labelled `cat` by the classifier.

From these categories, we can compute two useful measures: **Precision** and **Recall**. By taking the exemple above:
- Precision is the proportion of images rightly categorized as `cat` among all the instances categorized as a `cat`
- Recall is the proportion of images rightly categorized as `cat` among all instances that should have been categorized as `cat`

From Wikipedia:

<img src="./assets/precisionrecall.png" style="width:30%"></img>

A widely used measure taking into account Precision and Recall is the **F1 Score**:

$$fscore = 2*\frac{precision \times recall}{precision + recall}$$

### Computations

In [None]:
from sklearn.metrics import precision_recall_fscore_support

In [None]:
[precision, recall, _, _] = precision_recall_fscore_support(test_Y, pred_Y)

In [None]:
precision

In [None]:
recall

Average globally

In [None]:
[precision, recall, _, _] = precision_recall_fscore_support(test_Y, pred_Y, average='macro')

In [None]:
fscore = 2 * (precision * recall) / (precision + recall)

In [None]:
print(fscore)

Looks like the accuracy?....

### Imbalance dataset

So far we made a strong constraint on our datasets : **each class has the same number of observations!**. This is not realistic in real world cases. 

Let's inspect what happens in case of datasets with different number of instance per class. We take the original data, this time we reduce the whole dataset by keeping a certain percentage of the observations per class, and not a fixed number of them. 

In [None]:
features = dataset.data
labels = dataset.target
[features, labels] = reduce_dataset(features, labels, reduce_by=98)

In [None]:
print(features.shape)

As before, we shuffle and build the training and testing sets:

In [None]:
indexes = np.arange(len(features.data))
random.shuffle(indexes)

In [None]:
X = features[indexes]
Y = labels[indexes]

In [None]:
train_X = X[:int(0.8 * len(X))]
train_Y = Y[:int(0.8 * len(Y))]

In [None]:
test_X = X[int(0.8 * len(X)):]
test_Y = Y[int(0.8 * len(Y)):]

Inspect number of instances per class:

In [None]:
for c in np.unique(labels):
    idx_train = np.where(train_Y == c)[0]
    idx_test = np.where(test_Y == c)[0]
    print('class', c, '\t num. training obs', len(idx_train), ' | num. testing obs', len(idx_test))

In [None]:
clf = KNeighborsClassifier()

In [None]:
clf.fit(train_X, train_Y)

In [None]:
accuracy = clf.score(test_X, test_Y)
print('accuracy =', accuracy)

**QUESTIONS:** 
- Do we have a better classifier than before? 
- What does this score mean?

Exercice: plot the confusion matrix and comment

In [None]:
pred_Y = clf.predict(test_X)
confmat = confusion_matrix(test_Y, pred_Y)
plt.imshow(confmat)
plt.colorbar()

Let try to inspect relatively

In [None]:
confmat = np.float32(confmat)

In [None]:
for i in range(len(confmat)):
    confmat[i,:] = confmat[i,:] / np.sum(confmat[i,:])
plt.imshow(confmat)
plt.colorbar()

Accuracy is limited because it does not give insights on performance. 

Let's inspect the fscore:

In [None]:
[precision, recall, _, _] = precision_recall_fscore_support(test_Y, pred_Y, average='macro')
fscore = 2 * (precision * recall) / (precision + recall)

In [None]:
print('accuracy:', accuracy)
print('precision:', precision)
print('recall:', recall)
print('fscore:', fscore)

<a name="model_selection"></a>
# Model Selection

In machine learning, we usually compare various models in order to pick the best one for a particular application.

In [None]:
from sklearn.neural_network import MLPClassifier
%pylab inline

## Comparing model by varying parameters

In [None]:
MLPClassifier()

More on webpage: https://scikit-learn.org/stable/modules/neural_networks_supervised.html

<img src="assets/exerice-icon.png" style="width:80px; float:left;"></img><div style="clear:left;"></div>
**EXERCICE:** find the best MLP model in terms of `learning_rate`

In [None]:
# TODO

## Comparing classifiers

In the context of this course, for the sake of comparison, we compare classification accuracy for two classifiers: 
- LDA: Linear Discriminant Analysis (linear model)
- kNN: k-Nearest Neigbours (non linear) 
- MLP: Multi-layer Perception (non linear)

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

<img src="assets/exerice-icon.png" style="width:80px; float:left;"></img><div style="clear:left;"></div>
**EXERCICE:** find the best classifer between `discriminant_analysis` (LDA), `neighbors` (kNN) and `MLPClassifier` (Neural Network), trained on `train_X, train_Y` and tested on `test_X, test_Y`

In [None]:
# TODO

## Cross-validation

Model selection is usually done using cross-validation: a way to to create different splits of a dataset and perform several tests, one for each split. This gives a statistical estimate of the generalisability of a classifier, and can be considered as a good measure to compare models. 

There are different ways to split a dataset. Sklearn has several methods for that (see [API](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)):

Function | Description
--- | ---
`model_selection.KFold([n_splits, shuffle, ...])` | K-Folds cross-validator
`model_selection.GroupKFold([n_splits])`	| K-fold iterator variant with non-overlapping groups.
`model_selection.StratifiedKFold([n_splits, ...])`	| Stratified K-Folds cross-validator
`model_selection.LeaveOneGroupOut()`	| Leave One Group Out cross-validator
`model_selection.LeavePGroupsOut(n_groups)`	| Leave P Group(s) Out cross-validator
`model_selection.LeaveOneOut()`	| Leave-One-Out cross-validator
`model_selection.LeavePOut(p)`	| Leave-P-Out cross-validator
`model_selection.ShuffleSplit([n_splits, ...])`	| Random permutation cross-validator
`model_selection.GroupShuffleSplit([...])`	| Shuffle-Group(s)-Out cross-validation iterator
`model_selection.StratifiedShuffleSplit([...])`	| Stratified ShuffleSplit cross-validator
`model_selection.PredefinedSplit(test_fold)`	| Predefined split cross-validator
`model_selection.TimeSeriesSplit([n_splits])`	| Time Series cross-validator

A partition of the initial dataset is usually caleld a fold. In our example below, we will used the **stratified k-fold** which creates folds preserving the percentage of samples for each class.

In [None]:
from sklearn.model_selection import StratifiedKFold

Let's for instance declare a stratified spitting method with a number of splits equals to 3:

In [None]:
splitter = StratifiedKFold( n_splits = 4 )

Reload data such having the same number of observations per class:

In [None]:
features = dataset.data
labels = dataset.target
[features, labels] = reduce_dataset(features, labels, num_obs=1660)

In [None]:
splitter.split(train_X, train_Y)

In [None]:
train_index, test_index = next(splitter.split(train_X, train_Y))

In [None]:
train_X_split = train_X[train_index]
train_Y_split = train_Y[train_index]

In [None]:
test_X_split = train_X[test_index]
test_Y_split = train_Y[test_index]

In [None]:
np.unique(train_Y_split)

In [None]:
np.unique(test_Y_split)

Loop on splits:

In [None]:
for train_index, test_index in splitter.split(train_X, train_Y):
    # Do something 
    TODO = True

## Compare models with cross-validation

In [None]:
count_tests = 0

score1 = []
score2 = []
score3 = []

splitter = StratifiedKFold( n_splits = 12 )

for train_index, test_index in splitter.split(train_X, train_Y):
    
    print('Split')

    # select training and testing datasets
    train_X_split = train_X[train_index]
    train_Y_split = train_Y[train_index]
    test_X_split = train_X[test_index]
    test_Y_split = train_Y[test_index]
    
    clf1 = LinearDiscriminantAnalysis()
    clf2 = KNeighborsClassifier()
    clf3 = MLPClassifier()

    clf1.fit(train_X, train_Y)
    clf2.fit(train_X, train_Y)
    clf3.fit(train_X, train_Y)

    s1 = clf1.score(test_X, test_Y)
    s2 = clf2.score(test_X, test_Y)
    s3 = clf3.score(test_X, test_Y)
    
    score1.append(s1)
    score2.append(s2)
    score3.append(s3)

print('LDA:', np.mean(score1))
print('kNN:', np.mean(score2))
print('MLP:', np.mean(score3))


___
About this material: copyright Baptiste Caramiaux (write me for any questions or use of this material [email](mailto:baptiste.caramiaux@lri.fr))
___