# Basics of Machine Learning `part 2`

In [2]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


# Table of Contents
<!-- MarkdownTOC autolink=true autoanchor=true bracket=round -->

- [Performance metrics in classification](#perf_metrics)
- [Model selection]
- [Regression tasks](#toyexample)

# Libraries and dataset

In [2]:
import random
import numpy as np
from basics.utils import reduce_dataset

We consider here a more complex dataset, stemming from "real-world" data

In [3]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [4]:
from sklearn import datasets
dataset = datasets.fetch_covtype()

In [5]:
print(dataset.data.shape)

(581012, 54)


In [6]:
print(np.unique(dataset.target))

[1 2 3 4 5 6 7]


In [7]:
features = dataset.data
labels = dataset.target

In [8]:
[features, labels] = reduce_dataset(features, labels, reduce_by=99)

In [9]:
print(features.shape)
print(labels.shape)

(5807, 54)
(5807,)


<a name="perf_metrics"></a>
# Performance metrics in classification

## Accuracy

We have already mentioned that accuracy is a widely metrics to assess the performance of a model. It counts the number of good predictions among all the predictions

Let us inspect accuracy on the dataset considered. We first split the dataset into a training set and a testing set.

**Good practice**: shuffle the dataset 

In [15]:
import random

In [16]:
indexes = np.arange(len(features.data))
print(indexes)

[   0    1    2 ... 5804 5805 5806]


In [17]:
random.shuffle(indexes)
print(indexes)

[4906 5630 2044 ... 1610  666 4025]


Shuffled dataset:

In [18]:
X = features[indexes]
Y = labels[indexes]

Now we can split between **training set** and **testing set**

In [21]:
train_X = X[:int(0.8 * len(X))]
train_Y = Y[:int(0.8 * len(Y))]

In [22]:
test_X = X[int(0.8 * len(X)):]
test_Y = Y[int(0.8 * len(Y)):]

Inspect sizes:

In [24]:
print(train_X.shape)
print(test_X.shape)

(4645, 54)
(1162, 54)


Let's get a classifier of our choice (e.g. SVM)

In [25]:
from sklearn import svm

In [26]:
clf = svm.SVC()

In [27]:
clf.fit(train_X, train_Y)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [30]:
accuracy = clf.score(test_X, test_Y)
print('accuracy =', accuracy)

accuracy = 0.49139414802065406


___
About this material: copyright Baptiste Caramiaux (write me for any questions or use of this material [email](mailto:baptiste.caramiaux@lri.fr))
___