# Intro to Machine Learning with Classification

## Contents
1. **Loading** iris dataset
2. Splitting into **train**- and **test**-set
3. Creating a **model** and training it
4. **Predicting** test set
5. **Evaluating** the result
6. Selecting **features**

This notebook will introduce you to Machine Learning and classification, using our most valued Python data science toolkit: [ScikitLearn](http://scikit-learn.org/).

Classification will allow you to automatically classify data, based on the classification of previous data. The algorithm determines automatically which features it will use to classify, so the programmer does not have to think of this anymore (although it helps).

First, we will transform a dataset into a set of features with labels that the algorithm can use. Then we will predict labels and validate them. Last we will select features manually and see if we can make the prediction better.

Let's start with some imports.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets

## 1. Loading iris dataset

We load the dataset from the datasets module in sklearn.

In [2]:
iris = datasets.load_iris()

This dataset contains information about iris flowers. Every entry describes a flower, more specifically its 
- sepal length
- sepal width
- petal length
- petal width

So every entry has four columns.

![Iris](https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/84f03ae1d048482471f2a9ca85b0c649730cc269/images/03_iris.png)

We can visualise the data with Pandas, a Python library to handle dataframes. This gives us a pretty table to see what our data looks like.

We will not cover Pandas in this notebook, so don't worry about this piece of code.

In [3]:
import pandas as pd
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df["target"] = iris.target
df.sample(n=10)  # show 10 random rows

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
107,7.3,2.9,6.3,1.8,2
12,4.8,3.0,1.4,0.1,0
11,4.8,3.4,1.6,0.2,0
86,6.7,3.1,4.7,1.5,1
36,5.5,3.5,1.3,0.2,0
27,5.2,3.5,1.5,0.2,0
55,5.7,2.8,4.5,1.3,1
13,4.3,3.0,1.1,0.1,0
124,6.7,3.3,5.7,2.1,2
114,5.8,2.8,5.1,2.4,2


There are 3 different species of irises in the dataset. Every species has 50 samples, so there are 150 entries in total.

We can confirm this by checking the "data"-element of the iris variable. The "data"-element is a 2D-array that contains all our entries. We can use the python function `.shape` to check its dimensions.

In [4]:
iris.data.shape

(150, 4)

To get an example of the data, we can print the first ten rows:

In [5]:
print(iris.data[0:10, :]) # 0:10 gets rows 0-10, : gets all the columns

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


The labels that we're looking for are in the "target"-element of the iris variable. This 1D-array contains the iris species for each of the entries.

In [6]:
iris.target.shape

(150,)

Let's have a look at the target values:

In [7]:
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


There are three categories so each entry will be classified as 0, 1 or 2. To get the names of the corresponding species we can print `target_names`.

In [8]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


The iris variable is a dataset from sklearn and also contains a description of itself. We already provided the information you need to know about the data, but if you want to check, you can print the `.DESCR` method of the iris dataset.

In [9]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

Now we have a good idea what our data looks like.

Our task now is to solve a **supervised** learning problem: Predict the species of an iris using the measurements that serve as our so-called **features**.

In [10]:
# First, we store the features we use and the labels we want to predict into two different variables
X = iris.data
y = iris.target

## 2. Splitting into train- and test-set

We want to evaluate our model on data with labels that our model has not seen yet. This will give us an idea on how well the model can predict new data, and makes sure we are not [overfitting](https://en.wikipedia.org/wiki/Overfitting). If we would test and train on the same data, we would just learn this dataset really really well, but not be able to tell anything about other data.

So we split our dataset into a train- and test-set. Sklearn has a function to do this: `train_test_split`. Have a look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) of this function and see if you can split `iris.data` and `iris.target` into train- and test-sets with a test-size of 33%.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =  # TODO: split iris.data and iris.target into test and train

We can now check the size of the resulting arrays. The shapes should be `(100, 4)`, `(100,)`, `(50, 4)` and `(50,)`.

In [None]:
print("X_train shape: {}, y_train shape: {}".format(X_train.shape, y_train.shape))
print("X_test  shape: {} , y_test  shape: {}".format(X_test.shape, y_test.shape))

## 3. Creating a model and training it

Now we will give the data to a model. We will use a Decision Tree Classifier model for this.

This model will create a decision tree based on the X_train and y_train values and include decisions like this:

![Iris](https://sebastianraschka.com/images/blog/2014/intro_supervised_learning/decision_tree_1.png)

Find the Decision Tree Classifier in sklearn and call its constructor. It might be useful to set the random_state parameter to 0, otherwise a different tree will be generated each time you run the code.

In [None]:
from sklearn import tree

model =  # TODO: create a decision tree classifier

The model is still empty and doesn't know anything. Train (fit) it with our train-data, so that it learns things about our iris-dataset.

In [None]:
model =  # TODO: fit the train-data to the model

## 4. Predicting test set

We now have a model that contains a decision tree. This decision tree knows how to turn our X_train values into y_train values. We will now let it run on our X_test values and have a look at the result.

We don't want to overwrite our actual y_test values, so we store the predicted y_test values as y_pred.

In [None]:
y_pred =  # TODO: predict y_pred from X_test

## 5. Evaluating the result

We now have y_test (the real values for X_test) and y_pred. We can print these values and compare them, to get an idea of how good the model predicted the data.

In [None]:
print(y_test)
print("-"*75)  # print a line
print(y_pred)

If we look at the values closely, we can discover that all but two values are predicted correctly. However, it is bothersome to compare the numbers one by one. There are only thirty of them, but what if there were one hundred? We will need an easier method to compare our results.

Luckily, this can also be found in sklearn. Google for sklearn's accuracy score and compare our y_test and y_pred. This will give us the percentage of entries that was predicted correctly.

In [None]:
from sklearn import metrics

  # TODO: calculate accuracy score of y_test and y_pred

That's pretty good, isn't it?

To understand what our classifier actually did, have a look at the following picture:

![Decision Tree](http://scikit-learn.org/stable/_images/sphx_glr_plot_iris_0011.png)

We see the distribution of all our features, compared with each other. Some have very clear distinctions between two categories, so our decision tree probably used those to make predictions about our data.

## 6. Selecting features

In our dataset, there are four features to describe the flowers. Using these four features, we got a pretty high accuracy to predict the species. But maybe some of our features were not necessary. Maybe some did not improve our prediction, or even made it worse.

It's worth a try to see if a subset of features is better at predicting the labels than all features.

We still have our X_train, X_test, y_train and y_test variables. We will try removing a few columns from X_train and X_test and recalculate our accuracy.

First, create a feature selector that will select the 2 features X_train that best describe y_train.

(Hint: look at the imports)

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

selector =  # TODO: create a selector for the 2 best features and fit X_train and y_train to it

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42)

We can check which features our selector selected, using the following function:

In [None]:
print(selector.get_support())

It gives us an array of True and False values that represent the columns of the original X_train. The values that are marked by True are considered the most informative by the selector. Let's use the selector to select (transform) these features from the X_train values.

In [None]:
X_train_new =  # TODO: use selector to transform X_train

The dimensions of X_train have now changed:

In [None]:
X_train_new.shape

If we want to use these values in our model, we will need to adjust X_test as well. We would get in trouble later if X_train has only 2 columns and X_test has 4. So perform the same selection on X_test.

In [None]:
X_test_new =  # TODO: use selector to transform X_test

In [None]:
X_test_new.shape

Now we can repeat the earlier steps: create a model, fit the data to it and predict our y_test values.

In [None]:
model =  # TODO: create model as before
model =  # TODO: fit model as before, but use X_train_new
y_pred =  # TODO: predict values as before, but use X_test_new

Let's have a look at the accuracy score of our new prediction. 

In [None]:
  # TODO: calculate accuracy score as before

So our new prediction, using only two of the four features, is better than the one using all information. The two features we used are petal length and petal width. These say more about the species of the flowers than the sepal length and sepal width.