## Supervised learning: classification

In this exercise we will work with two types of classification models. To do so, we will make use of functions from the SKLEARN library. This library contains a lot of machine learning algorithms that are used in practise, so it is very valuable to familiarize yourself with the SKLEARN library.

## Libraries

In [None]:
import numpy
import pandas as pd

from sklearn import tree 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


## Read the data

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/Daphne310/TrainingCases/master/iris-data-training.csv')
iris.head()

## Data preparation 

We just loaded the iris dataset. It contains data of three types of irisses. The features are measures of the lenght and width of the sepal and petal. Let's first split the data in features and targets. The iris dataset is a well-defined set for educational purposes, but in practise you will have to define the features and targets before this step.  

In [None]:
data = iris[['sepal_length_cm', 'sepal_width_cm',
                             'petal_length_cm', 'petal_width_cm']].values
target = iris['class'].values

To get a better understanding of the data, we will print the first example. 

In [None]:
print('The features of the first example are: ', data[0])

In [None]:
print('The label of the first example is: ', target[0])

The next thing to do is to create a trainset and a testset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, 
                                                  test_size=1/3, random_state=0)

print('The number of measurements are: ', data.shape[0])
print('The number of training measurements are: ',X_train.shape[0])
print('The number of test measurements are: ',X_test.shape[0])

## Decision Tree Classifier 

To construct the classifier, we follow a few simple steps: 

In [None]:
#STEP 1: Choose the type of classifier 
DTalgorithm = tree.DecisionTreeClassifier(random_state=0)

#STEP 2: Fit the model to the train data
DTalgorithm = DTalgorithm.fit(X_train, y_train)

#STEP 3: Predict the labels of the test data 
Pred_DT = DTalgorithm.predict(X_test)

Congratulations! You just made a predictive algorithm!

The variable 'Pred_DT' holds the predictions made by the algorithm. Now let's take a look at your performance using classification accuracy. 

In [None]:
DT_CA = accuracy_score(y_test, Pred_DT) 

print('The Decision Tree Classifier reached a Classification Accuracy of: ', DT_CA)

## K-Nearest Neighbors 

To construct the classifier, we follow a few simple steps: 

In [None]:
#STEP 1: Choose the type of classifier 
KNNalgorithm = KNeighborsClassifier(n_neighbors = 3)

#STEP 2: Fit the model to the train data
KNNalgorithm = KNNalgorithm.fit(X_train, y_train)

#STEP 3: Predict the labels of the test data 
Pred_KNN = KNNalgorithm.predict(X_test)

Congratulations! You did it again!

The variable 'Pred_KNN' holds the predictions made by the algorithm. Now let's take a look at your performance using classification accuracy. 

In [None]:
KNN_CA = accuracy_score(y_test, Pred_KNN) 

print('The K-Nearest Neighbors Classifier reached a Classification Accuracy of: ', KNN_CA)

# Did you get a good score? 

By know you probably know the drill. Tuning the parameters might help to reach a better performance. For K-Nearest Neighbors the most important parameter is the number of neighbors taken into account. The code now says 'n_neighbors = 3'. Play around with the number to see if the model can be improved. 


## Evaluation 

What model eventually reached the highest classification accuracy? 

In [None]:
print('The Decision Tree Classifier reached a Classification Accuracy of: ', DT_CA)
print('The K-Nearest Neighbors Classifier reached a Classification Accuracy of: ', KNN_CA)

## Question
Is this test enough to determine which model has the best fit?
Tip: change the values of the random_state in train/test split and in the decision tree model