# Classification with k-NN

### 1. The IRIS dataset

The Iris flower data set is a data set describing the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

The data set consists of 50 samples from each of three species of Iris (_Iris setosa_, _Iris virginica_ and _Iris versicolor_). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

In this work, you will use the Iris dataset, considering only two of the three species of Iris.

---

We start by loading the dataset. The Iris dataset is available directly as part of `scikit-learn`and we use the `scikit-learn` command `load_iris` to retrieve the data.

In [2]:
%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets as data

# Load dataset and print its description
iris = data.load_iris()
print(iris.DESCR)

data_X = iris.data[50:,:]          # Select only 2 classes
data_y = 2 * iris.target[50:] - 3  # Set output to {-1, +1}

# Get dimensions 
nP = data_X.shape[0]
nF = data_X.shape[1]

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

---

#### Activity 1.        

* Split the data into training and test sets. To build the train and test sets, you can use the function `train_test_split` from the module `model_selection` of `scikit-learn`. Make sure that the test set corresponds to 1/10th of your data. For reproducibility, set `random_state` to some fixed value.


**Note:** Keep in mind that the test data should not be used for any design or validation decisions.

---

In [5]:
from sklearn.model_selection import train_test_split


# Separa 90% para treino e 10% para teste
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.1, random_state=42)

---

#### Activity 2.  

* Plot the training data. In particular, for every pair of features/attributes, create a scatter plot where different classes are plotted with different symbols. You may find useful the `subplot` command from `matplotlib.pyplot`.


In [8]:
fig = plt.figure(figsize=(9,6))

idx1 = np.where(y_train == -1)[0]
idx2 = np.where(y_train == 1)[0]

# Plot 1
#plt.subplot(2,3,1)

plt.plot(X_train[idx1,0], X_train[idx1,1], 'bx', label='Vesicolor')
plt.plot(X_train[idx2,0], X_train[idx2,1], 'bx', label='Virginica')

plt.show()

<IPython.core.display.Javascript object>

---

#### Activity 3.  

* Fit the k-nearest neighbor classifier to the training set from Activity 1, using a k = 3. The algorithm can be imported from the `sklearn.neighbors` library under the name `KNeighborsClassifier`.

In [23]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

---

#### Activity 4.  

* Print the predictions of the test set
* Print the true classes in the test set
* Print the test set accuracy
* Print the confusion matrix of the test set
* Print the test set precision
* Print the test set recall
* Print the test set f1 score

These algorithms to compute these metrics can be imported from the `sklearn.metrics`.

In [25]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

# Calcula a acurácia no conjunto de treino
train_accuracy = knn.score(X_train, y_train)
print("Training Accuracy:", train_accuracy)

# Calcula a acurácia no conjunto de teste
test_accuracy = knn.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

# Previsões no conjunto de teste
y_pred = knn.predict(X_test)

# Calcula a matriz de confusão
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Calcula a precisão no conjunto de teste
test_precision = precision_score(y_test, y_pred, average='weighted')
print("Test Precision:", test_precision)

# Calcula o recall no conjunto de teste
test_recall = recall_score(y_test, y_pred, average='weighted')
print("Test Recall:", test_recall)

# Calcula o F1 Score no conjunto de teste
test_f1 = f1_score(y_test, y_pred, average='weighted')
print("Test F1 Score:", test_f1)


Training Accuracy: 0.9555555555555556
Test Accuracy: 0.9
Confusion Matrix:
 [[5 1]
 [0 4]]
Test Precision: 0.9199999999999999
Test Recall: 0.9
Test F1 Score: 0.901010101010101


#### Activity 5.  

* Fit the k-nearest neighbor classifier with k = 1, 3, 6, and 9. For each value of k, store the following metrics: accuracy, precision, recall, and f1 score. 
* Plot each metric to see how they change with different values of k.

In [27]:
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train, y_train)

In [29]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

# Calcula a acurácia no conjunto de treino
train_accuracy = knn.score(X_train, y_train)
print("Training Accuracy:", train_accuracy)

# Calcula a acurácia no conjunto de teste
test_accuracy = knn.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

# Previsões no conjunto de teste
y_pred = knn.predict(X_test)

# Calcula a matriz de confusão
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Calcula a precisão no conjunto de teste
test_precision = precision_score(y_test, y_pred, average='weighted')
print("Test Precision:", test_precision)

# Calcula o recall no conjunto de teste
test_recall = recall_score(y_test, y_pred, average='weighted')
print("Test Recall:", test_recall)

# Calcula o F1 Score no conjunto de teste
test_f1 = f1_score(y_test, y_pred, average='weighted')
print("Test F1 Score:", test_f1)


Training Accuracy: 0.9555555555555556
Test Accuracy: 0.9
Confusion Matrix:
 [[5 1]
 [0 4]]
Test Precision: 0.9199999999999999
Test Recall: 0.9
Test F1 Score: 0.901010101010101


In [31]:
knn = KNeighborsClassifier(n_neighbors=6)

knn.fit(X_train, y_train)

In [33]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

# Calcula a acurácia no conjunto de treino
train_accuracy = knn.score(X_train, y_train)
print("Training Accuracy:", train_accuracy)

# Calcula a acurácia no conjunto de teste
test_accuracy = knn.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

# Previsões no conjunto de teste
y_pred = knn.predict(X_test)

# Calcula a matriz de confusão
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Calcula a precisão no conjunto de teste
test_precision = precision_score(y_test, y_pred, average='weighted')
print("Test Precision:", test_precision)

# Calcula o recall no conjunto de teste
test_recall = recall_score(y_test, y_pred, average='weighted')
print("Test Recall:", test_recall)

# Calcula o F1 Score no conjunto de teste
test_f1 = f1_score(y_test, y_pred, average='weighted')
print("Test F1 Score:", test_f1)

Training Accuracy: 0.9777777777777777
Test Accuracy: 0.9
Confusion Matrix:
 [[5 1]
 [0 4]]
Test Precision: 0.9199999999999999
Test Recall: 0.9
Test F1 Score: 0.901010101010101


In [39]:
knn = KNeighborsClassifier(n_neighbors=12)

knn.fit(X_train, y_train)

In [41]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

# Calcula a acurácia no conjunto de treino
train_accuracy = knn.score(X_train, y_train)
print("Training Accuracy:", train_accuracy)

# Calcula a acurácia no conjunto de teste
test_accuracy = knn.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

# Previsões no conjunto de teste
y_pred = knn.predict(X_test)

# Calcula a matriz de confusão
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Calcula a precisão no conjunto de teste
test_precision = precision_score(y_test, y_pred, average='weighted')
print("Test Precision:", test_precision)

# Calcula o recall no conjunto de teste
test_recall = recall_score(y_test, y_pred, average='weighted')
print("Test Recall:", test_recall)

# Calcula o F1 Score no conjunto de teste
test_f1 = f1_score(y_test, y_pred, average='weighted')
print("Test F1 Score:", test_f1)

Training Accuracy: 0.9555555555555556
Test Accuracy: 1.0
Confusion Matrix:
 [[6 0]
 [0 4]]
Test Precision: 1.0
Test Recall: 1.0
Test F1 Score: 1.0
