## Classification with Scitkit-learn

Classification involves using labeled (known) training examples to make an accurate prediction for new unseen input examples. In this lab we will use the classification functionality provided by the Scitkit-learn Python package.

The *k-Nearest Neighbour (KNN) classifier* is a simple but effective "lazy"classifier. Given a new input example, it finds the most similar previous examples for which a decision has already been made (i.e. their nearest neighbours from the training set). Based on the majority vote among the K neighbours, a prediction will be made for the input.

#### Example 1: KNN Classifier

The scikit-learn package includes a number of datasets, which are useful for getting a handle on a given machine learning algorithm before using it in your own work. We will load the version of the Iris dataset which is provided by scikit-learn:

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

This dataset has four different descriptive features:

In [None]:
print(iris.feature_names)

Each example in the dataset has a class label or a "target" from three possible classes:

In [None]:
print(iris.target_names)

Build a nearest neighbour classifier using $k=1$ nearest neighbour. In this case we will use the full dataset and all of the target labels for those examples:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(iris.data, iris.target)

We can test it out by making a prediction for a new input example described by 4 feature values:

In [None]:
import numpy as np
xinput = np.array([[3.0, 5.0, 4.1, 2.0]])
# make the prediction, the output is the number of the class
pred_class_number = knn.predict(xinput)
# to get the name of the class
print( iris.target_names[pred_class_number] )

We can also predict for multiple input examples at once:

In [None]:
xinput = np.array([[3, 5, 4, 2], [3, 5, 2, 2]])
pred_class_numbers = knn.predict(xinput)
print( iris.target_names[pred_class_numbers] )

#### Example 2: KNN Classifier

Next, we will load a CSV copy of the Pima Indian diabetes dataset from the UCI Machine Learning Repository, where the target is to make a prediction of 1 (tested positive for diabetes) or 0 (tested negative for diabetes). 

In [None]:
# load the CSV file as a numpy matrix
raw_dataset = np.loadtxt("diabetes.csv", delimiter=",")
# the dataset 
print(raw_dataset.shape)
raw_dataset[0,:]

The CSV format of the dataset contains data for 768 rows (patients), each with 9 columns. These are 8 descriptive numeric features, and the binary target value. We will separate out the descriptive columns from the target column (i.e. the class labels). 

In [None]:
dataset = raw_dataset[:,0:7]
target = raw_dataset[:,8]

Now, we will randomly split the complete dataset into a training test (used to build the model) and an unseen test set (used to try out and evaluate the model). Scikit-learn provides functionality to do this. We will specify that 20% (0.2) of the data will be used for the test set.

In [None]:
from sklearn.model_selection import train_test_split
dataset_train, dataset_test, target_train, target_test = train_test_split(dataset, target, test_size=0.2)

In [None]:
print("Training set size is %d" % dataset_train.shape[0] )
print("Test set size is %d" % dataset_test.shape[0] )

Next, we will fit a k-nearest neighbor model to the data using $k=3$ nearest neighbours:

In [None]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(dataset_train, target_train)
print(model)

Make predictions for the test set, based on the model that we just built:

In [None]:
predicted = model.predict(dataset_test)
predicted

In [None]:
num_pos = (predicted == 1).sum()
num_neg = (predicted == 0).sum()
print( "Number of patients predicted positive for diabetes: %d" % num_pos )
print( "Number of patients predicted negative for diabetes: %d" % num_neg )

We can compare our predictions to the "correct answer" based on the labels for the test data:

In [None]:
print("Predictions\n", predicted)
print("Correct labels\n", target_test)

We can quantitatively check how accurate these predictions are, by measuring *accuracy*, which will return a value between 0.0 (predictions are completely wrong) and 1.0 (predictions are 100% accurate):

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(target_test, predicted)

In the next lab we will look at evaluation measures for classification in more detail.