# Introduction to Classification

Classification involves using labeled (known) training examples to predict a class label for unseen input examples. In this lab we will use the classification functionality provided by the *scitkit-learn* Python package.

In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

## Data Loading and Preparation

For the examples in this notebook, we will use the *Penguin* dataset described [here](https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris).

The dataset contains records related to penguins collected from islands in the Palmer Archipelago, Antarctica. Each penguin belongs to one of 3 species (Adelie', 'Gentoo', 'Chinstrap'). We would like to train a classification algorithm to automatically classify a record describing a penguin into one of the three species (classes).

The features in the data are as follows:

- *island*: The name of the island in Antarctica where the penguin was found ('Dream', 'Torgersen', or 'Biscoe') 
- *bill_length*: Length of the penguin's bill in mm
- *bill_depth*: Depth of the penguin's bill in mm
- *flipper_length*: Length of the penguin's flipper in mm
- *body_mass*: Weight of the penguin in grams
- *sex*: Indicates whether the penguin is male or female
- *species*: The species of this penguin (either 'Adelie', 'Gentoo', or 'Chinstrap')


In [None]:
df = pd.read_csv('penguins.csv')
df.head()

Each example in the dataset has a class label or a "target" from three possible classes. We can look at the distribution of these classes (i.e. the number of penguins in each species):

In [None]:
df["species"].value_counts()

For the purposes of classification, we will focus on the numeric features in the data and remove the non-numeric features ('island' and 'sex'). We will also separate out the feature values from the target label ('species').

In [None]:
target = df["species"].values
data = df[["bill_length", "bill_depth", "flipper_length", "body_mass"]]
data.head()

Since the various numeric features in the data all have different ranges, we will apply min-max normalisation to transfrom them in to the range [0,1]. We can use the *MinMaxScaler* in scikit-learn to do this. Note the output will be a NumPy array.

In [None]:
normalizer = MinMaxScaler()  
data_scaled = normalizer.fit_transform(data.values)
data_scaled

## Basic Classification

As our classification algorithm in this notebook, we will focus on the use of the simple but effectives *k-Nearest Neighbour (KNN) classifier*. Given a new input example, this algorithm finds the most similar previous examples for which a decision has already been made (i.e. their nearest neighbours from the training set). Based on the majority vote among the K neighbours, a prediction is made for the input.

Build a nearest neighbour classifier using just 1 nearest neighbour. 
In this case we will use the full dataset and all of the target labels for those examples.  
Because the ranges of the feature values are quite different we scale them all to the range [0,1] so they all have the same impact on classification.

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
m = knn.fit(data_scaled, target)

We can test it out by making a prediction for a new input example described by 4 feature values:

In [None]:
# create our new penguin record
penguin1 = [39.1, 16.8, 180.5, 3705.0]
unseen_data = np.array([penguin1])
# apply the same min-max- normalisation as before
unseen_scaled = normalizer.transform(unseen_data)

In [None]:
# make the prediction, the output is the class label
prediction = knn.predict(unseen_scaled)
prediction[0]

We can also make predictions for multiple unseen input examples at once:

In [None]:
penguin1 = [39.1, 16.8, 180.5, 3705]
penguin2 = [46.2, 14.9, 208, 5286]
penguin3 = [50.3, 18.8, 201.5, 3804]
penguin4 = [40.1, 17.3, 185, 3402]
unseen_data = np.array([penguin1, penguin2, penguin3, penguin4])
# normalize the input data
unseen_scaled = normalizer.transform(unseen_data)
# make the predictions for the 4 unseen examples
knn.predict(unseen_scaled)

## Training and Test Sets

A key task when applying a classifier is to determine how effective our classifier will be at making predictions. One way to estimate this is to divide the full dataset into two sets using a "hold-out strategy":
1. *Training set*: A set of examples used to build the classification model.
2. *Test set*: A separate set of examples that is withheld from the classifier during training, and is used afterwards to evaluate the model.

To evaluate the effectiveness of our KNN classifier on the penguin data, we will randomly split the complete dataset into a training test (used to build the model) and an unseen test set (used to try out and evaluate the model). Scikit-learn provides functionality to do this. We will specify that 30% (0.3) of the data will be used for the test set. By specifiying a value for the *random_state*, we can reproduce the same results again.

In [None]:
from sklearn.model_selection import train_test_split
# use 70% for training, 30% for testing
data_train, data_test, target_train, target_test = train_test_split(data_scaled, target, 
    test_size=0.3, random_state=1)
print("Training set has %d examples" % data_train.shape[0])
print("Test set has %d examples" % data_test.shape[0])

Once we have performed our split, we then train our model based only on the training data:

In [None]:
model = KNeighborsClassifier(n_neighbors=1)
m = model.fit(data_train, target_train)

Make predictions for the test set, based on the model that we just trained:

In [None]:
predicted = model.predict(data_test)
# print the predictions
print("Predictions:\n%s" % predicted)
# print the number of predictions from each class
print("Class counts:\n%s" % pd.Series(predicted).value_counts())

We can compare our predictions to the "correct answer" based on the labels for the test data:

In [None]:
print("Predictions\n", predicted)
print("Correct labels\n", target_test)

We can quantitatively check how accurate these predictions are, by measuring *accuracy*, which will return a value between 0.0 (predictions are completely wrong) and 1.0 (predictions are 100% accurate):

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(target_test, predicted)
print("Accuracy=%.3f" % acc)

## Selecting Parameters

Many classification algorithms have one or more parameter values which control various aspects of the learning process. In the case of a KNN classifier, the key parameter is the number of neighbours *k* considered when making a prediction. Different values for this parameter can lead to different predictions on unseen data, resulting in higher levels of accuracy.

Using the training/test split that we created above, we will examine the effect of increasing the parameter for the number of neighbours *k* on the accuracy of our predictions.

In [None]:
# iterate over a range of values of k
for k in range(1, 16):
    # train a classifier with this parameter value
    model = KNeighborsClassifier(n_neighbors=k)
    m = model.fit(data_train, target_train)
    # make predictions
    predicted = model.predict(data_test)
    # evaluate the predictions
    acc = accuracy_score(target_test, predicted)
    print("K=%02d neighbours: Accuracy=%.3f" % (k, acc))

We see a little variation in the accuracy results above, although for this dataset the algorithm is not very sensitive to the choice of parameter value for KNN. However, this may not be the case for more difficult classification tasks and more complex datasets.