# Learning outcomes

When you've worked through the exercise in this notebook, you'd have

* built and evaluated tree-based and kNN models using a standard software library;

* run experiments to explore effects of differing feature scales on machine learning models.

# Objectives


* To apply k-nearest neighbour (kNN) and Bagging algorithms from Week 2 mini-videos to classification of Iris plants based on petal and sepal sizes. This was the same dataset introduced in the Week 1 code notebook.

# Section 1 - Load the Iris dataset

In [None]:
from sklearn import datasets

iris_data, iris_labels = datasets.load_iris(return_X_y=True, as_frame=False)

print("The dimensions of the Iris feature matrix", iris_data.shape)

# Section 2 - Split into training and test sets

In [None]:
from sklearn.model_selection import train_test_split
import numpy

all_ids = numpy.arange(0, iris_data.shape[0])

random_seed = 1

# Rrandomly split the data into 50:50 to get the training set
train_set_ids, test_set_ids = train_test_split(all_ids, test_size=0.5, train_size=0.5,
                                 random_state=random_seed, shuffle=True)

training_data = iris_data[train_set_ids, :]
training_labels = iris_labels[train_set_ids]
test_data = iris_data[test_set_ids, :]
test_labels = iris_labels[test_set_ids]

print("Size of the training data:", training_data.shape)
print("Size of the test data:", test_data.shape)
print("A peek at the range of values of the training data features:", training_data)

# Section 3 - Train and evaluate a kNN model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

model_kNN = KNeighborsClassifier(n_neighbors=5)
model_kNN.fit(training_data, training_labels)
test_predictions_kNN = model_kNN.predict(test_data)

print("\n What proportion of the kNN test predictions were correct? %.2f " % accuracy_score(test_labels, test_predictions_kNN))

# Section 4 - Visually explore the data and predictions

* Use plots of iris sepal and petal characteristics (their length and/or width) to explore how distinct the three iris classes are based on their sepals and petals.


# Section 5 - Explore the effect of the kNN hyperparameters

* Try different values of k, i.e. the number of nearest neighbours, e.g. k = 1, 2, 5, 10, 20. What effect of k do you notice?
* Try a different distance metric. See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html.

# Section 6 - Train and evaluate a Bagging model

In [None]:
from sklearn.ensemble import BaggingClassifier
import math

# Set the max number of features to be used to split each node for each tree
max_feats = int(math.sqrt(training_data.shape[1]))

model_B = BaggingClassifier(n_estimators=100, max_features=max_feats, random_state=random_seed)
model_B.fit(training_data, training_labels)
test_predictions_B = model_B.predict(test_data)


print("\n What proportion of the Bagging test predictions were correct? %.2f " % accuracy_score(test_labels, test_predictions_B))

# Section 7 - Explore the effect of the Bagging hyperparameters

* Try different numbers of base classifiers, i.e. trees, e.g. ntrees = 1, 10, 100, 1000. What effect of the number of trees do you notice?

# Section 8 - Explore split into training and test sets

* How was the Iris dataset split into training and test sets? See Section 2.
* Randomly split the dataset into training and test sets such that the ratio of instances is 80:20.
* What is the effect on performance of the Bagging algorithm?

# Section 9 - Train and evaluate the kNN and Bagging models with scaled features

* Read the StandardScaler documentation (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Using the documentation above, compute scaled features from *iris_data* in Section 1 based on standard scaling.

* Train and evaluate a kNN and a Bagging model with the scaled features.

* What differences do you notice in the feature distribution and the results?