## ML4 Notebook

In this course, we consider binary classifiers. For now, let's view this as a black box that takes in a data point $\mathbf{x}_i$ (this could e.g. be an image), and outputs a decision between one of two classes (e.g. cat or dog)

<img src="./classifier.png" title="classifier"/>

To produce a classifier we need a dataset, which we split into three subsets:

- The **training set** is used the train the classifier
- The **validation set** is used to tune the classifier's hyperparameters (we won't delve into this further in these notebooks)
- The **test set** is used to evaluate how good the classifier is 

In this notebook, and in the following notebooks we are going to use a **customised version of the iris dataset**
and consider **the classification task of determining if a flower is (i) versicolor or (ii) virginica.** We will throw away the third class (setosas).

Note: multi-way classification (e.g. between 3 or more classes) is possible but will keep everything as two classes for simplicity.

In [None]:
import numpy as np
import sklearn.datasets # If you are running this locally, then `pip install sklearn` in your Python environment.

iris = sklearn.datasets.load_iris()
X = np.load('iris_standardised.npy') # Load our standardised iris data

### Creating our sets

We are going to learn classifiers using **supervised learning algorithms**. This means we need to provide a class label for each data point.

We are now going to load in the iris dataset, throw away the setosas, and split the remaining data into the three sets (train/val/test) with corresponding vectors for class labels. 

**Note: Manually splitting up a dataset is fairly tedious. Code will be provided to do it for you in the coursework lab.**

First, we are going to create a matrix `X_b` that contains only the veriscolor, and virginica data points. We can do this by looking at the following arrays:

- `iris.target` is an array of size $150$ where `iris.target[i]` is a numbered label for data point i
- `iris.target_names` tells you which species each numbered label corresponds to



In [None]:
print(iris.target_names)
print(iris.target)

We can check the target of data point `X[i,:]` by looking at `iris.target[i]`. If this is a 0 then it is a setosa. If this is a 1 it is a versicolor and if it is 2 it is virginica. 

We only want veriscolors and virginicas so we filter our data accordingly.

In [None]:
valid_data = np.where(iris.target != 0)[0] # Get targets that aren't zero
X_b = X[valid_data,:] # Get corresponding data points

Remember that for supervised learning algorithms, we need to provide class labels. Let's create a vector `y_b` where `y_b[i]` is 0 for veriscolors, and 1 for virginicas for `X_b[i,:]`. We can do this easily because we know that our remaining data is 50 versicolors followed by 50 virginicas.

In [None]:
y_b = np.concatenate([np.zeros(50), np.ones(50)])
print(y_b)

The final step is to split this into three sets. We are going to split our data with the ratio 40/20/40 into train/val/test at random.



In [None]:
np.random.seed(5) # Set a random seed

indices = np.array([i for i in range(100)]) # The numbers 0-99
print(indices)

In [None]:
np.random.shuffle(indices)
print(indices)

In [None]:
X_train = X_b[indices[0:40]]
X_val = X_b[indices[40:60]]
X_test = X_b[indices[60::]]

y_train = y_b[indices[0:40]]
y_val = y_b[indices[40:60]]
y_test = y_b[indices[60::]]

That's it. We'll now save these for use in later notebooks.

In [None]:
np.savez('iris_splits', X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test)
