# Classification

This practical looks at:
    
* How to **use sklearn to run a classifier using gradient descent**
* **Basic good practice**; test/train split, centering data
* The ways in which you can **modify the gradient descent** algorithm
* Pointers to other implementations in sklearn

In [None]:
# Import libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn import datasets

### Dataset

We're going to be using one of `sklearn`'s built in datasets. These are a mix of toy classification and regression problems which are useful to play with, but not as complex as real data. We're going to be using the [Iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) flower dataset.

From `datasets` load the dataset using `load_iris`. As we're not interested in the meta data, pass `return_X_y=True` to the function. This will return a tuple of X, the data, and y, the labels.

In [None]:
# load the iris data into an X matrix and a y vector.
X, y = datasets.load_iris(return_X_y=True)



To get an idea of what we're working with, plot the first two dimensions of X (sepal length and sepal width) and colour the points in according to their class.

In [None]:
# Plot the first two dimensions of the dataset as a scatter graph. Colour the points according to their class.

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()



The first thing to note is that the data is not centered. In general, centering is a good idea for optimisation algorithms which are unconstrained. We're going to be classifying the dataset, so our model will be attempting to find an appropriate boundary between the classes. If the data are too distant from the center, the model's bias parameter will, if initialised to a small value, automatically be far from optimal.

One caveat is that if we do center, and 0 is already a meaningful value in our data, we may struggle to interpret the resultant decision boundaries. Think about whether this is the case here.

Center the data, so that each feature is zero mean.

In [None]:
# Make each feature zero mean, and plot the first two dimensions again

X = X - np.mean(X, axis=0)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()


Secondly, good practice means we need to fence off a portion of our data before we go any further. Splitting into a training set and a test set is **absolutely vital** for any model training. This is to prevent us overfitting on our data, a problem where our solution becomes less and less generalisable to unseen data.

`sklearn` will do this for us. Let's choose a test proportion of `0.1`. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

### Model

For our model, we're going to use `sklearn`'s `SGDClassifier` which, as the name suggests, implements gradient descent on a wide variety of classifiers. This means that we choose our optimisation algorithm and then select the model. This is an unusual set up which we're using because we're learning about vanilla gradient descent. 

In practice, we often use `sklearn` for a specific classifier which will have several optimisation algorithms which can be selected as options. These algorithms will have been chosen to best suit the model in question. For example, the `MLPClassifier`, a standard multilayer perceptron (aka a neural network), provides the options using `ADAM`, `SGD`, and `lBFGS` as optimisers. All are variants of gradient descent.

(🦉 *Note; The term SGD is somewhat overloaded here. In the context of comparison to batching, and minibatching, SGD means to update after every single data point. Here, in the context of an alternative to ADAM and lBFGS, it simply means basic gradient descent, as we've discussed in this module.*)

Let's initialise an instance of `SGDClassifier`. This is a class which takes several variables of note when initialised:

* `loss`: How to choose the classifier. Examples include `hinge`, which will result in training an SVM classifier, and `log_loss`, which will result in a Logistic Regression classifier.
* `eta0`: the learning rate. 
* `learning_rate`: the approach to annealing the learning rate. `constant` is basic gradient descent. `adaptive` keeps the learning rate constant until the loss fails to decrease; each time this occures, it is divided by 5. 

Initialise a logistic regression classifer using a constant learning rate of `0.01`.

In [None]:
# Initialise the classifier instance.

clf = SGDClassifier(loss='log_loss', learning_rate='constant', eta0=0.01)



### Training

Once we've done the set up, the rest is easy. Simply call the `.fit` method of your classifier object and pass it `X_train` and `y_train`.

In [None]:
#train your classifier on the training data
clf.fit(X_train, y_train)


Note that `sklearn` does a lot of things under the hood. We don't need to choose batch sizes or write loops ourself; all of that is taken care of. Getting good results, however, requires understanding what the packages are doing. Hopefully, you have progressed towards that goal today!

### Results

`sklearn` provides various ways to see how we've done. For example, we can plot the confusion matrix easily.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

#pass the trained classifier and X_test, y_test to the function ConfusionMatrixDisplay.from_estimator()

ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)
plt.show()



As a final exercise, let's compare a constant learning rate to an adaptive one, in terms of the model's performance as we show it more and more data (a graph often called the 'learning curve'). As we don't have a huge amount of data, this won't be super precise, but it should give you an idea of how things might vary (for a quick summary of the various sorts of curves we can plot to analyse models, see [here](https://scikit-learn.org/stable/modules/learning_curve.html)).

In [None]:
from sklearn.model_selection import learning_curve

# Plot the learning curves for the SDG classifiers with constant learning rate, and with adaptive learning rate.
clf2 = SGDClassifier(loss="log_loss", learning_rate="constant", eta0=0.01)

train_sizes_constant, train_scores_constant, test_scores_constant = learning_curve(
    clf2, X_train, y_train
)

clf3 = SGDClassifier(loss="log_loss", learning_rate="adaptive", eta0=0.01)

train_sizes_adaptive, train_scores_adaptive, test_scores_adaptive = learning_curve(
    clf3, X_train, y_train
)

plt.plot(
    train_sizes_constant,
    -test_scores_constant.mean(1),
    "o-",
    color="r",
    label="constant",
)

plt.plot(
    train_sizes_adaptive,
    -test_scores_adaptive.mean(1),
    "o-",
    color="b",
    label="adaptive",
)

plt.show()


What do you notice? If the training curves separate, where do they do so? Remember that this is a stochastic process; run your code a few times and see if there are differences. Pay special attention to where the randomness comes from, For example, what makes the most difference? The test/train split? The learning rate?