# Classification

This week, we will cover machine learning for classification. First, we will cover some of the basic concepts. Then we will look at some hands-on examples using Scikit-Learn.

## What is classification?

For a **classification** machine learning problem, we are interesting in predicting which **class** a sample will be in based on one or more features of the sample. Classes sometimes also termed labels. The goal of our model is to predict which class a sample is in. This is a type of supervised learning where we will train our models using labeled data. Examples of classes include "healthy" vs "disease"; or "healthy" vs "minor disease" vs "serious disease". If we have exactly two classes, then we have a **binary** classification problem. For a binary classification problem, it is common to designate one class as positive (or label 1) and another class as negative (or label 0). If we have more than two classes, then we have a **multiclass** (or **multilabel**) problem. 

## Prediction of labeling and prediction of confidence

The minimum information that our model needs to provide about each sample is the predicted class. It is often also useful for the model to also provide a confidence for the prediction or predicted probabilities for each class.

## Confusion matrix

One of the most useful visualizations of the predictions of a classification model is the confusion matrix. This will show the number (or fraction) of the predictions that fall into different sets.

For binary classification:

True positive:
True negative:
False positive:
False negative:

[Show images of confusion matrices here.]

## Scoring classification models

There are several scores (or metrics) that are useful for evaluating the perfomance of a classification model, and for comparing different classification models.

The simplest score is accuracy.

[Define accuracy]
[Define accuracy for ]

However, there is an issue with accuracy. Often, we want to minimize false positives. Or we may want t

[Examples of cases where you would want to minimize false positives or false negatives]

In the biomedical world, the following metrics are commonly used for evaluating binary classification models:

[Define sensitivity]
[Define specificity]

In the machine learning world, the following related metrics are more commonly used:

[Define recall]
[Define precision]
[Extend definitions for the multiclass case]

Furthermore, the F1-score is combination of recall and precision, and is often used a single score to evaluate models.

## Receiver-operator-characteristic (ROC) curves

Another technique to evaluate binary classification models are receiver-operator characteristics curves.

We need a model that returns a confidence score or has an adjustable decision boundary.

We plot the true positive rate (or ...) on one axis and the true negative rate (or ...) on the other axis. Then we plot...

[Show curve examples]

After computing the ROC curve, we can calculate the area under the curve (AUC). The AUC can be used as a metric to compare models.

## Models

There are many types of models for classification. Different models will be appropriate for different types of data. The model can be selected by testing the performance of the model on your data, or by experience. In the following notebooks will go through examples using each model.

### Perceptron

### Logistic Regression Classifier

### Support Vector Classifier

#### Non-linear SVC with kernels

### Decision Tree Classifier

### Random Forest Classifier

## Linear Binary Classification

<img src="imgs/LinearBinaryClassification3.png">

Let's now look in detail how linear binary classification works. The prediction is based on decision function, which is a multivariate linear function.

The decision boundary is defined by decision function being equal to zero.

If the value of decision function is positive, we will predict label 1. If the value of decision function is negative, we will predict label 0.

## Perceptron

<img src="imgs/Perceptron2.png">

The linear perceptron is a simple model for classification. A linear perceptron is simple model that will find a line (or in higher dimensions a plane or hyper-plane) that divides the data into two classes. It can also be used for multiclass problems (see a later notebook).

So how can we find the decision function for our dataset? There are various algorithms to do that. We have already introduced Perceptron model before and we will revisit it now.

In perceptron model we find the decision function that minimises perceptron
criterion. This loss function penalises misclassified samples proportionally to their distance from the decision boundary, which is expressed by the absolute value of the decision function. The perceptron learning algorithm is simple. We will first pick a random sample. If the sample is misclassified we will update the weight vector. The algorithm iterates until convergence. The value eta is the learning rate and is usually set to 1. This algorithm has some disadvantages. It does not always have a unique solution and is not always guaranteed to converge. But it generally works in practice.

## Logistic Regression Classifier

The next model we'll look at is the logistic regression classifier. Note that despite the name this is a classification model not a regression model!

An advantage of the logistic regression classifier is that the output of the model is a probability.

<img src="imgs/LogisticRegression.png">

Logistic regression model allows us to do that. It converts the output of the decision function h to probability of the positive class using sigmoid function. This function squashes the output of decision function into rage [0,1].

Probability of label 1 given the feature x is therefore sigmoid of h(x), plotted here using the red solid line. The probability of label 0 for the same feature is 1 minus probability of label 1. It is plotted using the blue dotted line.

<img src="imgs/CrossEntropy2.png">

So how do we fit the logistic regression model? We minimise cross entropy loss. Let’s consider a single sample with index i and see what cross entropy loss means for this sample. The probability pi for this sample is the probability of the class one.

If the label for this sample is 1, the penalty will be equal to minus logarithm pi. If pi is one, it will result in zero penalty. If pi is close to zero, it will result in large penalty. The loss function is therefore forcing the probability to 1 for samples with label 1.

If the label for this sample is 0, the penalty will be equal to minus logarithm 1
pi. If pi is zero, the penalty is zero as well. If pi is close to 1, it will result in large penalty. For samples with label 0 the loss function forces probability to zero as well.

We can therefore see that minimisation of cross entropy ensures that probabilities pi are similar to labels yi . The solution is found using numerical methods and in this case, the convergence is guaranteed.

## Support Vector Classifier 

Next we'll take a look at the support vector classifier (SVC). This is also often called support vector machine (SVM).

<img src="imgs/LinearlySeparableDataset.png" width = "300" style="float: right;">

- First, let's assume we have a linearly separable dataset
- All 3 decision boundaries result in accuracy = 1
- Which boundary is likely to generalise well?
- The red boundary is most likely to generalise well

Linearly separable datasets can be perfectly separated by a linear decision boundary and we can achieve classification accuracy 1. In our example of diagnosis of heart failure, this is the case for healthy patients and patients with severe heart failure.

There are many decision boundaries with accuracy 1 for separable datasets. So how do we choose the one that is most likely to generalise well?

The red boundary seem to be the best because it is far from the samples unlike the other two.

**Large margin classifier (hard margin)**

<img src="imgs/HardMargin.png" width = "400" style="float: right;">

With a large margin classifier, the decision boundary is
- as far as possible from the samples
- determined by samples on the margins - **support vectors**

Support vector classifier is a large margin classifier, which means that it searches for a decision boundary that is as far as possible from the samples.

The decision boundary is determined by samples that lie on the margins and are called support vectors, here denoted by pink circles.

<img src="imgs/SoftMargin.png">

Large margin classifier can be generalised to non-separable datasets by minimising the margin violations. The decision boundary is again determined by support vectors, which lie on or inside the margin or on the wrong side of the decision boundary.



Now let's look at an example using a support vector classifier

# Support vector classification

In this notebook we will explore Support Vector Classifier (SVC). Linear binary SVC is very similar to the perceptron and logistic regression in a sense that it finds the optimal hyperplane to separate two classes. These methods, however, have different objectives through which they decide what is the optimal decision boundary.

There are three different SVC classifiers in `sklearn` library:
1. `LinearSVC` implements linear classifier optimised for performance but does not support the kernel trick
2. `SVC` implements SVC with kernel trick. Setting `kernel='linear'` produces the same result as `LinearSVC` but is less efficient in terms of computational time. Setting `kernel='rbf'` produces non-linear classifier with Gaussian kernel.
3. `SGDclassifier` implements various classifiers that are optimised using stochastic gradient descent. Its default setting for loss function is `loss='hinge'` which is another implementation of a linear SVC.

SVC result also depends on hyperparameter `C` which controls the width of the margin and regularises the decision function. Larger `C` means smaller margin, less regularisation, and closer approximation of hard margin objective. Smaller `C` means larger margin, and smoother boundary for non-linear SVC. Note, that `C` has an opposite role to the parameter `alpha` that we used for penalised regression (e.g. `Ridge`). This is because it multiplies the data term rather than the penalty term.

## Extra techniques

Finally we will cover examples of some additional techniques that are important for classification problems:

1. Encoding classes
2. One-hot encoding
3. Creating training and test sets
4. Handling unbalance datasets