# Classification

This week, we will cover machine learning for classification. First, we will cover some of the basic concepts. Then we will look at some hands-on examples using Scikit-Learn.

### What is classification?

For a **classification** machine learning problem, we are interesting in predicting which **class** a sample will be in based on one or more features of the sample. Classes sometimes also termed labels. The goal of our model is to predict which class a sample is in. This is a type of supervised learning where we will train our models using labeled data. Examples of classes include "healthy" vs "disease"; or "healthy" vs "minor disease" vs "serious disease". If we have exactly two classes, then we have a **binary** classification problem. For a binary classification problem, it is common to designate one class as positive (or label 1) and another class as negative (or label 0). If we have more than two classes, then we have a **multiclass** (or **multilabel**) problem. 

### Prediction of labeling and prediction of confidence

The minimum information that our model needs to provide about each sample is the predicted class. It is often also useful for the model to also provide a confidence for the prediction or predicted probabilities for each class.

### Confusion matrix

One of the most useful visualizations of the predictions of a classification model is the confusion matrix. This will show the number (or fraction) of the predictions that fall into different sets.

For binary classification, we can define four sets. First, we need to select one class to be the "positive" class and the other class to be the "negative" class. Usually we will use label 0 as the negative class and label 1 and the positive class.

True positive (TP): The true and predicted labels are both positive.

True negative (TN): The true and predicted labels are both negative.

False positive (FP): The true label is negative but the predicted label is positive.

False negative (FN): The true label is positive but the predicted label is negative.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Confusion-Matrix-2.png" width = "250" style="float: left;">

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/confusion-matrix.png" width = "250" style="float: right;">

Confusion matrices can also be used for multiclass classification, not just binary classification.

Alright, it's time to get into the world of stats and technical terms! We're going to explore something called a 'confusion matrix'. This is a handy tool that allows us to compare our model's predictions with the actual results in our test data.

A confusion matrix is a bit like a report card for our predictions. It gives us the numbers for when we got things right and when we got them wrong, in a handy little table:

|          | Predicted 0         | Predicted 1         |
|----------|---------------------|---------------------|
| **Actual 0** | True negative (TN)  | False positive (FP) |
| **Actual 1** | False negative (FN) | True positive (TP)  |

Predicted Didn't Survive	Predicted Survived
Actually Didn't Survive	True negative (TN)	False positive (FP)
Actually Survived	False negative (FN)	True positive (TP)
The true negatives and true positives are when our predictions match reality. True negatives are when we said a person wouldn't survive, and they didn't. True positives are when we said a person would survive, and they did.

False negatives and false positives are where we slipped up. False negatives are when we said a person wouldn't survive, but they ended up surviving. False positives are when we said a person would survive, but they didn't.

### Scoring classification models

There are several scores (or metrics) that are useful for evaluating the perfomance of a classification model, and for comparing different classification models.

The simplest score is accuracy.

accuracy = $\frac{(TP + TN)}{TP + TN + FP + FN}$

However, there is an issue with accuracy. Often, we want to minimize false positives. Or we may want to minimize false negatives.

In the biomedical world, the following metrics are commonly used for evaluating binary classification models:

sensitivity = true positive rate = $\frac{TP}{TP + FN}$

specificity = true negative rate = $\frac{TN}{TN + FP}$

In the machine learning world, the following related metrics are more commonly used:

recall = sensitivity = true positive rate = $\frac{TP}{TP + FN}$
precision = $\frac{TP}{TP + FP}$

[Extend definitions for the multiclass case]

Furthermore, the $F_1$ score is a combination of recall and precision, and is often used a single score to evaluate models.

$F_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$

We can calculate accuracy as the percentage of times we got it right. It's the number of true negatives and true positives divided by the total number of predictions.

Consider a team of medical researchers trying to predict whether or not a patient is at risk of heart failure:
Just like in our previous examples, accuracy is the proportion of correct predictions. But, as we noted earlier, if heart failures are rare, a model that always predicts "no heart failure" might have a high accuracy but it's not really helpful in the medical context. So, it's important to consider more than just accuracy when evaluating such predictions!

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/confusion-matrix.png' width=500px align="center">

Accuracy is basically how often our model gets its guesses right. It's calculated by adding up all the times our model correctly predicted heart failure (true positives) and correctly predicted no heart failure (true negatives), then dividing that by the total number of predictions. In other words,

accuracy = (TN + TP) / (TN + TP + FN + FP).

Accuracy might not always give us the full picture. Let's say we want to predict a rare disease that only affects 1 in every 1,000 people. If we create a model that always predicts that no one has the disease, it will be 99.9% accurate, but it's not very useful, right?

The same can be applied to our **Titanic challenge**. If we made a model that simply predicted everyone on board died, it would be 62% accurate. But that model wouldn't tell us much about who actually had a better chance of survival. So, accuracy isn't always the best judge of a model's performance!

Imagine a circle that represents all the patients we predicted would experience heart failure. Now, the left part of that circle represents the patients who actually did experience heart failure.

**Precision** is like asking,

"*Out of all the patients we predicted would experience heart failure, how many actually did?*" We calculate it as:

TP / (TP + FP).

**Recall**, on the other hand, is like asking,

"*Out of all the patients who actually experienced heart failure, how many did we correctly predict?*" We calculate it as:

TP / (TP + FN).

Now, there are times when precision or recall matters more:

**High precision** is key when we really want to avoid false positives.

- For example, in court trials, we want to be sure that if we declare someone guilty, they really are.

- Another example is email spam filters. We don't want to accidentally label important emails as spam.

**High recall** is critical when we want to avoid false negatives.

- For example, in cancer screenings, a false negative could mean a patient who actually has cancer gets a clean bill of health.

Think about it, what do you think is more important for recommendation engines like YouTube, Netflix, or Spotify? Would it be precision or recall?

#### Analogy to explain precision and recall

 [Here](https://towardsdatascience.com/precision-and-recall-88a3776c8007) is a great analogy to explain Precision and Recall, and I've paraphrased it below:

Think of it like fishing with a net. If you cast a wide net into a lake and catch 80 out of 100 fish, that's 80% recall. However, you also end up with 80 rocks in your net, which means your precision is 50% since half of the net's contents are unwanted junk.

On the other hand, you could use a smaller net and focus on a specific area of the lake where there are lots of fish and no rocks. In this case, you might only catch 20 out of the fish, but you'll have zero rocks. This results in 20% recall and 100% precision.

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/precision_recall.gif' width=300px align="right">

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/recall.gif' width=300px align="left">

### Receiver-operator-characteristic (ROC) curves

Another technique to evaluate binary classification models are receiver-operator characteristics curves.

We need a model that returns a confidence score or has an adjustable decision boundary.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/roc-curve.png" width = "300" style="float: right;">

We plot the true positive rate on one axis and the false positive rate on the other axis.

After computing the ROC curve, we can calculate the area under the curve (AUC). The AUC can be used as a metric to compare models. With this metric, with a larger AUC are ranked higher than models with a lower AUC. This assumes we are interested in models with a balance between sensitivity and specificity.

## Classification Models

There are many types of models for classification. Different models will be appropriate for different types of data. The model can be selected by testing the performance of the model on your data, or by experience. In the following notebooks will go through examples using each model.

### Linear Binary Classification

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/LinearBinaryClassification3.png">

Let's now look in detail how linear binary classification works. The prediction is based on decision function, which is a multivariate linear function.

The decision boundary is defined by decision function being equal to zero.

If the value of decision function is positive, we will predict label 1. If the value of decision function is negative, we will predict label 0.

### Perceptron

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Perceptron2.png">

The linear perceptron is a simple model for classification. A linear perceptron is simple model that will find a line (or in higher dimensions a plane or hyper-plane) that divides the data into two classes. It can also be used for multiclass problems (see a later notebook).

So how can we find the decision function for our dataset? There are various algorithms to do that. We have already introduced the perceptron model before and we will revisit it now.

In perceptron model we find the decision function that minimises perceptron
criterion. This loss function penalises misclassified samples proportionally to their distance from the decision boundary, which is expressed by the absolute value of the decision function. The perceptron learning algorithm is simple. We will first pick a random sample. If the sample is misclassified we will update the weight vector. The algorithm iterates until convergence. The value eta is the learning rate and is usually set to 1. This algorithm has some disadvantages. It does not always have a unique solution and is not always guaranteed to converge. But it generally works in practice.

### Logistic Regression Classifier

The next model we'll look at is the logistic regression classifier. Note that despite the name this is a classification model not a regression model!

An advantage of the logistic regression classifier is that the output of the model is a probability.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/LogisticRegression.png">

Logistic regression model allows us to do that. It converts the output of the decision function h to probability of the positive class using sigmoid function. This function squashes the output of decision function into rage [0,1].

Probability of label 1 given the feature x is therefore sigmoid of h(x), plotted here using the red solid line. The probability of label 0 for the same feature is 1 minus probability of label 1. It is plotted using the blue dotted line.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/CrossEntropy2.png">

So how do we fit the logistic regression model? We minimise cross entropy loss. Let’s consider a single sample with index i and see what cross entropy loss means for this sample. The probability pi for this sample is the probability of the class one.

If the label for this sample is 1, the penalty will be equal to minus logarithm pi. If pi is one, it will result in zero penalty. If $p_i$ is close to zero, it will result in large penalty. The loss function is therefore forcing the probability to 1 for samples with label 1.

If the label for this sample is 0, the penalty will be equal to minus logarithm 1
$p_i$. If pi is zero, the penalty is zero as well. If $p_i$ is close to 1, it will result in large penalty. For samples with label 0 the loss function forces probability to zero as well.

We can therefore see that minimisation of cross entropy ensures that probabilities $p_i$ are similar to labels $y_i$. The solution is found using numerical methods and in this case, the convergence is guaranteed.

### Support Vector Classifier 

Next we'll take a look at the support vector classifier (SVC). This is also often called support vector machine (SVM).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/LinearlySeparableDataset.png" width = "300" style="float: right;">

- First, let's assume we have a linearly separable dataset
- All 3 decision boundaries result in accuracy = 1
- Which boundary is likely to generalise well?
- The red boundary is most likely to generalise well

Linearly separable datasets can be perfectly separated by a linear decision boundary and we can achieve classification accuracy 1. In our example of diagnosis of heart failure, this is the case for healthy patients and patients with severe heart failure.

There are many decision boundaries with accuracy 1 for separable datasets. So how do we choose the one that is most likely to generalise well?

The red boundary seem to be the best because it is far from the samples unlike the other two.

**Large margin classifier (hard margin)**

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/HardMargin.png" width = "400" style="float: right;">

With a large margin classifier, the decision boundary is
- as far as possible from the samples
- determined by samples on the margins - **support vectors**

Support vector classifier is a large margin classifier, which means that it searches for a decision boundary that is as far as possible from the samples.

The decision boundary is determined by samples that lie on the margins and are called support vectors, here denoted by pink circles.

**Large margin classifier (soft margin)**

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/SoftMargin.png" width = "400" style="float: right;">

With a soft margin classifier, the decision boundary
- minimises margin violations
- is determined by samples on or inside the margins or on the wrong side of the decision boundary - **support vectors**

Large margin classifiers can be generalised to non-separable datasets by minimising the margin violations. The decision boundary is again determined by support vectors, which lie on or inside the margin or on the wrong side of the decision boundary.



### Decision Tree Classifier

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Decision-tree.png" width = "400" style="float: right;">

A decision tree is a type of graph

- Nodes (questions)
- Edges (binary choices

Sets of tests that are hierarchically organised. Each test is a _weak learner_.

Advantages of decision trees:

- Easy to interpret
- Able to handle both numerical and categorical data
- Able to handle multi-class problems
- Requires little data preparation (e.g. no normalization)
- Can be used for both classification and regression

Imagine it like a game of 20 questions. The model asks questions about the data and makes decisions based on the answers. And the cool thing? It can change or fine-tune its answers as it gets more information!

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/youdroppedfood.jpg' width=500px align="center">

(Image courtsey: Audrey Fukman and Andy Wright on SFoodie, via Serious Eats)

### Random Forest Classifier

Ensemble of decision trees

- Increases randomisation by restricting each tree node's choice of optimal feature from a subset of the total feature space
- Further decorrelates predictive models
- Further decreases model variance
- Increases stability against feature noise and thus chance of overfitting

**If a tree is good, is a forest better?**

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/random_forest.jpeg' width=500px align="right">

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Random_forest_explain.png' width=500px align="right">

Think about it this way - if one tree is handy, wouldn't a whole forest be even better?

Imagine we could create a bunch of decision trees, each one asking different questions. Then, we could combine their answers to make a final prediction. This is exactly what a Random Forest does.

A Random Forest is what we call an 'ensemble model'. It's like a supergroup band made up of lots of individual musicians, all working together to create a harmonious sound. Here, each model contributes to a final, hopefully better, prediction.

## Extra techniques

Finally we will cover examples of some additional techniques that are important for classification problems:

1. Encoding classes
2. One-hot encoding
3. Creating training and test sets
4. Handling unbalanced datasets

### One-hot encoding

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/one-hot-encoding.png' width=500px align="center">

Machine learning is a bit like a kid who only likes to play with numbers and not with words. So, when we have categories like **'apple' and 'peer'**, we need to find a way to turn them into numbers.

This is where one-hot encoding comes in handy! It's a cool trick that turns these categories into something machine learning algorithms can work with. It's like giving each category its own 'on' and 'off' switch.

And guess what? There's an easy way to do this if you're using pandas, a tool in Python. It has a function called 'get_dummies' that does all the work for you.

**I thought it would be intresting for you to know**: the term "one-hot" in "one-hot encoding" comes from the way digital circuits are designed. In digital electronics, a one-hot signal is a group of bits among which the legal combinations of values are only those with a single high bit (1) and all the others low (0).

## Example - The Titanic Kaggle challenge: A case study for classification

Titanic Kaggle Challenge is a competition where you'll use data to predict who could've survived the infamous Titanic disaster.

Classification "survived" or "not survived""

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Kaggle-Titanic-Project-Getting-Started.png' width=500px align="center">

Let's explore machine learning with a fun example from Kaggle, a competition site owned by Google. These contests can be for giggles, cash prizes, or even a job offer sometimes! The Titanic Kaggle Challenge is known as one of the classic examples for learning classification in a hands-on way.

One beginner's challenge is based on the Titanic disaster, a famous shipwreck that happened on April 15, 1912. The ship, thought to be unsinkable, hit an iceberg and sank, sadly causing 1502 out of 2224 people onboard to lose their lives because there weren't enough lifeboats.

But it's not just about guessing who made it. It's about deeply exploring the data, finding patterns, and understanding how different factors might have affected survival rates. It poses questions like 'Did socioeconomic status influence survival rates?' and 'What was the impact of the "women and children first" policy? Was the 'women and children first' policy strictly followed?'

Here's the interesting part: it seems that some people were more likely to survive than others. The challenge asks us to figure out who these folks were, using data like their names, ages, genders, and social classes.

We get a file with details about 891 passengers, including whether they survived or not. We'll use this data to teach our machine to make smart guesses.

But the real test comes with another file, this one has information on 418 passengers, but doesn't tell us if they survived. That's where our machine's predictions come in!

Suggested tutorial about Kaggle's titanic challange: [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic)

You can follow Chris White's tutorial on this subject through this Jupyter notebook: [01-intro-classification.ipynb](https://colab.research.google.com/github/ualberta-rcg/python-machine-learning/blob/main/notebooks/01-intro-classification.ipynb#scrollTo=Gd80ekh_zQu4).