<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-UofA/blob/main/Week-3-Classification-models/3.1-Classification-intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification

This week, we will cover machine learning for classification. First, we will cover some of the basic concepts. Then we will look at some hands-on examples using Scikit-Learn.

### What is classification?

In a classification task in machine learning, we want to determine the **class or group** a particular **sample ** belongs to based on **one or more of its features**. These groups are sometimes referred to as **labels**. Our model aims to accurately predict the correct label for each sample.

We carry out this task through supervised learning, which means that we use a **dataset** where the **correct labels are already known** to **train** our model. For instance, we might have a dataset where samples are labeled either "**healthy" or "disease.**" In scenarios like this with just two possible labels, we are dealing with a binary classification problem. It is standard practice to represent one class as **positive (or label 1)** and the other as **negative (or label 0)**.

When there are more than two potential labels for our samples, we are facing a **multiclass or multilabel classificatio**n problem. For instance, samples could be categorized as **"healthy**," "**minor disease**," or "**serious disease**."

**In multiclass problems, each sample is assigned to one and only one label, while in multilabel problems, a sample can be associated with multiple labels. **

### Prediction of labeling and prediction of confidence

At the very least, our model should **indicate the predicted class for each sample**. Additionally, it can be beneficial for the model to offer a **level of confidence** in its predictions or to give the **predicted probabilities for each class**.

### Confusion matrix

Alright, it's time to get into the world of stats and technical terms! We're going to explore something called a 'confusion matrix'. This is a handy tool that allows us to **compare our model's predictions with the actual results in our test data**.

The matrix itself can be confusing for people who haven’t seen it before. Maybe that’s **why it was called confusion **matrix. It’s a good idea to extract some understandable data from the matrix, that’s why we have **precision, accuracy, recall, and F1-score.**

**First**, we need to **select** one class to be the **"positive" class** and the other class to be the **"negative" class**. Usually we will use label 0 as the negative class and label 1 and the positive class.

A confusion matrix is a bit like a report card for our predictions. It gives us the numbers for when we got things right and when we got them wrong, in a handy little table:

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Confusion-Matrix-2.png" width = "350" style="float: left;">

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/confusion-matrix.png' width=500px align="center">

**True positive (TP)**: The actual and predicted labels are both positive.

**True negative (TN)**: The actual and predicted labels are both negative.

**False positive (FP)**: The actual label is negative but the predicted label is positive.

**False negative (FN)**: The actual label is positive but the predicted label is negative.

The **True negative** and **true positives** are when our predictions match reality.

**True negatives are when we said a person wouldn't survive, and they didn't.**

True positives are when we said a person would survive, and they did.

**False negatives and false positives are where we slipped up**. False negatives are when we said a person wouldn't survive, but they ended up surviving. Ooops!

False positives are when we said a person would survive, but they didn't. Sorry!

**Confusion matrices** can also be used for multiclass classification, not just binary classification.

### Classification models metrics

There are several scores (or metrics) that are useful for evaluating the perfomance of a classification model, and for comparing different classification models.

#### 1. Accuracy

<font color=blue>*Accuracy: correct prediction/total number of predictions.*</font>

That’s the sum of items in the diagonal divided by the total number of items!

**accuracy** = $\frac{(TP + TN)}{TP + TN + FP + FN}$

The simplest score is accuracy. **We can calculate accuracy as the percentage of times we got it right.** It's the number of true negatives and true positives divided by the total number of predictions.

Consider a team of medical researchers trying to predict whether or not a patient is at risk of heart failure:
Just like in our previous examples, **accuracy is the proportion of correct predictions**.

**Issue with accuracy**

But, as we noted earlier, if heart failures are **rare**, a model that always predicts "**no heart failure**" might have a **high accuracy** but it's not really helpful in the medical context. So, it's important to consider more than just accuracy when evaluating such predictions!

**So, there is an issue with accuracy. **Accuracy might not always give us the **full picture**. Let's say we want to predict a **rare disease** that only affects 1 in every 1,000 people. If we create a model that always predicts that no one has the disease, it will be 99.9% accurate, but it's not very useful, right?

In the machine learning world, **recall and precision** are more commonly used:


#### 2. Precision

<font color=blue>*Precision: correct positive prediction/total number of positive prediction*</font>

Precision only cares about positive class most of the time (Sometimes we have negative predictive value, and it only cares about negative class).

**precision** = $\frac{TP}{TP + FP}$

Precision is like asking,

"*Out of all the patients we predicted would experience heart failure, how many actually did?*"

**High precision** is key when we really want to avoid **false positives**.

- For example, in court trials, we want to be sure that if we declare someone guilty, they really are.

- Another example is email **spam filters**. We don't want to accidentally label important emails as spam.


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-UofA/main/Week-3-Classification-models/imgs/Precision_recall_Representation_1052507280.png" width = "350" style="float: left;">




####4. Specificity

<font color=blue>Specificity: True Negative predictions / (True Negative predictions + False Positive predictions)</font>

**specificity** = **true negative rate** = $\frac{TN}{TN + FP}$

Specificity is a metric used in the context of binary classification in machine learning, which quantifies the ability of the classification model to correctly identify the negative instances from all the actual negative instances available. It's a very important metric, especially in medical diagnosis where we want to be very sure not to falsely identify a condition as being present (positive).

We want our test to have a high specificity, which means it should correctly identify as many healthy individuals as possible without labeling them as diseased.


 If the diagnostic test has a specificity of 99%, indicating that it has a very high accuracy in correctly identifying individuals who do not have the disease, thus avoiding unnecessary distress and further testing for those individuals. This also shows that the test has a very low rate of false alarms, which is very important to prevent overdiagnosis and overtreatment.

####3. Recall(sensitivity)

<font color=blue>*Recall: correct positive prediction/total number of positive class (in original data)*</font>

**recall** = **true positive rate** = $\frac{TP}{TP + FN}$

The recall of the positive class is the same as sensitivity. The recall of the negative class is the same as specificity.

Recall, on the other hand, is like asking,

"*Out of all the patients who actually experienced heart failure, how many did we correctly predict?*"

**High recall** is critical when we want to avoid **false negatives**.

- For example, in cancer screenings, a false negative could mean a patient who actually has cancer gets a clean bill of health.

Think about it, what do you think is more important for recommendation engines like YouTube, Netflix, or Spotify? Would it be precision or recall?

#### 5. $F_1$ score

Furthermore, the **$F_1$ score is a combination of recall and precision**, and is often used a single score to evaluate models.

$F_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$

Imagine a circle that represents all the patients we predicted would experience heart failure. Now, the left part of that circle represents the patients who actually did experience heart failure.

#### Analogy to explain precision and recall

[Here](https://towardsdatascience.com/precision-and-recall-88a3776c8007) is a great analogy to explain Precision and Recall, and I tried to summarize it below:

**Fishing with a net**

Think of it like fishing with a net. If you cast a **wide net** into a lake and **catch 80 out of 100 fish**, that's 80% **recall**. However, you also end up with 80 rocks in your net, which means your **precision is 50%** since half of the net's contents are unwanted junk.

On the other hand, you could use a **smaller net** and focus on a **specific area of the lake** where there are lots of fish and no rocks. In this case, you might only catch 20 out of 100 the fish, but you'll have zero rocks. This results in **20% recall and 100% precision**.

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/precision_recall.gif' width=400px align="right">

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/recall.gif' width=400px align="left">

### Receiver-operator-characteristic (ROC) curves

ROC curves are a handy tool to assess the **performance** of binary classification models, helping us visualize how well our model is doing.

Before focusing on ROC curves, ensure your model can provide **confidence scores** or allows for tuning of the **decision threshold**.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/roc-curve.png" width = "400" style="float: right;">

In an ROC curve, we plot the true positive rate (sensitivity) against the false positive rate (1-specificity) to see how they trade-off.

Subsequently, we determine the area under the curve (AUC), which serves as a valuable metric to compare different models.

With this metric, with a **larger AUC** are **ranked higher **than models with a **lower AUC**. **This assumes we are interested in models with a balance between sensitivity and specificity.**

Generally, a model with a higher AUC is favored as it indicates a good balance between sensitivity and specificity, showcasing a model’s ability to maintain a healthy rate of true positives while minimizing false positives. Remember, we are aiming for a larger AUC to ensure the best blend of sensitivity and specificity!

## Classification Models

Classification tasks can be approached using a variety of models, each suitable for different kinds of data. The ideal approach to selecting a model is by evaluating its performance using your specific dataset or drawing from past experiences. In the subsequent notebooks, we will delve into examples utilizing various models to give you a comprehensive understanding.

### Linear Binary Classification

Linear binary classification is a type of classification algorithm that is used to separate data into two classes (hence "binary") based on a linear combination of the features of the data (hence "linear"). In other words, it is a method that uses a linear equation to categorize data into one of two groups.


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/decision_boundry2.png" width = "500">


#### Decision function

In the context of linear binary classification, the linear decision function is a mathematical equation that helps in distinguishing between two classes based on a set of features or inputs. The linear decision function is represented using a linear combination of the input features and a set of weights, alongside a bias term.

In otherwords, the decision function is a linear function that takes a vector of input features and returns a scalar value that is used to decide the class of the input. It is represented mathematically as follows:

$$
f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b
$$

or

$$
f(\mathbf{x}) = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + b
$$

We have the feature vector, \(\mathbf{x}\), and the weight vector, \(\mathbf{w}\), defined as:

$$
\mathbf{x} = (1, x_1, x_2, \ldots, x_N)^T
$$

$$
\mathbf{w} = (w_0, w_1, w_2, \ldots, w_N)^T
$$

where

- \(f(\mathbf{x})\) is the decision function that we want to find.
- \(w_1, w_2, \ldots, w_n\) are the weights assigned to the respective features
- \(x_1, x_2, \ldots, x_n\) represent the feature values
- \(b\) is the bias term, helping to shift the decision boundary away from the origin.

#### Classifying the data points

The linear decision function plays a pivotal role in:**Classifying the data points**:

Depending on the sign of $f(\mathbf{x})$, the data point $\mathbf{x}$ is classified into one of the two classes. Typically:
   - If $f(\mathbf{x}) > 0$, we predict the label as 1.
   - If $f(\mathbf{x}) \leq 0$, we predict the label as 0.


#### Decision Boundary

The decision boundary is a hypersurface that separates the feature space into regions, each corresponding to a different class label. For a two-dimensional feature space, this boundary is a line, while in a three-dimensional space, it is a plane, and so on. It is established based on the decision function, typically where the decision function equals zero:

$$ f(\mathbf{x}) = 0 $$.

As we said, it is a hypersurface that partitions the feature space into regions corresponding to the different classes.

Once a decision boundary is established using the training data, it can be used to classify new, unseen data points into one of the classes based on which side of the boundary they fall on.



#### **Finding the Optimal Decision Function**

In the learning phase, the goal is to find the optimal weights and bias term that minimizes the errors in classification. This is often done using algorithms like the Perceptron Learning Algorithm, which iteratively updates the weights based on the misclassified samples in each iteration to find a decision function that best separates the two classes.



#### **Perceptron Criterion**

When working with a perceptron, the objective is to minimize the perceptron criterion, which penalizes misclassified samples proportionally to their distance from the decision boundary. This can be formulated mathematically as:

$$
\hat{w} = \arg\min_w \sum_{\xi_i \in M} |\mathbf{w}^T \xi_i|
$$

Where:
- $\hat{w}$: is the weight vector that minimizes the criterion.
- $\mathbf{w}$: is the weight vector.
- $\xi_i$: represents individual samples in the dataset.
- $M$: is the set of all misclassified samples.
- $\mathbf{w}^T \xi_i$: is the dot product between the transpose of the weight vector and the sample vector, giving a measure of the distance of the sample from the decision boundary.

This formula essentially sums the absolute distances of all misclassified points from the decision boundary, aiming to find the weight vector that minimizes this sum, thus correctly classifying as many points as possible.

Extra Materials (not mandatory)

#### **Perceptron Learning Algorithm**

The perceptron learning algorithm is a supervised learning method used primarily for binary classification problems — that is, categorizing a given input into one of two possible classes. The algorithm operates through an iterative process, continuously updating the weights associated with the inputs in order to find the optimal decision boundary that separates the classes. Here, I detail each step of the process along with the pertinent mathematical formulas:

#### **1. Initialization**

We initiate the weight vector (\(w\)) and the bias term (\(b\)) with random small numbers or zeros.

$$
\mathbf{w} = [0, 0, \ldots, 0] \quad \text{(or small random values)}
$$
$$
b = 0 \quad \text{(or a small random value)}
$$

#### **2. Activation Function**

The activation function, typically a step function, determines the output label based on the input features and the current weights:

$$
f(\mathbf{x}) = \begin{cases}
  1 & \text{if } \mathbf{w}^T \mathbf{x} + b > 0 \\
  0 & \text{if } \mathbf{w}^T \mathbf{x} + b \leq 0
\end{cases}
$$

#### **3. Weight Update Rule**

For each misclassified point, we update the weights and bias using the following rules:

- **Positive mistake (false negative)**: If an instance belonging to class 1 is misclassified (the current weights classify it as 0), we update the weights and bias as:
   
$$
\mathbf{w} = \mathbf{w} + \eta y_i \mathbf{x}_i
$$
$$
b = b + \eta y_i
$$

- **Negative mistake (false positive)**: Conversely, if an instance belonging to class 0 is misclassified (the current weights classify it as 1), we update the weights and bias as:

$$
\mathbf{w} = \mathbf{w} - \eta y_i \mathbf{x}_i
$$
$$
b = b - \eta y_i
$$

Here,
- \( y_i \) is the true label of the \( i^{th} \) instance.
- \( \mathbf{x}_i \) is the \( i^{th} \) instance.

#### **4. Learning Rate**

The learning rate (\( \eta \)) is a hyperparameter which controls the step size during the weight updates:

$$
\eta \in (0, 1]
$$

#### **5. Iterative Process**

We repeat the weight update rule for a fixed number of iterations or until convergence, i.e., when no more mistakes are made, or the mistakes are below a certain threshold:

$$
\text{Repeat until convergence or for a fixed number of epochs}
$$

#### **6. Outcome**

At the end of the learning process, the final weights and bias are used to define the decision function that will classify new data points according to:

$$
f(\mathbf{x}) = \begin{cases}
  1 & \text{if } \mathbf{w}^T \mathbf{x} + b > 0 \\
  0 & \text{if } \mathbf{w}^T \mathbf{x} + b \leq 0
\end{cases}
$$

### **Summary**

By following these steps, the perceptron learning algorithm iteratively finds a hyperplane that best separates the two classes in the feature space, provided that the two classes are linearly separable. This involves a process of learning from mistakes, iteratively updating the weights to minimize the classification errors, eventually converging to a solution if one exists. It's a foundational algorithm in machine learning, giving rise to more complex models like neural networks in deep learning.

### Logistic Regression Classifier



<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/LogisticRegression2.png">

Logistic regression model allows us to do that. It converts the output of the decision function h to probability of the positive class using sigmoid function. This function squashes the output of decision function into rage [0,1].

Probability of label 1 given the feature x is therefore sigmoid of h(x), plotted here using the red solid line. The probability of label 0 for the same feature is 1 minus probability of label 1. It is plotted using the blue dotted line.



### Logistic regression
The next model we'll look at is the logistic regression classifier. Note that despite the name this is a classification model not a regression model!

An advantage of the logistic regression classifier is that the output of the model is a probability.

This is done using the sigmoid function which maps any real-valued number to the range (0, 1), making it suitable to represent a probability score.

Here is how it works in detail:

#### **1. Decision Function**

The decision function in logistic regression is given by the linear combination of input features (\( \mathbf{x} \)) and their respective weights (\( \mathbf{w} \)):

$$
f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b
$$

#### **2. Sigmoid Function**

The output of the decision function is then transformed using the sigmoid function to yield a probability value. The sigmoid function (\( \sigma \)) is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

So, applying this to our decision function, we get:

$$
P(y=1|\mathbf{x}) = \sigma(f(\mathbf{x})) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}
$$

#### **3. Probabilities of Each Class**

Using the sigmoid function, we can determine the probability that a given input (\( \mathbf{x} \)) belongs to class 1 (\( y = 1 \)) or class 0 (\( y = 0 \)):

- **Probability of \( y = 1 \) given \( \mathbf{x} \)**
  
  We already derived this above:

  $$
  P(y=1|\mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}
  $$

- **Probability of \( y = 0 \) given \( \mathbf{x} \)**
  
  This can be computed as one minus the probability of \( y = 1 \):

  $$
  P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x}) = 1 - \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}
  $$

#### **4. Logistic Regression Training**

During the training process of logistic regression, the goal is to find the best parameters \( \mathbf{w} \) and \( b \) that minimize the loss function, commonly the log-loss defined as:

$$
\text{Log-Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]
$$

where:
- \( N \) is the number of training examples
- \( y_i \) is the actual label of the \( i^{th} \) training example
- \( \hat{y}_i \) is the predicted probability of the \( i^{th} \) training example being in class 1.

#### **5. Making Predictions**

After training, we can make predictions on new data points using the following decision rule:

$$
\hat{y} = \begin{cases}
  1 & \text{if } P(y=1|\mathbf{x}) \geq 0.5 \\
  0 & \text{if } P(y=1|\mathbf{x}) < 0.5
\end{cases}
$$

### **Summary**

Logistic regression uses the sigmoid function to squash the output of the linear decision function, producing a probability score between 0 and 1. This score represents the likelihood of a given input belonging to class 1. The logistic regression model is trained to find the optimal weights and bias that minimize the log-loss over the training data. Once the optimal parameters are found, the model can make predictions on new inputs, assigning them to one of the two classes based on whether the computed probability is greater or less than a threshold, commonly set at 0.5.

### Cross-entropy loss function

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/CrossEntropy2.png">

So how do we fit the logistic regression model? We minimise cross entropy loss. Let’s consider a single sample with index i and see what cross entropy loss means for this sample. The probability pi for this sample is the probability of the class one.

If the label for this sample is 1, the penalty will be equal to minus logarithm pi. If pi is one, it will result in zero penalty. If $p_i$ is close to zero, it will result in large penalty. The loss function is therefore forcing the probability to 1 for samples with label 1.

If the label for this sample is 0, the penalty will be equal to minus logarithm 1
$p_i$. If pi is zero, the penalty is zero as well. If $p_i$ is close to 1, it will result in large penalty. For samples with label 0 the loss function forces probability to zero as well.

We can therefore see that minimisation of cross entropy ensures that probabilities $p_i$ are similar to labels $y_i$. The solution is found using numerical methods and in this case, the convergence is guaranteed.



Extra Materials (not mandatory)
### **Cross-Entropy Loss Function (or log-loss)**

#### **Definition**

The cross-entropy loss function, also known as log-loss, is a loss function used in binary and multiclass classification problems. In binary classification, it is defined for a single sample as:

$$
\text{Cross-Entropy Loss} = - [y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
$$

where:
- \(y\) is the true label (0 or 1)
- \(\hat{y}\) is the predicted probability of the sample being in class 1.

#### **Properties**

1. **Non-negative**: The cross-entropy loss is always non-negative, and is zero if and only if the predicted probability matches the true label exactly.
   
2. **Penalizes Confident Incorrect Predictions**: The loss grows infinitely when the predicted probability diverges from the true label.

#### **Intuition**

The intuition behind the cross-entropy loss function can be understood from the perspective of information theory. Essentially, it measures how well the predicted probabilities align with the actual classes. It quantifies the "surprise" from observing a different outcome than predicted, with predictions that are further from the true label incurring a larger loss.

#### **Application to Logistic Regression**

In logistic regression, we use the cross-entropy loss function during the training process to find the optimal parameters. This is typically done through a process of gradient descent, where we iteratively update the parameters to minimize the loss function. The overall loss over a dataset of \(N\) samples is given by the average loss over all samples:

$$
\text{Average Cross-Entropy Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]
$$

#### **Multiclass Classification**

In multiclass classification scenarios, we generalize the cross-entropy loss function to handle more than two classes. The loss for a single sample in this case is given by:

$$
\text{Cross-Entropy Loss} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)
$$

where:
- $C$ is the number of classes
- $y_c$ is 1 if the true class is $c$ and 0 otherwise
- $\hat{y}_c$ is the predicted probability of the sample belonging to class $c$.

### **Conclusion**

In conclusion, the cross-entropy loss function is a core component in logistic regression and many other classification algorithms. I**t helps in quantifying the difference between the actual and predicted labels**, **guiding the optimization of the model parameters during training** to achieve better accuracy in classification. It is defined slightly differently for binary and multiclass classification problems, but maintains the same underlying principle of penalizing predictions that are further away from the true labels.

### Support Vector Classifier

Next we'll take a look at the support vector classifier (SVC). This is also often called support vector machine (SVM).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/LinearlySeparableDataset.png" width = "400" style="float: right;">

- First, let's assume we have a linearly separable dataset
- All 3 decision boundaries result in accuracy = 1
- Which boundary is likely to generalise well?
- The red boundary is most likely to generalise well

Linearly separable datasets can be perfectly separated by a linear decision boundary and we can achieve classification accuracy 1. In our example of diagnosis of heart failure, this is the case for healthy patients and patients with severe heart failure.

There are many decision boundaries with accuracy 1 for separable datasets. So how do we choose the one that is most likely to generalise well?

The red boundary seem to be the best because it is far from the samples unlike the other two.

**Large margin classifier (hard margin)**

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/HardMargin.png" width = "400" style="float: right;">

With a large margin classifier, the decision boundary is
- as far as possible from the samples
- determined by samples on the margins - **support vectors**

Support vector classifier is a large margin classifier, which means that it searches for a decision boundary that is as far as possible from the samples.

The decision boundary is determined by samples that lie on the margins and are called support vectors, here denoted by pink circles.

**Large margin classifier (soft margin)**

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/SoftMargin.png" width = "400" style="float: right;">

With a soft margin classifier, the decision boundary
- minimises margin violations
- is determined by samples on or inside the margins or on the wrong side of the decision boundary - **support vectors**

Large margin classifiers can be generalised to non-separable datasets by minimising the margin violations. The decision boundary is again determined by support vectors, which lie on or inside the margin or on the wrong side of the decision boundary.

### Decision Tree Classifier

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Decision-tree.png" width = "400" style="float: right;">

A decision tree is a type of graph

- Nodes (questions)
- Edges (binary choices

Sets of tests that are hierarchically organised. Each test is a _weak learner_.

Advantages of decision trees:

- Easy to interpret
- Able to handle both numerical and categorical data
- Able to handle multi-class problems
- Requires little data preparation (e.g. no normalization)
- Can be used for both classification and regression

Imagine it like a game of 20 questions. The model asks questions about the data and makes decisions based on the answers. And the cool thing? It can change or fine-tune its answers as it gets more information!

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/youdroppedfood.jpg' width=500px align="center">

(Image courtsey: Audrey Fukman and Andy Wright on SFoodie, via Serious Eats)

### Random Forest Classifier

Ensemble of decision trees

- Increases randomisation by restricting each tree node's choice of optimal feature from a subset of the total feature space
- Further decorrelates predictive models
- Further decreases model variance
- Increases stability against feature noise and thus chance of overfitting

**If a tree is good, is a forest better?**

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/random_forest.jpeg' width=500px align="right">

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Random_forest_explain.png' width=500px align="right">

Think about it this way - if one tree is handy, wouldn't a whole forest be even better?

Imagine we could create a bunch of decision trees, each one asking different questions. Then, we could combine their answers to make a final prediction. This is exactly what a Random Forest does.

A Random Forest is what we call an 'ensemble model'. It's like a supergroup band made up of lots of individual musicians, all working together to create a harmonious sound. Here, each model contributes to a final, hopefully better, prediction.

## Extra techniques

Finally we will cover examples of some additional techniques that are important for classification problems:

1. Encoding classes
2. One-hot encoding
3. Creating training and test sets
4. Handling unbalanced datasets

### One-hot encoding

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/one-hot-encoding.png' width=500px align="right">

Machine learning is a bit like a kid who only likes to play with numbers and not with words. So, when we have categories like **'apple' and 'peer'**, we need to find a way to turn them into numbers.

This is where one-hot encoding comes in handy! It's a cool trick that turns these categories into something machine learning algorithms can work with. It's like giving each category its own 'on' and 'off' switch.

And guess what? There's an easy way to do this if you're using pandas, a tool in Python. It has a function called 'get_dummies' that does all the work for you.

**I thought it would be intresting for you to know**: the term "one-hot" in "one-hot encoding" comes from the way digital circuits are designed. In digital electronics, a one-hot signal is a group of bits among which the legal combinations of values are only those with a single high bit (1) and all the others low (0).

## Example - The Titanic Kaggle challenge: A case study for classification

Titanic Kaggle Challenge is a competition where you'll use data to predict who could've survived the infamous Titanic disaster.

Classification "survived" or "not survived""

<img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-3-Classification-models/imgs/Kaggle-Titanic-Project-Getting-Started.png' width=700px align="center">

Let's explore machine learning with a fun example from Kaggle, a competition site owned by Google. These contests can be for giggles, cash prizes, or even a job offer sometimes! The Titanic Kaggle Challenge is known as one of the classic examples for learning classification in a hands-on way.

One beginner's challenge is based on the Titanic disaster, a famous shipwreck that happened on April 15, 1912. The ship, thought to be unsinkable, hit an iceberg and sank, sadly causing 1502 out of 2224 people onboard to lose their lives because there weren't enough lifeboats.

But it's not just about guessing who made it. It's about deeply exploring the data, finding patterns, and understanding how different factors might have affected survival rates. It poses questions like 'Did socioeconomic status influence survival rates?' and 'What was the impact of the "women and children first" policy? Was the 'women and children first' policy strictly followed?'

Here's the interesting part: it seems that some people were more likely to survive than others. The challenge asks us to figure out who these folks were, using data like their names, ages, genders, and social classes.

We get a file with details about 891 passengers, including whether they survived or not. We'll use this data to teach our machine to make smart guesses.

But the real test comes with another file, this one has information on 418 passengers, but doesn't tell us if they survived. That's where our machine's predictions come in!

Suggested tutorial about Kaggle's titanic challange: [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic)

You can follow Chris White's tutorial on this subject through this Jupyter notebook: [01-intro-classification.ipynb](https://colab.research.google.com/github/ualberta-rcg/python-machine-learning/blob/main/notebooks/01-intro-classification.ipynb#scrollTo=Gd80ekh_zQu4).