# What is class imbalance and why is it important

* Imbalanced data refers to datasets where the target has different proportions amongst it's class. In other words, one class is over-represented compared to the other(s).
* The class imbalance scenario exists for classification problems (binary or multi-class)
* It is important to ensure we are focusing and solving the right problem. If a dataset has 95% of one class = 0, and 5% of the other class = 1, we wouldn't be satisfied having a 95% accuracy just by classifying all records as 0.

# Before taking action, understand the problem to choose the right metric

### Summary of metrics

| Metric                 | Formula                                                           | Is is appropriate for imbalanced problems | Suggested use                                                                                                                                                                                                                                                                                                                                  |
|------------------------|-------------------------------------------------------------------|-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $Accuracy$             | $\frac{TP+TN}{TP+TN+FP+FN}$                                       | <ul><li>No</li></ul>                      | <ul><li>Understand baseline/naive classifier metric</li><li>Balanced classification problems</li><li>When both classes are equally important</li></ul>                                                                                                                                                                                         |
| $Precision$            | $\frac{TP}{TP+FP}$                                                | <ul><li>Depends</li></ul>                 | <ul><li>Measures **quality**</li><li>Optimise for precision if you want your model to make the least amount of possible errors</li></ul>                                                                                                                                                                                                       |
| $Specificity$          | $\frac{TN}{TN+FP}$                                                | <ul><li>Depends</li></ul>                 | <ul><li>Idem as precision but for the negative class</li></ul>                                                                                                                                                                                                                                                                                 |
| $Recall / Sensitivity$ | $\frac{TP}{TP+FN}$                                                | <ul><li>Depends</li></ul>                 | <ul><li>Measures **coverage**</li><li>Optimise for recall if you want to catch all possible instances of the target</li></ul>                                                                                                                                                                                                                  |
| $F1 Score$             | $2*\frac{precision*recall}{precision+recall}$                     | <ul><li>Yes</li></ul>                     | <ul><li>Combines quality and coverage measures</li><li>Optimise for F1-score when you believe that precision and recall should be equally weighted</li><li>Use this when you believe that the positive and negative class should be weighted differently (give more value to the positive class and forget about the negative class)</il></ul> |
| $F-beta Score$         | $(1+\beta^2)*\frac{precision*recall}{(\beta^2*precision)+recall}$ | <ul><li>Yes</li></ul>                     | <ul><li>Ideam to F1 score</li>                                                                                                                                                                                                                                                                                                                 |</ul>
| $MCC$                  | $\frac{(TP*TN)-(FP*FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$     | <ul><li>Yes</li></ul>                     | <ul><li>Combines all 4 categories of the confusion metric (tp, fn, tn, fn)</li><li>Use this when both the positive and negative class are equally important</li></ul>                                                                                                                                                                          |
| $AUC$                  | $Area under the ROC curve$                                        | <ul><li>No</li></ul>                      | <ul><li>When you care equally about positive and negative classes metric</li><li>When you care about ranking predictions, not necessarily well-calibrated probabilities</li></ul>                                                                                                                                                              |


### Understanding metrics
##### Confusion matrix
*Image source: https://manisha-sirsat.blogspot.com/2019/04/confusion-matrix.html*


![title](img/confusionMatrxiUpdated.jpg)

##### Accuracy

* Accuracy how many observations, both positive and negative, were correctly classified.
* It is **NOT** a good metric to use for imbalanced problems, as it is easy to get a high accuracy score by simply classifying all observations as the majority class.
* It is useful to understand the effects of a naive classifier.

##### Precision and Recall

**Precision**
* Measures the quality of the positive class prediction made by the model.
* It answers the question: 'out of our predicted positive class, which ones are truly correct?'
* For example, on a cancer detection exercise, we would be answering: 'out of our 100 predicted patients, which ones do truly have cancer?'

**Recall**
* Measures the completeness (or coverage) of the positive class prediction made by the model.
* It answers the question: 'out of all possible true positives, how many has the model detected?'
* For example, on a cancer detection exercise, we would be answering: 'out of 500 patients with patients, how many have we flagged?'

**Precision and recall are, generally, inversely correlated.**
* When a model wants to be very precise with its predictions, it normally leaves out true positives out (ie, reducing recall)
* When a model wants to captured all possible true positives, it normally does it by including as a predicted true positive instances with lower probabilities (ie, reducing precision).

##### F1 and Fbeta scores
**F-scores**
* As you can see from the formula, F1 is a special case of the Fbeta score, but we have included both for completeness (this way you can decide if positive and negative classes should be treated equally).
* Fbeta is tricky in the sense that it includes another parameter to tune, ie, how much better should precision by over recall (how much better should quality be over coverage).

**There is a problem with F-scores**
* F-scores consider only 1 class, ie, it is interested **only in the positive class** and not the negative class.
* Now, one might argue that this is OK, as what we are trying to do is predict this minority positive class.
* However, leaving the majority class out of scope doesn't seem intelligent either. We want our model to do as best as possible in both classes, with maybe a bit more skew towards the minority class.

**What is the problem**
* Let's review the formula. If you look at the confusion matrix and the precision-recall formulas, the TN bucket doesn't appear at all in the formulas.
* This means that, we could have a 0 or infinity in the TN bucket, and F-scores wouldn't change at all.
* In other words, the positive and negative class are not symmetric.

**2 examples to illustrate this**
* Example 1: TP = 18, TN = 1, FP = 3, FN = 2
* Example 2: TP = 1, TN = 18, FP = 2, FN = 3 (inverse of example 1)
* F1 score example 1 = 88%
* F1 score example 2 = 29%
* So... just by labelling the majority class at positive or negative, we get wildly different F1-scores!

##### MCC: Matthews Correlation Coefficient

**Properties**
* Takes into account all buckets in the classification matrix
* It is a special case of Pearson Correlation Coefficient applied to a binary classification task (where 2 the random variables are prediction and label)
* Values between -1 and 1.
    * MCC = 0, the classifier is no better than a random flip (weak correlation).
    * MCC = 1, predictions match labels (strong positive correlation).
    * MCC = -1, predictions disagree with labels (strong negative correlation).
* Perfectly symmetric for both classes (positives are not more important than negatives)

**2 examples to illustrate this**
* Example 1: TP = 18, TN = 1, FP = 3, FN = 2
* Example 2: TP = 1, TN = 18, FP = 2, FN = 3 (inverse of example 1)
* F1 score example 1 = 0.17
* F1 score example 2 = 0.17
* As you can see, given MCC is symmetrical, MCC outputs the same value if you interchange the positive and negative class.
* In addition, we get a sense here that our model is slightly better than random.

# Method 1: changing the probability thresholds to define each class

### ROC and AUC

# Method 2: under or over sampling

# Method 3: target class weights