# What is class imbalance and why is it important

* Imbalanced data refers to datasets where the target has different proportions amongst it's class. In other words, one class is over-represented compared to the other(s).
* The class imbalance scenario exists for classification problems (binary or multi-class)
* It is important to ensure we are focusing and solving the right problem. If a dataset has 95% of one class = 0, and 5% of the other class = 1, we wouldn't be satisfied having a 95% accuracy just by classifying all records as 0.

# Before taking action, understand the problem to choose the right metric

### Summary of metrics

| Metric                  | Formula                                                           | Is is appropriate for imbalanced problems | Suggested use                                                                                                                                                                     |
|-------------------------|-------------------------------------------------------------------|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Accuracy                | $\frac{TP+TN}{TP+TN+FP+FN}$                                       | <ul><li>No</li></ul>                      | <ul><li>Understand baseline/naive classifier metric</li><li>Balanced classification problems</li></ul>                                                                            |
| Precision / Specificity | $\frac{TP}{TP+FP}$                                                | <ul><li>Depends</li></ul>                 | <ul><li>Measures **quality**</li><li>Optimise for precision if you want your model to make the least amount of possible errors</li></ul>                                          |
| Recall / Sensitivity    | $\frac{TP}{TP+FN}$                                                | <ul><li>Depends</li></ul>                 | <ul><li>Measures **coverage**</li><li>Optimise for recall if you want to catch all possible instances of the target</li></ul>                                                     |
| F1 Score                | $2*\frac{precision*recall}{precision+recall}$                     | <ul><li>Yes</li></ul>                     | <ul><li>Combines quality and coverage measures</li><li>Optimise for F1-score when you believe that precision and recall should be equally weighted</li>                           |
| F-beta Score            | $(1+\beta^2)*\frac{precision*recall}{(\beta^2*precision)+recall}$ | <ul><li>Yes</li></ul>                     | <ul><li>Combines quality and coverage measures</li><li>Optimise for Fbeta-score when you believe that precision and recall should **not** be equally weighted</li>                |
| MCC                     | $\frac{(TP*TN)-(FP*FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$     | <ul><li>Yes</li></ul>                     | <ul><li>Combines all 4 categories of the confusion metric (tp, fn, tn, fn)</li></ul>                                                                                              |
| AUC                     | Area under the ROC curve                                          | <ul><li>No</li></ul>                      | <ul><li>When you care equally about positive and negative classes metric</li><li>When you care about ranking predictions, not necessarily well-calibrated probabilities</li></ul> |


### Understanding metrics
##### Confusion matrix

##### Accuracy

##### Precision

##### Recall

##### F1 and Fbeta scores

# Method 1: changing the probability thresholds to define each class

### ROC and AUC

# Method 2: under or over sampling

# Method 3: target class weights