# Missing Values

> Note:  
    - **MCAR** = Missing Completly At Random  
    - **MAR**  = Missing at Random  
    - **MNAR** = Missing Not At Random  

### Main Take-Aways

- **Missing Values**
    - are very common in real-world dataset
    - can cause severe problems when not treated properly
    - very often are not treated properly
- MCAR/ MAR/ MNAR, hard to decide which case => Domain knowledge is needed.
- When **deleting** missing values
    - Only safe if MCAR
    - MAR and MNAR, will introduce **bias** in the model
    - Will reduce the overall power of the model 
- Imputation, esp. multiple Imputation is the best approach if MAR or MCAR
- If MNAR has to be assumed, there is no good solution.
    - If missingness depends on not-collected feature => collect these feature (so that MAR holds) or collect data without missing values.
    - If missingness depends on the missing values themselves: collect data without missing values

---

## Possible Questions:

**What are the problems of missing data?**  
Can cause severe problems _when not treated properly_, such as bias in the models.

**What are the different _types_ of missing data and how do they need to be treated?**  
<div style="display: flex; justify-content: center;"><img src="img/Type_of_Missingness.jpg" alt="Type_of_Missingness.png" width="75%" height2="614"></div>

**Was ist Imputation**  
Imputation ist eine Art und Weise fehlende Daten aufzufüllen, ein relativ naiver Ansatz ist, den Mittelwert der Spalte für alle NaN's zu verwenden.

**Was ist der Unterschied zwischen "MNAR on not-collected features" und "MNAR on missing value itself"**  
- MNAR on not-collected features => Fehlende PLZ mit Zahl 0 am Anfang.
    - Man kann unter Umständen mehr Feature collecten und dann mit _educated guesses_ Imputation durchführen.
    - Alternative Lösung: Mehr Daten sammeln
- MNAR on missing value itself => Personen mit Übergewicht geben ihr Gewicht in der Umfrage nicht an.
    - Man kann keine Imputation anwenden, weil man ja nicht weiß was fehlt. Man wird ja hier kein _mean-weight_ hernehmen können...
    - Lösung: Mehr Daten sammeln

**Was ist das Problem mit Daten löschen?**  
Je nachdem, um welchen Typ von _Missing Type_ es sich handelt, kann das zu einer stärkeren Verzerrung des Models führen (Fall MNAR). 

---

# How to handle Missing Values
<div style="display: flex; justify-content: center;"><img src="img/Handling_Missing_Data.webp" alt="Type_of_Missingness.png" width="40%" height2="614"></div>

## Deletion
- Row-Wise: Delete all Rows with Missing Values
- Column-Wise: Delete all Columns with Missing Values (Listwise Complete Case Analysis)
- Pair-Wise: Work with the Data available e.g.:
    - to compute the mean of column A, use all non-missing values in column A
    - to compute the mean of column B, use all non-missing values in column B
        - => can lead to _weird_ / unexpected results. Note: Different (number of) rows may be used for each mean!

## Imputation

### Simple Imputation  
Replace missing values with _some_ other value. Most commonly with mean/median or mode (Modalwert) of subset of the row. In case of time-series data, take the previous or next value, or use interpolation. A **common mistake** is when using train/test (or k-fold cross validation), the imputed value must be computed on the train-set **only**. _(Which makes train/test and esp. k-fold-cross-validation much more complex to handle)_

### Imputation: Missing Value Prediction  
Predicting Missing Values from the other features. Can be as simple as a linear regression and go as far as using deep-learning. Basically we take the data we have with non-missing values, and try to predict the missing ones.

### Multiple Imputation process  
This method is similar to the previous one with the difference of using smaller datasets for the prediction, and then combining their outcome together to one prediction. Benefit: Less Bias.