## <center>Missing data in supervised ML</center>
### <center>Andras Zsom</center>
<center>Lead Data Scientist and  Adjunct Lecturer in Data Science</center>
<center>Center for Computation and Visualization</center>
<center>Brown University</center>
<center>Providence, RI, USA</center>

## About me
- Born and raised in Hungary
- Astrophysics PhD at MPIA, Heidelberg, Germany
- Postdoctoral researcher at MIT (still in astrophysics at the time)
- Started at Brown in December 2015 as a Data Scientist
- Promoted to Lead Data Scientist in 2017
- Teaching the course *DATA1030: Hands-on data science* to the DS master students at Brown this semester

## Data Science at Brown
- Center for Computation and Visualization
- Institutional Data group
   - Data-driven decision support and predictive modeling for Brown’s administrative units
   - Academic research on data-intensive projects

## Learning Objectives

By the end of this workshop, you will be able to
- Describe the three main types of missingness patterns
- Evaluate simple approaches for handling missing values (and why they can fail)
- Apply XGBoost to a dataset with missing values
- Apply multivariate imputation
- Apply the reduced-features model (also called the pattern submodel approach)
- Decide which approach is best for your dataset

## <font color='LIGHTGRAY'>Learning Objectives</font>

<font color='LIGHTGRAY'>By the end of this workshop, you will be able to</font>
- **Describe the three main types of missingness patterns**
- <font color='LIGHTGRAY'>Evaluate simple approaches for handling missing values (and why they can fail)</font>
- <font color='LIGHTGRAY'>Apply XGBoost to a dataset with missing values</font>
- <font color='LIGHTGRAY'>Apply multivariate imputation</font>
- <font color='LIGHTGRAY'>Apply the reduced-features model (also called the pattern submodel approach)</font>
- <font color='LIGHTGRAY'>Decide which approach is best for your dataset</font>

## Missing values often occur in datasets
- survey data: not everyone answers all the questions
- medical data: not all tests/treatments/etc are performed on all patients
- sensor can be offline or malfunctioning
- customer data: not every user uses all features of an app

## Missing values are an issue for multiple reasons

#### Concenptual reason
- missing values can introduce biases
    - bias: the samples (the data points) are not representative of the underlying distribution/population
    - any conclusion drawn from a biased dataset is also biased.
    - rich people tend to not fill out survey questions about their salaries and the mean salary estimated from survey data tend to be lower than true value


#### Practical reason
- missing values (NaN, NA, inf) are incompatible with sklearn
   - all values in an array need to be numerical otherwise sklearn will throw a *ValueError*
- there are a few supervised ML techniques that work with missing values (e.g., XGBoost)
   - we will cover those later this semester during a follow-up lecture on missing data

# Missing data patterns

- MCAR - Missing Complete At Random
- MAR - Missing At Random
- MNAR - Missing Not At Random

## MCAR - Missing Complete At Random

- the reason the values are missing are related to an unobserved (unimportant) variable
   - the data sample is still representative of the underlying distribution/population
- your best case scenario but usually rare
- can be diagnosed with a statistical test ([Little, 1988](https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722))
   - python implementation [here](https://github.com/RianneSchouten/pymice)

## MCAR examples
- some people randomly fail to fill in some values in a survey 
- sensor randomly malfunctions
- apps, websites are unavailable sometimes

## MAR - Missing At Random

- Name is misleading! Better name would be 'Missing Conditionally at Random' but the MCAR acronym is taken. 
- the reason why feature_i contains missing values is conditional on other variables but not on feature_i
- difficult to diagnose
- still, most methods assume MAR

## MAR examples
- missing value in blood pressure data conditional on age
   - older people are more likely to have their blood pressure measured during a regular check-up than younger people
- males are less likely to fill in a depression survey
   - this has nothing to do with their level of depression after accounting for maleness

## MNAR - Missing Not At Random

- the reason the variable is missing is related to the value of the variable itself
- most severe case of missingness!
- not many approaches can deal with this correctly

## MNAR examples
- depressed people are less likely to fill out a survey on depression because of their level of depression
- rich people don't fill out survey info on  their salaries because they are rich
- temperature sensor doesn't work because the observed temperature is outside of range

## Takeaway
- it can be challenging to infer the missingness pattern from an incomplete dataset
   - There is a statistical test to diagnose MCAR
   - MAR and MNAR are difficult/impossible to diagnose to the best of my knowledge
- multiple patterns can be present in the data
   - even worse, multiple patterns can be present in one feature!
   - missing values in a feature can occur due to a mix of MCAR and MAR for example
