In [1]:
import numpy as np
import pandas as pd
import os

import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(10, 5))
plt.rc('font', size=12)

# Lecture 11 – Missing Values

## DSC 80, Spring 2022

### Announcements

- Project 2's checkpoint is due **tomorrow at 11:59PM**, and the full project is due on **Saturday, April 30th at 11:59PM**.
- Discussion section 4 (all about visualization 📊) is today, and the assignment is due for extra credit on **Saturday, April 23rd at 11:59PM**.
- Lab 4 is due on **Monday, April 25th at 11:59PM**.
    - No hidden tests, just for this lab! Public notebook tests = Gradescope tests.
    - Check [here](https://campuswire.com/c/G325FA25B/feed/767) for clarifications.
- Grades are released for Lab 2 and Project 1! Look [here](https://campuswire.com/c/G325FA25B/feed/755) for details.
- The Midterm Exam is **in-class on Wednesday, April 27th.** More details to come.
    - It will cover Lectures 1-11, Labs 1-4, Projects 1-2, and Discussions 1-4.
    - Unless you email Suraj beforehand, you **must take the exam during the section you are enrolled in**.
    - A single, two-sided cheat sheet will be allowed.
- 🚨 If 80% of the class fills out the **[Mid-Quarter Survey](https://docs.google.com/forms/d/e/1FAIpQLSd9k90fGqPKDRAnHjFBEx5kak_VtvYN5Fq5uPv9jyqrryaKeA/viewform)**, then everyone will receive an extra point on the Midterm Exam. 🚨

### Agenda

- Speeding up permutation tests.
- Missingness mechanisms.

## Speeding things up 🏃

### Recap: permutation tests

- Permutation tests help decide whether **two samples came from the same distribution**.
- In a permutation test, we simulate data under the null by **shuffling** either group labels or numerical features.
    - In effect, this **randomly assigns individuals to groups**.
- If the two distributions are numeric, we use as our test statistic the difference in group means or medians.
- If the two distributions are categorical, we use as our test statistic the total variation distance (TVD).



### Speeding up permutation tests

- A permutation test, like all simulation-based hypothesis tests, generates an **approximation** of the distribution of the test statistic.
    - If we found **all** permutations, the distribution would be exact!
    - If there are $a$ elements in one group and $b$ in the other, the total number of permutations is ${a + b \choose a}$.
    - If $a = 100$ and $b = 150$, there are more than ${250 \choose 100} \approx 6 \cdot 10^{71}$ permutations!

- The more repetitions we use, the better our approximation will be.
- Unfortunately, our code is pretty slow, so we can't use many repetitions.

### Example: Birth weight and smoking 🚬

In [2]:
baby_fp = os.path.join('data', 'baby.csv')
baby = pd.read_csv(baby_fp)
smoking_and_birthweight = baby[['Maternal Smoker', 'Birth Weight']]

In [None]:
smoking_and_birthweight.head()

Recall our permutation test from last class:
- **Null hypothesis**: In the population, birth weights of smokers and non-smokers have the same distribution. The difference we saw was due to random chance.
- **Alternative hypothesis**: In the population, babies born to smokers have lower birth weights, on average.

### Timing the birth weights example ⏰

We'll use 3000 repetitions instead of 500.

In [None]:
%%time

n_repetitions = 3000
differences = []

for _ in range(n_repetitions):
    
    # Step 1: Shuffle the weights
    shuffled_weights = (
        smoking_and_birthweight['Birth Weight']
        .sample(frac=1)
        .reset_index(drop=True) # Be sure to reset the index! (Why?)
    )
    
    # Step 2: Put them in a DataFrame
    shuffled = (
        smoking_and_birthweight
        .assign(**{'Shuffled Birth Weight': shuffled_weights})
    )
    
    # Step 3: Compute the test statistic
    group_means = (
        shuffled
        .groupby('Maternal Smoker')
        .mean()
        .loc[:, 'Shuffled Birth Weight']
    )
    difference = group_means.diff().iloc[-1]
    
    # Step 4: Store the result
    differences.append(difference)
    
differences[:10]

### Minor improvements

**Improvement 1:** Use `np.random.permutation` instead of `df.sample`.

**Why?** This way, we don't need to shuffle index as well. This is how you ran permutation tests in DSC 10.

In [5]:
to_shuffle = smoking_and_birthweight.copy()
weights = to_shuffle['Birth Weight']

In [6]:
%%timeit
np.random.permutation(weights.values)

20.2 µs ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [7]:
%%timeit
weights.sample(frac=1)

84.1 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


**Improvement 2:** Don't use `assign`; instead, add the new column in-place.

**Why?** This way, we don't create a new copy of our DataFrame on each iteration.

In [None]:
%%timeit
to_shuffle['Shuffled Birth Weight'] = np.random.permutation(weights.values)

In [None]:
%%timeit
to_shuffle.assign(**{'Shuffled Birth Weight': np.random.permutation(weights.values)})

Let's try out both of these improvements, again with 3000 repetitions.

In [None]:
%%time

n_repetitions = 3000
faster_differences = []

to_shuffle = smoking_and_birthweight.copy()
weights = to_shuffle['Birth Weight'].values

for _ in range(n_repetitions):
    
    # Step 1: Shuffle the weights
    shuffled_weights = np.random.permutation(weights)
    
    # Step 2: Put them in a DataFrame
    to_shuffle['Shuffled Birth Weight'] = shuffled_weights
    
    # Step 3: Compute the test statistic
    group_means = (
        to_shuffle
        .groupby('Maternal Smoker')
        .mean()
        .loc[:, 'Shuffled Birth Weight']
    )
    difference = group_means.diff().iloc[-1]
    
    # Step 4: Store the result
    faster_differences.append(difference)
    
faster_differences[:10]

In [None]:
bins = np.linspace(-4, 4, 20)
plt.hist(differences, density=True, ec='w', bins=bins, alpha=0.65, label='Original Test Statistics');
plt.hist(faster_differences, density=True, ec='w', bins=bins, alpha=0.65, label='Faster Test Statistics')
plt.legend();

The distribution of test statistics generated by the faster approach resembles the distribution of test statistics generated by the original approach. Both are doing the same thing!

### An _even faster_ approach

- Both of our previous approaches involved calling `groupby` inside of a loop.
- We can avoid `groupby` entirely!
- Let's start by generating a Boolean array of size `(3000, 1174)`.
    - Each row will correspond to a single permutation of the `'Maternal Smoker'` (`bool`) column.

In [3]:
is_smoker = smoking_and_birthweight['Maternal Smoker'].values
weights = smoking_and_birthweight['Birth Weight'].values

In [4]:
is_smoker

array([False, False,  True, ...,  True, False, False])

In [None]:
%%time

np.random.seed(24) # So that we get the same results each time (for lecture)

# We are still using a for-loop!
is_smoker_permutations = np.column_stack([
    np.random.permutation(is_smoker)
    for _ in range(3000)
]).T

In `is_smoker_permutatons`, each row is a new simulation. 
- `False` means that baby comes from a non-smoking mother.
- `True` means that baby comes from a smoking mother.

In [None]:
is_smoker_permutations

In [None]:
is_smoker_permutations.shape

Note that each row has 459 `True`s and 715 `False`s – it's just the order of them that's different.

In [None]:
is_smoker_permutations.sum(axis=1)

The first row of `is_smoker_permutations` tells us that in this permutation, we'll assign baby 1 to "smoker", baby 2 to "smoker", baby 3 to "non-smoker", and so on.

In [None]:
is_smoker_permutations[0]

### Broadcasting

- If we multiply `is_smoker_permutations` by `weights`, we will create a **new** (3000, 1174) array, in which:
    - the weights of babies assigned to "smoker" are present, and
    - the weights of babies assigned to "non-smoker" are 0.
- `is_smoker_permutations` is a "mask".

First, let's try this on just the first permutation (i.e. the first row of `is_smoker_permutations`).

In [None]:
weights * is_smoker_permutations[0]

Now, on all of `is_smoker_permutations`:

In [None]:
weights * is_smoker_permutations

The mean of the **non-zero** entries in a row is the mean of the weights of "smoker" babies in that permutation.

_Why can't we use `.mean(axis=1)`?_

In [None]:
n_smokers = is_smoker.sum()
mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_smokers

In [None]:
mean_smokers.shape

We also need to get the weights of the non-smokers in our permutations. We can do this by "inverting" the `is_smoker_permutations` mask and performing the same calculations.

In [None]:
n_non_smokers = 1174 - n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
mean_non_smokers

In [None]:
test_statistics = mean_smokers - mean_non_smokers
test_statistics

### Putting it all together

In [None]:
%%time

is_smoker = smoking_and_birthweight['Maternal Smoker'].values
weights = smoking_and_birthweight['Birth Weight'].values
n_smokers = is_smoker.sum()
n_non_smokers = 1174 - n_smokers

is_smoker_permutations = np.column_stack([
    np.random.permutation(is_smoker)
    for _ in range(3000)
]).T

mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
ultra_fast_differences = mean_smokers - mean_non_smokers

In [None]:
plt.hist(differences, density=True, ec='w', bins=bins, alpha=0.65, label='Original Test Statistics');
plt.hist(ultra_fast_differences, density=True, ec='w', bins=bins, alpha=0.65, label='Ultra Fast Test Statistics')
plt.legend();

Again, the distribution of test statistics with the "ultra-fast" simulation is similar to the original distribution of test statistics.

## Missingness mechanisms

_Good resources: [course notes](https://notes.dsc80.com/content/06/defining-missing.html), [Wikipedia](https://en.wikipedia.org/wiki/Missing_data), [this textbook page](https://stefvanbuuren.name/fimd/sec-MCAR.html)_

### Imperfect data

<center><img src="imgs/image_0.png" width=40%></center>

- When studying a problem, we are interested in understanding the **true model** in nature.
- The data generating process is the "real-world" version of the model, that generates the data that we observe.
- The recorded data is **supposed** to "well-represent" the data generating process, and subsequently the true model.

- Example: Consider the upcoming Midterm Exam.
    - The exam is meant to be a **model** of your **true** knowledge of DSC 80 concepts.
    - The data generating process should give us a sense of your true knowledge, but is influenced by the specific questions on the exam, your preparation for the exam, whether or not you are sick on the day of the exam, etc.
    - The recorded data consists of the final answers you write on the exam page.

### Imperfect data

<center><img src="imgs/image_0.png" width=40%></center>

* **Problem 1:** Your data is not representative, i.e. you collected a poor sample.
    - If the exam only asked questions about `pivot_table`, that would not give us an accurate picture of your understanding of DSC 80!
* **Problem 2:** Some of the entries are missing.
    - If you left some questions blank, why?

We will focus on the second problem.

### Types of missingness

There are four key ways in which values can be missing. It is important to distinguish between these types so that we can correctly **handle** missing data (Lecture 13).

* **Missing by design (MD)**.
* **Not missing at random (NMAR)**.
    - Also called "non-ignorable" (NI), especially in Lab 4.
* **Missing at random (MAR)**.
    - Also called "conditionally ignorable".
* **Missing completely at random (MCAR)**.
    - Also called "unconditionally ignorable".

### Missing by design (MD)

- Values in a column are missing by design if:
    - the designers of the data collection process **intentionally decided to not collect data in that column**,
    - because it can be recovered from other columns. 

- If you can determine whether a value is missing solely using other columns, then the data is missing by design.
    - For example: `'Age4'` is missing if and only if `'Number of People'` is less than 4.


<center><img src=imgs/households.png width=40%></center>

    
- Refer to [this StackExchange link](https://stats.stackexchange.com/questions/201782/meaning-of-missing-by-design-in-longitudinal-studies) for more examples.

### Missing by design


<center><img src="./imgs/Skiplogic.png"/></center>

**Example:** `'Car Type'` and `'Car Colour'` are missing if and only if `'Own a car?'` is `'No'`.


### Other types of missingness

- Not missing at random (NMAR).
    - The chance that a value is missing **depends on the actual missing value**!
    - Weird name, because it's still random...
    
- Missing at random (MAR).
    - The chance that a value is missing **depends on other columns**, but **not** the actual missing value itself.

- Missing completely at random (MCAR).
    - The chance that a value is missing is **completely independent** of
        - other columns, and
        - the actual missing value.

### Mom... the dog ate my data! 🐶

Consider the following (contrived) example:

- We survey 100 people for their favorite color and birth month.
- We write their answers on index cards.
    - On the left side, we write colors.
    - On the right side, we write birth months.
- A (bad) dog takes the top 10 cards from the stack and chews off the right side (birth months).
- Now ten people are missing birth months!

<center><img src='imgs/dog.png' width=50%></center>

### Discussion Question

We are now missing birth months for the first 10 people we surveyed. What is the missingness mechanism for birth months if:

1. Cards were sorted by favorite color?
2. Cards were sorted by birth month?
3. Cards were shuffled?

Remember:

- **Not missing at random (NMAR):** The chance that a value is missing **depends on the actual missing value**!
- **Missing at random (MAR):** The chance that a value is missing **depends on other columns**, but **not** the actual missing value itself.
- **Missing completely at random (MCAR):** The chance that a value is missing is **completely independent** of other columns and the actual missing value.

### Discussion Question, solved

- If cards were sorted by favorite color, then:
    - The fact that a card is missing a month is **related to the favorite color**.
    - Since the missingness depends on another column, we say values are **missing at random (MAR)**.
        - The missingness doesn't depend on the actual missing values – early months are no more likely to be missing than later months, for instance.

- If cards were sorted by birth month, then:
    - The fact that a card is missing a month is **related to the missing month**.
    - Since the missingness depends on the actual missing values – early months are more likely to be missing than later months – we say values are **not missing at random (NMAR)**.

- If cards were shuffled, then:
    - The fact that a card is missing a month is **related to nothing**.
    - Since the missingness depends on nothing, we say values are **missing completely at random (MCAR)**.

### The real world is messy! 🌎

- In our contrived example, the distinction between NMAR, MAR, and MCAR was relatively clear.
- However, in more practical examples, it can be hard to distinguish between types of missingness.
- Domain knowledge is often needed to understand **why** values might be missing.

### Not missing at random (NMAR)

- Data is NMAR if the chance that a value is missing **depends on the actual missing value**!
    - It could _also_ depend on other columns.
- Another term for NMAR is "non-ignorable" – the fact that data is missing is data in and of itself that we cannot ignore.

- **Example:** On an employment survey, people with really high incomes may be less likely to report their income.
    - If we **ignore** missingness and compute the mean salary, our result will be **biased** low!

- **Example:** A person doesn't take a drug test because they took drugs the day before.

- When data is NMAR, we must reason about why the data is missing using domain expertise on the data generating process – the other columns in our data won't help. 

### Missing at random (MAR)

- Data is MAR if the chance that a value is missing **depends on other columns**, but **not** the actual missing value itself.
- Another term for MAR is "conditionally ignorable".

* **Example:** An elementary school teacher keeps track of the health conditions of each student in their class. One day, a student doesn't show up for a test because they are at the hospital.
    - The fact that their test score is missing has nothing to do with the test score itself.
    - But the teacher could have predicted that the score would have been missing given the other information they had about the student.

- **Example:** People who work in the service industry may be less likely to report their income.

<center><img src="imgs/tip.jpg" width="20%"/></center>

### Missing completely at random (MCAR)

- Data is MCAR if the chance that a value is missing is **completely independent** of other columns and the actual missing value.
- Another term for MCAR is "unconditionally ignorable" – the fact that a value is missing is not noteworthy.

- **Example:** There is a bank of 10 exam questions, and each student's Midterm Exam consists of a random sample of 5 questions. Each student will have missing scores for the other 5 questions; those scores will be MCAR.

- **Example:** After the Midterm Exam, I accidentally spill boba on the top of the stack. Assuming that the exams are in a random order, the exam scores that are lost due to this still will be MCAR. (Hopefully this doesn't happen!)

<center><img src="imgs/tea.jpg" width="20%"></center>



### Isn't everything NMAR? 🤔

- You can argue that many of these examples are NMAR, by arguing that the missingness depends on the value of the data that is missing.
    - For example, if a student is hospitalized, they may have lots of health problems and may not have spent much time on school, leading to their test scores being worse.
- Fair point, but with that logic _almost everything is NMAR_.
- What we really care about is **the main reason data is missing**.
- If the other columns **mostly** explain the missing value and missingness, treat it as MAR.

### Flowchart

A good strategy is to assess missingness in the following order.

<center><b>Missing by design (MD)</b></center>
<center><i>Can I determine the missing value exactly by looking at the other columns? 🤔</i></center>
$$\downarrow$$

<center><b>Not missing at random (NMAR)</b></center>
<center><i>Is there a good reason why the missingness depends on the values themselves? 🤔</i></center>
$$\downarrow$$

<center><b>Missing at random (MAR)</b></center>
<center><i>Do other columns tell me anything about the likelihood that a value is missing? 🤔</i></center>
$$\downarrow$$

<center><b>Missing completely at random (MCAR)</b></center>
<center><i>The missingness must not depend on other columns or the values themselves. 😄</i></center>

### Discussion Question

In each of the following examples, decide whether the missing data are MD, NMAR, MAR, or MCAR:

* A table for a medical study has columns for `'gender'` and `'age'`. **`'age'` has missing values**.
* Measurements from the Hubble Space Telescope are **dropped during transmission**.
* A table has a single column, `'self-reported education level'`, **which contains missing values**.
* A table of grades contains three columns, `'Version 1'`, `'Version 2'`, and `'Version 3'`. **$\frac{2}{3}$ of the entries in the table are `NaN`.**


### Why do we care again?

- If a dataset contains missing values, it is likely not an accurate picture of the data generating process.
- By identifying missingness mechanisms, we can best **fill in** missing values, to gain a better understanding of the DGP.

### Formal definition: MCAR

Suppose we have:
- A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is **missing completely at random** (MCAR) if 

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: \psi)$$

That is, adding information about the dataset doesn't change the likelihood data is missing!

### Formal definition: MAR

Suppose we have:
- A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- A parameter $\psi$ that represents all relevant information that is not part of the dataset.

Data is **missing at random** (MCAR) if 

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi) = \text{P}(\text{data is present} \: | \: Y_{obs},  \psi)$$

That is, MAR data is **actually MCAR**, **conditional** on $Y_{obs}$.

### Formal definition: NMAR

Suppose we have:
- A dataset $Y$ with observed values $Y_{obs}$ and missing values $Y_{mis}$.
- A parameter $\psi$ that represents all relevant information that is not part of the dataset.


Data is **not missing at random** (NMAR) if  

$$\text{P}(\text{data is present} \: | \: Y_{obs}, Y_{mis}, \psi)$$

cannot be simplified. That is, in NMAR data, **missingness is dependent on the missing value** itself.

## Assessing missingness through data

### Assessing missingness through data

- Suppose I believe that the missingness mechanism of a column is NMAR, MAR, or MCAR.
    - I've ruled out missing by design (a good first step).
- Can I check whether this is true, by looking at the data?

### Assessing NMAR

- We can't determine if data is NMAR just by looking at the data, as whether or not data is NMAR depends on the **unobserved data**.
- To establish if data is NMAR, we must:
    - **reason about the data generating process**, or
    - collect more data.

- **Example:** Consider a dataset of survey data of students' self-reported happiness. The data contains PIDs and happiness scores; nothing else. Some happiness scores are missing. **Are happiness scores likely NMAR?**

### Assessing MAR

- Data are MAR if the missingness only depends on **observed** data.
- After reasoning about the data generating process, if you establish that data is not NMAR, then it must be either MAR or MCAR.
- The more columns we have in our dataset, the "weaker the NMAR effect" is.
    - Adding more columns -> controlling for more variables -> moving from NMAR to MAR.
    - **Example:** With no other columns, income in a census is NMAR. But once we look at location, education, and occupation, incomes are closer to being MAR.

### Assessing MCAR

- For data to be MCAR, the chance that values are missing should not depend on any other column or the values themselves.
- **Example:** Consider a dataset of phones, in which we store the screen size and price of each phone. **Some prices are missing.**

| Phone | Screen Size | Price |
| --- | --- | --- |
| iPhone 13 | 6.06 | 999 |
| Galaxy Z Fold 3 | 7.6 | NaN |
| OnePlus 9 Pro | 6.7 | 799 |
| iPhone 12 Pro Max | 6.68 | NaN |

- If prices are MCAR, then **the distribution of screen size should be the same** for:
    - phones whose prices are missing, and 
    - phones whose prices aren't missing.
- **We can use a permutation test to decide between MAR and MCAR!** We are asking the question, did these two samples come from the same underlying distribution?

## Summary, next time

### Summary, next time

- **Missing by design (MD):** Whether or not a value is missing depends entirely on the data in other columns. In other words, if we can always predict if a value will be missing given the other columns, the data is MD.
- **Not missing at random (NMAR, also called NI):** The chance that a value is missing **depends on the actual missing value**!
- **Missing at random (MAR):** The chance that a value is missing **depends on other columns**, but **not** the actual missing value itself.
- **Missing completely at random (MCAR):** The chance that a value is missing is **completely independent** of other columns and the actual missing value.

- **Next time:** How to verify if data are MAR vs. MCAR using a permutation test. A new test statistic.

- **Important:** Refer to the [Flowchart](#Flowchart) when deciding between missingness types.