# Dealing with Missing Data

## Dropping Missing Data

df.dropna() will drop any row containing missing values

In [4]:
import pandas as pd

# Sample data to play with and clean
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, None, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}
df = pd.DataFrame(data)

# Full dataset
print(df)

# Drop all rows that have any missing values in any column.
#print(df.dropna())

# Drop only rows where all values are missing
#print(df.dropna(how='all'))

# Drop only rows where more than two values are missing
#print(df.dropna(thresh=2))

# Drop all rows that have any missing values in the gender or height
#print(df.dropna(subset=['gender', 'height']))

# Drop all rows where both height and weight are missing???
print(df.dropna(thresh=2, subset=['height', 'weight']))

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f     NaN   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
5   NaN   None     NaN     NaN
    age gender  height  weight
0  27.0      f    64.0   140.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0


## When Does Missingness Matter?

1. Loss of statistical power because so many rows have been thrown away, harder to detect effects

2. Bias because certain values are more likely to be missing than others

### Missing Completely At Random (MCAR)

1. A flood washed away some servers and 20% of data is lost
2. Unless so much data is lost and sample sizes are too small, it's fair to throw out missing values and proceed.

### Missing at Random (MAR)

1. Women are more likely to skip a question about weight, regardless of actual weight.

2. Because we can explain why data is missing using data we have, we can proceed as long as we include the variable that "explains" the missingness in our analyses.

3.  There's no way to know that data is MAR, but sometimes we can assume it is.  If we find variable in our dataset that seems to differentiate really well between missing and non-missing (90% of the people with missing values on the "depression" score are men) we have reason to suspect MAR.

### Missing Not at Random (MNAR)

1. LGBT individuals less likely to answer a survey question about their sexual orientation

2. Systematic missingness: people who would answer in a certain way (LGBT vs Hetrosexual) are less likely to answer at all

3.  Stop, do not pass go, do not collect $200.  If we throw out MNAR data, we end up with biased sample (proportionately fewer LGBT people than in the populartion we want to study) and biased conclusions.

4.  NOTE: we don't know what people would have said for questions they don't answer, MNAR is an assumption based on looking at the data and noticing what ISN'T there: abnormally low counts of LGBT people, almost no men who say they are depressed, variables with missingness where nobody picks the highest or lowest value, etc.

So what to do if you have MNAR data you can't drop (or if MCAR or MAR but dropping missing values leaves your sample too small)?

## Imputing Data

Imputation: "Guess" what missing data would have been and fill in that cell with our guess.

Simple to complex, might be easy as replacing missing values with mode, mean, or median, keeping central tendency the same, but reducing variance and correlations among variables.

In [7]:
import pandas as pd

# Sample data to play with.
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, 71, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}
df = pd.DataFrame(data)

print(df)

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
5   NaN   None     NaN     NaN


In [8]:
# For each numeric column, replace the missing values with 
# the mean for that column.
df.fillna(df.mean(),inplace=True)
print(df)

    age gender  height  weight
0  27.0      f   64.00   140.0
1  50.0      f   67.25   135.0
2  34.0      f   71.00   130.0
3  37.0      m   66.00   110.0
4  37.0      m   68.00   160.0
5  37.0   None   67.25   135.0


In [9]:
# For each column, replace the missing values with the most common value 
# for that column. Useful for filling in missing categorical values.
# As written, this command will fill in missing values for both 
# numerical and categorical columns.
df = pd.DataFrame(data)
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
print(df)

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f    68.0   160.0
2  34.0      f    71.0   130.0
3  34.0      m    66.0   110.0
4  34.0      m    68.0   160.0
5  34.0      f    68.0   160.0


In [14]:
# Try replacing each value with the median, mode, or other statistic
# of your choice.
df = pd.DataFrame(data)
df.fillna(df.median(),inplace=True)
print(df)

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f    67.0   135.0
2  34.0      f    71.0   130.0
3  34.0      m    66.0   110.0
4  34.0      m    68.0   160.0
5  34.0   None    67.0   135.0


## Beyond Imputation

If causes of MNAR (or major amounts of missingness in MCAR or MAR) are easy to fix, then fixing causes/collecting new data may be easier than imputation.

Run study afresh
Collect more data with intentional focus on groups with highest missingness.

Example:
coding error in tech usage survey means data wasn't recorded for Mac users, it may be easier to fix the coding error and run study again (or fix coding error and collect data from just Mac users) than try to impute such a centrally important variable.