# Dealing with Missing Data

In [1]:
# standard imports
import numpy as np
import pandas as pd

# statistics imports
import scipy.stats as stats

# plotting imports
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline



### Table of Contents
1. [Dealing with Missing Data](#Dealing-with-Missing-Data)
2. [Basic Imputation](#Basic-Imputation)
3. [Impute Using Linear Regression](#Impute-Using-Linear-Regression)
4. [Interpolation](#Interpolation)


### Dealing with Missing Data
[[back to top]](#Table-of-Contents)

If the dataset we're working with has missing values, there are two main ways we can deal with the missing values:

- Partial Deletion
- Imputation

For partial deletion, there are two types of partial deletion:

1. **Listwise Deletion:** When using listwise deletion, we remove any rows that have `NULL` values in the columns used for the analysis. This can be achieved in Pandas using the `.dropna()` method, using the default `how='any'` (as long as you're not dropping a row based on a `NULL` value in a column that you don't care about). Let's say we have a data set that contains a person ID, age, weight, and height. If we want to perform analysis on age and height, but one person is missing their age in the dataset, we'd exclude that entire row. 
2. **Pairwise Deletion:** Excludes missing values when performing calculations on the column that contains the missing value, but still uses the available values in other columns. Using the same example as above, if we want to perform analysis on age and height, but one person is missing their age in the dataset, we'd exclude them when looking at age, but keep them in the data while looking at height. 

Partial deletion has a few challenges associated with it:

- If the majority of rows have a `NULL` value in a particular column, removing too many values will greatly reduce the statistical power of our analysis.
- If a certain sub-population tends to have a value missing for a particular column, partial deletion may compromise the representativeness of our sample. 


### Basic Imputation
[[back to top]](#Table-of-Contents)

Due to some of the problems with just removing missing values, sometimes it's a better idea to make an intelligent guess about the missing values in our data, and this is called *imputation*. There are many ways to impute missing data, but we'll go over some of the basic ones here. 

One of the most basic ways to impute data is to take the mean of the available data and assign that to all the missing values. One of the benefits of this method of imputation is that it doesn't change the mean of the column; however, this method will decrease the correlation between our imputed variable and any other variable. We can accomplish this type of imputation in Pandas using the `.fillna()` method.

### Impute Using Linear Regression
[[back to top]](#Table-of-Contents)

Another method of imputation is to use a linear regression to estimate the missing values. Here, we'd create an equation to predict the missing values in the data using the information that's available to us. 

This method, unlike the previous method, amplifies correlation between the imputed variable and the variables used to create the linear regression. Additionally, the imputed values are exact, implying more certainty in the imputed values than we actually have. 


### Interpolation
[[back to top]](#Table-of-Contents)

In the mathematical field of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points. Both Series and Dataframe objects have an interpolate method that, by default, performs linear interpolation at missing datapoints.

In [2]:
# create a sample DataFrame with some missing values
df = pd.DataFrame({"A":np.random.randn(5),
                   'B':[1, 2, np.nan, 4, 5],
                   'C':[np.nan, 12, 13, np.nan, np.nan]})
df

Unnamed: 0,A,B,C
0,-1.472251,1.0,
1,0.553938,2.0,12.0
2,0.264419,,13.0
3,1.203069,4.0,
4,0.053248,5.0,


In [3]:
# use the .interpolate() method to attempt to fill in the missing values
df.interpolate()

Unnamed: 0,A,B,C
0,-1.472251,1,
1,0.553938,2,12.0
2,0.264419,3,13.0
3,1.203069,4,13.0
4,0.053248,5,13.0
