## Handling Missing Data in a pandas `DataFrame`
### Working with pandas
*Curtis Miller*

In this notebook I demonstrate approaches to handling missing data in a pandas `DataFrame`. The first thing I do is create a `DataFrame` `df` that contains missing data. (Because numbers are random, you should expect your results to differ.)

In [None]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import random

# Create a data frame of random numbers, some randomly censored
vals = np.random.randn(21)
vals[random.sample([i for i in range(21)], 5)] = np.nan
df = DataFrame(vals.reshape(7, 3), columns = ["AAA", "BBB", "CCC"])
df

In [None]:
srs = Series([2, 3, 3, 9, 8, np.nan, 8, np.nan, 4, 4, 5])
print(srs)

Here we see methods for detecting missing data. These methods produce identical (or 100% contradictory) results.

In [None]:
np.isnan(df)

In [None]:
df.isnull()

In [None]:
df.notnull()    # Opposite of isnull() and isnan()

Here's what removing missing information looks like.

In [None]:
df.dropna()

In [None]:
print(srs.dropna())

Now let's look at more interesting approaches to filling missing information.

In [None]:
xbar = srs.mean()    # By default, ignores nan
print(xbar)

In [None]:
print(srs.fillna(0))

In [None]:
print(srs.fillna(xbar))

In [None]:
# How does the mean of this data compare to before?
srs.fillna(xbar).mean()

In [None]:
# What about the standard deviation (a measure of how dispersed data is)?
srs.std()

In [None]:
srs.fillna(xbar).std()

Filling missing data with the mean of that data is not cost-free; while the mean is preserved, other important metrics (such as the standard deviation) are affected, which may contaminate some algorithms (we made the data appear more concentrated than the original data was).

Here's a trick: replace the data with *randomly generated* data with the same mean and standard devation as the original data. We may pick random values from our data set and fill in missing data with those values. This resembles a statistical technique known as bootstrapping.

I demonstrate below.

In [None]:
s = srs.std()
# Generate a NumPy ndarray filled with randomly generated data, of the same length as the missing data
rep = Series(np.random.choice(srs[srs.notnull()], size=2), index=[5, 7])
print(rep)

In [None]:
srs.fillna(rep)

In [None]:
srs.fillna(rep).mean()

In [None]:
srs.fillna(rep).std()

While random, the mean and standard deviation of the filled-in data set are both close to that of the original data set. (Not that this approach is perfect either; why the Normal distribution?).

Now let's look at `df` again. Let's try to fill missing data.

In [None]:
df.fillna(0)

In [None]:
df.mean()

In [None]:
df.fillna(df.mean())

In [None]:
df.std()

In [None]:
df.fillna(df.mean()).std()    # All standard deviations go down

What does the "fill with random data" trick used above look like here?

In [None]:
col='AAA'
df[col][df[col].notnull()]

In [None]:
# We will fill missing data via a dict
rep_df = {col: Series(np.random.choice(df[col][df[col].notnull()],    # Create a Series of random values from col...
                                       size=df.isnull()[col].value_counts()[True]),     # ... as many as there are missing values
                                                                                        # in col...
                      index=df[col][df[col].isnull()].index)    # ... and having an index corresponding to the missing values
                                                                # in the column col of df ...
          for col in df}    # ... for each column in df
rep_df

In [None]:
df.fillna(rep_df)

In [None]:
df.fillna(rep_df).mean()

In [None]:
df.fillna(rep_df).std()

As you encounter different problems you may come upon other solutions to filling in missing values. Here are some examples.

For numeric data:

* Fill in with a "neutral" value, like 0, 1, or sample mean
* Fill with taylored values to preserve select statistics (like the mean or standard deviation), randomly assigned to rows
* Fill with independently generated random numbers with same statistical properties as the data

For categorical data

* Fill with most common value
* Fill with values chosen with a frequency that would preserve observed frequencies, randomly assigned to rows
* Fill with independently generated random values chosen with the same frequency as the observed frequencies

None of this even covers imputation! There's many ways to fill missing values.