# Imputation

In [51]:
import numpy as np
import pandas as pd

# read in data
df = pd.read_csv('airquality.csv')

# clean up the data a bit
df.drop('Unnamed: 0', 1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone     116 non-null float64
SolarR    146 non-null float64
Wind      153 non-null float64
Temp      153 non-null int64
Month     153 non-null int64
Day       153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


# Basic Imputation

mean, median, and mode imputation

Let's impute the NaN values in the Ozone and SolarR columns…

In [52]:
from sklearn.preprocessing import Imputer # Imputer object from Sklearn

# create imputer object
imp = Imputer(missing_values = 'NaN', strategy = 'mean')

# imputer for Ozone
df['Ozone'] = imp.fit_transform(df[['Ozone']])

# impute for Solar.R
df['SolarR'] = imp.fit_transform(df[['SolarR']])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone     153 non-null float64
SolarR    153 non-null float64
Wind      153 non-null float64
Temp      153 non-null int64
Month     153 non-null int64
Day       153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


# MICE Imputation

The basic idea of MICE imputation is to treat each variable with missing values as the dependent variable in a regression, with some or all of the remaining variables as its predictors. The MICE procedure cycles through these models, fitting each in turn, then uses a procedure called “predictive mean matching” (PMM) to generate random draws from the predictive distributions determined by the fitted models. These random draws become the imputed values for one imputed data set.

First let's reload the dataset so that the NaN's return…

In [61]:
# re-read in data
df = pd.read_csv('airquality.csv')

# clean up data a bit
df.drop('Unnamed: 0', 1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone     116 non-null float64
SolarR    146 non-null float64
Wind      153 non-null float64
Temp      153 non-null int64
Month     153 non-null int64
Day       153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


Let's impute the NaN values in the SolarR column ising MICE imputation…

In [80]:
from statsmodels.imputation import mice # mice imputer from statsmodels
import statsmodels.api as sm # for statsmodels linear regression

# Convert our data to a format that the function can handle (this is also the dataset which will be imputed upon)
miceData = mice.MICEData(df)

# regression formula
formula = 'SolarR ~ Ozone + Wind + Temp + Month + Day' # see miceData.conditional_formula to see all formulae

# Instantiate our MICE model
mice = mice.MICE(formula, sm.OLS, miceData)

# Fit the model
results = mice.fit(n_burnin=10, n_imputations=10)

print(results.summary())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


                           Results: MICE
Method:                    MICE        Sample size:          153    
Model:                     OLS         Scale                 6785.70
Dependent variable:        SolarR      Num. imputations      20     
--------------------------------------------------------------------
           Coef.   Std.Err.    t    P>|t|    [0.025   0.975]   FMI  
--------------------------------------------------------------------
Intercept  28.8272  82.0542  0.3513 0.7253 -131.9961 189.6505 0.0317
Ozone       0.7812   0.3467  2.2529 0.0243    0.1016   1.4608 0.1947
Wind        4.7025   2.3494  2.0015 0.0453    0.0977   9.3073 0.0356
Temp        2.1554   1.1280  1.9108 0.0560   -0.0555   4.3663 0.0806
Month     -10.7031   5.3572 -1.9979 0.0457  -21.2030  -0.2031 0.0131
Day        -1.0494   0.7699 -1.3630 0.1729   -2.5584   0.4596 0.0139



In [84]:
# put the imputed SolarR column from the miceData dataframe into our main dataframe
df['SolarR'] = miceData.data['SolarR']

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone     116 non-null float64
SolarR    153 non-null float64
Wind      153 non-null float64
Temp      153 non-null int64
Month     153 non-null int64
Day       153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


Notice that there are no NaN's in the miceData dataset. This is due to the nature of MICE imputation. However, you should only use the column of the miceData dataset which was the dependent variable of the MICE model.

You would repeat this process (instantiating and fitting a mice model) for every column you intend to impute.