# Fitted imputation of missing values

When working with microdata, particularly at the consumer or firm level, instances of missing values are highly common. This phenomena can be particularly troublesome when trying to construct a balanced panel of entities and require a full-rank feature vector for modelling, since any missing values in any feature will exclude the entity from the panel. This results in unnecessarily decreased sample size.



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns  # used for toy datasets

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

## load main dataset
df = sns.load_dataset('diamonds')

'''
TODO:
    - Contaminate y var and fill with linear model from carat alone  --> cross-validate
    - ^^ fit intercept --> cross-validate
    - random forest regressor --> cross-validate
    - compare best model to actual prices
'''

def contaminate(vector, share=0.1):
    '''contaminate one vector at a time with X% NaNs'''
    cont = np.full(len(vector), False)
    cont[:int(share * len(vector))] = True
    np.random.shuffle(cont)

    removed_values = vector.loc[cont] 
    vector.loc[cont] = np.nan

    return vector, removed_values


df.price, removed_prices = contaminate(df.price)  # contaminate endog var, simulate poor data quality

df_orig = df  # copy contaminated df for validation later

df = df.loc[~df.price.isna()]  # subset data to valid obs

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [3]:
## Basic regression: ln(price) ~ b1*carat + e

y = df.price
X = df.carat

model = Pipeline([('pol', PolynomialFeatures(1)),
                  ('ols', LinearRegression(fit_intercept=True))])

kf = KFold(n_splits=5, shuffle=True, random_state=1996)

## score model via cross-validation  (tr = train, tt = test)

sse, nobs  = 0.0, 0  # initialize sum squared errors and nobs

for tr, tt in kf.split(df):
    Xtr = df.iloc[tr][['carat']]
    ytr = df.iloc[tr][['price']]
    Xtt = df.iloc[tt][['carat']]
    ytt = df.iloc[tt][['price']]

    model.fit(Xtr, ytr)

    sse += ((model.predict(Xtt) - ytt)**2).sum()
    nobs += len(tt)

mse = sse / nobs
mae = np.sqrt(mse)
mae[0]

1551.085480715225

In [4]:
## Improve by adding intercept: ln(price) ~ b0 + b1*carat + e

model = Pipeline([('pol', PolynomialFeatures(2)),
                  ('ols', LinearRegression(fit_intercept=True))])

sse, nobs  = 0.0, 0  # initialize sum squared errors and nobs

for tr, tt in kf.split(df):
    Xtr = df.iloc[tr][['carat']]
    ytr = df.iloc[tr][['price']]
    Xtt = df.iloc[tt][['carat']]
    ytt = df.iloc[tt][['price']]

    model.fit(Xtr, ytr)

    sse += ((model.predict(Xtt) - ytt)**2).sum()
    nobs += len(tt)

mse = sse / nobs
mae = np.sqrt(mse)
mae[0]



1543.8955520236411

In [6]:
## Random forest regression: ln(price) ~ 

df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326.0,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326.0,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327.0,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334.0,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335.0,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757.0,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757.0,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757.0,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757.0,6.15,6.12,3.74
