# Fitted interpolation

When working with microdata, particularly at the consumer or firm level, instances of missing values are highly common. In some cases an entity will neglect to report values where it makes sense to do so, such as an interest expense line item on an income statement of a company with no debt, however very often it is the case that data which should exist is simply missing. 

This phenomena can be particularly troublesome when trying to construct a balanced panel of entities and require a full-rank feature vector for modelling, since any missing values in any feature will exclude the entity from the panel. This results in unnecessarily curtailed sample size.

Forward-filling missing values at the entity level is one option for dealing with this issue, however a more broadly applicable alternative is to estimate every missing value using the non-missing features from the same panel and fill the holes with estimated values. This utility of this method over forward-filling is subject to how well the model fits, and which model is used.

In [109]:
def contaminate(vector, share=0.1):
    '''contaminate one vector at a time with X% NaNs'''
    cont = np.full(len(vector), False)
    cont[:int(share * len(vector))] = True
    np.random.shuffle(cont)

    vector.loc[cont] = np.nan

    return vector

In [104]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns  # used for toy datasets

# dataset will consist of prices and other features of diamonds
df = sns.load_dataset('diamonds')
df.price = df.price.apply(np.log)

# one-hot encode diamond cut, color, and clarity
for cut in set(df.cut):
    df[cut] = (df.cut==cut).astype(int)

for clarity in set(df.clarity):
    df[clarity] = (df.clarity==clarity).astype(int)

for color in set(df.color):
    df[color] = (df.color==color).astype(int)

# remove extra vars + lowest class of each feature
df = df.drop(['cut','color','clarity','depth','table','x','y','z'] +
             ['Fair','I1','J'], axis=1)

In [105]:
## Build simple cross-sectional OLS using full dataset (no missing vals)

endog = df.price # take log of price for better behaved coefs

exog = df[[col for col in df.columns if col != 'price']]

model = sm.OLS(endog=endog, exog=exog).fit()
model.summary()

0,1,2,3
Dep. Variable:,price,R-squared (uncentered):,0.995
Model:,OLS,Adj. R-squared (uncentered):,0.995
Method:,Least Squares,F-statistic:,617800.0
Date:,"Sun, 12 Dec 2021",Prob (F-statistic):,0.0
Time:,15:27:01,Log-Likelihood:,-43846.0
No. Observations:,53940,AIC:,87730.0
Df Residuals:,53922,BIC:,87890.0
Df Model:,18,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
carat,2.5671,0.005,484.423,0.000,2.557,2.577
Premium,1.0598,0.014,78.466,0.000,1.033,1.086
Good,1.0857,0.015,73.327,0.000,1.057,1.115
Ideal,1.1023,0.013,82.681,0.000,1.076,1.128
Very Good,1.0618,0.014,77.605,0.000,1.035,1.089
VS2,3.5193,0.015,233.545,0.000,3.490,3.549
VVS1,3.6904,0.017,217.145,0.000,3.657,3.724
SI1,3.4032,0.015,226.476,0.000,3.374,3.433
VS1,3.6069,0.015,234.252,0.000,3.577,3.637

0,1,2,3
Omnibus:,28849.349,Durbin-Watson:,1.183
Prob(Omnibus):,0.0,Jarque-Bera (JB):,527173.935
Skew:,2.17,Prob(JB):,0.0
Kurtosis:,17.687,Cond. No.,22.1


All features had highly-significant positive coeficeients, which is intuitive as the dummies removed were the lowest-tier of each feature. Now we will contaminate the endogenous variable and fill the missing observations with estimates from this regression:

In [110]:
## Now contaminate the dataset

for col in df.columns:
    df[col] = contaminate(df[col])