# Fitted interpolation

When working with microdata, particularly at the consumer or firm level, instances of missing values are highly common. In some cases an entity will neglect to report values where it makes sense to do so, such as an interest expense line item on an income statement of a company with no debt, however very often it is the case that data which should exist is simply missing. 

This phenomena can be particularly troublesome when trying to construct a balanced panel of entities and require a full-rank feature vector for modelling, since any missing values in any feature will exclude the entity from the panel. This results in unnecessarily curtailed sample size.

Forward-filling missing values at the entity level is one option for dealing with this issue, however a more broadly applicable alternative is to estimate every missing value using the non-missing features from the same panel and fill the holes with estimated values. This utility of this method over forward-filling is subject to how well the model fits, and which model is used.

In [109]:
import pandas as pd
import numpy as np
import seaborn as sns  # used for toy datasets

## load main dataset
df = sns.load_dataset('diamonds')

'''
TODO:
    - Contaminate y var and fill with linear model from carat alone  --> cross-validate
    - ^^ fit intercept --> cross-validate
    - random forest regressor --> cross-validate
    - compare best model to actual prices
'''

def contaminate(vector, share=0.1):
    '''contaminate one vector at a time with X% NaNs'''
    cont = np.full(len(vector), False)
    cont[:int(share * len(vector))] = True
    np.random.shuffle(cont)

    removed_values = vector.loc[cont] 
    vector.loc[cont] = np.nan

    return vector, removed_values

In [14]:
## Basic regression: ln(price) ~ carat + e

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, train_test_split

df.price = df.price.apply(np.log)

y = df.price
X = df.carat

model = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=True)

## score model via cross-validation  (tr = train, tt = test)

sse, nobs  = 0.0, 0  # initialize sum squared errors and nobs

for tr, tt in kf.split(df):
    Xtr = df.iloc[tr]['carat'].values
    ytr = df.iloc[tr]['price'].values
    Xtt = df.iloc[tt]['carat'].values
    ytt = df.iloc[tt]['price'].values

    model.fit(Xtr, ytr)

    sse += ((model.predict(Xtt) - ytt)**2).sum()
    nobs += len(tt)

mse = sse / nobs
mse

ValueError: Expected 2D array, got 1D array instead:
array=[0.23 0.21 0.23 ... 0.7  0.86 0.75].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [8]:
## Build simple cross-sectional OLS using full dataset (no missing vals)

y = df.price # take log of price for better behaved coefs

X = df[[col for col in df.columns if col != 'price']]

model = LinearRegression(fit_intercept=True)

## score model via cross-validation

for 

array([2.1974591 , 0.08413693, 0.05885531, 0.0569081 , 0.05043238,
       0.9418565 , 0.88271961, 0.93159691, 1.02586593, 0.81925217,
       0.54169027, 0.72508841, 0.52804129, 0.45171285, 0.16207184,
       0.5804062 , 0.31877876, 0.52474255])

All features had highly-significant positive coeficeients, which is intuitive as the dummies removed were the lowest-tier of each feature. Now we will contaminate the endogenous variable and fill the missing observations with estimates from this regression:

In [110]:
## Now contaminate the endog variable

df.price = contaminate(df.price)