# Fitted interpolation

When working with microdata, particularly at the consumer or firm level, instances of missing values are highly common. In some cases an entity will neglect to report values where it makes sense to do so, such as an interest expense line item on an income statement of a company with no debt, however very often it is the case that data which should exist is simply missing. 

This phenomena can be particularly troublesome when trying to construct a balanced panel of entities and require a full-rank feature vector for modelling, since any missing values in any feature will exclude the entity from the panel. This results in unnecessarily curtailed sample size.

Forward-filling missing values at the entity level is one option for dealing with this issue, however a more broadly applicable alternative is to estimate every missing value using the non-missing features from the same panel and fill the holes with estimated values. This utility of this method over forward-filling is subject to how well the model fits, and which model is used.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns  # used for toy datasets

## load main dataset
df = sns.load_dataset('diamonds')
df.price = df.price.apply(np.log)

'''
TODO:
    - Contaminate y var and fill with linear model from carat alone  --> cross-validate
    - ^^ fit intercept --> cross-validate
    - random forest regressor --> cross-validate
    - compare best model to actual prices
'''

def contaminate(vector, share=0.1):
    '''contaminate one vector at a time with X% NaNs'''
    cont = np.full(len(vector), False)
    cont[:int(share * len(vector))] = True
    np.random.shuffle(cont)

    removed_values = vector.loc[cont] 
    vector.loc[cont] = np.nan

    return vector, removed_values

In [8]:
## Basic regression: ln(price) ~ b1*carat + e

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, train_test_split

y = df.price
X = df.carat

model = LinearRegression(fit_intercept=False)
kf = KFold(n_splits=5, shuffle=True, random_state=1996)

## score model via cross-validation  (tr = train, tt = test)

sse, nobs  = 0.0, 0  # initialize sum squared errors and nobs

for tr, tt in kf.split(df):
    Xtr = df.iloc[tr][['carat']]
    ytr = df.iloc[tr][['price']]
    Xtt = df.iloc[tt][['carat']]
    ytt = df.iloc[tt][['price']]

    model.fit(Xtr, ytr)

    sse += ((model.predict(Xtt) - ytt)**2).sum()
    nobs += len(tt)

mse = sse / nobs
mse

price    10.233595
dtype: float64

In [9]:
## First, add intercept to OLS: ln(price) ~ b0 + b1*carat + e

model = LinearRegression(fit_intercept=True)

sse, nobs  = 0.0, 0  # initialize sum squared errors and nobs
for tr, tt in kf.split(df):
    Xtr = df.iloc[tr][['carat']]
    ytr = df.iloc[tr][['price']]
    Xtt = df.iloc[tt][['carat']]
    ytt = df.iloc[tt][['price']]

    model.fit(Xtr, ytr)

    sse += ((model.predict(Xtt) - ytt)**2).sum()
    nobs += len(tt)

mse = sse / nobs
mse

price    0.157746
dtype: float64

In [None]:
## Random forest regression: ln(price) ~ 