In [267]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
from sklearn import datasets
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Case 5: Imputation

Andrew Larsen  
Matthew Chinchilla  
Rikel Djoko  

## Business Case

Our objective is to use various imputation strategies on a complete dataset to evaluate the impact imputation has on model performance. For this exercise we will be using a data set that comes standard with the pandas python package consisting of California housing data. Three different forms of imputation will be explored in this exercise missing completely at random (MCAR), missing at random(MAR), and missing not at random(MNAR). The difference in these three types of missing data are MCAR is totally at random there is no reason why. MAR is different from MCAR in that it typically only affects a subset of the data and that subset of missing data is being affected by another value and randomly missing. The final type of missing data MNAR shows a definate pattern in the missing data. In order to compare the results of our three missing data scenarios we will use a simple regression model using all of the California housing data to set baseline for us to compare against.

## Baseline Model Development

Since we are removing and imputing data from the AveRooms columnand running the model on those dataframes, we need a baseline model to compare them to. In this section, we run a linear regression model with the intact dataframe. 

Below, we fetch the data and organize the data into a single dataframe. As you can see, there are 7 explanatory variables, and our target variable is  MedInc. 

In [312]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,4.526,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,3.585,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,3.521,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,3.413,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.422,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


Here, we check for NA's. There are no NA's in the base dataset, so we can move forward with modelling. 

In [270]:
#check for NA
cal.isna().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

Here, we split the dataframes so we have the target variable separate from the explanatory variables, then run a test train split. We split the data 60/40 in order to ensure that we have enough data to get a quality model, but also enough data to test the model. 

In [271]:
cols = ['HouseAge','AveRooms','AveBedrms','Population','AveOccup','Latitude','Longitude']
X = cal[cols]
y = cal["MedInc"]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.4, random_state = 0)

Below, we save the training and testing indices from the split above. This ensures that any difference in the models will purely be the result of our imputations, rather than possibly being the result of random chance. 

In [313]:
train_X_index = train_X.index.values.astype(int)
test_X_index = test_X.index.values.astype(int)
train_y_index = train_y.index.values.astype(int)
test_y_index = test_y.index.values.astype(int)

Here, we create our baseline linear regression model. The coefficients of the model are listed below. 

In [273]:
lin_model = LinearRegression()
lin_model.fit(train_X, train_y)
coef_df = pd.DataFrame(lin_model.coef_, cols, columns = ["coef"])
coef_df

Unnamed: 0,coef
HouseAge,0.005633
AveRooms,0.363187
AveBedrms,-1.348481
Population,-1.1e-05
AveOccup,-0.003401
Latitude,-0.742928
Longitude,-0.738761


Here, we predict our test data using the model created by our training data. We show the Mean Absolute Error (MAE), Mean Squared Error (MSE) and R-squared (R2) of the model below. We will continue with these naming conventions for the rest of the report. Below, we define these measures:

|Metric|Metric Description|  
| :-|:-|
|Mean Squared Error (MSE)|Defined as $\sum{(actual - predicted)^2}$ This metric is in terms of the target variable squared. Lower is better. MSE is better than MAE when large errors are especially bad.|  
|Mean Absolute Error (MAE)|Defined as $\sum |actual - predicted|$ This metric is also in terms of the target variable. Lower is better. MAE is better when comparing models tested with different numbers of data points.|  
|R^2|The amount of variance in the target variable that is explained by the model. Higher is better.|

In [275]:
##predict
y_predict = lin_model.predict(test_X)
orig_mae = mean_absolute_error(test_y, y_predict)
orig_mse = mean_squared_error(test_y, y_predict)
# orig_rmse_val = rmse(test_y, y_predict)
orig_r2 = r2_score(test_y, y_predict)
print("MAE: %.3f"%orig_mae)
print("MSE:  %.3f"%orig_mse)
# print("RMSE:  %.3f"%orig_rmse_val)
print("R2:  %.3f"%orig_r2)

MAE: 0.684
MSE:  0.825
R2:  0.382


Below, we create a dataframe with the results of the baseline model. This allows for easier, more organized comparison of MAE, MSE and R2 as we create more models. 

In [276]:
perf_frame = pd.DataFrame({'Data':'Original',
                   'Imputation':'None',
                   'MAE': orig_mae, 
                   'MSE': orig_mse, 
                   'R2':orig_r2}, index=[0])
perf_frame

Unnamed: 0,Data,Imputation,MAE,MSE,R2
0,original,none,0.684403,0.824646,0.381605


Below, we create a function with the steps we did above to create and evaluate the model. This pipeline includes a train/test split, model creation, test data set prediction and evaluation. We will use this function to create and evaluate models for our imputed data. 

In [278]:
def LR_pipeline(df, data_description, imputation_type):
    
    #1 set the test and trainning set X and Y
    train_X = df.iloc[train_X_index, 1:]
    train_y = df["MedInc"].iloc[train_y_index]
    test_X = df.iloc[test_X_index, 1:].dropna()
    test_y = df["MedInc"].iloc[test_y_index]
    #2  create and fit the model using LR
    lin_model = LinearRegression()
    lin_model.fit(train_X, train_y)
    coef_df = pd.DataFrame(lin_model.coef_, cols, columns = ["coef"])
    print(coef_df)
    #3  predict using 
    y_predict = lin_model.predict(test_X)
    # 4 evaluation and performance
    orig_mae = mean_absolute_error(test_y, y_predict)
    orig_mse = mean_squared_error(test_y, y_predict)
    orig_r2 = r2_score(test_y, y_predict)
    perf_frame = pd.DataFrame({'Data':data_description,
                   'Imputation Type':imputation_type,
                   'MAE': orig_mae, 
                   'MSE': orig_mse, 
                   'R2':orig_r2}, index=[0])
    return perf_frame
    

The below output confirms that the model created for the baseline matches the model created for the baseline using the LR_pipeline function. It does, so we may continue. 

In [279]:
df = LR_pipeline(cal,"Original", "None")
df

                coef
HouseAge    0.005633
AveRooms    0.363187
AveBedrms  -1.348481
Population -0.000011
AveOccup   -0.003401
Latitude   -0.742928
Longitude  -0.738761


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,Original,,0.684403,0.824646,0.381605


## Missing Completely at Random

In this section, we create and evaluate linear regression models to predict MedInc after data is removed completely at random. The data is removed by taking a percentage of random indices and setting the AveRooms variable as NaN for lines with those indices. We removed 1%, 5%, 10%, 20%, 33% and 50% of the data from the AveRooms column, imputed using the  mean of the column after the data was removed, then used the LR_pipeline function to create and evaluate models for the MedInc variable after imputing. 

### mcar .01

Below, we randomly set 1% of the AveRooms data equal to NaN. The output confirms that about 1% of the AveRooms column is NaN.

In [280]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mcardf = cal
nidx = round(.01 * len(cal))
dropidxs = np.random.choice(len(cal), nidx, replace=False)
mcardf.loc[dropidxs, 'AveRooms'] = np.nan
print(mcardf.isna().sum() / len(cal))

MedInc        0.000000
HouseAge      0.000000
AveRooms      0.009981
AveBedrms     0.000000
Population    0.000000
AveOccup      0.000000
Latitude      0.000000
Longitude     0.000000
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [281]:
mcardf['AveRooms'] = mcardf['AveRooms'].fillna(mcardf['AveRooms'].mean())
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 1% of the AveRooms column imputed with the mean after the data was missing completely at random. We append the results to the results dataframe, then show the results below. 

In [282]:
df_MCAR01 = LR_pipeline(mcardf,"1% Imputed", "MCAR")
df = df.append(df_MCAR01)
df_MCAR01

                coef
HouseAge    0.005522
AveRooms    0.359361
AveBedrms  -1.329722
Population -0.000011
AveOccup   -0.003391
Latitude   -0.740181
Longitude  -0.736374


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,1% Imputed,MCAR,0.685918,0.828244,0.378908


MAE, MSE and R2 are slightly worse than our baseline model. 

### mcar .05

Below, we randomly set 5% of the AveRooms data equal to NaN. The output confirms that about 5% of the AveRooms column is NaN.

In [283]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mcardf = cal
nidx = round(.05 * len(cal))
dropidxs = np.random.choice(len(cal), nidx, replace=False)
mcardf.loc[dropidxs, 'AveRooms'] = np.nan
print(mcardf.isna().sum() / len(cal))

MedInc        0.00
HouseAge      0.00
AveRooms      0.05
AveBedrms     0.00
Population    0.00
AveOccup      0.00
Latitude      0.00
Longitude     0.00
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [284]:
mcardf['AveRooms'] = mcardf['AveRooms'].fillna(mcardf['AveRooms'].mean())
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 5% of the AveRooms column imputed with the mean after the data was missing completely at random. We append the results to the results dataframe, then show the results below. 

In [285]:
df_MCAR05 = LR_pipeline(mcardf,"5% Imputed", "MCAR")
df = df.append(df_MCAR05)
df_MCAR05

                coef
HouseAge    0.004791
AveRooms    0.335665
AveBedrms  -1.187112
Population -0.000017
AveOccup   -0.003308
Latitude   -0.733795
Longitude  -0.731269


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,5% Imputed,MCAR,0.694373,0.857249,0.357157


MAE, MSE and R2 are slightly worse than our baseline model. 

### mcar .1

Below, we randomly set 10% of the AveRooms data equal to NaN. The output confirms that about 10% of the AveRooms column is NaN.

In [286]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mcardf = cal
nidx = round(.1 * len(cal))
dropidxs = np.random.choice(len(cal), nidx, replace=False)
mcardf.loc[dropidxs, 'AveRooms'] = np.nan
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.1
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [287]:
mcardf['AveRooms'] = mcardf['AveRooms'].fillna(mcardf['AveRooms'].mean())
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 10% of the AveRooms column imputed with the mean after the data was missing completely at random. We append the results to the results dataframe, then show the results below. 

In [288]:
df_MCAR10 = LR_pipeline(mcardf,"10% Imputed", "MCAR")
df = df.append(df_MCAR10)
df_MCAR10

                coef
HouseAge    0.002853
AveRooms    0.250160
AveBedrms  -0.659887
Population -0.000025
AveOccup   -0.002802
Latitude   -0.738214
Longitude  -0.742101


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,10% Imputed,MCAR,0.708642,0.895135,0.328747


MAE, MSE and R2 are slightly worse than our baseline model. 

### mcar .2

Below, we randomly set 20% of the AveRooms data equal to NaN. The output confirms that about 20% of the AveRooms column is NaN.

In [289]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mcardf = cal
nidx = round(.2 * len(cal))
dropidxs = np.random.choice(len(cal), nidx, replace=False)
mcardf.loc[dropidxs, 'AveRooms'] = np.nan
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.2
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [290]:
mcardf['AveRooms'] = mcardf['AveRooms'].fillna(mcardf['AveRooms'].mean())
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 20% of the AveRooms column imputed with the mean after the data was missing completely at random. We append the results to the results dataframe, then show the results below. 

In [291]:
df_MCAR20 = LR_pipeline(mcardf,"20% Imputed", "MCAR")
df = df.append(df_MCAR20)
df_MCAR20

                coef
HouseAge    0.002426
AveRooms    0.277721
AveBedrms  -0.883117
Population -0.000028
AveOccup   -0.002313
Latitude   -0.728774
Longitude  -0.732040


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,20% Imputed,MCAR,0.722006,0.998445,0.251276


MAE is slightly worse than our baseline model, while MSE and R2 have a difference > .1 compared to our baseline model. 

### mcar .33

Below, we randomly set 33% of the AveRooms data equal to NaN. The output confirms that about 33% of the AveRooms column is NaN.

In [292]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mcardf = cal
nidx = round(.33 * len(cal))
dropidxs = np.random.choice(len(cal), nidx, replace=False)
mcardf.loc[dropidxs, 'AveRooms'] = np.nan
print(mcardf.isna().sum() / len(cal))

MedInc        0.00000
HouseAge      0.00000
AveRooms      0.32999
AveBedrms     0.00000
Population    0.00000
AveOccup      0.00000
Latitude      0.00000
Longitude     0.00000
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [293]:
mcardf['AveRooms'] = mcardf['AveRooms'].fillna(mcardf['AveRooms'].mean())
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 33% of the AveRooms column imputed with the mean after the data was missing completely at random. We append the results to the results dataframe, then show the results below. 

In [294]:
df_MCAR33 = LR_pipeline(mcardf,"33% Imputed", "MCAR")
df = df.append(df_MCAR33)
df_MCAR33

                coef
HouseAge    0.001316
AveRooms    0.248511
AveBedrms  -0.756279
Population -0.000032
AveOccup   -0.002904
Latitude   -0.725245
Longitude  -0.728414


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,33% Imputed,MCAR,0.733787,0.997848,0.251723


MAE is slightly worse than our baseline model, while MSE and R2 have a difference > .1 compared to our baseline model. 

### mcar .5

Below, we randomly set 50% of the AveRooms data equal to NaN. The output confirms that about 50% of the AveRooms column is NaN.

In [295]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mcardf = cal
nidx = round(.5 * len(cal))
dropidxs = np.random.choice(len(cal), nidx, replace=False)
mcardf.loc[dropidxs, 'AveRooms'] = np.nan
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.5
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [296]:
mcardf['AveRooms'] = mcardf['AveRooms'].fillna(mcardf['AveRooms'].mean())
print(mcardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 50% of the AveRooms column imputed with the mean after the data was missing completely at random. We append the results to the results dataframe, then show the results below. 

In [297]:
df_MCAR50 = LR_pipeline(mcardf,"50% Imputed", "MCAR")
df = df.append(df_MCAR50)
df_MCAR50

                coef
HouseAge    0.000180
AveRooms    0.176238
AveBedrms  -0.142944
Population -0.000035
AveOccup   -0.002399
Latitude   -0.744367
Longitude  -0.754163


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,50% Imputed,MCAR,0.750547,0.971963,0.271134


MAE is slightly worse than our baseline model, while MSE and R2 have a difference > .1 compared to our baseline model. 

## Missing at Random

In this section, we create and evaluate models for the MedInc variable when the data is missing at random. For data to be missing at random, the data is randomly missing dependent on another variable. In our case, we randomly remove a percentage of the data when the AveOccup variable is above its median value. We set 10%, 20% and 30% of the AveRooms variable as NaN, which corresponds to 20%, 40% and 60% of the AveRooms variable when the AveOccup variable is above the mean. 

### MAR .1

Here, we randomly take 20% of the rows with an AveOccup above the median, find their indexes and set those row's AveRooms as NaN. The output below confirms that 10% of the AveRooms column is NaN.

In [298]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mardf = cal
idxs = mardf[mardf['AveOccup'] > mardf['AveOccup'].quantile(.5)]
dropidxs = idxs.sample(frac = .2, replace = False)
mardf.loc[dropidxs.index, 'AveRooms'] = np.nan
print(mardf.isna().sum() / len(mardf))

MedInc        0.0
HouseAge      0.0
AveRooms      0.1
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [299]:
mardf['AveRooms'] = mardf['AveRooms'].fillna(mardf['AveRooms'].mean())
print(mardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 10% of the AveRooms column imputed with the mean after the data was missing at random. We append the results to the results dataframe, then show the results below. 

In [300]:
df_MAR10 = LR_pipeline(mardf,"10% Imputed", "MAR")
df = df.append(df_MAR10)
df_MAR10

                coef
HouseAge    0.003833
AveRooms    0.341528
AveBedrms  -1.243998
Population -0.000019
AveOccup   -0.002038
Latitude   -0.742475
Longitude  -0.743358


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,10% Imputed,MAR,0.701955,0.860282,0.354882


MAE, MSE and R2 are slightly worse than our baseline model. 

### MAR .2

Here, we randomly take 40% of the rows with an AveOccup above the median, find their indexes and set those row's AveRooms as NaN. The output below confirms that 20% of the AveRooms column is NaN.

In [301]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mardf = cal
idxs = mardf[mardf['AveOccup'] > mardf['AveOccup'].quantile(.5)]
dropidxs = idxs.sample(frac = .4, replace = False)
mardf.loc[dropidxs.index, 'AveRooms'] = np.nan
print(mardf.isna().sum() / len(mardf))

MedInc        0.0
HouseAge      0.0
AveRooms      0.2
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [302]:
mardf['AveRooms'] = mardf['AveRooms'].fillna(mardf['AveRooms'].mean())
print(mardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 20% of the AveRooms column imputed with the meanafter the data was missing at random. We append the results to the results dataframe, then show the results below. 

In [303]:
df_MAR20 = LR_pipeline(mardf,"20% Imputed", "MAR")
df = df.append(df_MAR20)
df_MAR20

                coef
HouseAge    0.001882
AveRooms    0.315516
AveBedrms  -1.121274
Population -0.000026
AveOccup   -0.003289
Latitude   -0.739904
Longitude  -0.743836


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,20% Imputed,MAR,0.720606,0.901491,0.32398


MAE, MSE and R2 are slightly worse than our baseline model. 

### MAR .3

Here, we randomly take 60% of the rows with an AveOccup above the median, find their indexes and set those row's AveRooms as NaN. The output below confirms that 30% of the AveRooms column is NaN.

In [304]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mardf = cal
idxs = mardf[mardf['AveOccup'] > mardf['AveOccup'].quantile(.5)]
dropidxs = idxs.sample(frac = .6, replace = False)
mardf.loc[dropidxs.index, 'AveRooms'] = np.nan
print(mardf.isna().sum() / len(mardf))

MedInc        0.0
HouseAge      0.0
AveRooms      0.3
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [305]:
mardf['AveRooms'] = mardf['AveRooms'].fillna(mardf['AveRooms'].mean())
print(mardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 30% of the AveRooms column imputed with the mean after the data was missing at random. We append the results to the results dataframe, then show the results below. 

In [306]:
df_MAR30 = LR_pipeline(mardf,"30% Imputed", "MAR")
df = df.append(df_MAR30)
df_MAR30

                coef
HouseAge    0.000151
AveRooms    0.283949
AveBedrms  -0.987892
Population -0.000032
AveOccup   -0.002408
Latitude   -0.740424
Longitude  -0.748912


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,30% Imputed,MAR,0.734953,0.925965,0.305627


MAE and R2 are slightly worse than our baseline model, while MSE is >.1 worse than our baseline.  

## Missing "Not at" Random

Data missing "not at" random (MNAR) is the result of a certain subset of the data being totally missing. In our case, we set any AveRoom that is above the 75th percentile to be NaN, resulting in 25% of the AveRooms column to be set as NaN. 

### MNAR .25

In the cell below, we set any value in the top 25% of AveRooms to be NaN. That is confirmed in the output below:

In [307]:
california = datasets.fetch_california_housing()
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedInc'] = california.target
cal.head()
mnardf = cal
idxs = (mnardf['AveRooms'] > mnardf['AveRooms'].quantile(.75))
mnardf.loc[idxs, 'AveRooms'] = np.nan
print(mnardf.isna().sum() / len(mnardf))

MedInc        0.000000
HouseAge      0.000000
AveRooms      0.249952
AveBedrms     0.000000
Population    0.000000
AveOccup      0.000000
Latitude      0.000000
Longitude     0.000000
dtype: float64


Below, we impute the NaN's in the AveRoom column using the mean of the NaN filled AveRooms column. The output confirms that we have no more NaN's in the dataframe.

In [308]:
mnardf['AveRooms'] = mnardf['AveRooms'].fillna(mnardf['AveRooms'].mean())
print(mnardf.isna().sum() / len(cal))

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
dtype: float64


Here, we create a linear regression model for the dataframe with 25% of the AveRooms column imputed with the mean after the data was missing not at random. We append the results to the results dataframe, then show the results below. 

In [309]:
df_MNAR25 = LR_pipeline(mnardf,"25% Imputed", "MNAR")
df = df.append(df_MNAR25)
df_MNAR25

                coef
HouseAge   -0.001056
AveRooms    0.183896
AveBedrms   0.137195
Population -0.000039
AveOccup   -0.001847
Latitude   -0.752665
Longitude  -0.759881


Unnamed: 0,Data,Imputation Type,MAE,MSE,R2
0,25% Imputed,MNAR,0.769174,0.996447,0.252773


MAE is slightly worse than our baseline model, while MSE and R2 have a difference > .1 compared to our baseline model. 

## Conclusion

Below, we add the difference of the MAE, MSE and R2 compared to the original, non-imputed dataset to the results dataframe. 

In [310]:
df['MAE Difference'] = df['MAE'] - df.iloc[[0]]['MAE'].values[0]
df['MSE Difference'] = df['MSE'] - df.iloc[[0]]['MSE'].values[0]
df['R2 Difference'] = df['R2'] - df.iloc[[0]]['R2'].values[0]

Below, we have the results of the linear regression model after every imputation we did, as well as the difference in the results compared to the baseline. Here's some notes of what's in the result dataframe: 
- Our baseline model did the best in terms of MAE, MSE and R2 compared to all of the imputations. The MAE got worse the more data was imputed, for each imputation type. 
- MSE got worse as the imputed percentage increased for MAR, but not MCAR. 33% and 50% had a smaller MSE and higher R2 than the 20% imputed. MSE is larger than MAE when there are larger errors. The smaller MSE for larger amounts of imputed data is likely the result of more large errors in the 20% imputed model than the 33% and 50% model. More data imputed using the mean would cause the model to be more likely to predict the mean of MedInc, rather than trying to predict more extreme values. 
- MCAR and MAR had similar MAE for 10%, 20% and 30% imputed, but MAR had a smaller MSE and R2 for similar amounts of imputed data. MNAR performed significantly worse in terms of MAE, but performed similar to MCAR for similiar amounts of imputed data in terms of MSE and R2. This implies that the MNAR imputation had a large amount of errors that float around .8 or .9, and don't have too many huge errors. 
- In general, having fewer imputed NA's results in a better model, but there isn't a huge performance difference: most MAE are within 10% of the original data type. 
- The worst performing model in terms of MAE was the MNAR at 25% imputed. Considering we imputed 50% of MCAR data, it seems that the type of data that is missing does have an impact on the performance of the model.
- We're surprised that having 50% of a variable randomly set as the mean increases the MSE by less than 10%. We expected that to be much higher.

In [311]:
df

Unnamed: 0,Data,Imputation Type,MAE,MSE,R2,MAE Difference,MSE Difference,R2 Difference
0,Original,,0.684403,0.824646,0.381605,0.0,0.0,0.0
0,1% Imputed,MCAR,0.685918,0.828244,0.378908,0.001515,0.003598,-0.002698
0,5% Imputed,MCAR,0.694373,0.857249,0.357157,0.009969,0.032603,-0.024449
0,10% Imputed,MCAR,0.708642,0.895135,0.328747,0.024238,0.070488,-0.052859
0,20% Imputed,MCAR,0.722006,0.998445,0.251276,0.037602,0.173798,-0.13033
0,33% Imputed,MCAR,0.733787,0.997848,0.251723,0.049384,0.173202,-0.129882
0,50% Imputed,MCAR,0.750547,0.971963,0.271134,0.066144,0.147317,-0.110471
0,10% Imputed,MAR,0.701955,0.860282,0.354882,0.017551,0.035636,-0.026723
0,20% Imputed,MAR,0.720606,0.901491,0.32398,0.036203,0.076845,-0.057625
0,30% Imputed,MAR,0.734953,0.925965,0.305627,0.050549,0.101319,-0.075978


Often, having a significant amount of a variable imputed is seen as an issue. We used the simplest of imputation strategies, setting NAs equal to the mean, and our results weren't too far off from the results of the baseline model. The type of missing data does have an affect on the results of the model, with MNAR being the worst type of data to be missing, and MAR and MCAR seeming to be fairly similar. Moving forward, we recommend trying different imputation strategies to see if it is possible to improve on these results using KNN, a regression based imputer, or using the median instead of the mean. 