# Working with Missing Data

**Utilizing the Boston Housing dataset**

By Daniel Serna, Bruce Granger, and Brandon de la Houssaye

In [374]:
# Import package dependencies
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
#from ml_metrics import rmse
from math import sqrt
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

In [375]:
# Load in the dataset
boston = load_boston()
print(boston.data.shape)

(506, 13)


In [376]:
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [377]:
# View the data descriptions
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [378]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

Convert the matrix to pandas

In [379]:

bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['MEDV'] = boston.target
bos.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [380]:
bos.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


Start by fitting a Linear Regression model to the full dataset

**Create a training and testing split (ex., 70/30-split)**

Step 1:<br/>
Using Sklearn get the Boston Housing dataset.  Fit a linear regressor to the data as a baseline.  
There is no need to do Cross-Validation.  We are exploring the change in results.
What is the loss and what are the goodness of fit parameters?  This will be our baseline for comparison.

The above question is the first question of the assignment.  The preceding code shows the ETL of the Boston Housing
dataset from SkLeanr.  The steps below illulstrated the initial linear regression (finding the linear regressor)
for this same dataset including the calculation of the loss and goodness of fit parameters.  These metrics will then be put
into a dataframe for review and referenced as the "baseline" hereinafter.


In [381]:
# Create training and testing sets (cross-validation not needed)
train_set = bos.sample(frac=0.7, random_state=100)
test_set = bos[~bos.isin(train_set)].dropna()
print(train_set.shape[0])
print(test_set.shape[0])

354
152


In [382]:
train_set.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
198,0.03768,80.0,1.52,0.0,0.404,7.274,38.3,7.309,2.0,329.0,12.6,392.2,6.62,34.6
229,0.44178,0.0,6.2,0.0,0.504,6.552,21.4,3.3751,8.0,307.0,17.4,380.34,3.76,31.5
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1.0,273.0,21.0,396.9,9.08,20.6
31,1.35472,0.0,8.14,0.0,0.538,6.072,100.0,4.175,4.0,307.0,21.0,376.73,13.04,14.5
315,0.25356,0.0,9.9,0.0,0.544,5.705,77.7,3.945,4.0,304.0,18.4,396.42,11.5,16.2


Get the training and testing row indices for later use

In [383]:
train_index = train_set.index.values.astype(int)
test_index = test_set.index.values.astype(int)

Demonstration of using the row indices above to select consistent records

In [384]:
bos.iloc[train_index].head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
198,0.03768,80.0,1.52,0.0,0.404,7.274,38.3,7.309,2.0,329.0,12.6,392.2,6.62,34.6
229,0.44178,0.0,6.2,0.0,0.504,6.552,21.4,3.3751,8.0,307.0,17.4,380.34,3.76,31.5
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1.0,273.0,21.0,396.9,9.08,20.6
31,1.35472,0.0,8.14,0.0,0.538,6.072,100.0,4.175,4.0,307.0,21.0,376.73,13.04,14.5
315,0.25356,0.0,9.9,0.0,0.544,5.705,77.7,3.945,4.0,304.0,18.4,396.42,11.5,16.2


Converting the training and testing datasets back to matrix-formats

In [385]:
X_train = train_set.iloc[:, :-1].values # returns the data; excluding the target
Y_train = train_set.iloc[:, -1].values # returns the target-only
X_test = test_set.iloc[:, :-1].values # ""
Y_test = test_set.iloc[:, -1].values # ""

Fit a linear regression to the training data

In [386]:
reg = LinearRegression(normalize=True).fit(X_train, Y_train)
print(reg.score(X_train, Y_train))
print(reg.coef_)
print(reg.intercept_)
print(reg.get_params())

0.7478284701218886
[-1.35456753e-01  5.48606010e-02  5.46611167e-02  3.57648807e+00
 -2.01163242e+01  3.96567027e+00  1.33685712e-02 -1.48716658e+00
  2.99295349e-01 -9.83868843e-03 -9.45023886e-01  6.45207267e-03
 -5.77572297e-01]
36.079347688282304
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': True}


Find the variable with the largest "normalized" coefficient value

In [387]:
print('The abs(max) coef-value is {}'.format(max(reg.coef_))) # Positive Max
#print('The abs(max) coef-value is {}'.format(max(reg.coef_, key=abs))) # ABS Max
max_var = max(reg.coef_) # Positive Max
#max_var = max(reg.coef_, key=abs) # ABS Max
var_index = reg.coef_.tolist().index(max_var)
print('The variable associated with this coef-value is {}'.format(boston.feature_names[var_index]))

The abs(max) coef-value is 3.9656702708586273
The variable associated with this coef-value is RM


In [388]:
Y_pred = reg.predict(X_test)

orig_mae = mean_absolute_error(Y_test,Y_pred)
orig_mse = mean_squared_error(Y_test,Y_pred)
orig_rmse_val = sqrt(mean_squared_error(Y_test,Y_pred))
orig_r2 = r2_score(Y_test,Y_pred)
print("MAE: %.3f"%orig_mae)
print("MSE:  %.3f"%orig_mse)
print("RMSE:  %.3f"%orig_rmse_val)
print("R2:  %.3f"%orig_r2)

MAE: 3.605
MSE:  24.099
RMSE:  4.909
R2:  0.705


In [389]:
res_frame = pd.DataFrame({'data':'original',
                   'imputation':'none',
                   'mae': orig_mae, 
                   'mse': orig_mse, 
                   'rmse':orig_rmse_val, 
                   'R2':orig_r2,
                   'mae_diff':np.nan,
                   'mse_diff':np.nan,
                   'rmse_diff':np.nan,
                   'R2_diff':np.nan}, index=[0])

As a final output, the table below (object:  'res_frame' contains the loss and goodness of fit parameters for regression.)
As mentioned previously, this output will serve as the 'baseline' reference hereinafter.

In [390]:
res_frame

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,3.604571,24.098505,4.909023,0.70494,,,,


**Question 2:**<br/> 
For select between 1, 5 10, 20, 33, and 50% of your data on a single column (Completely at random), 
replace the present value with a NAN and then perform an imputation of that value.   
In each case perform a fit with the imputed data and compare the loss and goodness of fit to your baseline.

In solving this question, a function was built to wrap around the % of observations that would be replaced (at random)
with the value of NAN for eventual comparison with the baseline. We decided to perform the NAN randomization on the "RM" column and the NAN values were imputed with the median for this column.  The function is provided below and the final step is a 
returned dataframe providing the compared metrics (for loss and goodness of fit) along with calculated differences of those
measures against the baseline metrics (i.e., change in R2, for example)

In [391]:
def imputeValuesQ2(imputePercent):
    in_sample = bos.sample(frac=imputePercent, random_state=99)
    in_sample.shape
    out_sample = bos[~bos.isin(in_sample)].dropna()
    out_sample.shape
    print(out_sample.shape[0] + in_sample.shape[0])
    print(bos.shape[0])
    in_sample.head()
    #in_sample['NOX'] = np.nan
    in_sample['RM'] = np.nan
    in_sample.head()
    #out_sample['NOX'].median()
    out_sample['RM'].median()
    #in_sample['NOX'] = in_sample['NOX'].fillna(out_sample['NOX'].median())
    #in_sample['NOX'] = in_sample['NOX'].fillna(1)
    in_sample['RM'] = in_sample['RM'].fillna(out_sample['RM'].median())
    in_sample.head()
    imputed_data = pd.concat([in_sample, out_sample])
    imputed_data = imputed_data.sort_index()
    imputed_data.head()
    train_set = imputed_data.iloc[train_index]
    test_set = imputed_data.iloc[test_index]
    train_set.head()
    X_train = train_set.iloc[:, :-1].values
    Y_train = train_set.iloc[:, -1].values
    X_test = test_set.iloc[:, :-1].values
    Y_test = test_set.iloc[:, -1].values
    reg2 = LinearRegression().fit(X_train, Y_train)
    print(reg2.score(X_train, Y_train))
    print(reg2.coef_)
    print(reg2.intercept_)
    print(reg2.get_params())
    Y_pred = reg2.predict(X_test)

    mae = mean_absolute_error(Y_test,Y_pred)
    mse = mean_squared_error(Y_test,Y_pred)
    rmse_val = sqrt(mean_squared_error(Y_test,Y_pred))
    r2 = r2_score(Y_test,Y_pred)
    print("MAE: %.3f"%mae)
    print("MSE:  %.3f"%mse)
    print("RMSE:  %.3f"%rmse_val)
    print("R2:  %.3f"%r2)
    
    temp_data_frame = pd.DataFrame({'data': str(imputePercent*100) + '% imputed',
                   'imputation':'MAR',
                   'mae': mae, 
                   'mse': mse, 
                   'rmse':rmse_val,
                   'R2':r2,
                   'mae_diff':mae-orig_mae,
                   'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val,
                   'R2_diff':r2-orig_r2
                   }, index=[0])
    
    print(temp_data_frame)
    return temp_data_frame
           

    

In [392]:
onePercentImpute = imputeValuesQ2(.01)
fivePercentImpute = imputeValuesQ2(.05)
tenPercentImpute = imputeValuesQ2(.10)
twentyPercentImpute = imputeValuesQ2(.20)
thirtyThreePercentImpute = imputeValuesQ2(.33)
fiftyPercentImpute = imputeValuesQ2(.50)

506
506
0.7468055150730042
[-1.35531204e-01  5.65566790e-02  5.35315368e-02  3.56399523e+00
 -2.01674165e+01  3.89739947e+00  1.26561538e-02 -1.51098229e+00
  3.03670567e-01 -1.00830077e-02 -9.46635169e-01  6.42105079e-03
 -5.82998396e-01]
36.85811612888831
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
MAE: 3.634
MSE:  24.207
RMSE:  4.920
R2:  0.704
           data imputation       mae        mse      rmse        R2  mae_diff  \
0  1.0% imputed        MAR  3.634187  24.206667  4.920027  0.703616  0.029616   

   mse_diff  rmse_diff   R2_diff  
0  0.108162   0.011004 -0.001324  
506
506
0.7549282762974729
[-1.36917661e-01  5.36955824e-02  6.78445803e-02  3.30005890e+00
 -1.94839169e+01  4.32318651e+00  1.04244282e-02 -1.45392446e+00
  2.87674290e-01 -9.82211060e-03 -9.14964592e-01  5.45539828e-03
 -5.89016396e-01]
33.4372026531222
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
MAE: 3.645
MSE:  24.653
RMSE:  4.965
R2:  0.698
    

Here is the final output for question 2.  This is where we compute the randomly swapped variables observations (variable = 'RM', and
amount of swapped is 1%, 5%, 10%, 20%, 33%, and 50% respectively).  The means of measurement are the loss and goodness of 
fit metrics.  These metrics are compared to the baseline.  All is contained in the table below.

In [393]:
res_frameq2 = pd.concat([res_frame, onePercentImpute, fivePercentImpute, tenPercentImpute, twentyPercentImpute, thirtyThreePercentImpute, fiftyPercentImpute])
res_frameq2

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,3.604571,24.098505,4.909023,0.70494,,,,
0,1.0% imputed,MAR,3.634187,24.206667,4.920027,0.703616,0.029616,0.108162,0.011004,-0.001324
0,5.0% imputed,MAR,3.644905,24.652966,4.965175,0.698151,0.040334,0.554461,0.056153,-0.006789
0,10.0% imputed,MAR,3.659578,25.155434,5.015519,0.691999,0.055007,1.056929,0.106496,-0.012941
0,20.0% imputed,MAR,3.663828,25.05039,5.005036,0.693285,0.059256,0.951885,0.096014,-0.011655
0,33.0% imputed,MAR,3.725349,26.074701,5.106339,0.680743,0.120778,1.976196,0.197316,-0.024196
0,50.0% imputed,MAR,3.76216,26.861651,5.182823,0.671108,0.157589,2.763146,0.2738,-0.033832


Moving towards question 3, the first question we need to address is 'what Z'?  

'AGE' is an interesting variable, especially in Boston.  The city has incredibly strict building codes with respect to 
older/historical buildings especially when compared to older NorthEastern US cities.  There is always a contrast between
buyers interest in 'history' or 'character' against 'new' and 'convenience'.  For that reason, 'AGE' seems an interesting
'Z' variable, and it made sense to set the conditional (if) to '77'.

In [394]:
bos['AGE'].describe()

count    506.000000
mean      68.574901
std       28.148861
min        2.900000
25%       45.025000
50%       77.500000
75%       94.075000
max      100.000000
Name: AGE, dtype: float64

**Question 3:<br/>**
Take 2 different columns and create data “Missing at Random” when controlled for a third variable (i.e if Variable Z is > 30, than Variables X, Y are randomly missing).  

Make runs with 10%, 20% and 30% missing data imputed via your best guess.  Repeat your fit and comparisons to the baseline.

We will use as the z variable:  Age >77 (~ median) and the imputed columns will be "RM" and "NOX". The columns will randomly be imputed with the median value for these columns.

In [395]:
def imputeValuesQ3(imputePercent):
    if 77 > bos['AGE'].all():   #this sets the conditional.  Note the use of "series.all()".  Need to apply that for order of ops in numpy.
        in_sample = bos.sample(frac=imputePercent, random_state=99)
        in_sample.shape
        out_sample = bos[~bos.isin(in_sample)].dropna()
        out_sample.shape
        print(out_sample.shape[0] + in_sample.shape[0])
        print(bos.shape[0])
        in_sample.head()
        in_sample['RM'] = np.nan
        in_sample['NOX'] = np.nan  #this is the second selected variable for application of the random application of NaN
        in_sample.head()
        out_sample['RM'].median()
        out_sample['NOX'].median()
        in_sample['RM'] = in_sample['RM'].fillna(out_sample['RM'].median())
        in_sample['NOX'] = in_sample['NOX'].fillna(out_sample['NOX'].median())
        in_sample.head()
        imputed_data = pd.concat([in_sample, out_sample])
        imputed_data = imputed_data.sort_index()
        imputed_data.head()
        train_set = imputed_data.iloc[train_index]
        test_set = imputed_data.iloc[test_index]
        train_set.head()
        X_train = train_set.iloc[:, :-1].values
        Y_train = train_set.iloc[:, -1].values
        X_test = test_set.iloc[:, :-1].values
        Y_test = test_set.iloc[:, -1].values
        reg2 = LinearRegression().fit(X_train, Y_train)
        print(reg2.score(X_train, Y_train))
        print(reg2.coef_)
        print(reg2.intercept_)
        print(reg2.get_params())
        Y_pred = reg2.predict(X_test)

        mae = mean_absolute_error(Y_test,Y_pred)
        mse = mean_squared_error(Y_test,Y_pred)
        rmse_val = sqrt(mean_squared_error(Y_test,Y_pred))
        r2 = r2_score(Y_test,Y_pred)
        print("MAE: %.3f"%mae)
        print("MSE:  %.3f"%mse)
        print("RMSE:  %.3f"%rmse_val)
        print("R2:  %.3f"%r2)
    
        temp_data_frame = pd.DataFrame({'data': str(imputePercent*100) + '% imputed',
                   'imputation':'MAR',
                   'mae': mae, 
                   'mse': mse, 
                   'rmse':rmse_val,
                   'R2':r2,
                   'mae_diff':mae-orig_mae,
                   'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val,
                   'R2_diff':r2-orig_r2
                   }, index=[0])
    
        print(temp_data_frame)
        return temp_data_frame
    
    else:
        pass    #this sets the false to, effectively, "do nothing"
    

Set the three seperate returned dataframes based on the 10%, 20%, and 30% of obs. swap (from values to NaN) df's named according to % of swapped observations.

In [396]:
tenPercentImputeq3 = imputeValuesQ3(.10)
twentyPercentImputeq3 = imputeValuesQ3(.20)
thirtyPercentImputeq3 = imputeValuesQ3(.30)

506
506
0.7376725192682683
[-1.29548959e-01  5.70811931e-02  5.90935885e-03  3.14763392e+00
 -1.20775766e+01  3.92442473e+00  8.24676965e-03 -1.39313844e+00
  2.64590683e-01 -1.00143877e-02 -8.33334357e-01  5.01413805e-03
 -6.53061970e-01]
32.150447414480354
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
MAE: 3.710
MSE:  25.908
RMSE:  5.090
R2:  0.683
            data imputation       mae        mse      rmse        R2  \
0  10.0% imputed        MAR  3.710238  25.908192  5.090009  0.682782   

   mae_diff  mse_diff  rmse_diff   R2_diff  
0  0.105667  1.809687   0.180986 -0.022158  
506
506
0.7235173761350033
[-1.28817719e-01  6.16431141e-02 -2.60114156e-02  3.06191016e+00
 -8.01998102e+00  3.60372224e+00  1.06454311e-02 -1.33534905e+00
  2.65089570e-01 -1.01244074e-02 -7.91424076e-01  5.33034267e-03
 -7.08333765e-01]
31.725704828923362
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
MAE: 3.650
MSE:  25.030
RMSE:  5.003
R2:  0.694

Here is the final output to Question 3.  The table contains the comparison of the fit when we set the conditional (for all conditions where age is less than 77 ~ variable median); we remove 10%, 20%, 30% (and replace with NaN) for 'RM' and 'NOX'.  As a last step, we calculate loss and goodness of fit metrics for comparison to the baseline (i.e., before random replacement) results.

In [397]:
res_frameq3 = pd.concat([res_frame,tenPercentImputeq3, twentyPercentImputeq3, thirtyPercentImputeq3])
res_frameq3

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,3.604571,24.098505,4.909023,0.70494,,,,
0,10.0% imputed,MAR,3.710238,25.908192,5.090009,0.682782,0.105667,1.809687,0.180986,-0.022158
0,20.0% imputed,MAR,3.65037,25.029925,5.002992,0.693536,0.045799,0.93142,0.093969,-0.011404
0,30.0% imputed,MAR,3.725018,26.154783,5.114175,0.679763,0.120447,2.056278,0.205152,-0.025177


**Question 4:**  

Create a Missing Not at Random pattern in which 25% of the data is missing for a single column.  Impute your data, fit the results and compare to a baseline.

For this step, we intentionally (i.e., not randomly) remove approximately 25% of a particular variables' observations.  In this case, we will continue to focus on the variable 'RM' and the first 126 observations. (1/4*506 = round(126)). As above, we will impute the NAN values with the median for this column.  After that is complete, we run the loss and fit results and compare to the baseline.

In [398]:
dfImputeQ4 = bos
rmMedian = dfImputeQ4['RM'].median()

dfImputeQ4['RM'][0:125] = np.nan
dfImputeQ4['RM'].fillna(rmMedian, inplace=True)  

imputed_data1 = dfImputeQ4      
train_set = imputed_data1.iloc[train_index]
test_set = imputed_data1.iloc[test_index]        
X_train = train_set.iloc[:, :-1].values
Y_train = train_set.iloc[:, -1].values
X_test = test_set.iloc[:, :-1].values
Y_test = test_set.iloc[:, -1].values
reg2 = LinearRegression().fit(X_train, Y_train)

print(reg2.score(X_train, Y_train))
print(reg2.coef_)
print(reg2.intercept_)
print(reg2.get_params())

Y_pred = reg2.predict(X_test)
mae = mean_absolute_error(Y_test,Y_pred)
mse = mean_squared_error(Y_test,Y_pred)
rmse_val = sqrt(mean_squared_error(Y_test,Y_pred))
r2 = r2_score(Y_test,Y_pred)
print("MAE: %.3f"%mae)
print("MSE:  %.3f"%mse)
print("RMSE:  %.3f"%rmse_val)
print("R2:  %.3f"%r2)
    
Q4_data_frame = pd.DataFrame({'data': '-25% imputed',   #here is the new df containing the loss & goodness of fit metrics
                   'imputation':'MAR',
                   'mae': mae, 
                   'mse': mse, 
                   'rmse':rmse_val,
                   'R2':r2,
                   'mae_diff':mae-orig_mae,
                   'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val,
                   'R2_diff':r2-orig_r2
                   }, index=[0])
    


0.7358156464846262
[-1.31030453e-01  5.94942730e-02  2.48548877e-02  3.55346905e+00
 -2.04083178e+01  3.39619043e+00  1.72321127e-02 -1.57963740e+00
  3.04330531e-01 -9.12207278e-03 -9.69296188e-01  6.21381002e-03
 -6.48853079e-01]
41.28816076509389
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
MAE: 3.625
MSE:  24.956
RMSE:  4.996
R2:  0.694


Here is the final table comparing the loss and goodness of fit metrics between our output for Q4 (intentional removal of 25% of obs.) against the output for baseline.

In [399]:
res_frameq4 = pd.concat([res_frame,Q4_data_frame])
res_frameq4

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,3.604571,24.098505,4.909023,0.70494,,,,
0,-25% imputed,MAR,3.624868,24.956094,4.995608,0.69444,0.020297,0.85759,0.086585,-0.0105


This concludes the defined assignment.