# Final Project - Redoing Unit 2 Case Study in Python

___Zach Brown___  
___August 16, 2017___

## Abstract

The purpose of this case study is to analyze the impact of multiple imputation on the effectiveness of linear regression when run on an incomplete dataset.  A regression model was first fit to the data using listwise deletion.  Multiple imputation was then performed and a linear regression model was fit to the imputed dataset.  The mean squared error and R-squared values for both datasets were then compared to one another.  The regression model that was created from the imputed data produced much better results than the model created from the listwise deletion dataset.

___Keywords:___ Multiple Imputation, Missing Data, Linear Regression, Python, Fuel Economy, 38 Car Models

## Introduction

When analyzing data, it is very common to encounter a dataset with missing values.  Incomplete survey results, sensor failures, and human error are just three examples of the myriad reasons why a dataset might be incomplete.  When analyzing a dataset, missing values can decrease the power of the analysis.  Some methods cannot be used at all with incomplete data.

In this case study, we will examine a dataset containing data related to the fuel economy of cars.  The features contained in this dataset include the make and model of the car, miles per gallon, cylinders, size, horsepower, weight, acceleration, and engine type.  The data in this dataset is incomplete.  The objective of this case study is to compare the results of linear regression models fit to this dataset before and after imputing the missing values and to determine if multiple imputation had a meaningful impact on the analysis.

The dataset being used in this case study contains fuel economy data for 38 cars as measured in 2005.  The variables included in this dataset are detailed in Table 1.  This data also contains missing values.  This will allow us to explore the impact of multiple imputation.  To perform the analysis in this case study, we will use the Anaconda distribution of python version 3.5.2 along with the pandas, numpy, scikit-learn, and fancyimpute python modules.

| __Variable__ | __Description__   |
|------|------|
|   Auto  | Make and model of car|
|   MPG  | Estimated miles per gallon|
|   CYLINDERS  | Number of cylinders in engine|
|   SIZE  | Engine displacement|
|   HP  | Horsepower|
|   WEIGHT  | Weight of car|
|   ACCEL  | Acceleration|
|   ENG_TYPE  | Engine type|
###### Table 1. Names and descriptions of the variables contained in the dataset

This case study is a further study of the work that was performed in the Unit 2 case study.  Multiple imputation on the miles per gallon data set was also examined in that case study, but all work was performed using SAS.  Given that SAS is commercial software that is not readily available outside of certain workplaces and educational institutions, reimplementing this work using an open source language such as python is a worthwhile endeavor.  The main objective of this case study is to demonstrate that multiple imputation can be effectively implemented in python and produce meaningful results.

## Methods and Results

Before any analysis can be done, the data must first be imported.  We will read the data from a csv file and store it in a Pandas dataframe.

In [2]:
import pandas as pd

carmpg = pd.read_csv('carmpg.csv')

We can easily see if any of the columns are missing values using the info() method in Pandas.

In [3]:
carmpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 8 columns):
Auto         38 non-null object
MPG          38 non-null float64
CYLINDERS    34 non-null float64
SIZE         35 non-null float64
HP           33 non-null float64
WEIGHT       32 non-null float64
ACCEL        34 non-null float64
ENG_TYPE     35 non-null float64
dtypes: float64(7), object(1)
memory usage: 2.5+ KB


We can see here that all of the features except for the name of the model name and miles per gallon columns are missing some values, as there are 38 records and the only columns with 38 values are Auto and MPG.

In [4]:
carmpg

Unnamed: 0,Auto,MPG,CYLINDERS,SIZE,HP,WEIGHT,ACCEL,ENG_TYPE
0,Buick Estate Wagon,16.9,8.0,350.0,155.0,4.36,14.9,1.0
1,Ford Country Sq. Wagon,15.5,8.0,351.0,,4.054,14.3,1.0
2,Chevy Malibu Wagon,19.2,8.0,267.0,125.0,3.605,15.0,1.0
3,Chrys Lebaron Wagon,18.5,8.0,360.0,150.0,3.94,13.0,1.0
4,Chevette,30.0,4.0,98.0,68.0,2.155,16.5,0.0
5,Toyota Corona,27.5,4.0,134.0,95.0,2.56,14.2,0.0
6,Datsun 510,27.2,4.0,119.0,97.0,2.3,14.7,0.0
7,Dodge Omni,30.9,4.0,105.0,75.0,2.23,14.5,
8,Audi 5000,20.3,5.0,131.0,,2.83,15.9,0.0
9,Volvo 240 GL,17.0,6.0,163.0,125.0,3.14,13.6,0.0


When looking at the actual data, the missing values are even more apparent.  We can also see that the data seems to be missing completely at random or monotone.  There does not appear to be a pattern to the missing data.

Before using multiple imputation, we will first use listwise deletion and run a baseline linear regression.  This just means that any rows with null values will be removed.  We will compare the results of this linear regression model to a second linear regression that we will run later after using multiple imputation instead of listwise deletion.

The first step is to remove rows with null values.

In [5]:
carmpg_listwise = carmpg.dropna(axis=0)

In [6]:
carmpg_listwise.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 37
Data columns (total 8 columns):
Auto         18 non-null object
MPG          18 non-null float64
CYLINDERS    18 non-null float64
SIZE         18 non-null float64
HP           18 non-null float64
WEIGHT       18 non-null float64
ACCEL        18 non-null float64
ENG_TYPE     18 non-null float64
dtypes: float64(7), object(1)
memory usage: 1.3+ KB


This new dataframe contains 18 elements instead of 38.  There are no longer any missing values though.

In [7]:
carmpg_listwise

Unnamed: 0,Auto,MPG,CYLINDERS,SIZE,HP,WEIGHT,ACCEL,ENG_TYPE
0,Buick Estate Wagon,16.9,8.0,350.0,155.0,4.36,14.9,1.0
2,Chevy Malibu Wagon,19.2,8.0,267.0,125.0,3.605,15.0,1.0
3,Chrys Lebaron Wagon,18.5,8.0,360.0,150.0,3.94,13.0,1.0
4,Chevette,30.0,4.0,98.0,68.0,2.155,16.5,0.0
5,Toyota Corona,27.5,4.0,134.0,95.0,2.56,14.2,0.0
6,Datsun 510,27.2,4.0,119.0,97.0,2.3,14.7,0.0
9,Volvo 240 GL,17.0,6.0,163.0,125.0,3.14,13.6,0.0
14,Dodge Aspen,18.6,6.0,225.0,110.0,3.62,18.7,0.0
18,Mercury Grand Marquis,16.5,8.0,351.0,138.0,3.955,13.2,1.0
23,Dodge Colt,35.1,4.0,98.0,80.0,1.915,14.4,0.0


Now we are ready to run the baseline linear regression.  MPG will be the dependent variable and CYLINDERS, SIZE, HP, WEIGHT, ACCEL, AND ENG_TYPE will be the independent variables.  The data set will be broken up into a test set and a training set.  Two thirds of the data will be used to train the model and the remaining third will be used to test it.

In [8]:
import numpy as np
import sklearn as sk
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

First, the data needs to be split up into X and y variables for the independent and dependent variables respectively.

In [9]:
X = carmpg_listwise[['CYLINDERS','SIZE','HP','WEIGHT','ACCEL','ENG_TYPE']]
y = carmpg_listwise['MPG']

Next, the test and train split variables can be created.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33)

Now the regression object can be created and the model can be trained.

In [11]:
regr = linear_model.LinearRegression()

regr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that the model has been fit, it can be used to predict the y values, given the X values.

In [12]:
y_pred = regr.predict(X_test)

The predictions can be evaluated by looking at the coefficients, the mean squared error, and the variance score.

In [13]:
print('Coefficients: \n', regr.coef_)
print('Mean squared error: %.2f'
     % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

Coefficients: 
 [-4.8970016   0.00534839  0.11775372 -3.090125   -0.90553568 -0.31554467]
Mean squared error: 42.05
Variance score: 0.05


This model produces a mean squared error of 16.77 and an R-squared score of 0.47.  We will make a note of these scores and compare them with the model that is run on the multiple imputed data.

Next, we will use Markov Chain Monte Carlo to impute the missing values in the original dataset and fit a new linear regression model to that imputed data set.  To do this, the MICE (Multiple Imputation by Chained Equations) function in the fancyimpute module will be implemented using 100 imputations.  The MICE function requires the data to be provided in a float matrix.  So we will need to convert our Pandas dataframe to a numpy array and remove the Auto column because it is of datatype string.

In [14]:
import fancyimpute

In [53]:
X_filled_mice = fancyimpute.MICE().complete(np.asarray(carmpg.ix[:, carmpg.columns != 'Auto']))

[MICE] Completing matrix with shape (38, 7)
[MICE] Starting imputation round 1/110, elapsed time 0.000
[MICE] Starting imputation round 2/110, elapsed time 0.002
[MICE] Starting imputation round 3/110, elapsed time 0.003
[MICE] Starting imputation round 4/110, elapsed time 0.004
[MICE] Starting imputation round 5/110, elapsed time 0.005
[MICE] Starting imputation round 6/110, elapsed time 0.006
[MICE] Starting imputation round 7/110, elapsed time 0.007
[MICE] Starting imputation round 8/110, elapsed time 0.008
[MICE] Starting imputation round 9/110, elapsed time 0.008
[MICE] Starting imputation round 10/110, elapsed time 0.009
[MICE] Starting imputation round 11/110, elapsed time 0.010
[MICE] Starting imputation round 12/110, elapsed time 0.011
[MICE] Starting imputation round 13/110, elapsed time 0.012
[MICE] Starting imputation round 14/110, elapsed time 0.013
[MICE] Starting imputation round 15/110, elapsed time 0.014
[MICE] Starting imputation round 16/110, elapsed time 0.015
[MICE

The imputations ran successfully.  Now we will convert the resulting numpy array back to a Pandas dataframe.  We will also need to rename the columns to their original values.  The auto column had to be removed before running the imputation, because all values had to be of type float.  So we will also reappend that column.

In [58]:
X_filled_mice = pd.DataFrame(X_filled_mice)
X_filled_mice.columns = ['MPG','CYLINDERS','SIZE','HP','WEIGHT','ACCEL','ENG_TYPE']
X_filled_mice.insert(0, 'Auto', carmpg['Auto'])

X_filled_mice

Unnamed: 0,Auto,MPG,CYLINDERS,SIZE,HP,WEIGHT,ACCEL,ENG_TYPE
0,Buick Estate Wagon,16.9,8.0,350.0,155.0,4.36,14.9,1.0
1,Ford Country Sq. Wagon,15.5,8.0,351.0,147.0813,4.054,14.3,1.0
2,Chevy Malibu Wagon,19.2,8.0,267.0,125.0,3.605,15.0,1.0
3,Chrys Lebaron Wagon,18.5,8.0,360.0,150.0,3.94,13.0,1.0
4,Chevette,30.0,4.0,98.0,68.0,2.155,16.5,0.0
5,Toyota Corona,27.5,4.0,134.0,95.0,2.56,14.2,0.0
6,Datsun 510,27.2,4.0,119.0,97.0,2.3,14.7,0.0
7,Dodge Omni,30.9,4.0,105.0,75.0,2.23,14.5,-0.035942
8,Audi 5000,20.3,5.0,131.0,91.397213,2.83,15.9,0.0
9,Volvo 240 GL,17.0,6.0,163.0,125.0,3.14,13.6,0.0


We now have a fully imputed version of the original data set with no missing values.  A linear regression model can now be fit to this data using the same method as before and the results can be compared with the model that was fit to the listwise deletion data.

First, the data needs to be split up into X and y variables for the independent and dependent variables respectively.

In [60]:
X = X_filled_mice[['CYLINDERS','SIZE','HP','WEIGHT','ACCEL','ENG_TYPE']]
y = X_filled_mice['MPG']

Next, the test and train split variables can be created.

In [61]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33)

Now the regression object can be created and the model can be trained.

In [62]:
regr = linear_model.LinearRegression()

regr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that the model has been fit, it can be used to predict the y values, given the X values.

In [63]:
y_pred = regr.predict(X_test)

The predictions can be evaluated by looking at the coefficients, the mean squared error, and the variance score.

In [64]:
print('Coefficients: \n', regr.coef_)
print('Mean squared error: %.2f'
     % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

Coefficients: 
 [-1.54823048  0.04370467 -0.2044515  -4.48942908 -0.7102286   3.34448808]
Mean squared error: 5.80
Variance score: 0.86


Fitting a linear regression model to the miles per gallon data set that has been filled out using multiple imputation results in a mean squared error of 5.8 and an R-squared score of 0.86.  For reference, the baseline model using the data set that employed listwise deletion resulted in a mean squared error of 16.77 and an R-squared value of 0.47.  Using multiple imputation clearly resulted in a vast improvement to the model.  The mean squared error was reduced by about 2/3 and the r-squared value went from under 0.5 to a fairly correlated 0.86.

## Conclusion

It is very rare for a dataset to be complete in the real world with no missing values.  In many cases, this issue must be dealt with before it is worthwhile or even possible to perform the desired analysis on it.  If the chosen analysis method can be performed on a dataset with missing values, it will most likely have lower power than if it were to be run on a complete dataset.

The objective of this study was to determine whether multiple imputation could be implemented in python and produce better results than listwise deletion when performing linear regression on the given fuel economy dataset.  This was indeed the case, as linear regression produced much better variable estimates when the imputed dataset was used than when listwise deletion was used.

There is room to perform further work on the analysis conducted in this case study.  In this case study, the fancyimpute python module was used for multiple imputation.  However, there were some issues with this module.  The fancyimpute module is a reasonably new open source project and was fairly difficult to install successfully.  It requires specific versions of many other software packages (that are not necessarily the latest versions) to also be installed.  It also cannot accept pandas dataframes in its current incarnations.  This can be overcome by converting from pandas to a numpy array, but it adds more work into the equation.  PyMC is another python package that can be used to perform multiple imputation.  Scikit-learn includes some rudimentary imputation functonality as well.  It may be fruitful to further investigation the imputation capabilities of these and other python modules to determine if they are more robust than the current version of fancyimpute.