# **CIS 520: Machine Learning**

## **Missing Data** 


- **Content Creator:** Kenneth Shinn, Siyun Hu
- **Content Reviewers:** Aditya Pratap Singh
- **Objectives:** This worksheet will work through an example of missing data imputation using both means and regression. Here, we are going to compare the performance of those two missing data imputation techniques from lecture.


# Initialize Penn Grader

In [None]:
%%capture
!pip install penngrader

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import os 

In [None]:
# For autograder only, do not modify this cell. 
# True for Google Colab, False for autograder
NOTEBOOK = (os.getenv('IS_AUTOGRADER') is None)
if NOTEBOOK:
    print("[INFO, OK] Google Colab.")
else:
    print("[INFO, OK] Autograder.")
    sys.exit()

[INFO, OK] Google Colab.


In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO 
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 57931095 # YOUR PENN-ID GOES HERE AS AN INTEGER#

In [None]:
import penngrader.grader

grader = penngrader.grader.PennGrader(homework_id = 'CIS_5200_202230_HW_Missing_Data_WS', student_id = STUDENT_ID)

PennGrader initialized with Student ID: 57931095

Make sure this correct or we will not be able to store your grade


In [None]:
# A helper function for grading utils
def grader_serialize(obj):        # A helper function
    '''Dill serializes Python object into a UTF-8 string'''
    byte_serialized = dill.dumps(obj, recurse = True)
    return base64.b64encode(byte_serialized).decode("utf-8")

# **Data Preparation**

We will split up a data set into testing and training data. The training data will have data elements randomly dropped, and we will use the two strategies to impute the missing data. NOTE: this worksheet will have data missing at random. Then, we will train regression models on each of the imputed training sets, and see how they perform with the held out testing set!

In [None]:
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from random import random
from random import seed

Let's create a dataset using sklearn's make_regression function. This function randomly generates a data set for a regression problem. 

In [None]:
X, y = make_regression(n_samples = 100000, n_features = 4)
X_missing = X.copy()

Now, let's randomly drop values from the data set. These missing values will be replaced with np.nan.

In [None]:
seed(1)
for i in range(len(X)):
    if random() < .3:
        X_missing[i][(int) (random() * 4)] = np.nan


In [None]:
# split the dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_missing, X_test_missing, y_train, y_test = train_test_split(X_missing, y, test_size=0.3, random_state=42)

In [None]:
# make sure everything looks good!
print(X_train_missing[:10])
print(X_train[:10])

[[-0.02322197  0.43696322  0.16850924  0.35625915]
 [-2.02072205         nan -0.85961145 -0.56427015]
 [ 0.66788048  0.44718598  0.91242955 -1.12755079]
 [ 1.30721872 -1.31701916  0.15915672 -0.67072238]
 [ 0.9207756  -0.53624338  1.58523065  1.49596327]
 [-1.17671915 -0.7447005  -0.08264724 -1.01993779]
 [ 0.95636031  0.15601228  1.52141585 -0.81046941]
 [ 0.06524279  1.77406856  0.14961796  0.98130446]
 [ 1.14780442  1.49477241 -0.29783578 -1.0252665 ]
 [        nan -0.79242102 -0.69546528  1.66362747]]
[[-0.02322197  0.43696322  0.16850924  0.35625915]
 [-2.02072205 -1.52540395 -0.85961145 -0.56427015]
 [ 0.66788048  0.44718598  0.91242955 -1.12755079]
 [ 1.30721872 -1.31701916  0.15915672 -0.67072238]
 [ 0.9207756  -0.53624338  1.58523065  1.49596327]
 [-1.17671915 -0.7447005  -0.08264724 -1.01993779]
 [ 0.95636031  0.15601228  1.52141585 -0.81046941]
 [ 0.06524279  1.77406856  0.14961796  0.98130446]
 [ 1.14780442  1.49477241 -0.29783578 -1.0252665 ]
 [-0.80292517 -0.79242102 -0.6

## **Simple Mean Based Imputation**

Let's perform a mean based imputation on the training set and test set respectively. Remember that we cannot use the data in the training set to impute the missing value in the test set, because this will result in *data leakage* problem. 

In [None]:
mean_imp_train = SimpleImputer(missing_values=np.nan, strategy='mean')
mean_imp_test = SimpleImputer(missing_values=np.nan, strategy='mean')

X_train_mean_imp = mean_imp_train.fit_transform(X_train_missing)
X_test_mean_imp = mean_imp_test.fit_transform(X_test_missing)

## **Regression Based Imputation**

Here, we will now do a regression based imputation on the dataset with missing values. Remember that a regression based imputation uses the other columns of non-missing data to predict the missing data of a given column. 

In [None]:
reg_imp_train = IterativeImputer(missing_values=np.nan)
reg_imp_test = IterativeImputer(missing_values=np.nan)

X_train_reg_imp = reg_imp_train.fit_transform(X_train_missing)
X_test_reg_imp = reg_imp_test.fit_transform(X_test_missing)

## *Question 1*


In [None]:
#@markdown Comparing these two imputation methods, which do you think will perform better? Why?
ans1 = 'Regression Based Imputation will perform better. As it makes use of other columns to continuously make a prediction of the missing values, it is expected to be more accurate in comparison to blindly substituting the mean of the feature values.' #@param {type:"string"}

In [None]:
grader.grade(test_case_id = 'test_imputation', answer = ans1)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


## **Training and Testing on the Imputed Datasets**

Now, we can verify your hypothesis through experiment. Let's train OLS regression on each of the imputed data sets and compare their testing MSE. 



In [None]:
# training on mean imputed data
mean_imp_lm = LinearRegression().fit(X_train_mean_imp, y_train)
y_pred_mean_imp = mean_imp_lm.predict(X_test_mean_imp)
mse_mean_imp = mean_squared_error(y_test, y_pred_mean_imp)

print("Mean Imputed Data MSE: " + str(mse_mean_imp))

# training on regression imputed data
reg_imp_lm = LinearRegression().fit(X_train_reg_imp, y_train)
y_pred_reg_imp = reg_imp_lm.predict(X_test_reg_imp)
mse_reg_imp = mean_squared_error(y_test, y_pred_reg_imp)

print("Regression Imputed Data MSE: " + str(mse_reg_imp))

# training on full X_train data
lm = LinearRegression().fit(X_train, y_train)
y_pred = lm.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("True X_train MSE: " + str(mse))

Mean Imputed Data MSE: 1039.0120582518593
Regression Imputed Data MSE: 1039.0904507534844
True X_train MSE: 2.5145436839611363e-26


## *Question 2*

Observe the above result, answer the following questions:


In [None]:
#@markdown Which imputation technique produced the lower MSE? Why do you think that this is the case? Is this what you expected?
ans2 = 'Mean imputed produced lower MSE. It is not what I initially expected but MSE is not really the metric I was approaching the problem first. The mean produced lower MSE as the mean substitutes the mean of all values in the feature column, hence effectively reducing the variance and hence the MSE of the data.' #@param {type:"string"}

#@markdown Think about why regression imputation didn't work significantly better here despite having "more information"? (hint: the x variables of the data generating model are indepedent)
ans3 = 'The variables are independent of each other and hence have no real correlation. This leads to more error in the prediction that accumulates over the number of features of predicted.' #@param {type:"string"}

#@markdown How might these MSE results change if there was a slight correlation between the x variables?
ans4 = 'Regression imputation will work better' #@param {type:"string"}

#@markdown Which imputation technique do you think would work better in the real world? Why?
ans5 = 'In the real world, regression imputation will work better but requires more computation for larger datasets. Hence mean imputation is a popular choice.' #@param {type:"string"}


In [None]:
grader.grade(test_case_id = 'test_observations', answer = [ans2, ans3, ans4, ans5])

Correct! You earned 4.0/4.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.
