# Machine Learning Preprocessing: Handling Missing Data

In this notebook, we examine how to address missing data.  In practice, we will often work with datasets missing some amount of data, and will have to handle this before we can effectively use the data on a machine learning algorithm.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

In [1]:
# Import support libraries
import os

# Import analytical libraries
import pandas as pd
import numpy as np

# Import machine learning support
from sklearn.impute import SimpleImputer

## Load & Preview Data

In this notebook we will preprocess purchase data by addressing missing data.  Our dataset contains information on which clients purchased a product based on a set of features.

In [2]:
# Define data file path
purchases_file_path = os.path.join('Data', 'Data.csv')

# Load data
purchases = pd.read_csv(purchases_file_path)

In [3]:
# Preview data
display(purchases.head())

display(purchases.describe())

display(purchases.isna().sum())

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [4]:
# Define features & labels
X = purchases.drop(columns='Purchased').values
y = purchases['Purchased'].values

## Address Missing Values

In practice, you will often work with missing data.  From just previewing the first five records, we can see that one of our records is missing a salary.  We need to address this before moving forward, and though it's not visible in our preview, we can see from our NA count that the age column also has a missing value.

There are a number of ways to address missing data.  One way is to simply drop records with missing data, however, this will also drop clean data that we do have in other fields; therefore dropping data is generally discouraged.

In this notebook, we will instead replace missing age/salary data with the average of the age/salary column.  We import the SimpleImputer class from scikit learn, which allows us to replace missing data by a defined value, which in our case is the mean of the respective columns.

An imputer object takes at least two arguments:
1. What value you are replacing (e.g. np nans)
2. What you are replacing them with (e.g. mean, median, mode)

In [5]:
# Create imputer object
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

In [6]:
# Fit imputer object to the age & salary features, which are in the columns at index 1 and 2
imputer.fit(X[:, 1:3])

# Apply the imputer transformation based on fitted data
X[:, 1:3] = imputer.transform(X[:, 1:3])

To examine what we achieved, we will view the mean of the age and salary column, then exmaine our features.

In [7]:
# Return mean age and salary
print(purchases['Age'].mean())
print(purchases['Salary'].mean())

# Examine features to identify imputed means
print('\n', X)

38.77777777777778
63777.77777777778

 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


As seen above, we have calculated the mean of the age and salary columns, then imputed these into rows with missing data.