# Machine Learning Preprocessing: Handling Missing Data

In this notebook, we examine how to address missing data.  In practice, we will often work with datasets missing some amount of data, and will have to handle this before we can effectively use the data on a machine learning algorithm.

Sources:
1. <a href='https://www.udemy.com/course/machinelearning/'>Machine Learning A-Z™: Hands-On Python & R In Data Science</a>

## Load and Preview Data

In [1]:
# Define file path to data
purchases_file_path  <- file.path('Data', 'Data.csv')

# Load data
purchases <- read.csv(purchases_file_path)

In [2]:
cat('Shape:')
dim(purchases)
cat('Figure 1.')

cat('\n\nPreview:')
head(purchases, 5)
cat('Figure 2.')

cat('\n\nStructure:\n')
str(purchases)
cat('Figure 3.')

cat('\n\nSummary:')
summary(purchases)
cat('Figure 4.')

cat('\n\nNulls:')
data.frame(colSums(is.na(purchases)))
cat('Figure 5.')

Shape:

Figure 1.

Preview:

Country,Age,Salary,Purchased
France,44,72000.0,No
Spain,27,48000.0,Yes
Germany,30,54000.0,No
Spain,38,61000.0,No
Germany,40,,Yes


Figure 2.

Structure:
'data.frame':	10 obs. of  4 variables:
 $ Country  : Factor w/ 3 levels "France","Germany",..: 1 3 2 3 2 1 3 1 2 1
 $ Age      : int  44 27 30 38 40 35 NA 48 50 37
 $ Salary   : int  72000 48000 54000 61000 NA 58000 52000 79000 83000 67000
 $ Purchased: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 1 2
Figure 3.

Summary:

    Country       Age            Salary      Purchased
 France :4   Min.   :27.00   Min.   :48000   No :5    
 Germany:3   1st Qu.:35.00   1st Qu.:54000   Yes:5    
 Spain  :3   Median :38.00   Median :61000            
             Mean   :38.78   Mean   :63778            
             3rd Qu.:44.00   3rd Qu.:72000            
             Max.   :50.00   Max.   :83000            
             NA's   :1       NA's   :1                

Figure 4.

Nulls:

Unnamed: 0,colSums.is.na.purchases..
Country,0
Age,1
Salary,1
Purchased,0


Figure 5.

## Address Missing Values

In practice, you will often work with missing data.  From just previewing the first five records, we can see in Figure 2 that one of our records is missing a salary.  We need to address this before moving forward, and though it's not visible in our preview, we can see from our NA count in Figure 5 that the age column also has a missing value.

There are a number of ways to address missing data.  One way is to simply drop records with missing data, however, this will also drop clean data that we do have in other fields; therefore dropping data is generally discouraged.

In this notebook, we will instead replace missing age/salary data with the average of the age/salary column.  We will use the ifelse() function to do this.  This function takes 3 parameters:
1. The condition to check
2. The value to return if True
3. The value to return if False

In [3]:
# View salaries and ages before addressing missing data
purchases$Salary
purchases$Age

In [4]:
# Replace salary values with averages if they are null
purchases$Salary <- ifelse(is.na(purchases$Salary) # If the salary is null
                          ,ave(purchases$Salary, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean salary
                          ,purchases$Salary) # else return the current record's salary

# Replace age values with averages if they are null
purchases$Age <- ifelse(is.na(purchases$Age) # If the age is null
                          ,ave(purchases$Age, FUN = function(x) mean(x, na.rm = TRUE)) # then return the mean age
                          ,purchases$Age) # else return the current record's age                        

In [5]:
# View salaries and ages after addressing missing data
purchases$Salary
purchases$Age

To examine what we achived with this, we will then exmaine our features.

In [6]:
head(purchases, 5)
print('Figure 6.')

Country,Age,Salary,Purchased
France,44,72000.0,No
Spain,27,48000.0,Yes
Germany,30,54000.0,No
Spain,38,61000.0,No
Germany,40,63777.78,Yes


[1] "Figure 6."


As can be seen in Figure 6, the salary which was previously null has been replaced by the mean salary.