## Adding a variable to capture NA

In previous notebooks we learnt how to replace missing values by the mean, median or by extracting a random value. In other words we learnt about mean / median and random sample imputation. These methods assume that the data are missing completely at random (MCAR).

There are other methods that can be used when values are not missing at random, for example arbitrary value imputation or end of distribution imputation. However, these imputation techniques will affect the variable distribution dramatically, and are therefore not suitable for linear models.

**So what can we do if data are not MCAR and we want to use linear models?**

If data are not missing at random, it is a good idea to replace missing observations by the mean / median / mode AND  **flag** those missing observations as well with a **Missing Indicator**. A Missing Indicator is an additional binary variable, which indicates whether the data was missing for an observation (1) or not (0).


### For which variables can I add a missing indicator?

We can add a missing indicator to both numerical and categorical variables. 

#### Note

Adding a missing indicator is never used alone. On the contrary, it is always used together with another imputation technique, which can be mean / median imputation for numerical variables, or frequent category imputation for categorical variables. We can also use random sample imputation together with adding a missing indicator for both categorical and numerical variables.

Commonly used together:

- Mean / median imputation + missing indicator (Numerical variables)
- Frequent category imputation + missing indicator (Categorical variables)
- Random sample Imputation + missing indicator (Numerical and categorical)

### Assumptions

- Data is not missing at random
- Missing data are predictive

### Advantages

- Easy to implement
- Captures the importance of missing data if there is one

### Limitations

- Expands the feature space
- Original variable still needs to be imputed to remove the NaN

Adding a missing indicator will increase 1 variable per variable in the dataset with missing values. So if the dataset contains 10 features, and all of them have missing values, after adding a missing indicator we will have a dataset with 20 features: the original 10 features plus additional 10 binary features, which indicate for each of the original variables whether the value was missing or not. This may not be a problem in datasets with tens to a few hundreds variables, but if our original dataset contains thousands of variables, by creating an additional variable to indicate NA, we will end up with very big datasets. 

#### Important

In addition, data tends to be missing for the same observation across multiple variables, which often leads to many of the missing indicator variables to be actually similar or identical to each other.

### Final note

Typically, mean / median / mode imputation is done together with adding a variable to capture those observations where the data was missing, thus covering 2 angles: if the data was missing completely at random, this would be contemplated by the mean / median / mode imputation, and if it wasn't this would be captured by the missing indicator.


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# for Q-Q plots
import pylab
import scipy.stats as stats

# to display the total number columns present in the dataset
pd.set_option('display.max_columns', None)

# to split the datasets
from sklearn.model_selection import train_test_split

In [2]:
# let's load the imports-85-clean-data.csv dataset

data = pd.read_csv('C:\\Users\\gusal\\machine learning\\Feature engineering\\car-data_rev2.csv')


In [3]:
data.head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,price
0,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,13495.0
1,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,16500.0
2,1,alfa-romero,gas,,two,hatchback,rwd,front,,16500.0
3,2,audi,gas,,four,,fwd,front,99.8,13950.0
4,2,audi,gas,,four,sedan,4wd,front,99.4,17450.0


In [4]:
# let's look at the percentage of NA

data.isnull().mean()

symboling          0.000000
make               0.000000
fuel-type          0.000000
aspiration         0.175610
num-of-doors       0.000000
body-style         0.141463
drive-wheels       0.000000
engine-location    0.000000
wheel-base         0.121951
price              0.000000
dtype: float64

To add a binary missing indicator, we don't necessarily need to learn anything from the training set, so in principle we could do this in the original dataset and then separate into train and test. However, this practice not recommended.
In addition, if you are using scikit-learn to add the missing indicator, the indicator as it is designed, needs to learn from the train set, which features to impute, this is, which are the features for which the binary variable needs to be added. We will see more about different implementations of missing indicators in future notebooks. For now, let's see how to create a binary missing indicator manually.

In [5]:
inputs = data.drop(['price'], axis = 1)
target = data.price

In [6]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    inputs,  # predictors
    target,  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((143, 9), (62, 9))

In [7]:
# Let's explore the missing data in the train set
# the percentages should be fairly similar to those
# of the whole dataset

X_train.isnull().mean()

symboling          0.000000
make               0.000000
fuel-type          0.000000
aspiration         0.139860
num-of-doors       0.000000
body-style         0.111888
drive-wheels       0.000000
engine-location    0.000000
wheel-base         0.104895
dtype: float64

In [8]:
# add the missing indicator

# this is done very simply by using np.where from numpy
# and isnull from pandas:

X_train['aspiration_NA'] = np.where(X_train['aspiration'].isnull(), 1, 0)
X_test['aspiration_NA'] = np.where(X_test['aspiration'].isnull(), 1, 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [9]:
# add the missing indicator

# this is done very simply by using np.where from numpy
# and isnull from pandas:

X_train['body-style_NA'] = np.where(X_train['body-style'].isnull(), 1, 0)
X_test['body-style_NA'] = np.where(X_test['body-style'].isnull(), 1, 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [10]:
# add the missing indicator

# this is done very simply by using np.where from numpy
# and isnull from pandas:

X_train['wheel-base_NA'] = np.where(X_train['wheel-base'].isnull(), 1, 0)
X_test['wheel-base_NA'] = np.where(X_test['wheel-base'].isnull(), 1, 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [11]:
X_train.head(12)

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,aspiration_NA,body-style_NA,wheel-base_NA
40,0,honda,gas,std,four,sedan,fwd,front,96.5,0,0,0
60,0,mazda,gas,std,four,sedan,fwd,front,98.8,0,0,0
56,3,mazda,gas,std,two,hatchback,rwd,front,95.3,0,0,0
101,0,nissan,gas,std,four,sedan,fwd,front,100.4,0,0,0
86,1,mitsubishi,gas,std,four,sedan,fwd,front,96.3,0,0,0
19,1,chevrolet,gas,std,two,hatchback,fwd,front,94.5,0,0,0
155,0,toyota,gas,std,four,wagon,4wd,front,95.7,0,0,0
97,1,nissan,gas,std,four,wagon,fwd,front,94.5,0,0,0
54,1,mazda,gas,,four,sedan,fwd,front,93.1,1,0,0
184,2,volkswagen,diesel,,four,,fwd,front,97.3,1,1,0


In [12]:
# yet the original variable, still shows the missing values
# which need to be replaced by any of the techniques
# we have learnt

X_train.isnull().mean()

symboling          0.000000
make               0.000000
fuel-type          0.000000
aspiration         0.139860
num-of-doors       0.000000
body-style         0.111888
drive-wheels       0.000000
engine-location    0.000000
wheel-base         0.104895
aspiration_NA      0.000000
body-style_NA      0.000000
wheel-base_NA      0.000000
dtype: float64

In [13]:
# because wheel-base is a numerical variable we can do median imputation

median = X_train['wheel-base'].median()

X_train['wheel-base'] = X_train['aspiration'].fillna(median)
X_test['wheel-base'] = X_test['wheel-base'].fillna(median)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [14]:
# check that there are no more missing values
X_train['wheel-base'].isnull().mean()

0.0

In [15]:
# let's make a function to fill missing values with a value:
# we have use a similar function in our previous notebooks
# so you are probably familiar with it

def impute_na(df, variable, value):
    return df[variable].fillna(value)

In [16]:
# find the frequent category with we will impute the NA

In [17]:
X_train.groupby('aspiration').size()

aspiration
std      100
turbo     23
dtype: int64

In [18]:
X_train.groupby('body-style').size()

body-style
convertible     5
hardtop         5
hatchback      39
sedan          62
wagon          16
dtype: int64

In [19]:
# the most frequent category for aspiration is std (100), and for body-style is sedan(62)

In [20]:
# let's impute the NA in categorical variables by the 
# most frequent category (aka the mode)
# the mode needs to be learnt from the train set

mode = X_train['aspiration'].mode()[0]
X_train['aspiration'] = impute_na(X_train, 'aspiration', mode)
X_test['aspiration'] = impute_na(X_test, 'aspiration', mode)

mode = X_train['body-style'].mode()[0]
X_train['body-style'] = impute_na(X_train, 'body-style', mode)
X_test['body-style'] = impute_na(X_test, 'body-style', mode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in t

In [21]:
X_train.head()

Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,aspiration_NA,body-style_NA,wheel-base_NA
40,0,honda,gas,std,four,sedan,fwd,front,std,0,0,0
60,0,mazda,gas,std,four,sedan,fwd,front,std,0,0,0
56,3,mazda,gas,std,two,hatchback,rwd,front,std,0,0,0
101,0,nissan,gas,std,four,sedan,fwd,front,std,0,0,0
86,1,mitsubishi,gas,std,four,sedan,fwd,front,std,0,0,0


In [22]:
X_train.groupby('body-style').size()

body-style
convertible     5
hardtop         5
hatchback      39
sedan          78
wagon          16
dtype: int64

In [23]:
X_train.groupby('aspiration').size()

aspiration
std      120
turbo     23
dtype: int64

In [None]:
#now we have 16 more sedan category for the variable body-style, and 20 more std category for the variable aspiration

In [24]:
# and now let's check there are no more NA
X_train.isnull().mean()

symboling          0.0
make               0.0
fuel-type          0.0
aspiration         0.0
num-of-doors       0.0
body-style         0.0
drive-wheels       0.0
engine-location    0.0
wheel-base         0.0
aspiration_NA      0.0
body-style_NA      0.0
wheel-base_NA      0.0
dtype: float64

As you can see, we have now 3 features respect to the original dataset.