## Missing indicator

- A binary variable that indicates whether a value is missing.

- Useful for both numerical and categorical features, and often paired with other imputation methods rather than used alone. 

### Assumptions:

- Use with mean, median, mode imputation, random sample imputation (Data is **missing at random**)

### Considerations

- Expands the feature space

- Original variable still needs to be imputed

- Many missing indicators may end up being identical or very highly correlated

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

TEST_SIZE = 0.3
RANDOM_STATE = 44

In [2]:
cols_to_use = [
    "OverallQual",
    "TotalBsmtSF",
    "1stFlrSF",
    "GrLivArea",
    "WoodDeckSF",
    "BsmtUnfSF",
    "LotFrontage",
    "MasVnrArea",
    "GarageYrBlt",
    "BsmtQual",
    "FireplaceQu",
    "SalePrice"
]

In [3]:
data = pd.read_csv("../data/houseprice.csv", usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,OverallQual,MasVnrArea,BsmtQual,BsmtUnfSF,TotalBsmtSF,1stFlrSF,GrLivArea,FireplaceQu,GarageYrBlt,WoodDeckSF,SalePrice
0,65.0,7,196.0,Gd,150,856,856,1710,,2003.0,0,208500
1,80.0,6,0.0,Gd,284,1262,1262,1262,TA,1976.0,298,181500
2,68.0,7,162.0,Gd,434,920,920,1786,TA,2001.0,0,223500
3,60.0,7,0.0,TA,540,756,961,1717,Gd,1998.0,0,140000
4,84.0,8,350.0,Gd,490,1145,1145,2198,TA,2000.0,192,250000


In [4]:
# Split into train and test set before imputing, as a good practice to reduce overfitting due to data leaks
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('SalePrice', axis=1),
    data['SalePrice'],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE
)

X_train.shape, X_test.shape

((1022, 11), (438, 11))

In [6]:
# Capture numerical variables
vars_num = list(data.select_dtypes(include="number").columns)
vars_num

['LotFrontage',
 'OverallQual',
 'MasVnrArea',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 'GrLivArea',
 'GarageYrBlt',
 'WoodDeckSF',
 'SalePrice']

In [7]:
# Capture categorical variables
vars_cat = list(data.select_dtypes(exclude="number").columns)
vars_cat

['BsmtQual', 'FireplaceQu']

In [13]:
# Create imputation dict
imputation_dict = data[vars_num].median().to_dict()
imputation_dict.update(data[vars_cat].mode().iloc[0].to_dict())
imputation_dict

{'LotFrontage': 69.0,
 'OverallQual': 6.0,
 'MasVnrArea': 0.0,
 'BsmtUnfSF': 477.5,
 'TotalBsmtSF': 991.5,
 '1stFlrSF': 1087.0,
 'GrLivArea': 1464.0,
 'GarageYrBlt': 1980.0,
 'WoodDeckSF': 0.0,
 'SalePrice': 163000.0,
 'BsmtQual': 'TA',
 'FireplaceQu': 'Gd'}

In [14]:
# Create indicator features' name
indicators = [f"{var}_na" for var in X_train.columns]
indicators

['LotFrontage_na',
 'OverallQual_na',
 'MasVnrArea_na',
 'BsmtQual_na',
 'BsmtUnfSF_na',
 'TotalBsmtSF_na',
 '1stFlrSF_na',
 'GrLivArea_na',
 'FireplaceQu_na',
 'GarageYrBlt_na',
 'WoodDeckSF_na']

In [15]:
# add na indicators to X_train, X_test
X_train[indicators] = X_train.isna().astype(int)
X_test[indicators] = X_test.isna().astype(int)

print(X_train.head())
print(X_test.head())

      LotFrontage  OverallQual  MasVnrArea BsmtQual  BsmtUnfSF  TotalBsmtSF  \
146          51.0            5         0.0       TA        506          715   
1115         93.0            8       328.0       Ex        730         1734   
758          24.0            7       360.0       Gd        195          744   
280          82.0            7       340.0       Gd        386          807   
1340         70.0            4         0.0       TA        858          858   

      1stFlrSF  GrLivArea FireplaceQu  GarageYrBlt  ...  OverallQual_na  \
146        875        875         NaN       1931.0  ...               0   
1115      1734       1734          Gd       2007.0  ...               0   
758        757       1501         NaN       1999.0  ...               0   
280       1175       1982          TA       1989.0  ...               0   
1340       872        872         NaN       1974.0  ...               0   

      MasVnrArea_na  BsmtQual_na  BsmtUnfSF_na  TotalBsmtSF_na  1stFlrSF_n

In [16]:
# fill missing data with imputation dict
X_train.fillna(imputation_dict, inplace=True)
X_test.fillna(imputation_dict, inplace=True)

In [17]:
X_train.isna().sum()

LotFrontage       0
OverallQual       0
MasVnrArea        0
BsmtQual          0
BsmtUnfSF         0
TotalBsmtSF       0
1stFlrSF          0
GrLivArea         0
FireplaceQu       0
GarageYrBlt       0
WoodDeckSF        0
LotFrontage_na    0
OverallQual_na    0
MasVnrArea_na     0
BsmtQual_na       0
BsmtUnfSF_na      0
TotalBsmtSF_na    0
1stFlrSF_na       0
GrLivArea_na      0
FireplaceQu_na    0
GarageYrBlt_na    0
WoodDeckSF_na     0
dtype: int64

In [18]:
X_test.isna().sum()

LotFrontage       0
OverallQual       0
MasVnrArea        0
BsmtQual          0
BsmtUnfSF         0
TotalBsmtSF       0
1stFlrSF          0
GrLivArea         0
FireplaceQu       0
GarageYrBlt       0
WoodDeckSF        0
LotFrontage_na    0
OverallQual_na    0
MasVnrArea_na     0
BsmtQual_na       0
BsmtUnfSF_na      0
TotalBsmtSF_na    0
1stFlrSF_na       0
GrLivArea_na      0
FireplaceQu_na    0
GarageYrBlt_na    0
WoodDeckSF_na     0
dtype: int64