## Arbitrary Category Imputation

- Treating missing data as an additional label or category. Ex: create a new label called **'Missing'**

- This is the most used method of missing data imputation for categorical variables.

#### Assumptions

- Data is **not missing at random**


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

TEST_SIZE = 0.3
RANDOM_STATE = 44

In [2]:
cols_to_use = [
    "BsmtQual",
    "FireplaceQu",
    "SalePrice"
]

In [3]:
data = pd.read_csv("../data/houseprice.csv", usecols=cols_to_use)
data.head()

Unnamed: 0,BsmtQual,FireplaceQu,SalePrice
0,Gd,,208500
1,Gd,TA,181500
2,Gd,TA,223500
3,TA,Gd,140000
4,Gd,TA,250000


In [4]:
# Split data to train and test sets before imputing to avoid data leaks
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('SalePrice', axis=1),
    data['SalePrice'],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE
)

X_train.shape, X_test.shape

((1022, 2), (438, 2))

In [5]:
# proportion of missing values
X_train.isna().mean()

BsmtQual       0.026419
FireplaceQu    0.465753
dtype: float64

In [6]:
# Dict for imputation with 'Missing' as value
imputation_dict = {
    "BsmtQual": "Missing",
    "FireplaceQu": "Missing"
}

imputation_dict

{'BsmtQual': 'Missing', 'FireplaceQu': 'Missing'}

In [7]:
X_train.fillna(imputation_dict, inplace=True)
X_test.fillna(imputation_dict, inplace=True)

In [8]:
X_train.isna().sum()

BsmtQual       0
FireplaceQu    0
dtype: int64

In [9]:
X_test.isna().sum()

BsmtQual       0
FireplaceQu    0
dtype: int64