# Frequent Category imputation (Mode imputation)

- Replacing all occurences of missing values within a variable with the mode, or the **most frequent category**

- Apply for **categorical** variables.

- For variables that have **2 or more modes**, we either pick one category, or use arbitrary imputation

## Assumptions

- Data is **missing at random**

- The missing observations, most likely look like most observations

- Missing data are **blended** with other values

## Good practice

Frequent category imputation + Missing indicator = Good imputaion strategy

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

TEST_SIZE =  0.3
RANDOM_STATE = 44

In [3]:
# categorical columns and target
cols_to_use = ["BsmtQual", "FireplaceQu", "SalePrice"]

In [4]:
data = pd.read_csv("../data/houseprice.csv", usecols=cols_to_use)
data.head()

Unnamed: 0,BsmtQual,FireplaceQu,SalePrice
0,Gd,,208500
1,Gd,TA,181500
2,Gd,TA,223500
3,TA,Gd,140000
4,Gd,TA,250000


In [6]:
# Split into train and test sets, learn mode from train set only to avoid overfitting due to data leaks
X_train, X_test, y_train, y_test = train_test_split (
    data.drop("SalePrice", axis=1),
    data["SalePrice"],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE
)

X_train.shape, X_test.shape

((1022, 2), (438, 2))

In [7]:
# Missing data proportion
X_train.isna().mean()

BsmtQual       0.026419
FireplaceQu    0.465753
dtype: float64

~47% of data in FireplaceQu are missing

In [8]:
X_train[["BsmtQual", "FireplaceQu"]].mode()

Unnamed: 0,BsmtQual,FireplaceQu
0,TA,Gd


In [9]:
# Capture the mode of the variables in a dict
imputation_dict = X_train[["BsmtQual", "FireplaceQu"]].mode().iloc[0].to_dict()
imputation_dict

{'BsmtQual': 'TA', 'FireplaceQu': 'Gd'}

In [10]:
# replace missing data
X_train.fillna(imputation_dict, inplace=True)
X_test.fillna(imputation_dict, inplace=True)

In [11]:
# re-check to see if the imputation worked
X_train.isna().sum()

BsmtQual       0
FireplaceQu    0
dtype: int64

In [12]:
X_test.isna().sum()

BsmtQual       0
FireplaceQu    0
dtype: int64