We can categorize missing values by the reason they occur:

• Missing completely at random (MCAR)—The reason for the missing data is unrelated to the

rest of the data. An example could be a respondent accidentally missing a question in a survey.

• Missing at random (MAR)—The missingness of the data can be inferred from data in another
column(s). For example, a missing response to a certain survey question can to some extent
be determined conditionally by other factors such as sex, age, lifestyle, and so on.

• Missing not at random (MNAR)—When there is some underlying reason for the missing values.
For example, people with very high incomes tend to be hesitant about revealing it.

• Structurally missing data—Often a subset of MNAR, the data is missing because of a logical
reason. For example, when a variable representing the age of a spouse is missing, we can infer
that a given person has no spouse

In [2]:
import pandas as pd
import missingno as msno
from sklearn.impute import SimpleImputer

In [None]:
# Inspect the information about the DataFrame:
X.info()

In [None]:
# Visualize the nullity of the DataFrame:
msno.matrix(X)

In [None]:
# Define columns with missing values per data type:
NUM_FEATURES = ["age"]
CAT_FEATURES = ["sex", "education", "marriage"]

In [None]:
# Impute numerical features:
for col in NUM_FEATURES:
 num_imputer = SimpleImputer(strategy="median")
 num_imputer.fit(X_train[[col]])
 X_train.loc[:, col] = num_imputer.transform(X_train[[col]])
 X_test.loc[:, col] = num_imputer.transform(X_test[[col]])

In [None]:
# Impute categorical features:
for col in CAT_FEATURES:
 cat_imputer = SimpleImputer(strategy="most_frequent")
 cat_imputer.fit(X_train[[col]])
 X_train.loc[:, col] = cat_imputer.transform(X_train[[col]])
 X_test.loc[:, col] = cat_imputer.transform(X_test[[col]])