You have a categorical feature containing missing values that you want to
replace with predicted values.

The ideal solution is to train a machine learning classifier algorithm to predict
the missing values, commonly a k-nearest neighbors (KNN) classifier

When we have missing values in a categorical feature, our best solution is to
open our toolbox of machine learning algorithms to predict the values of the
missing observations. We can accomplish this by treating the feature with the
missing values as the target vector and the other features as the feature matrix. A
commonly used algorithm is KNN (discussed in depth later in this book), which
assigns to the missing value the median class of the k nearest observations.
Alternatively, we can fill in missing values with the most frequent class of the
feature. While less sophisticated than KNN, it is much more scalable to larger
data. In either case, it is advisable to include a binary feature indicating which
observations contain imputed values.

In [37]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

In [38]:
# Create feature matrix with categorical feature
X = np.array([
[0, 2.10, 1.45],
[1, 1.18, 1.33],
[0, 1.22, 1.27],
[1, -0.21, -1.19]])

In [39]:
# Create feature matrix with missing values in the categorical feature
X_with_nan = np.array([
[np.nan, 0.87, 1.31],
[np.nan, -0.67, -0.22]])

In [40]:
# Train KNN learner
classifer=KNeighborsClassifier(n_neighbors=3, weights='distance')
model=classifer.fit(X[:, 1:], X[:, 0])

In [41]:
# Predict missing values' class
imputed_values=model.predict(X_with_nan[:,1:])
imputed_values

array([0., 1.])

In [42]:
# Join column of predicted class with their other features

X_with_imputed= np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))
X_with_imputed

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22]])

In [43]:
# Join two feature matrices
np.vstack((X_with_imputed, X))


array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

An alternative solution is to fill in missing values with the feature’s most
frequent value:

In [44]:
from sklearn.impute import SimpleImputer
# Join the two feature matrices
X_complete = np.vstack((X_with_nan, X))
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X_complete)

TypeError: SimpleImputer.__init__() got an unexpected keyword argument 'axis'