# Imputing Missing Class Values

You have a categorical feature containing missing values that you want to replace with
predicted values

The ideal solution is to train a machine learning classifier algorithm to predict the
missing values, commonly a k-nearest neighbors (KNN) classifier:

In [2]:
# Load libraries
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


# Create feature matrix with categorical feature
X = np.array([[0, 2.10, 1.45],
 [1, 1.18, 1.33],
 [0, 1.22, 1.27],
 [1, -0.21, -1.19]])


# Create feature matrix with missing values in the categorical feature
X_with_nan = np.array([[np.nan, 0.87, 1.31],
 [np.nan, -0.67, -0.22]])

In [3]:
# Train KNN learner
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:,0])

In [4]:
clf

KNeighborsClassifier(n_neighbors=3, weights='distance')

In [5]:
trained_model

KNeighborsClassifier(n_neighbors=3, weights='distance')

In [6]:
# Predict missing values' class
imputed_values = trained_model.predict(X_with_nan[:,1:])

In [7]:
imputed_values

array([0., 1.])

In [10]:
# Join column of predicted class with their other features
X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))

In [12]:
# Join two feature matrices
np.vstack((X_with_imputed, X))

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

An alternative solution is to fill in missing values with the feature’s most frequent
value:

In [17]:
from sklearn.impute import SimpleImputer
# Join the two feature matrices
X_complete = np.vstack((X_with_nan, X))
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X_complete)

array([[ 0.  ,  0.87,  1.31],
       [ 0.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])