
### Imputation Redux: Imputing Missing Categorical Data

Categorical data presents a particular kind of challenge, because (as we discussed) sklearn doesn't like to deal with non-numeric data. For instance, the KNNImputer will fail if we attempt to encode categorical values.



In [35]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

# Set random seed for reproducibility
np.random.seed(42)

# Create synthetic data
n_samples = 100
age = np.random.randint(18, 80, n_samples)
income = np.random.randint(20000, 100000, n_samples)
education = np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples)
city = np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'], n_samples)


# Create a DataFrame
df = pd.DataFrame({
    'Age': age,
    'Income': income,
    'Education': education,
    'City': city
})

# Introduce missing values
for column in df.columns:
    mask = np.random.choice([True, False], size=df.shape[0], p=[0.1, 0.9])
    df.loc[mask, column] = np.nan



# Try to use KNNImputer on the entire DataFrame
imputer = KNNImputer(n_neighbors=5)

try:
    imputed_data = imputer.fit_transform(df)
    print("\nImputed data:")
    print(imputed_data[:10])
except ValueError as e:
    print("\nError when trying to impute:")
    print(e)


Error when trying to impute:
could not convert string to float: 'High School'


There are two strategies for dealing with this; we can use SKLearn's "most frequent" method for Simple Imputation.  Alternatively (if we want to use something like a KNNImputer) we'll need to encode the data first, and then impute.



#### Using Pandas or Most Frequent Category

You can use pandas to replace nulls, using one of the methods we covered previously. Alternatively, you can use SimpleImputer with the `strategy='most_frequent'` option to impute missing values with the most frequent category in each column before one-hot encoding.


In [37]:
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

df_imputed = df.copy()
# Impute missing values
imp = SimpleImputer(strategy='most_frequent')
df_imputed[["Education","City"]] = imp.fit_transform(df[['Education','City']])
df_imputed


Unnamed: 0,Age,Income,Education,City
0,,57065.0,Master,New York
1,,52606.0,High School,Los Angeles
2,46.0,31534.0,Bachelor,New York
3,32.0,60397.0,PhD,Houston
4,,21016.0,PhD,Miami
...,...,...,...,...
95,46.0,54766.0,Bachelor,New York
96,35.0,93530.0,PhD,Chicago
97,43.0,81087.0,High School,New York
98,,88840.0,PhD,New York


### Encoding and then imputing

If we wanted to use something more sophisticated, like a KNNImputer, we might think we try to encode first, and then impute, but there is another subtle problem here.

In [38]:
from sklearn.preprocessing import OrdinalEncoder

le_education = OrdinalEncoder()
le_city = OrdinalEncoder()

df_numeric = df.copy()
# Fit and transform the non-null values
df_numeric['Education'] = le_education.fit_transform(df[['Education']])
df_numeric['City'] = le_city.fit_transform(df[['City']])

# Impute missing values
imputer = KNNImputer(n_neighbors=5)
imputed_data_numeric = imputer.fit_transform(df_numeric)

# Create a new DataFrame with imputed values
df_imputed = pd.DataFrame(imputed_data_numeric, columns=df.columns)
df_imputed

Unnamed: 0,Age,Income,Education,City
0,56.2,57065.0,1.2,4.0
1,60.4,52606.0,1.0,2.0
2,46.0,31534.0,0.0,4.0
3,32.0,60397.0,3.0,1.0
4,50.2,21016.0,3.0,3.0
...,...,...,...,...
95,46.0,54766.0,0.0,1.6
96,35.0,93530.0,3.0,0.0
97,43.0,81087.0,1.0,1.6
98,50.2,88840.0,3.0,4.0


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        83 non-null     float64
 1   Income     93 non-null     float64
 2   Education  86 non-null     object 
 3   City       87 non-null     object 
dtypes: float64(2), object(2)
memory usage: 3.3+ KB


Here's the problem ...

In [40]:
df_imputed[(df.Education.isna()) | (df.City.isna())]

Unnamed: 0,Age,Income,Education,City
0,56.2,57065.0,1.2,4.0
5,25.0,75591.0,1.2,3.0
6,50.8,43247.0,2.0,1.4
7,38.0,44300.0,0.0,1.6
15,70.0,71214.0,1.2,0.0
27,19.0,33986.0,1.2,0.0
29,38.0,32666.0,3.0,1.6
33,39.0,72972.2,2.4,4.0
37,66.0,50535.0,2.0,1.0
38,44.0,98603.0,1.2,4.0


You'll note that KNN works by calculating the _mean_ of it's nearest neighbors.  That doesn't make much sense here.  We could round, but that defeats the purpose of using KNN.  One solution is to use a more robust library.  The following is a little advanced, but I've provided it hear so you can refer back.  For our purposes, a SimpleImputer should be sufficient.

#### Use a Different Library!

As you might imagine, others have struggled with this, and so there are other libraries designed to address this problem.  For instance, the `fancyimpute` package has both a KNNImputer and an IterativeImputer you might try.  Here's an example with the `KNNImputer` from `fancyimpute`.

In [2]:
!pip install fancyimpute


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m



#### FancyImputes K-Nearest Neighbors (KNN) Imputer

KNN from fancyimputer won't work with categorical data directly, but instead of using the `mean` (which is used by SciKit Learn's KNNImputer) is uses the `mode` for imputation, which is what we want.  To use KNN, first you should first encode your data using an `OrdinalEncoder` or `LabelEncoder`, then impute, then transform your data back into the categorical values you want.  This is more complicated than it should be because there is no easy way to preserve nulls in your data.



In [3]:
import pandas as pd
import numpy as np
from fancyimpute import KNN
from sklearn.preprocessing import LabelEncoder

# Create DataFrame with missing values
data = {
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple', None, 'Banana'],
    'Color': ['Red', 'Yellow', 'Red', None, 'Green', 'Yellow']
}

df = pd.DataFrame(data)

# Dictionary to hold LabelEncoders for each column
encoders = {}

# Replace categorical string values with numerical representations
for col in df.columns:
    le = LabelEncoder()
    not_null_mask = df[col].notnull()
    df.loc[not_null_mask, col] = le.fit_transform(df.loc[not_null_mask, col].astype(str))
    encoders[col] = le

# Use KNN to impute the missing values
knn_imputer = KNN()
df_imputed = knn_imputer.fit_transform(df)

# Round imputed values and convert to int for decoding
# Note that the rounding is necessary because NaNs force columns to become floats
df_imputed = pd.DataFrame(np.round(df_imputed), columns=df.columns).astype(int)

# Decode imputed values back to original categorical values
for col in df.columns:
    df_imputed[col] = encoders[col].inverse_transform(df_imputed[col])

print(df_imputed)

Imputing row 1/6 with 0 missing, elapsed time: 0.000
    Fruit   Color
0   Apple     Red
1  Banana  Yellow
2  Cherry     Red
3   Apple     Red
4  Banana   Green
5  Banana  Yellow


Other strategies may be applied in a similar manner, after which you can one-hot encode your data, and proceed with additional processing!

Note that there is currently no elegant solution for imputation of categorical variables, and so if you want something more sophisticated than a SimpleImputer with a "most_frequent" strategy, you'll probably need to write some code.  However, we can turn the above method into our own "Imputer" class like this:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
from fancyimpute import KNN
import pandas as pd
import numpy as np

class CategoricalKNNImputer(BaseEstimator, TransformerMixin):
    def __init__(self, include_numeric=False, include_cols = []):
        self.encoders = {}
        self.knn_imputer = KNN()
        self.include_numeric = include_numeric
        self.include_cols = include_cols
    
    def fit(self, X, y=None):
        X = X.copy()
        
        if self.include_numeric:
            self.cols = X.columns.tolist()
        else:
            self.cols = X.select_dtypes(include=['object', 'category']).columns.tolist()+self.include_cols
            
        for col in self.cols:
            le = LabelEncoder()
            not_null_mask = X[col].notnull()
            if not_null_mask.sum() > 0:  # Only if there are non-null values to fit
                X.loc[not_null_mask, col] = le.fit_transform(X.loc[not_null_mask, col].astype(str))
                self.encoders[col] = le
        return self
    
    def transform(self, X):
        X_original = X.copy()
        X = X.copy()
        
        for col in self.cols:
            if col in self.encoders:  # Only if encoder exists
                not_null_mask = X[col].notnull()
                X.loc[not_null_mask, col] = self.encoders[col].transform(X.loc[not_null_mask, col].astype(str))
        
        X_imputed = self.knn_imputer.fit_transform(X)
        X_imputed = pd.DataFrame(X_imputed, columns=X.columns)
        
        for col in self.cols:
            if col in self.encoders:  # Only if encoder exists
                X_imputed.loc[:, col] = np.round(X_imputed.loc[:, col])  # Rounding only categorical columns
                X_imputed[col] = X_imputed[col].astype(int)  # Converting to int before decoding
                X_imputed[col] = self.encoders[col].inverse_transform(X_imputed[col])
        
        if not self.include_numeric:
            replacements = [x for x in X.columns if x not in self.cols]
            #numeric_cols = X_original.select_dtypes(include=[np.number]).columns
            X_imputed[replacements] = X_original[replacements]
        
        return X_imputed

