# Imputing Missing Values

You have missing values in your data and want to fill in or predict their values

## If you have a small amount of data, predict the missing values using k-nearest neighbors (KNN)

In [1]:
pip install fancyimpute




In [2]:
import numpy as np
from fancyimpute import KNN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

In [3]:
# Make a simulated feature matrix
features, _ = make_blobs(n_samples = 1000,
 n_features = 2,
 random_state = 1)

In [4]:
features

array([[-3.05837272,  4.48825769],
       [-8.60973869, -3.72714879],
       [ 1.37129721,  5.23107449],
       ...,
       [-1.91854276,  4.59578307],
       [-1.79600465,  4.28743568],
       [-6.97684609, -8.89498834]])

In [5]:
# Standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)

In [6]:
features

array([[-3.05837272,  4.48825769],
       [-8.60973869, -3.72714879],
       [ 1.37129721,  5.23107449],
       ...,
       [-1.91854276,  4.59578307],
       [-1.79600465,  4.28743568],
       [-6.97684609, -8.89498834]])

In [7]:
standardized_features

array([[ 0.87301861,  1.31426523],
       [-0.67073178, -0.22369263],
       [ 2.1048424 ,  1.45332359],
       ...,
       [ 1.18998798,  1.33439442],
       [ 1.22406396,  1.27667052],
       [-0.21664919, -1.19113343]])

In [8]:
# Replace the first feature's first value with a missing value
true_value = standardized_features[0,0]
standardized_features[0,0] = np.nan

In [9]:
standardized_features

array([[        nan,  1.31426523],
       [-0.67073178, -0.22369263],
       [ 2.1048424 ,  1.45332359],
       ...,
       [ 1.18998798,  1.33439442],
       [ 1.22406396,  1.27667052],
       [-0.21664919, -1.19113343]])

In [11]:
# Predict the missing values in the feature matrix
features_knn_imputed = KNN(k=5, verbose=0).fit_transform(standardized_features)

In [12]:
features_knn_imputed

array([[ 1.09553327,  1.31426523],
       [-0.67073178, -0.22369263],
       [ 2.1048424 ,  1.45332359],
       ...,
       [ 1.18998798,  1.33439442],
       [ 1.22406396,  1.27667052],
       [-0.21664919, -1.19113343]])

In [13]:
# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_knn_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: 1.0955332713113226


### Alternatively, we can use scikit-learn’s Imputer module to fill in missing values withthe feature’s mean, median, or most frequent value. However, we will typically get worse results than KNN:

In [15]:
# Load library
from sklearn.impute import SimpleImputer

In [16]:
# Create imputer
mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [17]:
mean_imputer

SimpleImputer()

In [18]:
features_mean_imputed = mean_imputer.fit_transform(features)

In [19]:
features_mean_imputed

array([[-3.05837272,  4.48825769],
       [-8.60973869, -3.72714879],
       [ 1.37129721,  5.23107449],
       ...,
       [-1.91854276,  4.59578307],
       [-1.79600465,  4.28743568],
       [-6.97684609, -8.89498834]])

In [21]:
# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_mean_imputed[0,0])


True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


# Read IT

There are two main strategies for replacing missing data with substitute values, each
of which has strengths and weaknesses. First, we can use machine learning to predict
the values of the missing data. To do this we treat the feature with missing values as a
target vector and use the remaining subset of features to predict missing values.
While we can use a wide range of machine learning algorithms to impute values, a
popular choice is KNN. KNN is addressed in depth later in Chapter 14, but the short
explanation is that the algorithm uses the k nearest observations (according to some
distance metric) to predict the missing value.

## The downside to KNN 
is that in order to know which observations are the closest to
the missing value, it needs to calculate the distance between the missing value and
every single observation. This is reasonable in smaller datasets, but quickly becomes
problematic if a dataset has millions of observations.

## An alternative and more scalable strategy

is to fill in all missing values with some
average value. For example, in our solution we used scikit-learn to fill in missing val‐
ues with a feature’s mean value. The imputed value is often not as close to the true
value as when we used KNN, but we can scale mean-filling to data containing mil‐
lions of observations easily.

## If we use imputation, 

it is a good idea to create a binary feature indicating whether or
not the observation contains an imputed value.