<div align="center"> <h1>Imputation of Missing Values</h1>
    <h2><a href="...">Richard Leibrandt</a></h2>
</div>

Sometimes datasets are incomplete. For example, the column engine-location has missing values in the following:

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
import pandas as pd

car_prices=pd.read_csv("../data/car_prices.csv")

features = car_prices
features = features.replace('?', '', regex=False)
features = features.apply(pd.to_numeric, errors='coerce')

In [2]:
features.isna().sum()

symboling              0
normalized-losses     41
make                 205
fuel-type            205
aspiration           205
num-of-doors         205
body-style           205
drive-wheels         205
engine-location      205
wheel-base             0
length                 0
width                  0
height                 0
curb-weight            0
engine-type          205
num-of-cylinders     205
engine-size            0
fuel-system          205
bore                   4
stroke                 4
compression-ratio      0
horsepower             2
peak-rpm               2
city-mpg               0
highway-mpg            0
price                  4
dtype: int64

First, let's remove all columns that are compeltely empty:

In [3]:
features = features.dropna(axis='columns', how='all')

In [4]:
features.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
mean,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


Most algorithms of machine learning and data mining do not accept inputs with missing values. There are several strategies to deal with missing values:

* remove observations that contain missing values
* imputation of missing values

Removing observations with missing values is easy with pandas:

In [5]:
features.dropna(axis='index', how='any').describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0
mean,0.7375,121.3,98.235625,172.319375,65.59625,53.87875,2459.45,119.09375,3.298437,3.237313,10.145125,95.875,5116.25,26.50625,32.06875,11427.68125
std,1.189511,35.602417,5.163763,11.54886,1.946999,2.276608,480.897834,30.411186,0.267348,0.29421,3.882507,30.625708,465.290536,6.081208,6.440948,5863.789011
min,-2.0,65.0,86.6,141.1,60.3,49.4,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,15.0,18.0,5118.0
25%,0.0,94.0,94.5,165.525,64.0,52.0,2073.25,97.0,3.05,3.1075,8.7,69.0,4800.0,23.0,28.0,7383.5
50%,1.0,114.0,96.9,172.2,65.4,54.1,2338.5,110.0,3.27,3.27,9.0,88.0,5200.0,26.0,32.0,9164.0
75%,2.0,148.0,100.6,177.8,66.5,55.5,2808.75,134.5,3.55,3.41,9.4,114.0,5500.0,31.0,37.0,14559.25
max,3.0,256.0,115.6,202.6,71.7,59.8,4066.0,258.0,3.94,4.17,23.0,200.0,6600.0,49.0,54.0,35056.0


We can also consider only specific columns for the drop:

In [6]:
features.dropna(axis=0, subset=['price']).head()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


With this solution we keep only 160 observations. We lost more than 20% of the observations.

Let's try the other approach. We can impute the missing values, inferring them from the rest of the known part of the data. Below, the missing values are replaced by the mean of the known values. The other available strategies are the median and the most frequent value.

Note that observations with missing values in the target feature are useless to 'teach' the model, so we get rid of them first.

In [7]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')  # define the imputer object
features_imp = imputer.fit_transform(features)  # fit to the data and tranform it
features_imp.shape

(205, 16)

The output is a numpy array. For an easier visualization we can transform it back to a pandas Dataframe:

In [8]:
features_df=pd.DataFrame(features_imp, columns=features.columns)
features_df.head()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3.0,122.0,88.6,168.8,64.1,48.8,2548.0,130.0,3.47,2.68,9.0,111.0,5000.0,21.0,27.0,13495.0
1,3.0,122.0,88.6,168.8,64.1,48.8,2548.0,130.0,3.47,2.68,9.0,111.0,5000.0,21.0,27.0,16500.0
2,1.0,122.0,94.5,171.2,65.5,52.4,2823.0,152.0,2.68,3.47,9.0,154.0,5000.0,19.0,26.0,16500.0
3,2.0,164.0,99.8,176.6,66.2,54.3,2337.0,109.0,3.19,3.4,10.0,102.0,5500.0,24.0,30.0,13950.0
4,2.0,164.0,99.4,176.6,66.4,54.3,2824.0,136.0,3.19,3.4,8.0,115.0,5500.0,18.0,22.0,17450.0
