# Handling missing values

Values may be missing for many reasons, there was no observation, a transcription error, data corruption, etc, which need to be dealt with.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('../data/diabetes.csv')
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnancies    768 non-null int64
glucose        768 non-null int64
diastolic      768 non-null int64
triceps        768 non-null int64
insulin        768 non-null int64
bmi            768 non-null float64
dpf            768 non-null float64
age            768 non-null int64
diabetes       768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


On the face of it, this dataframe does not appear to have any missing values. But missing values can be encoded in a number of ways, such as `'0'`, `' '`, `?` where your expecting a string, or `-1` or `0` where we expect a numerical value but the value provided make no sense.

The first step is to replace all missing values with `NaN`. We use `NaN` because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as `.dropna()` and `.fillna()`, as well as scikit-learn's Imputation transformer `Imputer()`.

In [4]:
df.insulin.replace(0, np.nan, inplace=True)
df.triceps.replace(0, np.nan, inplace=True)
df.bmi.replace(0, np.nan, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnancies    768 non-null int64
glucose        768 non-null int64
diastolic      768 non-null int64
triceps        541 non-null float64
insulin        394 non-null float64
bmi            757 non-null float64
dpf            768 non-null float64
age            768 non-null int64
diabetes       768 non-null int64
dtypes: float64(4), int64(5)
memory usage: 54.1 KB


### How to handle missing data

1. we could simply drop all rows with the missing data using pandas `.dropna()` method.

In [5]:
df_dropped = df.dropna()
df_dropped.shape

(393, 9)

But this leaves us with approximately half of the original number of rows. Which is unacceptable, as you're throwing away valuable information along with the missing data.

For the same reason it's generally unacceptable to remove columns that contain a large numbers of missing values.

2. we could impute the missing data, e.g. fill in the missing values. Here we can use Pandas's `.fillna()` method or sklearns `Imputer`.

This is where domain knowledge is useful, but in the absence of it we make an educated guess as to what the missing values could be. A common approach is to calculate the **mean**/**median** of the row or column that the missing value is in. We can use sklearn's `Imputer` function to perform this task.

In [6]:
from sklearn.preprocessing import Imputer

# instantiate the imputer
# imputer will replace all occurences of 'NaN', using the 'mean' as 
# specfied in the strategy, the 'axis=0' means that we will impute along columns
# thus calculates the 'mean' on the column. 'axis=1' is for rows
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)



In [8]:
X = df.drop('diabetes', axis=1).values
y = df.diabetes.values

# we then need to fit and transform our data using the 'imputer'
imp.fit(X)
X_transformed = imp.transform(X)

We would then need to train our data as normal.

Sklearn provides the **pipeline** object which allows the transformation and training of the data in fewer steps.

In [11]:
# import the modules and instantiate the imputer and estimator(regressor/classifier)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
logreg = LogisticRegression()



In [12]:
# build the list of steps, each step is a tuple - name of the step and the estimator
# in a pipeline, each step(but the last) must be a transformer
# the last step must be an estimator(classifier/regressor)
steps = [('imputation', imp), ('logistic_regression', logreg)]

# pass the list to the pipeline constructor
pipeline = Pipeline(steps)

In [13]:
# split the data in to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [14]:
# fit the model and predict on the test set
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

# compute the model's score
pipeline.score(X_test, y_test)



0.7619047619047619