Deleting observations with missing values is easy with a clever line of NumPy

In [2]:
# Load library
import numpy as np
# Create feature matrix
features = np.array([[1.1, 11.1],
[2.2, 22.2],
[3.3, 33.3],
[4.4, 44.4],
[np.nan, 55]])


In [3]:
# Keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

Alternatively, we can drop missing observations using pandas:

In [4]:
# Load library
import pandas as pd
# Load data
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Remove observations with missing values
dataframe.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4


Discussion
Most machine learning algorithms cannot handle any missing values in the target
and feature arrays. For this reason, we cannot ignore missing values in our data
and must address the issue during preprocessing.
The simplest solution is to delete every observation that contains one or more
missing values, a task quickly and easily accomplished using NumPy or pandas.
That said, we should be very reluctant to delete observations with missing
values. Deleting them is the nuclear option, since our algorithm loses access to
the information contained in the observation’s non-missing values.
Just as important, depending on the cause of the missing values, deleting
observations can introduce bias into our data. There are three types of missing
data:

**Missing Completely At Random (MCAR)**
The probability that a value is missing is independent of everything. For
example, a survey respondent rolls a die before answering a question: if she
rolls a six, she skips that question.


**Missing At Random (MAR)**
The probability that a value is missing is not completely random, but
depends on the information captured in other features. For example, a survey
asks about gender identity and annual salary and women are more likely to
skip the salary question; however, their nonresponse depends only on information we have captured in our gender identity feature.

**Missing Not At Random (MNAR)**
The probability that a value is missing is not random and depends on
information not captured in our features. For example, a survey asks about
gender identity and women are more likely to skip the salary question, and
we do not have a gender identity feature in our data.

It is sometimes acceptable to delete observations if they are MCAR or MAR.
However, if the value is MNAR, the fact that a value is missing is itself
information. Deleting MNAR observations can inject bias into our data because
we are removing observations produced by some unobserved systematic effect.
