<div class="alert alert-warning" role="alert">
    <b style="font-size: 1.5em;">🚧 Warning</b>
    <p>
    The removal of missing values <b>assumes</b> that the missing values are 
    completely missing at random (<code>MCAR</code>). In any other case, 
    removing missing values may introduce <b>bias</b> in subsequent analyses and models.
    </p>
</div>

In [2]:
import sys
import pyprojroot
import pyreadr
import missingno
import importlib
sys.path.append(str(pyprojroot.here()))
import src.pandas_missing_extension
importlib.reload(src.pandas_missing_extension)
from src.utils import make_dir_function
from src.pandas_missing_extension import MissingMethods

In [3]:
data_dir = make_dir_function("data")
riskfactor_file = data_dir("raw", "riskfactors.rda")

In [4]:
riskfactors_df = pyreadr.read_r(riskfactor_file)['riskfactors']

First, observe the total number of observations and variables in your dataset.

In [7]:
riskfactors_df.shape

(245, 34)

# Pairwise deletion

It consists of ignoring missing data only in the calculations where they are missing, without deleting entire rows.

In [8]:
riskfactors_df.weight_lbs.mean()

174.26808510638298

In [11]:
print(riskfactors_df.weight_lbs.size)
print(riskfactors_df.weight_lbs.count())

245
235


In [12]:
riskfactors_df.mean(skipna=False)

  riskfactors_df.mean(skipna=False)


age                58.106122
weight_lbs               NaN
height_inch              NaN
bmi                      NaN
children            0.424490
health_physical     4.118367
health_mental       3.142857
health_poor              NaN
drink_days               NaN
drink_average            NaN
diet_fruit               NaN
diet_salad               NaN
diet_potato              NaN
diet_carrot              NaN
diet_vegetable           NaN
diet_juice               NaN
dtype: float64

# Listwise Deletion or Complete Case

It consists of completely removing any row (or case) that has at least one missing value in any of the variables.

In [16]:
riskfactors_df.dropna(
    subset=['weight_lbs', 'height_inch'], # drop rows where weight_lbs is missing
    how='any', # drop rows where any of the subset columns are missing
).shape

(234, 34)

In [17]:
riskfactors_df.dropna(
    subset=['weight_lbs', 'height_inch'], # drop rows where weight_lbs is missing
    how='all', # drop rows where all of the subset columns are missing
).shape

(244, 34)