# Missing Data

A simple exploration of missing data


In [1]:
# %pip install --quiet --upgrade pip 
# %pip install numpy --quiet
# %pip install PyArrow --quiet
# %pip install Pandas --quiet
# %pip install scikit-learn --quiet
# %pip install missingno --quiet
# %pip install statsmodels --quiet

### Identifying Missing Data

The first step in handling missing data is to identify where and how data is missing. In **Pandas**, this can be done using the `isna()` or `isnull()` functions. 

It can be confusion that these two functions that do exactly the same thing, but if we check the [source code of Pandas](https://github.com/pandas-dev/pandas/blob/0409521665bd436a10aea7e06336066bf07ff057/pandas/core/dtypes/missing.py#L109) we can see that `isnull()` is just an alias for `isna()`.  As a best practice, I always prefer to use `isna()` over `isnull()`. In **Pandas** there are other similar method names like `dropna()` and  `fillna()` that handles missing values and it always helps to remember easily.


In [None]:
import pandas as pd

# Sample data
data = {'age': [25, 30, None, 40, 50],
        'income': [50000, None, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Identify missing values
missing_data = df.isnull()
print(missing_data)

     age  income
0  False   False
1  False    True
2   True   False
3  False   False
4  False   False


### Little's MCAR Test

[Little's MCAR test](https://wiki.q-researchsoftware.com/wiki/Missing_Data_-_Little%27s_MCAR_Test) is used to assess whether data is **Missing Completely at Random (MCAR)**. The test evaluates if the probability of missing data is independent of the observed data values. If the test result is **non-significant** (i.e., p-value > 0.05), it suggests that the data is MCAR.


In [None]:
from scipy.stats import chi2

def little_mcar_test(data : pd.DataFrame) -> float:
    """
    Implementation of Little's MCAR test
    
    Parameters
    ----------
    data : pd.DataFrame
        Dataset with missing values. `n` rows (samples) and `m` columns (features).

    Returns
    -------
    pvalue : float
        The p-value of a chi-square hypothesis test. Null hypothesis: data is Missing Completely At Random (MCAR). Alternative hypothesis: data is not MCAR.
        A p-value > 0.05 suggests that missing values in the column are likely MCAR.
    """
    overall_means = data.mean()
    variances = data.var()
    chi_squared_stat = 0
    degrees_of_fredom = 0

    pattern_groups = data.groupby(data.apply(lambda row: tuple(row.isna()), axis=1))
    for pattern, group in pattern_groups:
        observed_columns = ~np.array(pattern)
        group_means = group.mean()
        residuals = group_means[observed_columns] - overall_means[observed_columns]
        
        group_size = len(group)
        group_variance = variances[observed_columns]
        chi_squared_contribution = np.sum((group_size * (residuals ** 2)) / group_variance)

        chi_squared_stat += chi_squared_contribution
        degrees_of_fredom += len(observed_columns.nonzero()[0]) - 1  # Number of observed variables - 1

    p_value  = 1- chi2.cdf(chi_squared_stat, degrees_of_fredom)
    return p_value

In [None]:
import random

# Generate 100 samples of ages and incomes where income is correlated with age (with some variance) 
min_age, max_age, number_of_samples = 20, 80, 100
ages = [random.randint(min_age, max_age) for _ in range(number_of_samples)]
min_income, max_income, income_variance = 30000, 80000, 10000
income_range = max_income - min_income
age_range = max_age - min_age
min_base_income, max_base_income = min_income - income_variance, min_income + income_variance
incomes = [
    int(random.randint(min_base_income, max_base_income) + (((age - min_age) / age_range) * income_range)) 
    for age in ages
]

# Set 10% of the values to None randomly
ten_percent_of_incomes = int(0.1 * len(incomes))
random_indices = random.sample(range(len(incomes)), ten_percent_of_incomes)
randomly_null_incomes = [None if i in random_indices else income for i, income in enumerate(incomes)]

df = pd.DataFrame({'age': ages, 'income': randomly_null_incomes })
p_value = little_mcar_test(df)
print("For randomly_null_incomes data frame:")
print(f"Missing values are likely missing completely at random (p_value = {p_value})" 
      if p_value > 0.05 else 
      f"Missing values are not missing completly at random (p_value = {p_value})")

# Set 70% of the values over 70,000 to None - indicating some correlation between age and salary
indices_over_70000 = [i for i, income in enumerate(incomes) if income > 70000]
seventy_percent_of_incomes_over_70000 = int(0.7 * len(indices_over_70000))
random_indices_over_70000 = random.sample(indices_over_70000, seventy_percent_of_incomes_over_70000)
randomly_null_incomes_over_70000 = [None if i in random_indices_over_70000 else income for i, income in enumerate(incomes)]

df = pd.DataFrame({'age': ages, 'income': randomly_null_incomes_over_70000 })
p_value = little_mcar_test(df)
print("For randomly_null_incomes_over_70000 data frame:")
print(f"Missing values are likely missing completely at random (p_value = {p_value})" 
      if p_value > 0.05 else 
      f"Missing values are not missing completly at random (p_value = {p_value})")

### Handling MCAR Missing Data

For data that is **Missing completely at random (MCAR)**, you can often drop the rows or columns with missing values without introducing significant bias into your analysis, as MCAR data does not show systematic patterns.

In [None]:
# Drop rows with any missing values
df_cleaned = df.dropna()
print(df_cleaned)

    age   income
0  25.0  50000.0
3  40.0  70000.0
4  50.0  80000.0


In [None]:
# Impute missing values with the column mean
df['age'] = df['age'].fillna(df['age'].mean())
df['income'] = df['income'].fillna(df['income'].mean())
print(df)

     age   income
0  25.00  50000.0
1  30.00  65000.0
2  36.25  60000.0
3  40.00  70000.0
4  50.00  80000.0


### Handling MAR Missing Data

When data is **Missing at random (MAR)**, imputing missing values based on other observed variables is often the best approach. Techniques like **regression imputation** or **k-nearest neighbours (KNN) imputation** are commonly used.


In [None]:
# Imputation using linear regression
from sklearn.linear_model import LinearRegression

# Sample data
data = {'age': [25, 30, 35, 40, 50],
        'income': [50000, None, 60000, 70000, 80000]}
degrees_of_freedom = pd.DataFrame(data)

# Assuming we are imputing 'income' based on 'age'
model = LinearRegression()

# Drop rows with missing 'income' and use 'age' to predict 'income'
train_data = degrees_of_freedom.dropna(subset=['income'])
X_train = train_data[['age']]
y_train = train_data['income']
model.fit(X_train, y_train)

# Predict missing 'income' values
degrees_of_freedom.loc[degrees_of_freedom['income'].isnull(), 'income'] = model.predict(degrees_of_freedom.loc[degrees_of_freedom['income'].isnull(), ['age']])
print(degrees_of_freedom)


   age        income
0   25  50000.000000
1   30  55769.230769
2   35  60000.000000
3   40  70000.000000
4   50  80000.000000


In [None]:
# Imputation using KNN
from sklearn.impute import KNNImputer

# Sample data
data = {'age': [25, 30, 35, 40, 50],
        'income': [50000, None, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Use 2 nearest neighbors to fill missing data.
knn_imputer = KNNImputer(n_neighbors=2)

df[['age', 'income']] = knn_imputer.fit_transform(df[['age', 'income']])
print(df)

    age   income
0  25.0  50000.0
1  30.0  55000.0
2  35.0  60000.0
3  40.0  70000.0
4  50.0  80000.0


### Handling MNAR Missing Data

Handling **Missing Not at Random (MNAR)** data is the most challenging, as the missingness is related to the value of the missing data itself or some unobserved factor. The best approach often requires domain expertise or external data to model the missingness properly. However, one common strategy is to **impute with a constant** or create a separate indicator variable to flag the missingness.


In [121]:
# Sample data
data = {'age': [25, 30, 35, 40, 50], 'income': [50000, None, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Create a flag for missing 'income' data
df['income_missing'] = df['income'].isnull().astype(int)

# Impute 'income' with a constant or domain-based value
df['income'] = df['income'].fillna(0)
print(df)

   age   income  income_missing
0   25  50000.0               0
1   30      0.0               1
2   35  60000.0               0
3   40  70000.0               0
4   50  80000.0               0


Can also use **Multiple Imputation by Chained Equations (MICE)**, which accounts for uncertainty in the missing data by creating multiple plausible imputations.
`scikit-learn` `IterativeImputer` can be used for this.

In [131]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

data = {
    'age': [25, 30, None, 40, 50],
    'income': [50000, 60000, None, 70000, 80000],
    'bmi': [22.5, 24.0, 23.5, None, 27.0]
}

df = pd.DataFrame(data)

# Apply MICE to impute missing values. Note: MICE works only on numeric data
mice_imputer = IterativeImputer()
df_imputed = df.copy()
df_imputed.iloc[:, :] = mice_imputer.fit_transform(df)


print(df)
print(df_imputed)

    age   income   bmi
0  25.0  50000.0  22.5
1  30.0  60000.0  24.0
2   NaN      NaN  23.5
3  40.0  70000.0   NaN
4  50.0  80000.0  27.0
         age        income        bmi
0  25.000000  50000.000000  22.500000
1  30.000000  60000.000000  24.000000
2  31.872197  59493.007701  23.500000
3  40.000000  70000.000000  25.413455
4  50.000000  80000.000000  27.000000


