# **MICE Data Imputation Model | Code Documentation**

## Description

Performs the data imputation model based on the MICE (Multiple Imputation by Chained Equations) algorithm method using the IterativeImputer auxiliary function from the sklearn.impute package.

The function receives a dataset with missing numeric and categorical registers, treats them using the MICE algorithm method and returns a version of the original dataset with imputed values ​​and a comparative statistical analysis table of both original and imputed dataset.

## IterativeImputer MICE method

The Multiple Imputation by Chained Equations (MICE) method is an iterative approach to filling in missing values using predictive models. It models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

The IterativeImputer MICE method from scikit-learn is ruled by the following steps:

**1. Initialize Missing Values**

*   Missing values are initially replaced with a simple estimate, such as the column mean, median or most frequent value.

**2. Define an Imputation Strategy**

*   For each column with missing values, a regression model is defined.
*   This model is trained to predict the missing values using the other columns as predictors.

**3. Iterate Over Columns**

*   For each column with missing values, the dataset is split into training data (non-missing values) and test data (missing values).
*   A regression model is trained on the available values and then used to predict the missing values.
*   The imputed values are updated and used in the next iteration for further refinement.

**4. Repeat Until Convergence**

*   This process is repeated for n_iter iterations (default = 10).
*   At each iteration, the imputed values are refined as the models adapt better to the data.
*   The final imputed values are those obtained after the last iteration.

Unlike traditional MICE implementations, that generate multiple imputed datasets, IterativeImputer in scikit-learn keeps only a single dataset with the final imputed values.

## Arguments

**df**: *object, mandatory*

The original dataset which can contain missing values or not and both numerial and categorical variables.

**interest_vars:** *list, default = None*

List formed by the name of the columns of the only numerical and categorical variables that will be used by the algorith to make the predictions and worn in the statistical analysis table. The list can contain the name of columns with no missing values - it will be used just as another parameter to incorporate the predictions. If "None", every column of the dataset will be used and analysed.

**show_results:** *bool, default = True*

Boolean argument for the user to choose if the function will make and show the statistical analysis of the interest variables on both original and imputed datasets.

**random_state (sklearn):** *int, RandomState instance or None, default = None*

The seed of the pseudo random number generator to use. Randomizes selection of estimator features if n_nearest_features is not None, the imputation_order if random, and the sampling from posterior if sample_posterior=True. Use an integer for determinism.

**max_iter (sklearn):** *int, default = 10*

Maximum number of imputation rounds to perform before returning the imputations computed during the final round. A round is a single imputation of each feature with missing values.

**initial_strategy (sklearn):** *{'mean', 'median', 'most_frequent', 'constant'}, default = 'most_frequent'*

Which strategy to use to initialize the missing values.

**prefix_sep (pandas):** *str, default = '!'*

If appending prefix, separator/delimiter to use in the binary transformation of categorical variables.

***DISCLAIMER:*** Some arguments used in this MICE function are imported from other python packages as indicated next to the name of the argument.


## Outputs

**imputed_df:** *object*

The imputed version of the original dataset. This dataset has no missing values in the column names listed in the interest_vars argument.


**df_results:** *object*

The table of the statistical analysis that compares the column names listed in the interest_vars argument in both original and imputed datasets.

##Statistical analysis table

**VARIABLE:**

Name of the column. If the column is categorical, it will show the name of the class followed by the name of the column.

**MISSINGS:**

Shows the percentage of missing values in the original dataset column.

**BEFORE IMPUTATION:**

For numerical variables, it will show the IQR (Interquartile Range) of the columns registers distributions before the imputation (original data) as "median value (Q1,Q3)". For categorical variables, it will show the total number of registers of the class and the percentage that it represents of the column before the imputation (original data) as "total count (percentage)"

**AFTER IMPUTATION:**

For numerical variables, it will show the IQR (Interquartile Range) of the columns registers distributions after the imputation (imputed data) as "median value (Q1,Q3)". For categorical variables, it will show the total number of registers of the class and the percentage that it represents of the column after the imputation (imputed dataset) as "total count (percentage)"

**P-VALUE:**

Statistical analysis that compares original and imputed dataset value distributions.

## Example using ICU data

### Initializing

First, we need to import the following python packages that will be used in this notebook.

In [1]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu, chi2_contingency

Here, we name our database with missing registers to be worked on.

In [2]:
path = 'df_map.csv'

main_df = pd.read_csv(path)

Here, we define the list of interest variables from the main_df to be worked in the main function.

In [3]:
# Variables list
L_vars = ['demog_sex','demog_age','demog_healthcare','comor_hypertensi', 'comor_chrkidney', 'comor_chrkidney_stag', 'comor_liverdisease', 'comor_liverdisease_type', 'vital_highesttem_c', 'vital_hr']
print(main_df[L_vars].head())

  demog_sex  demog_age  demog_healthcare  comor_hypertensi  comor_chrkidney  \
0      Male       19.0             False             False            False   
1      Male       26.0             False             False            False   
2      Male       55.0              True             False            False   
3    Female       25.0             False             False            False   
4      Male       35.0             False             False             True   

  comor_chrkidney_stag  comor_liverdisease comor_liverdisease_type  \
0                  NaN               False                     NaN   
1                  NaN               False                     NaN   
2                  NaN               False                     NaN   
3                  NaN                True                    Mild   
4              Stage 2               False                     NaN   

   vital_highesttem_c  vital_hr  
0                38.8      75.0  
1                37.1      77.8  
2 

### Auxiliar functions
These functions will suport the implementation of the main function (MICE).

Function to calculate the median and IQR of numerical variables:

In [4]:
def calc_median_iqr(series):
  return np.nanmedian(series), np.nanpercentile(series, 25), np.nanpercentile(series, 75)

Function to calculate p_value for numerical variables:

In [5]:
def calculate_p_value_numerical(before, after):
  before_cleaned = before.dropna()
  after_cleaned = after.dropna()

  if len(before_cleaned) > 0 and len(after_cleaned) > 0:
    _, p_value = mannwhitneyu(before_cleaned, after_cleaned, alternative='two-sided')
  else:
    p_value = np.nan  # Return NaN if one or both arrays are empty.

  return p_value

Function to calculate p_value for categorical variables:

In [6]:
def calculate_p_value_categorical(before_positive, after_positive, before_total, after_total):
  before_negative = before_total - before_positive
  after_negative = after_total - after_positive

  contingency_table = np.array([
  [before_positive, before_negative],
  [after_positive, after_negative]
  ])

  _, p_value, _, _ = chi2_contingency(contingency_table)
  return p_value

Function that identifies dummy variables:

In [7]:
def is_binary(series):
  return series.dropna().isin([0, 1]).all()

Function that separate data in numerical and categorical variables, so the categorical columns get to be turned into binary registers that can be interperted by MICE.

In [8]:
def pre_imputation(df, interest_vars, prefix_sep):

    for var in list(df.columns):
        if (df[var].dtype == 'object' or df[var].dtype == 'int64') and is_binary(df[var]):
            df[var] = df[var].astype('boolean')

    # Separate categorical,numerical and binary variables:

    all_categorical_vars = []
    all_numerical_vars = []
    all_binary_vars = []

    for var in list(df.columns):
        if df[var].dtype == 'object':
            all_categorical_vars.append(var)
        elif (df[var].dtype == 'int64' or df[var].dtype == 'float64') and is_binary(df[var])==False:
            all_numerical_vars.append(var)
        elif df[var].dtype == 'boolean' or is_binary(df[var]):
            all_binary_vars.append(var)

    # Separate categorical,numerical and binary variables in interest_vars:

    interest_categorical_vars = []
    interest_numerical_vars = []
    interest_binary_vars = []

    for var in all_categorical_vars:
        if var in interest_vars:
            interest_categorical_vars.append(var)
    for var in all_numerical_vars:
        if var in interest_vars:
            interest_numerical_vars.append(var)
    for var in all_binary_vars:
        if var in interest_vars:
            interest_binary_vars.append(var)

    # Create a copy of the df to work on and transform categorical variables into dummy data:
    df_copy = pd.get_dummies(df, prefix_sep = prefix_sep, columns = interest_categorical_vars , dtype = 'boolean') # Less categorical data

    # Making sure that NaN registers keep equal to NaN instead of 0 (False) after the pd.get_dummies automatic imputation:
    interest_vars_with_new_dummies = interest_vars.copy()
    for var in interest_categorical_vars:
        same_group_cols = []
        var_prefix = var + prefix_sep
        for col in df_copy.columns:
            if col.startswith(var_prefix):
                same_group_cols.append(col)
                interest_vars_with_new_dummies.append(col)
        interest_vars_with_new_dummies.remove(var)
        # # Convert to nullable boolean type to allow np.nan
        # df_copy[same_group_cols] = df_copy[same_group_cols].astype('boolean')
        for index, row in enumerate(df_copy[same_group_cols].iterrows()):
            if row[1].sum() == 0:
                df_copy.loc[index, same_group_cols] = np.nan

    vars_for_training = interest_vars_with_new_dummies + all_numerical_vars + all_binary_vars
    vars_for_training = list(set(vars_for_training))

    return df_copy, vars_for_training, interest_categorical_vars, interest_numerical_vars

Function that receive the original dataframe, the list of interest numerical and categorical variables and provide the statistical analysis table, comparing data distribution before and after the MICE imputation.

In [9]:
def stat_analysis(df, imputed_df, interest_vars, numerical_vars, categorical_vars, show_results):

    # Create a list to store the results:
    df_results = pd.DataFrame(columns=["VARIABLE", "MISSINGS", "BEFORE IMPUTATION", "AFTER IMPUTATION", "P-VALUE"])

    # Loop through the variables in the dataframe BEFORE IMPUTATION:
    for var in interest_vars:

        # Identify missing values for this variable:
        n_missing = df[df[var].isna()].shape[0]
        if n_missing > 0:

            missing_percentage = n_missing / df.shape[0]

            # Numerical variables
            if var in numerical_vars:

                # Analysis BEFORE imputation (bi):
                median_iqr_bi = calc_median_iqr(df[var])

                # Analysis AFTER imputation (ai):
                median_iqr_ai = calc_median_iqr(imputed_df[var])

                # P-Value:
                pval = calculate_p_value_numerical(df[var], imputed_df[var])

                # Store the results:
                df_results = pd.concat([
                df_results,
                pd.DataFrame([{
                    "VARIABLE": var,
                    "MISSINGS": f"{n_missing} ({missing_percentage * 100:.2f}%)",
                    "BEFORE IMPUTATION": '%.2f (%.2f,%.2f)' % (median_iqr_bi[0], median_iqr_bi[1], median_iqr_bi[2]),
                    "AFTER IMPUTATION": '%.2f (%.2f,%.2f)' % (median_iqr_ai[0], median_iqr_ai[1], median_iqr_ai[2]),
                    "P-VALUE": pval
                }])
                ], ignore_index=True)

            # Categorical variables
            else:

                count_total_bi = df[var].notna().sum()
                count_total_ai = imputed_df[var].notna().sum()

                # Boolean and dummy variables:
                if is_binary(df[var]):

                    # Count positive registers of the unit BEFORE imputation:
                    count_positive_bi = df[df[var] == True].shape[0]
                    percent_positive_bi = (count_positive_bi / count_total_bi) * 100

                    # Count positive registers of the unit AFTER imputation:
                    count_positive_ai = imputed_df[imputed_df[var] == True].shape[0]
                    percent_positive_ai = (count_positive_ai/count_total_ai)* 100

                    # Calculate p-value for the analysis before VS after the imputation:
                    pval = calculate_p_value_categorical(count_positive_bi, count_positive_ai, count_total_bi, count_total_ai)

                    # Store the results:
                    df_results = pd.concat([
                    df_results,
                    pd.DataFrame([{
                        "VARIABLE": '%s'%var,
                        "MISSINGS": f"{n_missing} ({missing_percentage * 100:.2f}%)",
                        "BEFORE IMPUTATION": '%i (%.2f%%)'%(count_positive_bi, percent_positive_bi),
                        "AFTER IMPUTATION": '%i (%.2f%%)' %(count_positive_ai, percent_positive_ai),
                        "P-VALUE": pval
                        }])
                        ], ignore_index=True)

                # Object variables:
                else:

                    # Obtain the units of the variable:
                    units = df[var].dropna().unique()

                    for unit in units:

                        # Count positive registers of the unit BEFORE imputation:
                        count_positive_bi = df[df[var] == unit].shape[0]
                        percent_positive_bi = (count_positive_bi/count_total_bi)* 100

                        # Count positive registers of the unit AFTER imputation:
                        count_positive_ai = imputed_df[imputed_df[var] == unit].shape[0]
                        percent_positive_ai = (count_positive_ai/count_total_ai)* 100

                        # Calculate p-value for the analysis before VS after the imputation:
                        pval = calculate_p_value_categorical(count_positive_bi, count_positive_ai, count_total_bi, count_total_ai)

                        # Store the results:
                        df_results = pd.concat([
                        df_results,
                        pd.DataFrame([{
                            "VARIABLE": '%s (%s)'%(var,unit),
                            "MISSINGS": f"{n_missing} ({missing_percentage * 100:.2f}%)",
                            "BEFORE IMPUTATION": '%i (%.2f%%)'%(count_positive_bi, percent_positive_bi),
                            "AFTER IMPUTATION": '%i (%.2f%%)' %(count_positive_ai, percent_positive_ai),
                            "P-VALUE": pval
                        }])
                        ], ignore_index=True)

    if show_results == True:
        # Show the results:
        print('\nStatistical analysis of data imputation with MICE method to numerical and categorical variables\n')
        print(df_results)
        print("\nALERT: The table will only show a variable if there is some missing register in its column. Otherwise, it will not be shown in the table.\n")

    return df_results

### Running MICE step by step

Defining the main_df as the df and L_vars as the interest_vars to be worked on.

In [10]:
df = main_df
interest_vars = L_vars

In [11]:
if interest_vars == None:
    interest_vars = list(df.columns)

Calling the pre-imputation function to create a copy of the original data frame to work on.

In [12]:
# Term that will be used to separate the column name and it's classes when turning categorical registers into binary variables.
prefix_sep = '!'

df_copy, vars_for_training, interest_categorical_vars, interest_numerical_vars = pre_imputation(df, interest_vars, prefix_sep)

Creating a training dataframe using the assinged interest variables and all the other numerical and binary columns of the dataset.

In [13]:
df_train = df_copy.loc[:, vars_for_training]

Defining the IterativeImputer MICE function, fitting the MICE function in the training data, storing each feature's estimator and predicting without refitting (in order) with the "transform" phase, creating the imputed data frame, which is the version of the original data frame that no longer have missings registers.

In [14]:
imputer = IterativeImputer(random_state = 100 , max_iter = 10 , initial_strategy = 'most_frequent')
imputer.fit(df_train)
imputed_df = imputer.transform(df_train)

Recrating categorical-type columns that were excluded when we turned them into binary registers.

In [15]:
# Convert imputed_df to DataFrame and maintain original index
imputed_df = pd.DataFrame(imputed_df, columns=imputer.get_feature_names_out(), index=df.index)

# Add the original categorical columns
imputed_df = pd.concat([imputed_df, df[interest_categorical_vars]], axis=1)

Translating the binary imputation into categorical registers and excluding the binary variables so that the imputed data frame have the same length of the original data frame.

In [16]:
for var in interest_categorical_vars:
    same_group_cols = []
    var_prefix = var + prefix_sep
    for col in list(df_copy.columns):
        if col.startswith(var_prefix):
            same_group_cols.append(col)
    for index, row in imputed_df[same_group_cols].iterrows():

        # Get max value in the row:
        max_value = row.max()

        # Get column name of the max value:
        max_column = row.idxmax()

        var_sufix = max_column.split(prefix_sep)[-1]
        imputed_df.loc[index, var] = var_sufix
    for col in same_group_cols:
        imputed_df.drop(col, axis=1, inplace=True)

Printing the statistical analysis between the original and the imputed df:

In [17]:
# The user can decide either if it wants to print the table or not
show_results = True

df_results = stat_analysis(df, imputed_df, interest_vars, interest_numerical_vars, interest_categorical_vars, show_results = show_results)


Statistical analysis of data imputation with MICE method to numerical and categorical variables

                                       VARIABLE      MISSINGS  \
0                comor_chrkidney_stag (Stage 2)  854 (85.40%)   
1               comor_chrkidney_stag (Stage 3a)  854 (85.40%)   
2                comor_chrkidney_stag (Stage 1)  854 (85.40%)   
3               comor_chrkidney_stag (Stage 3b)  854 (85.40%)   
4                comor_chrkidney_stag (Stage 4)  854 (85.40%)   
5                comor_chrkidney_stag (Stage 5)  854 (85.40%)   
6                comor_liverdisease_type (Mild)  856 (85.60%)   
7  comor_liverdisease_type (Moderate or severe)  856 (85.60%)   

  BEFORE IMPUTATION AFTER IMPUTATION       P-VALUE  
0       55 (37.67%)       55 (5.50%)  4.136903e-34  
1       41 (28.08%)       41 (4.10%)  5.138735e-25  
2       30 (20.55%)     884 (88.40%)  4.466933e-80  
3        14 (9.59%)       14 (1.40%)  1.198273e-08  
4         5 (3.42%)        5 (0.50%)  2.118501e-03 

  df_results = pd.concat([


By running this cell, it returns the imputed data frame.

In [18]:
imputed_df

Unnamed: 0,treat_o2supp_suppleo2,vital_meanbp,compl_ards,comor_smoking_yn,vital_hr,adsym_fever,compl_severeliver,comor_obesity,vital_rr,inter_o2support_type___Non-invasive ventilation,...,vital_gcs,compl_acuterenal,sympt_fever,labs_creatinine_mgdl,labs_bilirubin_mgdl,comor_diabetes_yn,inter_o2support_type___Unknown,demog_sex,comor_chrkidney_stag,comor_liverdisease_type
0,0.0,96.1,0.0,0.0,75.0,1.0,0.0,1.0,16.3,1.0,...,14.0,0.0,1.0,1.379,1.063,1.0,0.0,Male,Stage 1,Mild
1,0.0,84.6,0.0,0.0,77.8,0.0,0.0,0.0,14.9,0.0,...,8.0,0.0,0.0,0.880,0.295,0.0,0.0,Male,Stage 1,Mild
2,1.0,94.5,0.0,1.0,77.0,1.0,0.0,0.0,17.2,0.0,...,16.0,0.0,1.0,0.763,0.315,0.0,0.0,Male,Stage 1,Mild
3,0.0,79.1,0.0,0.0,83.2,1.0,0.0,0.0,14.4,0.0,...,16.0,0.0,1.0,0.857,0.530,1.0,0.0,Female,Stage 1,Mild
4,0.0,96.4,0.0,0.0,62.2,1.0,0.0,1.0,16.8,0.0,...,15.0,1.0,0.0,1.264,0.353,1.0,0.0,Male,Stage 2,Mild
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,94.0,0.0,0.0,73.9,1.0,0.0,1.0,17.0,0.0,...,16.0,0.0,1.0,1.010,0.443,0.0,0.0,Male,Stage 1,Moderate or severe
996,0.0,75.9,0.0,0.0,79.8,1.0,0.0,0.0,14.9,0.0,...,16.0,1.0,0.0,0.574,0.555,0.0,0.0,Male,Stage 1,Mild
997,1.0,110.6,0.0,0.0,81.7,1.0,0.0,0.0,16.5,0.0,...,11.0,0.0,1.0,0.883,0.388,0.0,0.0,Male,Stage 1,Moderate or severe
998,1.0,82.0,0.0,0.0,71.1,0.0,0.0,1.0,17.5,0.0,...,11.0,0.0,0.0,0.733,0.802,0.0,1.0,Female,Stage 2,Mild


By running this cell, it returns a data frame of the statistical analysis table.

In [19]:
df_results

Unnamed: 0,VARIABLE,MISSINGS,BEFORE IMPUTATION,AFTER IMPUTATION,P-VALUE
0,comor_chrkidney_stag (Stage 2),854 (85.40%),55 (37.67%),55 (5.50%),4.136903e-34
1,comor_chrkidney_stag (Stage 3a),854 (85.40%),41 (28.08%),41 (4.10%),5.138735e-25
2,comor_chrkidney_stag (Stage 1),854 (85.40%),30 (20.55%),884 (88.40%),4.466933e-80
3,comor_chrkidney_stag (Stage 3b),854 (85.40%),14 (9.59%),14 (1.40%),1.198273e-08
4,comor_chrkidney_stag (Stage 4),854 (85.40%),5 (3.42%),5 (0.50%),0.002118501
5,comor_chrkidney_stag (Stage 5),854 (85.40%),1 (0.68%),1 (0.10%),0.6027373
6,comor_liverdisease_type (Mild),856 (85.60%),86 (59.72%),942 (94.20%),9.00409e-37
7,comor_liverdisease_type (Moderate or severe),856 (85.60%),58 (40.28%),58 (5.80%),9.00409e-37


### Complete function

In [20]:
def MICE_Imputer(df , interest_vars = None , show_results = True , random_state = 100 , max_iter = 10 , initial_strategy = 'most_frequent' , prefix_sep = '!'):

    df_copy, vars_for_training, interest_categorical_vars, interest_numerical_vars = pre_imputation(df, interest_vars, prefix_sep)

    # Select interest variables
    df_train = df_copy.loc[:, vars_for_training]

    # Define imputer
    imputer = IterativeImputer(random_state = random_state , max_iter = max_iter , initial_strategy = initial_strategy)

    # Fit on the dataset
    imputer.fit(df_train)

    # Predict the missing values
    imputed_df = imputer.transform(df_train)

    # Revert dummies to original categories
    imputed_df = pd.DataFrame(imputed_df, columns=imputer.get_feature_names_out(), index=df.index)  # Convert imputed_df to DataFrame and maintain original index

    imputed_df = pd.concat([imputed_df, df[interest_categorical_vars]], axis=1) # Add the original categorical columns

    for var in interest_categorical_vars:
        same_group_cols = []
        var_prefix = var + prefix_sep
        for col in list(df_copy.columns):
            if col.startswith(var_prefix):
                same_group_cols.append(col)
        for index, row in imputed_df[same_group_cols].iterrows():
            max_value = row.max()         # Get max value in the row
            max_column = row.idxmax()     # Get column name of the max value
            var_sufix = max_column.split(prefix_sep)[-1]
            imputed_df.loc[index, var] = var_sufix
        for col in same_group_cols:
            imputed_df.drop(col, axis=1, inplace=True)

    # Statistical analysis between the original and the imputed df:
    df_results = stat_analysis(df, imputed_df, interest_vars, interest_numerical_vars, interest_categorical_vars, show_results = show_results)

    return imputed_df, df_results

Calling the complete function.

In [21]:
imputed_main_df, main_df_results = MICE_Imputer(main_df, L_vars)


Statistical analysis of data imputation with MICE method to numerical and categorical variables

                                       VARIABLE      MISSINGS  \
0                comor_chrkidney_stag (Stage 2)  854 (85.40%)   
1               comor_chrkidney_stag (Stage 3a)  854 (85.40%)   
2                comor_chrkidney_stag (Stage 1)  854 (85.40%)   
3               comor_chrkidney_stag (Stage 3b)  854 (85.40%)   
4                comor_chrkidney_stag (Stage 4)  854 (85.40%)   
5                comor_chrkidney_stag (Stage 5)  854 (85.40%)   
6                comor_liverdisease_type (Mild)  856 (85.60%)   
7  comor_liverdisease_type (Moderate or severe)  856 (85.60%)   

  BEFORE IMPUTATION AFTER IMPUTATION       P-VALUE  
0       55 (37.67%)       55 (5.50%)  4.136903e-34  
1       41 (28.08%)       41 (4.10%)  5.138735e-25  
2       30 (20.55%)     884 (88.40%)  4.466933e-80  
3        14 (9.59%)       14 (1.40%)  1.198273e-08  
4         5 (3.42%)        5 (0.50%)  2.118501e-03 

  df_results = pd.concat([
