# Multiple Imputation by Chained Equations (MICE)

### DATA IMPUTATION METHOD



Imputing a Single Dataset
If you only want to create a single imputed dataset, you can use
KernelDataSet:

# Create kernel. 
kds = mf.KernelDataSet(
  iris_amp,
  save_all_iterations=True,
  random_state=1991
)

# Run the MICE algorithm for 3 iterations
kds.mice(3)

# Return the completed kernel data
completed_data = kds.complete_data()



kernel_inplace.complete_data(dataset=0, inplace=True)
print(iris_amp.isnull().sum(0))

Diagnostic Plotting
As of now, miceforest has four diagnostic plots available.

Distribution of Imputed-Values
We probably want to know how the imputed values are distributed. We can plot the original distribution beside the imputed distributions in each dataset by using the plot_imputed_distributions method of an ImputationKernel object:

kernel.plot_imputed_distributions(wspace=0.3,hspace=0.3)

The red line is the original data, and each black line are the imputed values of each dataset.

Convergence of Correlation
We are probably interested in knowing how our values between datasets converged over the iterations. The plot_correlations method shows you a boxplot of the correlations between imputed values in every combination of datasets, at each iteration. This allows you to see how correlated the imputations are between datasets, as well as the convergence over iterations:

kernel.plot_correlations()

Variable Importance
We also may be interested in which variables were used to impute each variable. We can plot this information by using the plot_feature_importance method.

kernel.plot_feature_importance(dataset=0, annot=True,cmap="YlGnBu",vmin=0, vmax=1)

The numbers shown are returned from the lightgbm.Booster.feature_importance() function. Each square represents the importance of the column variable in imputing the row variable.

Mean Convergence
If our data is not missing completely at random, we may see that it takes a few iterations for our models to get the distribution of imputations right. We can plot the average value of our imputations to see if this is occurring:

kernel.plot_mean_convergence(wspace=0.3, hspace=0.4)

Our data was missing completely at random, so we don’t see any convergence occurring here.

Using the Imputed Data
To return the imputed data simply use the complete_data method:

dataset_1 = kernel.complete_data(0)

The MICE Algorithm
Multiple Imputation by Chained Equations ‘fills in’ (imputes) missing data in a dataset through an iterative series of predictive models. In each iteration, each specified variable in the dataset is imputed using the other variables in the dataset. These iterations should be run until it appears that convergence has been met.



This process is continued until all specified variables have been imputed. Additional iterations can be run if it appears that the average imputed values have not converged, although no more than 5 iterations are usually necessary.



Confidence Intervals:
MICE can be used to impute missing values, however it is important to keep in mind that these imputed values are a prediction. Creating multiple datasets with different imputed values allows you to do two types of inference:

Imputed Value Distribution: A profile can be built for each imputed value, allowing you to make statements about the likely distribution of that value.
Model Prediction Distribution: With multiple datasets, you can build multiple models and create a distribution of predictions for each sample. Those samples with imputed values which were not able to be imputed with much confidence would have a larger variance in their predictions.



Predictive Mean Matching
miceforest can make use of a procedure called predictive mean matching (PMM) to select which values are imputed. PMM involves selecting a datapoint from the original, nonmissing data which has a predicted value close to the predicted value of the missing sample. The closest N (mean_match_candidates parameter) values are chosen as candidates, from which a value is chosen at random. This can be specified on a column-by-column basis. Going into more detail from our example above, we see how this works in practice:



This method is very useful if you have a variable which needs imputing which has any of the following characteristics:

Multimodal
Integer
Skewed
Effects of Mean Matching

We can see how our variables are distributed and correlated in the graph above. Now let’s run our imputation process twice, once using mean matching, and once using the model prediction.
kernelmeanmatch = mf.ImputationKernel(ampdat, datasets=1,mean_match_candidates=5)
kernelmodeloutput = mf.ImputationKernel(ampdat, datasets=1,mean_match_candidates=0)

kernelmeanmatch.mice(2)
kernelmodeloutput.mice(2)
Let’s look at the effect on the different variables.

With Mean Matching
kernelmeanmatch.plot_imputed_distributions(wspace=0.2,hspace=0.4)

You can see the effects that mean matching has, depending on the distribution of the data. Simply returning the value from the model prediction, while it may provide a better ‘fit’, will not provide imputations with a similair distribution to the original. This may be beneficial, depending on your goal.