New Data Imputation issue arise when using train_nonmissing as True #48

KaikeWesleyReis · 2022-07-26T14:52:27Z

Hi,
First, congratulations for your package @AnotherSamWilson !

I'm facing an issue related to the ImputedData object: When I use train_nonmissing=True in Kernel definition, I have an error with the returned object from impute_new_data, but I'm still able to do the imputation by calling the complete_data function.

I prepared an example using the same dataset from your README.md. When I use the Kernel definition as you presented, i.e. train_nonmissing=False:

import miceforest as mf
from sklearn.datasets import load_iris
# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename({"target": "species"}, inplace=True, axis=1)
iris['species'] = iris['species'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)
# Create kernel. 
kds = mf.ImputationKernel(iris_amp, datasets=1, save_all_iterations=True, train_nonmissing=False, random_state=1991)
# Run the MICE algorithm for 2 iterations
kds.mice(2)

# Return the completed dataset -> Just works normally
iris_complete = kds.complete_data(0)
#print(iris_complete.isnull().sum())
# New data has missing values in species column
new_missing_cols = ["sepal length (cm)", "sepal width (cm)", "species"]
iris_amp2_new = iris.iloc[range(10),:].copy()
iris_amp2_new[new_missing_cols] = mf.ampute_data(iris_amp2_new[new_missing_cols], perc=0.25,random_state=1991)

# Species column can still be imputed
iris_amp2_new_imp = kds.impute_new_data(iris_amp2_new)
iris_amp2_new_imp.complete_data(0).isnull().sum()

I have the following result:

But, this exactly same code with train_nonmissing=True generates this:

For the complete log error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

/opt/conda/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

/opt/conda/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in __repr__(self)
    339 
    340     def __repr__(self):
--> 341         summary_string = " " * 14 + "Class: ImputedData\n" + self._ids_info()
    342         return summary_string
    343 

/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in _ids_info(self)
    347          Iterations: {self.iteration_count()}
    348   Imputed Variables: {len(self.imputation_order)}
--> 349 save_all_iterations: {self.save_all_iterations}"""
    350         return summary_string
    351 

/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in iteration_count(self, datasets, variables)
    498         if len(ds_uniq) > 1:
    499             raise ValueError(
--> 500                 "iterations were not consistent across provided datasets, variables."
    501             )
    502 

ValueError: iterations were not consistent across provided datasets, variables.

From my research around your code, my suspicion is if I don't have to impute a nonmissing column, this error would happen, but my tests shows that this wasn't the issue. The object fails given a check from the model.

How could I fix this?

The text was updated successfully, but these errors were encountered:

AnotherSamWilson · 2022-07-26T15:13:46Z

Good catch, this is a bug in how the ImputationData stores the iteration count. It's storing iterations for variables that aren't being imputed, and therefore they have an iteration of 0 while the other variables iterations get incremented. This bug should be pretty isolated - as long as you don't try to get the global iteration count from iris_amp2_new_imp, I don't think you'll run into the error. You should still be able to use .complete_data(). Printing iris_amp2_new_imp actually calls self.iteration_count(), which is why that code is failing.

This should be an easy fix, but I am actually in the middle of a medium sized update to the package, so I probably won't get to it for a few days.

As a side note - Changing train_nonmissing in the code above won't functionally do anything, since all of the columns in iris_amp contain missing values. However, I could see how this bug would cause problems in real-world scenarios.

KaikeWesleyReis · 2022-07-26T16:00:34Z

Thanks for your reply.

Regarding your side note, I used iris as an example, but my actual dataset have this characteristic (a column in train is completed, but not in testing).

My original concern is that my result in complete_data() was misleading given the error, but now I'm more calm.

AnotherSamWilson · 2022-07-29T19:37:42Z

Fixed in e944636

AnotherSamWilson closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Data Imputation issue arise when using train_nonmissing as True #48

New Data Imputation issue arise when using train_nonmissing as True #48

KaikeWesleyReis commented Jul 26, 2022

AnotherSamWilson commented Jul 26, 2022

KaikeWesleyReis commented Jul 26, 2022

AnotherSamWilson commented Jul 29, 2022

New Data Imputation issue arise when using train_nonmissing as True #48

New Data Imputation issue arise when using train_nonmissing as True #48

Comments

KaikeWesleyReis commented Jul 26, 2022

AnotherSamWilson commented Jul 26, 2022

KaikeWesleyReis commented Jul 26, 2022

AnotherSamWilson commented Jul 29, 2022