Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Data Imputation issue arise when using train_nonmissing as True #48

Closed
KaikeWesleyReis opened this issue Jul 26, 2022 · 3 comments
Closed

Comments

@KaikeWesleyReis
Copy link

Hi,
First, congratulations for your package @AnotherSamWilson !

I'm facing an issue related to the ImputedData object: When I use train_nonmissing=True in Kernel definition, I have an error with the returned object from impute_new_data, but I'm still able to do the imputation by calling the complete_data function.

I prepared an example using the same dataset from your README.md. When I use the Kernel definition as you presented, i.e. train_nonmissing=False:

import miceforest as mf
from sklearn.datasets import load_iris
# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename({"target": "species"}, inplace=True, axis=1)
iris['species'] = iris['species'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)
# Create kernel. 
kds = mf.ImputationKernel(iris_amp, datasets=1, save_all_iterations=True, train_nonmissing=False, random_state=1991)
# Run the MICE algorithm for 2 iterations
kds.mice(2)

# Return the completed dataset -> Just works normally
iris_complete = kds.complete_data(0)
#print(iris_complete.isnull().sum())
# New data has missing values in species column
new_missing_cols = ["sepal length (cm)", "sepal width (cm)", "species"]
iris_amp2_new = iris.iloc[range(10),:].copy()
iris_amp2_new[new_missing_cols] = mf.ampute_data(iris_amp2_new[new_missing_cols], perc=0.25,random_state=1991)

# Species column can still be imputed
iris_amp2_new_imp = kds.impute_new_data(iris_amp2_new)
iris_amp2_new_imp.complete_data(0).isnull().sum()

I have the following result:
good-result

But, this exactly same code with train_nonmissing=True generates this:
bad-result

For the complete log error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

/opt/conda/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

/opt/conda/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in __repr__(self)
    339 
    340     def __repr__(self):
--> 341         summary_string = " " * 14 + "Class: ImputedData\n" + self._ids_info()
    342         return summary_string
    343 

/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in _ids_info(self)
    347          Iterations: {self.iteration_count()}
    348   Imputed Variables: {len(self.imputation_order)}
--> 349 save_all_iterations: {self.save_all_iterations}"""
    350         return summary_string
    351 

/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in iteration_count(self, datasets, variables)
    498         if len(ds_uniq) > 1:
    499             raise ValueError(
--> 500                 "iterations were not consistent across provided datasets, variables."
    501             )
    502 

ValueError: iterations were not consistent across provided datasets, variables.

From my research around your code, my suspicion is if I don't have to impute a nonmissing column, this error would happen, but my tests shows that this wasn't the issue. The object fails given a check from the model.

How could I fix this?

@AnotherSamWilson
Copy link
Owner

Good catch, this is a bug in how the ImputationData stores the iteration count. It's storing iterations for variables that aren't being imputed, and therefore they have an iteration of 0 while the other variables iterations get incremented. This bug should be pretty isolated - as long as you don't try to get the global iteration count from iris_amp2_new_imp, I don't think you'll run into the error. You should still be able to use .complete_data(). Printing iris_amp2_new_imp actually calls self.iteration_count(), which is why that code is failing.

This should be an easy fix, but I am actually in the middle of a medium sized update to the package, so I probably won't get to it for a few days.

As a side note - Changing train_nonmissing in the code above won't functionally do anything, since all of the columns in iris_amp contain missing values. However, I could see how this bug would cause problems in real-world scenarios.

@KaikeWesleyReis
Copy link
Author

Thanks for your reply.

Regarding your side note, I used iris as an example, but my actual dataset have this characteristic (a column in train is completed, but not in testing).

My original concern is that my result in complete_data() was misleading given the error, but now I'm more calm.

@AnotherSamWilson
Copy link
Owner

Fixed in e944636

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants