You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm facing an issue related to the ImputedData object: When I use train_nonmissing=True in Kernel definition, I have an error with the returned object from impute_new_data, but I'm still able to do the imputation by calling the complete_data function.
I prepared an example using the same dataset from your README.md. When I use the Kernel definition as you presented, i.e. train_nonmissing=False:
import miceforest as mf
from sklearn.datasets import load_iris
# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris.rename({"target": "species"}, inplace=True, axis=1)
iris['species'] = iris['species'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)
# Create kernel.
kds = mf.ImputationKernel(iris_amp, datasets=1, save_all_iterations=True, train_nonmissing=False, random_state=1991)
# Run the MICE algorithm for 2 iterations
kds.mice(2)
# Return the completed dataset -> Just works normally
iris_complete = kds.complete_data(0)
#print(iris_complete.isnull().sum())
# New data has missing values in species column
new_missing_cols = ["sepal length (cm)", "sepal width (cm)", "species"]
iris_amp2_new = iris.iloc[range(10),:].copy()
iris_amp2_new[new_missing_cols] = mf.ampute_data(iris_amp2_new[new_missing_cols], perc=0.25,random_state=1991)
# Species column can still be imputed
iris_amp2_new_imp = kds.impute_new_data(iris_amp2_new)
iris_amp2_new_imp.complete_data(0).isnull().sum()
I have the following result:
But, this exactly same code with train_nonmissing=True generates this:
For the complete log error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
/opt/conda/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
392 if cls is not object \
393 and callable(cls.__dict__.get('__repr__')):
--> 394 return _repr_pprint(obj, self, cycle)
395
396 return _default_pprint(obj, self, cycle)
/opt/conda/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
698 """A pprint that just redirects to the normal repr function."""
699 # Find newlines and replace them with p.break_()
--> 700 output = repr(obj)
701 lines = output.splitlines()
702 with p.group():
/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in __repr__(self)
339
340 def __repr__(self):
--> 341 summary_string = " " * 14 + "Class: ImputedData\n" + self._ids_info()
342 return summary_string
343
/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in _ids_info(self)
347 Iterations: {self.iteration_count()}
348 Imputed Variables: {len(self.imputation_order)}
--> 349 save_all_iterations: {self.save_all_iterations}"""
350 return summary_string
351
/opt/conda/lib/python3.7/site-packages/miceforest/ImputedData.py in iteration_count(self, datasets, variables)
498 if len(ds_uniq) > 1:
499 raise ValueError(
--> 500 "iterations were not consistent across provided datasets, variables."
501 )
502
ValueError: iterations were not consistent across provided datasets, variables.
From my research around your code, my suspicion is if I don't have to impute a nonmissing column, this error would happen, but my tests shows that this wasn't the issue. The object fails given a check from the model.
How could I fix this?
The text was updated successfully, but these errors were encountered:
Good catch, this is a bug in how the ImputationData stores the iteration count. It's storing iterations for variables that aren't being imputed, and therefore they have an iteration of 0 while the other variables iterations get incremented. This bug should be pretty isolated - as long as you don't try to get the global iteration count from iris_amp2_new_imp, I don't think you'll run into the error. You should still be able to use .complete_data(). Printing iris_amp2_new_imp actually calls self.iteration_count(), which is why that code is failing.
This should be an easy fix, but I am actually in the middle of a medium sized update to the package, so I probably won't get to it for a few days.
As a side note - Changing train_nonmissing in the code above won't functionally do anything, since all of the columns in iris_amp contain missing values. However, I could see how this bug would cause problems in real-world scenarios.
Regarding your side note, I used iris as an example, but my actual dataset have this characteristic (a column in train is completed, but not in testing).
My original concern is that my result in complete_data() was misleading given the error, but now I'm more calm.
Hi,
First, congratulations for your package @AnotherSamWilson !
I'm facing an issue related to the ImputedData object: When I use
train_nonmissing=True
in Kernel definition, I have an error with the returned object fromimpute_new_data
, but I'm still able to do the imputation by calling thecomplete_data
function.I prepared an example using the same dataset from your
README.md
. When I use the Kernel definition as you presented, i.e.train_nonmissing=False
:I have the following result:
But, this exactly same code with
train_nonmissing=True
generates this:For the complete log error:
From my research around your code, my suspicion is if I don't have to impute a nonmissing column, this error would happen, but my tests shows that this wasn't the issue. The object fails given a check from the model.
How could I fix this?
The text was updated successfully, but these errors were encountered: