Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joblib problems after using sweetviz #95

Open
fior-di-latte-byte opened this issue Jul 16, 2021 · 5 comments
Open

Joblib problems after using sweetviz #95

fior-di-latte-byte opened this issue Jul 16, 2021 · 5 comments
Labels

Comments

@fior-di-latte-byte
Copy link

fior-di-latte-byte commented Jul 16, 2021

Hello,
first off, I'd like to thank you a lot in the name of many for this awesome package.

I am using sweetviz to export a feature report in a data science context. After that, the features are postprocessed using a sklearn (0.24.1) ColumnTransformer that (among other things) uses a TargetEncoder for the categories (category-encoders =2.2.2).

Here comes the odd thing:
Whenever I use sv.analyze(features) before using the ColumnTransformer(...).fit_transform(features), it throws a rather long joblib error, complaining that a FloatingPointError occured in a worker that uses the TargetEncoder.

Let me be clear: If I ditch that single line sv.analyze(features) everything works smoothly [on multiple systems].

This leads me to conjecture that sweetviz maybe leaves some kind of artifact behind that the ColumnTransformer subsequently stumbles over.

Maybe somebody has an idea what might be the reason for this behaviour?

Thanks in advance!

Traceback (most recent call last):
File "/Users/someuser/Projekte/2021/someproject/somesrc/cli.py", line 106, in create_features
features_processed = ft_postprocessor.fit_transform(
File "/Users/someuser/Projekte/2021/someproject/somesrc/features/feature_postprocessor.py", line 155, in fit_transform
self.fit(X)
File "/Users/someuser/Projekte/2021/someproject/somesrc/features/feature_postprocessor.py", line 115, in fit
self.pipe.fit(X, y=target_for_encoding)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 341, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 303, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/memory.py", line 352, in call
return self.func(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 507, in fit_transform
result = self._fit_transform(X, y, _fit_transform_one)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 434, in _fit_transform
return Parallel(n_jobs=self.n_jobs)(
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 1041, in call
if self.dispatch_one_batch(iterator):
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
self._dispatch(tasks)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 777, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in init
self.results = batch()
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in call
return [func(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in
return [func(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 222, in call
return self.function(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/category_encoders/utils.py", line 150, in fit_transform
return self.fit(X, y, **fit_params).transform(X, y)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/category_encoders/target_encoder.py", line 142, in fit
self.mapping = self.fit_target_encoding(X_ordinal, y)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/category_encoders/target_encoder.py", line 172, in fit_target_encoding
smoove = 1 / (1 + np.exp(-(stats['count'] - self.min_samples_leaf) / self.smoothing))
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/pandas/core/generic.py", line 1936, in array_ufunc
return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/pandas/core/arraylike.py", line 358, in array_ufunc
result = getattr(ufunc, method)(*inputs, **kwargs)
FloatingPointError: underflow encountered in exp

I am using Macos Big Sur and Python 3.8.2.

antlr4-python3-runtime==4.8
appdirs==1.4.4
astunparse==1.6.3; python_version < "3.9"
atomicwrites==1.4.0; sys_platform == "win32"
attrs==21.2.0
category-encoders==2.2.2
cfgv==3.3.0
click==8.0.1
colorama==0.4.4; platform_system == "Windows"
cycler==0.10.0
cython==0.29.24
distlib==0.3.2
filelock==3.0.12
ghp-import==2.0.1
hdbscan==0.8.27
hydra-core==1.1.0
hydra==2.5
identify==2.2.10
importlib-metadata==4.5.0
importlib-resources==5.1.4
importlib-resources==5.1.4; python_version < "3.9"
iniconfig==1.1.1
jinja2==3.0.1
joblib==1.0.1
kiwisolver==1.3.1
livereload==2.6.3
llvmlite==0.36.0
markdown==3.3.4
markupsafe==2.0.1
matplotlib==3.4.2
mergedeep==1.3.4
mkdocs-autorefs==0.2.1
mkdocs-material-extensions==1.0.1
mkdocs-material==7.1.7
mkdocs==1.2
mkdocstrings==0.15.1
nodeenv==1.6.0
numba==0.53.1
numpy==1.21.0
omegaconf==2.1.0
packaging==20.9
pandas==1.2.4
patsy==0.5.1
pillow==8.2.0
pluggy==0.13.1
pre-commit==2.13.0
py==1.10.0
pyarrow==4.0.1
pygments==2.9.0
pymdown-extensions==8.2
pyparsing==2.4.7
pytest==6.2.4
python-dateutil==2.8.1
pytkdocs==0.11.1
pytz==2021.1
pyyaml-env-tag==0.1
pyyaml==5.4.1
scikit-learn==0.24.2
scipy==1.6.1
six==1.16.0
statsmodels==0.12.2
sweetviz==2.1.2
threadpoolctl==2.1.0
toml==0.10.2
tornado==6.1; python_version > "2.7"
tqdm==4.61.1
virtualenv==20.4.7
watchdog==2.1.2
yellowbrick @ git+https://github.com/DistrictDataLabs/yellowbrick@develop
zipp==3.4.1
zipp==3.4.1; python_version < "3.10"

@fbdesignpro
Copy link
Owner

Hi @fior-di-latte-byte, thanks for the detailed report!

This is the first time I hear of something like this, but it's definitely possible that at some point in processing, the source dataframe gets modified somehow. A lot of processing happens but it's generally on copies of data.

I know "Boolean" columns can get their values standardized to 0/1 from Y/N etc, but I don't think that happens in-place.

I thought if a column is named "index" it would get renamed to "df_index" but I don't think that happens in-place either.

I tried doing a couple of before-and-after tests to see if anything changes but I haven't seen anything obvious yet.

@fior-di-latte-byte
Copy link
Author

Hi,
thanks for the answer. :-)
Actually, I also tried to use the analyze method on a deep copy of the dataframe (=features.copy()), but the problem persists. So the actual problem seems to be unrelated to the dataframe itself.

I already imagined this error is hard to reconstruct. But who knows, maybe someone else will show up with the same problem. If not, even better ?? ;-)

Anyway, thanks again.
Greetings to Canada

@fbdesignpro
Copy link
Owner

Hi @fior-di-latte-byte,
thank you for the follow-up! That's very interesting and good to know that the source is "left unharmed" at least to a good degree. :)

So it would be some global state or variable that is confusing joblib.

I noticed stats['count'] in the call stack. I wondered if that could have gotten overridden somehow, but creating a dummy stats object before calling analyze didn't show any change to that dummy before/after, so that was not it.

But it's probably something similar/related. Perhaps something used to set up stats that causes it to return a large negative number (or something similar to the other values in that exp), hence the underflow. Mmm...

@CamiloSalomonT
Copy link

Hi @fbdesignpro @fior-di-latte-byte ,

I found myself with the same error yesterday, so I put here a way to replicate the FloatingPointError. I also found that this behavior doesn't depends on modifying DataFrame data, as I show on the notebook.

I hope you find it helpful,
Greetings from Colombia

@HanyHossny
Copy link

Hi Guys,

I dug into the code of SweetViz and found this line of code "np.seterr(all='raise')" in graph_numeric.py, I commented it and the underflow problem disappeared.

The other workaround is to put "np.seterr(under='ignore')" just after SweetViz call is complete. This way we set how NumPy deal with such kinds of errors.

I hope this helps
Thanks
Hany Hossny

@fbdesignpro fbdesignpro added bug Something isn't working working on it A fix/update for this should be coming soon! labels Oct 4, 2023
@fbdesignpro fbdesignpro added workaround found can't repro issue and removed working on it A fix/update for this should be coming soon! labels Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants