Joblib problems after using sweetviz #95

fior-di-latte-byte · 2021-07-16T15:49:39Z

Hello,
first off, I'd like to thank you a lot in the name of many for this awesome package.

I am using sweetviz to export a feature report in a data science context. After that, the features are postprocessed using a sklearn (0.24.1) ColumnTransformer that (among other things) uses a TargetEncoder for the categories (category-encoders =2.2.2).

Here comes the odd thing:
Whenever I use sv.analyze(features) before using the ColumnTransformer(...).fit_transform(features), it throws a rather long joblib error, complaining that a FloatingPointError occured in a worker that uses the TargetEncoder.

Let me be clear: If I ditch that single line sv.analyze(features) everything works smoothly [on multiple systems].

This leads me to conjecture that sweetviz maybe leaves some kind of artifact behind that the ColumnTransformer subsequently stumbles over.

Maybe somebody has an idea what might be the reason for this behaviour?

Thanks in advance!

Traceback (most recent call last):
File "/Users/someuser/Projekte/2021/someproject/somesrc/cli.py", line 106, in create_features
features_processed = ft_postprocessor.fit_transform(
File "/Users/someuser/Projekte/2021/someproject/somesrc/features/feature_postprocessor.py", line 155, in fit_transform
self.fit(X)
File "/Users/someuser/Projekte/2021/someproject/somesrc/features/feature_postprocessor.py", line 115, in fit
self.pipe.fit(X, y=target_for_encoding)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 341, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 303, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/memory.py", line 352, in call
return self.func(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 507, in fit_transform
result = self._fit_transform(X, y, _fit_transform_one)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 434, in _fit_transform
return Parallel(n_jobs=self.n_jobs)(
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 1041, in call
if self.dispatch_one_batch(iterator):
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
self._dispatch(tasks)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 777, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in init
self.results = batch()
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in call
return [func(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/joblib/parallel.py", line 262, in
return [func(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 222, in call
return self.function(*args, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/category_encoders/utils.py", line 150, in fit_transform
return self.fit(X, y, **fit_params).transform(X, y)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/category_encoders/target_encoder.py", line 142, in fit
self.mapping = self.fit_target_encoding(X_ordinal, y)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/category_encoders/target_encoder.py", line 172, in fit_target_encoding
smoove = 1 / (1 + np.exp(-(stats['count'] - self.min_samples_leaf) / self.smoothing))
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/pandas/core/generic.py", line 1936, in array_ufunc
return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File "/Users/someuser/Projekte/2021/someproject/.venv/lib/python3.8/site-packages/pandas/core/arraylike.py", line 358, in array_ufunc
result = getattr(ufunc, method)(*inputs, **kwargs)
FloatingPointError: underflow encountered in exp

I am using Macos Big Sur and Python 3.8.2.

antlr4-python3-runtime==4.8
appdirs==1.4.4
astunparse==1.6.3; python_version < "3.9"
atomicwrites==1.4.0; sys_platform == "win32"
attrs==21.2.0
category-encoders==2.2.2
cfgv==3.3.0
click==8.0.1
colorama==0.4.4; platform_system == "Windows"
cycler==0.10.0
cython==0.29.24
distlib==0.3.2
filelock==3.0.12
ghp-import==2.0.1
hdbscan==0.8.27
hydra-core==1.1.0
hydra==2.5
identify==2.2.10
importlib-metadata==4.5.0
importlib-resources==5.1.4
importlib-resources==5.1.4; python_version < "3.9"
iniconfig==1.1.1
jinja2==3.0.1
joblib==1.0.1
kiwisolver==1.3.1
livereload==2.6.3
llvmlite==0.36.0
markdown==3.3.4
markupsafe==2.0.1
matplotlib==3.4.2
mergedeep==1.3.4
mkdocs-autorefs==0.2.1
mkdocs-material-extensions==1.0.1
mkdocs-material==7.1.7
mkdocs==1.2
mkdocstrings==0.15.1
nodeenv==1.6.0
numba==0.53.1
numpy==1.21.0
omegaconf==2.1.0
packaging==20.9
pandas==1.2.4
patsy==0.5.1
pillow==8.2.0
pluggy==0.13.1
pre-commit==2.13.0
py==1.10.0
pyarrow==4.0.1
pygments==2.9.0
pymdown-extensions==8.2
pyparsing==2.4.7
pytest==6.2.4
python-dateutil==2.8.1
pytkdocs==0.11.1
pytz==2021.1
pyyaml-env-tag==0.1
pyyaml==5.4.1
scikit-learn==0.24.2
scipy==1.6.1
six==1.16.0
statsmodels==0.12.2
sweetviz==2.1.2
threadpoolctl==2.1.0
toml==0.10.2
tornado==6.1; python_version > "2.7"
tqdm==4.61.1
virtualenv==20.4.7
watchdog==2.1.2
yellowbrick @ git+https://github.com/DistrictDataLabs/yellowbrick@develop
zipp==3.4.1
zipp==3.4.1; python_version < "3.10"

The text was updated successfully, but these errors were encountered:

fbdesignpro · 2021-07-19T13:21:12Z

Hi @fior-di-latte-byte, thanks for the detailed report!

This is the first time I hear of something like this, but it's definitely possible that at some point in processing, the source dataframe gets modified somehow. A lot of processing happens but it's generally on copies of data.

I know "Boolean" columns can get their values standardized to 0/1 from Y/N etc, but I don't think that happens in-place.

I thought if a column is named "index" it would get renamed to "df_index" but I don't think that happens in-place either.

I tried doing a couple of before-and-after tests to see if anything changes but I haven't seen anything obvious yet.

fior-di-latte-byte · 2021-07-19T13:28:48Z

Hi,
thanks for the answer. :-)
Actually, I also tried to use the analyze method on a deep copy of the dataframe (=features.copy()), but the problem persists. So the actual problem seems to be unrelated to the dataframe itself.

I already imagined this error is hard to reconstruct. But who knows, maybe someone else will show up with the same problem. If not, even better ?? ;-)

Anyway, thanks again.
Greetings to Canada

fbdesignpro · 2021-07-19T13:38:52Z

Hi @fior-di-latte-byte,
thank you for the follow-up! That's very interesting and good to know that the source is "left unharmed" at least to a good degree. :)

So it would be some global state or variable that is confusing joblib.

I noticed stats['count'] in the call stack. I wondered if that could have gotten overridden somehow, but creating a dummy stats object before calling analyze didn't show any change to that dummy before/after, so that was not it.

But it's probably something similar/related. Perhaps something used to set up stats that causes it to return a large negative number (or something similar to the other values in that exp), hence the underflow. Mmm...

CamiloSalomonT · 2021-11-22T20:11:22Z

Hi @fbdesignpro @fior-di-latte-byte ,

I found myself with the same error yesterday, so I put here a way to replicate the FloatingPointError. I also found that this behavior doesn't depends on modifying DataFrame data, as I show on the notebook.

I hope you find it helpful,
Greetings from Colombia

HanyHossny · 2022-01-02T13:24:31Z

Hi Guys,

I dug into the code of SweetViz and found this line of code "np.seterr(all='raise')" in graph_numeric.py, I commented it and the underflow problem disappeared.

The other workaround is to put "np.seterr(under='ignore')" just after SweetViz call is complete. This way we set how NumPy deal with such kinds of errors.

I hope this helps
Thanks
Hany Hossny

fbdesignpro added bug Something isn't working working on it A fix/update for this should be coming soon! labels Oct 4, 2023

fbdesignpro added workaround found can't repro issue and removed working on it A fix/update for this should be coming soon! labels Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joblib problems after using sweetviz #95

Joblib problems after using sweetviz #95

fior-di-latte-byte commented Jul 16, 2021 •

edited

fbdesignpro commented Jul 19, 2021

fior-di-latte-byte commented Jul 19, 2021

fbdesignpro commented Jul 19, 2021

CamiloSalomonT commented Nov 22, 2021

HanyHossny commented Jan 2, 2022

Joblib problems after using sweetviz #95

Joblib problems after using sweetviz #95

Comments

fior-di-latte-byte commented Jul 16, 2021 • edited

fbdesignpro commented Jul 19, 2021

fior-di-latte-byte commented Jul 19, 2021

fbdesignpro commented Jul 19, 2021

CamiloSalomonT commented Nov 22, 2021

HanyHossny commented Jan 2, 2022

fior-di-latte-byte commented Jul 16, 2021 •

edited