Yeo-Johnson inverse_transform fails silently on extreme skew data #28946

el-hult · 2024-05-04T08:23:07Z

Describe the bug

The Yeo-Johnson is not a surjective transformation for negative lambdas. Therefore, the inverse transformation returns np.nan when inverse transforming values outside the range of the transform. This failure is silent, so it took me quite a while of debugging to understand this behavior.

The problematic lines are

scikit-learn/sklearn/preprocessing/_data.py

Line 3390 in 8721245

x_inv[~pos] = 1 - np.power(-(2 - lmbda) * x[~pos] + 1, 1 / (2 - lmbda))

and

scikit-learn/sklearn/preprocessing/_data.py

Line 3386 in 8721245

x_inv[pos] = np.power(x[pos] * lmbda + 1, 1 / lmbda) - 1

in which we might compute np.power(something_negative, not_integral_value), which of course returns np.nan as per https://numpy.org/doc/stable/reference/generated/numpy.power.html

Steps/Code to Reproduce

To reproduce for positive values (there is a similar problem for negative values):

import numpy as np
import sklearn.preprocessing
trans = sklearn.preprocessing.PowerTransformer(method='yeo-johnson')
x = np.array([1,1,1e10]).reshape(-1, 1) # extreme skew
trans.fit(x)
lmbda = trans.lambdas_[0] 
print(lmbda)
assert lmbda < 0 # == -0.096 negative value

# any value `psi` for which lambda*psi+1 <= 0 will result in nan due to lacking support, since the forwards transformation 
# is not surjective on negative lambdas. In this specific case, 10*-0.096 < 1
psi = np.array([10]).reshape(-1, 1)
x = trans.inverse_transform(psi).item()
print(x)
assert np.isnan(x)

Expected Results

The code should either:

validate its inputs and raise an exception
validate its inputs and raise a warning
fail silently, but have it documented behavior

Actual Results

It just prints

-0.0962322261004418
nan

Versions

System:
    python: 3.11.3 (main, Jan 18 2024, 19:07:12) [Clang 18.0.0 (https://github.com/llvm/llvm-project 75501f53624de92aafce2f1da698
executable: /home/pyodide/this.program
   machine: Emscripten-3.1.46-wasm32-32bit

Python dependencies:
      sklearn: 1.3.1
          pip: None
   setuptools: None
        numpy: 1.26.1
        scipy: 1.11.2
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: False

The text was updated successfully, but these errors were encountered:

ogrisel · 2024-05-10T09:45:52Z

Thanks for the report and the analysis. I confirm I can reproduce on main with the provided reproducer.

I think that calling inverse_transform with negative lambda values should at least raise a warning.

Not sure if it would be helpful to raise such a warning at fit time though. Maybe some users only care about the transform without inverse_transform and raising a warning would be annoying for those users.

el-hult · 2024-05-14T13:51:25Z

I think that a warning in inverse_transform, behind a check that diagnose whether this error case is hit or not. I suppose it looks something like

tmp1 = -(2 - lmbda) * x[~pos] + 1
tmp2 = x[pos] * lmbda + 1
lambda < 0 and ( np.any( tmp1< 0) or np.any(tmp2 < 0))

Short circuiting means the np.any calls are often not computed at all, so it should be really cheap. tmp1 and tmp2 needs to be computed for the inverse_transform so that is no extra work.

EDIT: hopefully with better variable names than tmp1 and tmp2...

el-hult added Bug Needs Triage Issue requires triage labels May 4, 2024

ogrisel removed the Needs Triage Issue requires triage label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yeo-Johnson inverse_transform fails silently on extreme skew data #28946

Yeo-Johnson inverse_transform fails silently on extreme skew data #28946

el-hult commented May 4, 2024 •

edited

ogrisel commented May 10, 2024

el-hult commented May 14, 2024 •

edited

Yeo-Johnson inverse_transform fails silently on extreme skew data #28946

Yeo-Johnson inverse_transform fails silently on extreme skew data #28946

Comments

el-hult commented May 4, 2024 • edited

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

ogrisel commented May 10, 2024

el-hult commented May 14, 2024 • edited

el-hult commented May 4, 2024 •

edited

el-hult commented May 14, 2024 •

edited