Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yeo-Johnson inverse_transform fails silently on extreme skew data #28946

Open
el-hult opened this issue May 4, 2024 · 2 comments
Open

Yeo-Johnson inverse_transform fails silently on extreme skew data #28946

el-hult opened this issue May 4, 2024 · 2 comments
Labels

Comments

@el-hult
Copy link

el-hult commented May 4, 2024

Describe the bug

The Yeo-Johnson is not a surjective transformation for negative lambdas. Therefore, the inverse transformation returns np.nan when inverse transforming values outside the range of the transform. This failure is silent, so it took me quite a while of debugging to understand this behavior.

The problematic lines are

x_inv[~pos] = 1 - np.power(-(2 - lmbda) * x[~pos] + 1, 1 / (2 - lmbda))

and

x_inv[pos] = np.power(x[pos] * lmbda + 1, 1 / lmbda) - 1

in which we might compute np.power(something_negative, not_integral_value), which of course returns np.nan as per https://numpy.org/doc/stable/reference/generated/numpy.power.html

Steps/Code to Reproduce

To reproduce for positive values (there is a similar problem for negative values):

import numpy as np
import sklearn.preprocessing
trans = sklearn.preprocessing.PowerTransformer(method='yeo-johnson')
x = np.array([1,1,1e10]).reshape(-1, 1) # extreme skew
trans.fit(x)
lmbda = trans.lambdas_[0] 
print(lmbda)
assert lmbda < 0 # == -0.096 negative value

# any value `psi` for which lambda*psi+1 <= 0 will result in nan due to lacking support, since the forwards transformation 
# is not surjective on negative lambdas. In this specific case, 10*-0.096 < 1
psi = np.array([10]).reshape(-1, 1)
x = trans.inverse_transform(psi).item()
print(x)
assert np.isnan(x)

Expected Results

The code should either:

  1. validate its inputs and raise an exception
  2. validate its inputs and raise a warning
  3. fail silently, but have it documented behavior

Actual Results

It just prints

-0.0962322261004418
nan

Versions

System:
    python: 3.11.3 (main, Jan 18 2024, 19:07:12) [Clang 18.0.0 (https://github.com/llvm/llvm-project 75501f53624de92aafce2f1da698
executable: /home/pyodide/this.program
   machine: Emscripten-3.1.46-wasm32-32bit

Python dependencies:
      sklearn: 1.3.1
          pip: None
   setuptools: None
        numpy: 1.26.1
        scipy: 1.11.2
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: False
@el-hult el-hult added Bug Needs Triage Issue requires triage labels May 4, 2024
@ogrisel ogrisel removed the Needs Triage Issue requires triage label May 10, 2024
@ogrisel
Copy link
Member

ogrisel commented May 10, 2024

Thanks for the report and the analysis. I confirm I can reproduce on main with the provided reproducer.

I think that calling inverse_transform with negative lambda values should at least raise a warning.

Not sure if it would be helpful to raise such a warning at fit time though. Maybe some users only care about the transform without inverse_transform and raising a warning would be annoying for those users.

@el-hult
Copy link
Author

el-hult commented May 14, 2024

I think that a warning in inverse_transform, behind a check that diagnose whether this error case is hit or not. I suppose it looks something like

tmp1 = -(2 - lmbda) * x[~pos] + 1
tmp2 = x[pos] * lmbda + 1
lambda < 0 and ( np.any( tmp1< 0) or np.any(tmp2 < 0))

Short circuiting means the np.any calls are often not computed at all, so it should be really cheap. tmp1 and tmp2 needs to be computed for the inverse_transform so that is no extra work.

EDIT: hopefully with better variable names than tmp1 and tmp2...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Easy
Development

No branches or pull requests

2 participants