Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SMOTENC fails with ValueError: zero-size array to reduction operation maximum which has no identity #1035

Open
Ingvar-Y opened this issue Aug 17, 2023 · 0 comments

Comments

@Ingvar-Y
Copy link

Describe the bug

SMOTENC fit_transform fails with Numpy error ValueError: zero-size array to reduction operation maximum which has no identity when getting to this line:

is_max = np.isclose(col_maxs, col_maxs.max(axis=1, keepdims=True))

Steps/Code to Reproduce

The reason is unclear, maybe it is a highly imbalanced dataset with binary target equal to 1 in 134/22763 samples.
Example:

from imblearn.over_sampling import SMOTENC
oversample = SMOTENC(
    categorical_features=labels[:10],
    categorical_encoder=OneHotEncoder(drop="if_binary", handle_unknown="ignore"),
    sampling_strategy="minority",
)
X_smotenc, y_smotenc = oversample.fit_resample(X, y)

Using SMOTE instead works without problem.

Expected Results

Dataset oversample

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[156], line 1
----> 1 trdata_smotenc, tgt_smotenc = oversample.fit_resample(ctrdata, trdata.DEF_FLG)

File ~\.conda\envs\test\Lib\site-packages\imblearn\base.py:208, in BaseSampler.fit_resample(self, X, y)
    187 """Resample the dataset.
    188 
    189 Parameters
   (...)
    205     The corresponding label of `X_resampled`.
    206 """
    207 self._validate_params()
--> 208 return super().fit_resample(X, y)

File ~\.conda\envs\test\Lib\site-packages\imblearn\base.py:112, in SamplerMixin.fit_resample(self, X, y)
    106 X, y, binarize_y = self._check_X_y(X, y)
    108 self.sampling_strategy_ = check_sampling_strategy(
    109     self.sampling_strategy, y, self._sampling_type
    110 )
--> 112 output = self._fit_resample(X, y)
    114 y_ = (
    115     label_binarize(output[1], classes=np.unique(y)) if binarize_y else output[1]
    116 )
    118 X_, y_ = arrays_transformer.transform(output[0], y_)

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:683, in SMOTENC._fit_resample(self, X, y)
    680 X_ohe.data = np.ones_like(X_ohe.data, dtype=X_ohe.dtype) * self.median_std_ / 2
    681 X_encoded = sparse.hstack((X_continuous, X_ohe), format="csr")
--> 683 X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
    685 # reverse the encoding of the categorical features
    686 X_res_cat = X_resampled[:, self.continuous_features_.size :]

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:365, in SMOTE._fit_resample(self, X, y)
    363 self.nn_k_.fit(X_class)
    364 nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
--> 365 X_new, y_new = self._make_samples(
    366     X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
    367 )
    368 X_resampled.append(X_new)
    369 y_resampled.append(y_new)

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:119, in BaseSMOTE._make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
    116 rows = np.floor_divide(samples_indices, nn_num.shape[1])
    117 cols = np.mod(samples_indices, nn_num.shape[1])
--> 119 X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
    120 y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)
    121 return X_new, y_new

File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:755, in SMOTENC._generate_samples(self, X, nn_data, nn_num, rows, cols, steps)
    753 col_maxs = all_neighbors[:, :, start_idx:end_idx].sum(axis=1)
    754 # tie breaking argmax
--> 755 is_max = np.isclose(col_maxs, col_maxs.max(axis=1, keepdims=True))
    756 max_idxs = rng.permutation(np.argwhere(is_max))
    757 xs, idx_sels = np.unique(max_idxs[:, 0], return_index=True)

File ~\.conda\envs\test\Lib\site-packages\numpy\core\_methods.py:41, in _amax(a, axis, out, keepdims, initial, where)
     39 def _amax(a, axis=None, out=None, keepdims=False,
     40           initial=_NoValue, where=True):
---> 41     return umr_maximum(a, axis, None, out, keepdims, initial, where)

ValueError: zero-size array to reduction operation maximum which has no identity

Versions

System:
    python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]
executable: C:\Users\user\.conda\envs\test\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.3.0
          pip: 23.2.1
   setuptools: 68.0.0
        numpy: 1.25.2
        scipy: 1.11.1
       Cython: None
       pandas: 2.0.3
   matplotlib: 3.7.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
    num_threads: 4
         prefix: libblas
       filepath: C:\Users\user\.conda\envs\test\Library\bin\libblas.dll
        version: 2022.1-Product
threading_layer: intel

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: vcomp
       filepath: C:\Users\user\.conda\envs\test\vcomp140.dll
        version: None

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libiomp
       filepath: C:\Users\user\.conda\envs\test\Library\bin\libiomp5md.dll
        version: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant