Open
Description
Describe the bug
SMOTENC fit_transform fails with Numpy error ValueError: zero-size array to reduction operation maximum which has no identity
when getting to this line:
Steps/Code to Reproduce
The reason is unclear, maybe it is a highly imbalanced dataset with binary target equal to 1 in 134/22763 samples.
Example:
from imblearn.over_sampling import SMOTENC
oversample = SMOTENC(
categorical_features=labels[:10],
categorical_encoder=OneHotEncoder(drop="if_binary", handle_unknown="ignore"),
sampling_strategy="minority",
)
X_smotenc, y_smotenc = oversample.fit_resample(X, y)
Using SMOTE instead works without problem.
Expected Results
Dataset oversample
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[156], line 1
----> 1 trdata_smotenc, tgt_smotenc = oversample.fit_resample(ctrdata, trdata.DEF_FLG)
File ~\.conda\envs\test\Lib\site-packages\imblearn\base.py:208, in BaseSampler.fit_resample(self, X, y)
187 """Resample the dataset.
188
189 Parameters
(...)
205 The corresponding label of `X_resampled`.
206 """
207 self._validate_params()
--> 208 return super().fit_resample(X, y)
File ~\.conda\envs\test\Lib\site-packages\imblearn\base.py:112, in SamplerMixin.fit_resample(self, X, y)
106 X, y, binarize_y = self._check_X_y(X, y)
108 self.sampling_strategy_ = check_sampling_strategy(
109 self.sampling_strategy, y, self._sampling_type
110 )
--> 112 output = self._fit_resample(X, y)
114 y_ = (
115 label_binarize(output[1], classes=np.unique(y)) if binarize_y else output[1]
116 )
118 X_, y_ = arrays_transformer.transform(output[0], y_)
File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:683, in SMOTENC._fit_resample(self, X, y)
680 X_ohe.data = np.ones_like(X_ohe.data, dtype=X_ohe.dtype) * self.median_std_ / 2
681 X_encoded = sparse.hstack((X_continuous, X_ohe), format="csr")
--> 683 X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
685 # reverse the encoding of the categorical features
686 X_res_cat = X_resampled[:, self.continuous_features_.size :]
File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:365, in SMOTE._fit_resample(self, X, y)
363 self.nn_k_.fit(X_class)
364 nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
--> 365 X_new, y_new = self._make_samples(
366 X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
367 )
368 X_resampled.append(X_new)
369 y_resampled.append(y_new)
File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:119, in BaseSMOTE._make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
116 rows = np.floor_divide(samples_indices, nn_num.shape[1])
117 cols = np.mod(samples_indices, nn_num.shape[1])
--> 119 X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
120 y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)
121 return X_new, y_new
File ~\.conda\envs\test\Lib\site-packages\imblearn\over_sampling\_smote\base.py:755, in SMOTENC._generate_samples(self, X, nn_data, nn_num, rows, cols, steps)
753 col_maxs = all_neighbors[:, :, start_idx:end_idx].sum(axis=1)
754 # tie breaking argmax
--> 755 is_max = np.isclose(col_maxs, col_maxs.max(axis=1, keepdims=True))
756 max_idxs = rng.permutation(np.argwhere(is_max))
757 xs, idx_sels = np.unique(max_idxs[:, 0], return_index=True)
File ~\.conda\envs\test\Lib\site-packages\numpy\core\_methods.py:41, in _amax(a, axis, out, keepdims, initial, where)
39 def _amax(a, axis=None, out=None, keepdims=False,
40 initial=_NoValue, where=True):
---> 41 return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
Versions
System:
python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)]
executable: C:\Users\user\.conda\envs\test\python.exe
machine: Windows-10-10.0.19045-SP0
Python dependencies:
sklearn: 1.3.0
pip: 23.2.1
setuptools: 68.0.0
numpy: 1.25.2
scipy: 1.11.1
Cython: None
pandas: 2.0.3
matplotlib: 3.7.2
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: mkl
num_threads: 4
prefix: libblas
filepath: C:\Users\user\.conda\envs\test\Library\bin\libblas.dll
version: 2022.1-Product
threading_layer: intel
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: vcomp
filepath: C:\Users\user\.conda\envs\test\vcomp140.dll
version: None
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libiomp
filepath: C:\Users\user\.conda\envs\test\Library\bin\libiomp5md.dll
version: None
Metadata
Metadata
Assignees
Labels
No labels