You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we want to use StratifiedShuffleSplit to train test split across classes, we would expect we need 2 samples of the lowest represented class: 1 for test, one for train. We don't get this: we need 3 samples of the lowest class
I think this is not a bug but rather a known implementation detail: looking at the code, we use _approximate_mode that is know to be an approximate estimate that can be off by 1 (the value that you observed). I assume that this approximation is done for some computation reasons.
However, since the behaviour is surprising, I think that we could document it in a "Note" section to mention this corner case.
if n_train < n_classes:
raise ValueError(
"The train_size = %d should be greater or "
"equal to the number of classes = %d" % (n_train, n_classes)
)
if n_test < n_classes:
raise ValueError(
"The test_size = %d should be greater or "
"equal to the number of classes = %d" % (n_test, n_classes)
)
does a check I'd expect to catch the issue here: I'd suggest that should be modified too.
Describe the bug
When we want to use
StratifiedShuffleSplit
to train test split across classes, we would expect we need 2 samples of the lowest represented class: 1 for test, one for train. We don't get this: we need 3 samples of the lowest classsklearn version 1.2.1
Steps/Code to Reproduce
Expected Results
We expect to get a test set and a train set that both contain 1 example of each class when we have 2 representatives.
Actual Results
Versions
The text was updated successfully, but these errors were encountered: