Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oversample and undersample always return classes as well #116

Merged
merged 1 commit into from
Oct 28, 2022

Conversation

CarloLucibello
Copy link
Member

@CarloLucibello CarloLucibello commented Aug 5, 2022

Fix #113 by having the implementation adhere to the docs instead of changing the docs.
The resampled classes are now always returned.

Also, made the under/oversample calls deterministic when shuffle=false.

Since the change is breaking with respect to previous behavior (but non-breaking with respect to the behavior declaimed in the docs) I'm also updating the minor version.

@SimonEnsemble
Copy link

here b/c I am confused about how oversample works.

# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]
# oversample the class "a" to match "b"
X_bal, Y_bal = oversample(X, Y)
# this results in a bigger dataset with repeated data
@assert size(X_bal) == (3,8)
@assert length(Y_bal) == 8
# now both "a", and "b" have 4 observations each
@assert sum(Y_bal .== "a") == 4
@assert sum(Y_bal .== "b") == 4

does not hold as advertised...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect docs for undersample and oversample
2 participants