Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix stratify pretest sample #1004

Conversation

kristiankaufmann
Copy link

@kristiankaufmann kristiankaufmann commented Jan 16, 2020

What does this PR do?

Fixes an edge case that arises when using tpot on dataset with extreme class imbalance where the pretest sample would cause an error to be thrown since the sample resulted in only one class being represented in the pretest_sample

Where should the reviewer start?

See the changes in the tpot base with the introduction of a init_pretest function and the corresponding implementation in the TPotClassifier

How should this PR be tested?

a unit test was added to verify that the pretest function works as intended

Any background context you want to provide?

see PR description

Questions:

  • Do the docs need to be updated? no
  • Does this PR add new (Python) dependencies? no

Kristian Kaufmann added 2 commits January 16, 2020 16:05
…alanced datasets

pretest_* member variable are used to check dataset and pipeline compatibility. With highly imbalanced classification tasks the chance that a sample contains a single class arises. These changes ensure that at least one label for each class is included in the pretest dataset.
@coveralls
Copy link

coveralls commented Jan 16, 2020

Coverage Status

Coverage decreased (-0.1%) to 96.649% when pulling 5483750 on kristiankaufmann:bugfix_stratify_pretest_sample into 410d88c on EpistasisLab:development.

Copy link
Contributor

@weixuanfu weixuanfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding this issue and submitting this PR. Just one suggestion below for train/test splits in classification. Maybe we could also use that stratify param in train_test_split for regression. If so, I think the fix will be a little simpler.

tpot/tpot.py Show resolved Hide resolved
@weixuanfu
Copy link
Contributor

Thank you for the PR. I merge this patch for a temp fix.

@weixuanfu weixuanfu merged commit 0de4740 into EpistasisLab:development May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants