Conversation
…ments object v0.1.4)
…ify splitting process
…ncorporated that change into train_test_validate(). Also added _cerate_classes' arguments as kwargs to train_test_validate()
Member
|
I'm good to merge for now! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR for the training / testing and validation set generation scripts.
PR adds coderdata.split.splitter which contains:
train_test_validate()
This is the main (and "public") function of the submodule. The function enables the generation of train/test/validation splits. Returns 3 individual CoderData objects, one each for tain, test and validate. Arguments that can modify how the individual splits are generated are:
_create_classes()and will influence how the classes are cerated._create_classes()
Internal "private" helper function to internally create classes that are needed for the stratification. Arguments (besides the dataset) are:
split_typeintrain_test_validateifsplit_type != NoneTrue"bin size" is such that every bin has approximately the same number of datapoints in the reference dataset. If set toFalsethen the bin size is chosen to be uniform in the range of the drug response metric values.num_classes == 2&quantiles == False; Can be used to set a threshold for "uneven" bin size._filter()
Internal "private" helper function that aids in creating filtered subsets of the reference CoderData object which only contain data points that pertain to the individual train / test & validate sets.
Example call:
The call detailed above would generate a training, testing & validation CoderData object, based on the
BeatAMLdataset. The splits are generated such that the individual sets arecancer-blind, i.e. cell lines used to test drugs on intrainare not present in eithertestorvalidateand vice versa. Ratios for the split sizes are 8:1:1 for train/test/validate. The split is done with stratification by usingfit_aucas a reference. Stratification also is done by internally generating 5 classes (num_classes=5) as well as using "quantiles" (does not need to defined in the function call since this is the default behavior - if evenly spaced classes are desired setquantiles=False). Finally the seed for the randomization is set to 42 to generate a reproducible split (random_state=42).What this PR DOESN'T do:
Implement a Class function call akin to
dataset.train_test_validate()that can be directly called based on the loaded CoderData object.