Train test validate by ymahlich · Pull Request #244 · PNNL-CompBio/coderdata

ymahlich · 2024-11-11T22:58:27Z

PR for the training / testing and validation set generation scripts.

PR adds coderdata.split.splitter which contains:

train_test_validate()
_create_classes()
_filter()

train_test_validate()

This is the main (and "public") function of the submodule. The function enables the generation of train/test/validation splits. Returns 3 individual CoderData objects, one each for tain, test and validate. Arguments that can modify how the individual splits are generated are:

split_type : {'mixed-set', 'drug-blind', 'cancer-blind'} - this should be self explanatory
ratio : tuple[int, int, int] - for example ratio=(8,1,1) would result in a 80/10/10 split between train/test/validate
stratify_by : str | None - if None no stratification will happen. If passed string is a drug response metric the stratification will be based on this metric
random_state - should be self explanatory
**kwargs - additional keyword arguments that can be passed along to _create_classes() and will influence how the classes are cerated.

_create_classes()

Internal "private" helper function to internally create classes that are needed for the stratification. Arguments (besides the dataset) are:

metric - same as split_type in train_test_validate if split_type != None
num_classes : int - defines the number of classes that should be generated
quantiles : bool - if set to True "bin size" is such that every bin has approximately the same number of datapoints in the reference dataset. If set to False then the bin size is chosen to be uniform in the range of the drug response metric values.
thresh : float - Can only be used if num_classes == 2 & quantiles == False; Can be used to set a threshold for "uneven" bin size.

_filter()

Internal "private" helper function that aids in creating filtered subsets of the reference CoderData object which only contain data points that pertain to the individual train / test & validate sets.

Example call:

import coderdata as cd
data = cd.DatasetLoader('beataml')
train, test, validate = cd.train_test_validate(
    data,
    split_type='cancer-blind',
    ratio=[8,1,1],
    stratify_by='fit_auc',
    random_state=42,
    num_classes=5,
    )

The call detailed above would generate a training, testing & validation CoderData object, based on the BeatAML dataset. The splits are generated such that the individual sets are cancer-blind, i.e. cell lines used to test drugs on in train are not present in either test or validate and vice versa. Ratios for the split sizes are 8:1:1 for train/test/validate. The split is done with stratification by using fit_auc as a reference. Stratification also is done by internally generating 5 classes (num_classes=5) as well as using "quantiles" (does not need to defined in the function call since this is the default behavior - if evenly spaced classes are desired set quantiles=False). Finally the seed for the randomization is set to 42 to generate a reproducible split (random_state=42).

What this PR DOESN'T do:

Implement a Class function call akin to dataset.train_test_validate() that can be directly called based on the loaded CoderData object.

…ments object v0.1.4)

…ne split ratios

…ify splitting process

…ncorporated that change into train_test_validate(). Also added _cerate_classes' arguments as kwargs to train_test_validate()

sgosline · 2024-11-12T00:49:02Z

I'm good to merge for now!

ymahlich added 8 commits September 26, 2024 15:26

first draft of reworked train_test_split (works with coderData experi…

724451c

…ments object v0.1.4)

train_test_validate now returns filtered CoderData objects

2504eb8

remove 'disjoint-set' from list of possible split strategies

3f3ea00

preparation for stratification implementation / added ability to defi…

dfa0fa9

…ne split ratios

added pivoting the experiments table (and melting at return) to simpl…

51aae96

…ify splitting process

first draft of stratified splitting

9f360d1

added option to use quantiles as an argument to _create_classes() & i…

567446d

…ncorporated that change into train_test_validate(). Also added _cerate_classes' arguments as kwargs to train_test_validate()

Updated documentation

b11e807

ymahlich linked an issue Nov 11, 2024 that may be closed by this pull request

Stratification / balancing of train / test / validate according to drug efficacy #213

Closed

ymahlich requested review from jjacobson95 and sgosline November 11, 2024 22:58

sgosline merged commit c4011a4 into main Nov 12, 2024

ymahlich deleted the train-test-validate branch December 18, 2024 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train test validate#244

Train test validate#244
sgosline merged 8 commits intomainfrom
train-test-validate

ymahlich commented Nov 11, 2024

Uh oh!

sgosline commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ymahlich commented Nov 11, 2024

PR for the training / testing and validation set generation scripts.

train_test_validate()

_create_classes()

_filter()

Example call:

What this PR DOESN'T do:

Uh oh!

sgosline commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants