Skip to content

Train test validate#244

Merged
sgosline merged 8 commits intomainfrom
train-test-validate
Nov 12, 2024
Merged

Train test validate#244
sgosline merged 8 commits intomainfrom
train-test-validate

Conversation

@ymahlich
Copy link
Copy Markdown
Collaborator

PR for the training / testing and validation set generation scripts.

PR adds coderdata.split.splitter which contains:

  • train_test_validate()
  • _create_classes()
  • _filter()

train_test_validate()

This is the main (and "public") function of the submodule. The function enables the generation of train/test/validation splits. Returns 3 individual CoderData objects, one each for tain, test and validate. Arguments that can modify how the individual splits are generated are:

  • split_type : {'mixed-set', 'drug-blind', 'cancer-blind'} - this should be self explanatory
  • ratio : tuple[int, int, int] - for example ratio=(8,1,1) would result in a 80/10/10 split between train/test/validate
  • stratify_by : str | None - if None no stratification will happen. If passed string is a drug response metric the stratification will be based on this metric
  • random_state - should be self explanatory
  • **kwargs - additional keyword arguments that can be passed along to _create_classes() and will influence how the classes are cerated.

_create_classes()

Internal "private" helper function to internally create classes that are needed for the stratification. Arguments (besides the dataset) are:

  • metric - same as split_type in train_test_validate if split_type != None
  • num_classes : int - defines the number of classes that should be generated
  • quantiles : bool - if set to True "bin size" is such that every bin has approximately the same number of datapoints in the reference dataset. If set to False then the bin size is chosen to be uniform in the range of the drug response metric values.
  • thresh : float - Can only be used if num_classes == 2 & quantiles == False; Can be used to set a threshold for "uneven" bin size.

_filter()

Internal "private" helper function that aids in creating filtered subsets of the reference CoderData object which only contain data points that pertain to the individual train / test & validate sets.

Example call:

import coderdata as cd
data = cd.DatasetLoader('beataml')
train, test, validate = cd.train_test_validate(
    data,
    split_type='cancer-blind',
    ratio=[8,1,1],
    stratify_by='fit_auc',
    random_state=42,
    num_classes=5,
    )

The call detailed above would generate a training, testing & validation CoderData object, based on the BeatAML dataset. The splits are generated such that the individual sets are cancer-blind, i.e. cell lines used to test drugs on in train are not present in either test or validate and vice versa. Ratios for the split sizes are 8:1:1 for train/test/validate. The split is done with stratification by using fit_auc as a reference. Stratification also is done by internally generating 5 classes (num_classes=5) as well as using "quantiles" (does not need to defined in the function call since this is the default behavior - if evenly spaced classes are desired set quantiles=False). Finally the seed for the randomization is set to 42 to generate a reproducible split (random_state=42).

What this PR DOESN'T do:

Implement a Class function call akin to dataset.train_test_validate() that can be directly called based on the loaded CoderData object.

@sgosline
Copy link
Copy Markdown
Member

I'm good to merge for now!

@sgosline sgosline merged commit c4011a4 into main Nov 12, 2024
@ymahlich ymahlich deleted the train-test-validate branch December 18, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Stratification / balancing of train / test / validate according to drug efficacy

2 participants