Stratification / balancing of train / test / validate according to drug efficacy #213

ymahlich · 2024-09-27T17:26:11Z

Currently splitting the data into train / test & validate accounts only for approximate final set sizes (i.e. 80/10/10) and if asked for, creates drug-blind (i.e. drug ids will be unique to either train, test or validate), cancer-blind (analogous to drug-blind but based on cancer samples) or disjoint-sets (currently not implemented, see also issue #212, combination of drug- and cancer-blind where drug ids and cancer samples are unique to a split).

No stratification of the individual splits according to the potential prediction class (i.e. the efficacy of the drug) is taking place. At random one would expect the individual splits to reflect the overall class distribution of the full dataset (especially for sufficiently large datasets). Creating splits that reflect "the real world" might however not be the best strategy for training. Imagine a situation where 95% of the cancer responses to drugs is no change / very little efficacy. If the training set reflects this the easiest outcome of a model might be to just predict everything as no / low efficacy.
To combat that it might be worth to add an optional argument that allows for stratification of the training and test sets, such that they reflect a more balanced class distribution. To achieve this we might need to define which efficacy measure to base the classes on and what the cut-off would be. Another thing to keep in mind will be that balancing the sets will inevitably lead to not using all samples and result in smaller subsets for train / test / validate.

sgosline · 2024-10-01T23:35:29Z

I think we should do this. It sounds like skikit-learn has some functionality to enable this with classes, so we might want to hijack this, though it will require taking a threshold to classify a drug response, and which statistic we want to classify (ic50, AUC, etc). Perhaps these should be additional arguments?

ymahlich · 2024-10-02T16:34:54Z

My suggestion would be:

One additional argument (lets call it 'balance') that is a optional boolean set to True by default.
additional arguments like which statistic as possible **kwargs that are not defined in the function call, but documented in the documentation. Let's assume we define kwarg['stat']='ic50', then we could check if 'ic50' is one of the available statistics and if so use it to threshold over. That would also give us the flexibility to use additional statistics if we ever add in new ones (e.g. imagine 'auc' wouldn't be in there yet and we add it, the we wouldn't have to update the splitting code to account for that) or someone compiles their own dataset and adds in statistics that we haven't accounted for.
other **kwargs might be something like num_bins if we want to do bin into more than one class/label.

Concerning scikit-learn: I think we should be able to use StratifiedGroupKFold but I'll have to double check / do some testing.

ymahlich · 2024-11-05T17:59:31Z

First draft of stratified train/test/validate splitting is implemented as of 9f360d1.

Concerns / outstanding tasks that still need to be addressed before pull request:

The "list" is not necessarily in the order of importance. Potentially we want to PR first and move some of the outstanding tasks into their own issues?

Class generation currently "hard coded":

class generation (based on drug response measure) is currently hard coded to be a two class split at threshold 0.5
The function doing that takes num_classes (essentially num_bins, i.e. defines how many classes should be generated) and thresh (only possible when num_classes==2, can be used to define a threshold such that in 2-class approaches the classes aren't equally large)
Due to outliers in the drug response measure it is possible that for num_classes > 2 classes are generated that contain only 1 sample. This causes the stratification to fail
Possible solution to that would be implement a filtering step before splitting, that removes outliers. This could be done by defining a range in which samples should be considered (e.g. remove everything that doesn't have a fit_auc between [0, 1])

Stratification vs. Balancing:

TL;DR: Stratification and Balancing the dataset do two different things. Balancing is currently not implemented!

Stratification of the splits tries to guarantee that the class distribution within the individual sets (train/test/validate) reflect the class distribution of the full dataset. Logically that also means that the class distribution should be roughly equal between the individual sets as well. This does how ever not mean that the sets are class balanced, i.e. that each set contains roughly the same number of samples for each class. In other words:

Assuming there are two classes in the full dataset and they are distributed 1:2
Stratification tries to make sure the classes are distributed 1:2 in train/test/validate as well
Balancing would try to make sure that the class distribution in train/test/validate would be 1:1 which would be done by either over or under sampling.

Testing:

It might be worth to run some statistical tests (KS?) to check if the stratification is successful.

General Documentation:

Improve documentation of code
Update docstring to include new arguments in train_test_validate

sgosline assigned ymahlich Oct 1, 2024

jjacobson95 added enhancement New feature or request package labels Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratification / balancing of train / test / validate according to drug efficacy #213

Stratification / balancing of train / test / validate according to drug efficacy #213

ymahlich commented Sep 27, 2024

sgosline commented Oct 1, 2024

ymahlich commented Oct 2, 2024

ymahlich commented Nov 5, 2024

Stratification / balancing of train / test / validate according to drug efficacy #213

Stratification / balancing of train / test / validate according to drug efficacy #213

Comments

ymahlich commented Sep 27, 2024

sgosline commented Oct 1, 2024

ymahlich commented Oct 2, 2024

ymahlich commented Nov 5, 2024

Concerns / outstanding tasks that still need to be addressed before pull request: