-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stratification / balancing of train / test / validate according to drug efficacy #213
Comments
I think we should do this. It sounds like skikit-learn has some functionality to enable this with classes, so we might want to hijack this, though it will require taking a threshold to classify a drug response, and which statistic we want to classify (ic50, AUC, etc). Perhaps these should be additional arguments? |
My suggestion would be:
Concerning scikit-learn: I think we should be able to use StratifiedGroupKFold but I'll have to double check / do some testing. |
First draft of stratified train/test/validate splitting is implemented as of 9f360d1. Concerns / outstanding tasks that still need to be addressed before pull request:The "list" is not necessarily in the order of importance. Potentially we want to PR first and move some of the outstanding tasks into their own issues? Class generation currently "hard coded":
Stratification vs. Balancing: TL;DR: Stratification and Balancing the dataset do two different things. Balancing is currently not implemented! Stratification of the splits tries to guarantee that the class distribution within the individual sets (train/test/validate) reflect the class distribution of the full dataset. Logically that also means that the class distribution should be roughly equal between the individual sets as well. This does how ever not mean that the sets are class balanced, i.e. that each set contains roughly the same number of samples for each class. In other words:
Testing:
General Documentation:
|
Currently splitting the data into train / test & validate accounts only for approximate final set sizes (i.e. 80/10/10) and if asked for, creates drug-blind (i.e. drug ids will be unique to either train, test or validate), cancer-blind (analogous to drug-blind but based on cancer samples) or disjoint-sets (currently not implemented, see also issue #212, combination of drug- and cancer-blind where drug ids and cancer samples are unique to a split).
No stratification of the individual splits according to the potential prediction class (i.e. the efficacy of the drug) is taking place. At random one would expect the individual splits to reflect the overall class distribution of the full dataset (especially for sufficiently large datasets). Creating splits that reflect "the real world" might however not be the best strategy for training. Imagine a situation where 95% of the cancer responses to drugs is no change / very little efficacy. If the training set reflects this the easiest outcome of a model might be to just predict everything as no / low efficacy.
To combat that it might be worth to add an optional argument that allows for stratification of the training and test sets, such that they reflect a more balanced class distribution. To achieve this we might need to define which efficacy measure to base the classes on and what the cut-off would be. Another thing to keep in mind will be that balancing the sets will inevitably lead to not using all samples and result in smaller subsets for train / test / validate.
The text was updated successfully, but these errors were encountered: