Add multilabel stratified train/val/test splits#4
Conversation
…ratified splits Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
|
@copilot Remove the previous split function (the one which does not work for multilabel data). Also, the input df does not have a single "labels" column, but each label has its own column. the dataframe is structured "chebi_id", "mol", [possibly other columns], "label1", "label2", ... |
Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Done in ab26a27. The old |
|
@copilot Lint is failing. Please reformat with ruff. Tests are failing as well. Please check the test results and fix accordingly |
Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Fixed in 39e22ec. The lint failure was caused by pre-existing formatting issues in |
scikit-learnanditerative-stratificationtopyproject.tomldependenciescreate_multilabel_splitswithlabel_start_col: int = 2— all columns from that index are bool label columnscreate_splitsfunction (and helpers)create_multilabel_splitsfromchebi_utils/__init__.pychebi_utils/obo_extractor.pyandtests/test_obo_extractor.py(pre-existing lint failures in CI)Original prompt
This section details on the original issue you should resolve
<issue_title>Stratified splits for dataset</issue_title>
<issue_description>For the dataset created with the dataset builder, I need stratified splits. There is already a splitter implementation, but it is not sufficient for our multilabel dataset.
Use
iterstratorsklearnpackages to build stratified split. Here is an example for a similar library: