Data handling needs to be restructured #10

sfluegel05 · 2024-03-05T13:43:41Z

Status quo

Data preprocessing is split into "raw" and "processed" according to the predetermined lightning-structure
"raw" contains the chebi.obo, classes.txt, train/test/val splits (unprocessed SMILES, labels)
"processed" contains encoded versions of train/test/val splits (SMILES processed)

Goal

Have 3 preprocessing stages:
- first stages only contains chebi.obo (raw)
- second stage contains data without split, but with labels attached (processed 1)
- third level contains encoded data (again without split) (processed 2)
Splits are created "on the fly"
- Test that they can be reproduced with some seed (compare hashes)
The file structure should represent this:
- Current file paths are data/ChEBIX/chebi_version/raw / data/ChEBIX/chebi_version/processed/encoding
- Instead, only take the parameters that are important for each step:
- raw: data/chebi_version/raw
- processed 1: data/chebi_version/ChEBIX/processed
- processed 2: data/chebi_version/ChEBIX/processed/encoding

Things to keep in mind (for later implementations)

How can this work with cross-validation? -> it should be possible to get the same test set with different train/val splits
How to handle different versions of chebi, combinations of different training / test sets -> currently, this is handeled via additional files, should also be possible dynamically

The text was updated successfully, but these errors were encountered:

aditya0by0 · 2024-05-20T14:45:57Z

Hi @sfluegel05, I have doubt regarding the issue. Do we have to implement the above restructuring only for chebi dataset or for all other datasets too.

sfluegel05 · 2024-05-22T13:45:21Z

This is only for the ChEBI datasets. The other datasets have their own structure. That should be adjusted as well at some point, but that would be a different issue

sfluegel05 · 2024-05-30T08:51:11Z

A special case for the data splits is the chebi_version_train:

Use case

You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).

Tasks

if chebi_version_train is set, create and process two datasets (one for the chebi_version, one for chebi_version_train)
when creating splits, build the training and validation splits based on the chebi_version_train data, but using the test set from chebi_version
build the test set as an adaption of the chebi_version test set that has all the same entries, but only the labels that also appear in the classes.txt of chebi_version_train
test the implementation: classes ChEBIOver50(chebi_version=231) and ChEBIOver50(chebi_version=231, chebi_version_train=200) should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlaps

Most of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.

sfluegel05 added the good first issue Good for newcomers label Mar 5, 2024

sfluegel05 assigned VenkateshDas Mar 5, 2024

sfluegel05 unassigned VenkateshDas Mar 28, 2024

sfluegel05 assigned aditya0by0 May 16, 2024

aditya0by0 mentioned this issue May 27, 2024

Data handling restructure #29

Merged

4 tasks

sfluegel05 mentioned this issue May 29, 2024

Runs should be reproducable #12

Closed

2 tasks

aditya0by0 linked a pull request Jun 11, 2024 that will close this issue

Data handling restructure #29

Merged

4 tasks

sfluegel05 mentioned this issue Jun 26, 2024

Data migration / setting fixed splits #34

Closed

This was referenced Jun 26, 2024

Code documentation #35

Merged

Data migration #37

Merged

sfluegel05 closed this as completed in #29 Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data handling needs to be restructured #10

Data handling needs to be restructured #10

sfluegel05 commented Mar 5, 2024 •

edited

Loading

aditya0by0 commented May 20, 2024

sfluegel05 commented May 22, 2024

sfluegel05 commented May 30, 2024 •

edited by aditya0by0

Loading

Data handling needs to be restructured #10

Data handling needs to be restructured #10

Comments

sfluegel05 commented Mar 5, 2024 • edited Loading

Status quo

Goal

Things to keep in mind (for later implementations)

aditya0by0 commented May 20, 2024

sfluegel05 commented May 22, 2024

sfluegel05 commented May 30, 2024 • edited by aditya0by0 Loading

Use case

Tasks

sfluegel05 commented Mar 5, 2024 •

edited

Loading

sfluegel05 commented May 30, 2024 •

edited by aditya0by0

Loading