Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data handling needs to be restructured #10

Closed
sfluegel05 opened this issue Mar 5, 2024 · 3 comments · Fixed by #29
Closed

Data handling needs to be restructured #10

sfluegel05 opened this issue Mar 5, 2024 · 3 comments · Fixed by #29
Assignees
Labels
good first issue Good for newcomers

Comments

@sfluegel05
Copy link
Collaborator

sfluegel05 commented Mar 5, 2024

Status quo

  • Data preprocessing is split into "raw" and "processed" according to the predetermined lightning-structure
  • "raw" contains the chebi.obo, classes.txt, train/test/val splits (unprocessed SMILES, labels)
  • "processed" contains encoded versions of train/test/val splits (SMILES processed)

Goal

  • Have 3 preprocessing stages:
    • first stages only contains chebi.obo (raw)
    • second stage contains data without split, but with labels attached (processed 1)
    • third level contains encoded data (again without split) (processed 2)
  • Splits are created "on the fly"
    • Test that they can be reproduced with some seed (compare hashes)
  • The file structure should represent this:
    • Current file paths are data/ChEBIX/chebi_version/raw / data/ChEBIX/chebi_version/processed/encoding
    • Instead, only take the parameters that are important for each step:
    • raw: data/chebi_version/raw
    • processed 1: data/chebi_version/ChEBIX/processed
    • processed 2: data/chebi_version/ChEBIX/processed/encoding

Things to keep in mind (for later implementations)

  • How can this work with cross-validation? -> it should be possible to get the same test set with different train/val splits
  • How to handle different versions of chebi, combinations of different training / test sets -> currently, this is handeled via additional files, should also be possible dynamically
@aditya0by0
Copy link
Collaborator

Hi @sfluegel05, I have doubt regarding the issue. Do we have to implement the above restructuring only for chebi dataset or for all other datasets too.

@sfluegel05
Copy link
Collaborator Author

This is only for the ChEBI datasets. The other datasets have their own structure. That should be adjusted as well at some point, but that would be a different issue

@sfluegel05
Copy link
Collaborator Author

sfluegel05 commented May 30, 2024

A special case for the data splits is the chebi_version_train:

Use case

You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).

Tasks

  • if chebi_version_train is set, create and process two datasets (one for the chebi_version, one for chebi_version_train)
  • when creating splits, build the training and validation splits based on the chebi_version_train data, but using the test set from chebi_version
  • build the test set as an adaption of the chebi_version test set that has all the same entries, but only the labels that also appear in the classes.txt of chebi_version_train
  • test the implementation: classes ChEBIOver50(chebi_version=231) and ChEBIOver50(chebi_version=231, chebi_version_train=200) should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlaps

Most of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.

@aditya0by0 aditya0by0 linked a pull request Jun 11, 2024 that will close this issue
4 tasks
This was referenced Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants