Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data handling restructure #29

Merged
merged 19 commits into from
Jul 5, 2024
Merged

Data handling restructure #29

merged 19 commits into from
Jul 5, 2024

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented May 27, 2024

Goal

  • Have 3 preprocessing stages:
    • first stages only contains chebi.obo (raw)
    • second stage contains data without split, but with labels attached (processed 1)
    • third level contains encoded data (again without split) (processed 2)
  • Splits are created "on the fly"
    • Test that they can be reproduced with some seed (compare hashes)
  • The file structure should represent this:
    • Current file paths are data/ChEBIX/chebi_version/raw / data/ChEBIX/chebi_version/processed/encoding
    • Instead, only take the parameters that are important for each step:
    • raw: data/chebi_version/raw
    • processed 1: data/chebi_version/ChEBIX/processed
    • processed 2: data/chebi_version/ChEBIX/processed/encoding

A special case for the data splits is the chebi_version_train:

Use case

You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).

Tasks

  • if chebi_version_train is set, create and process two datasets (one for the chebi_version, one for chebi_version_train)
  • when creating splits, build the training and validation splits based on the chebi_version_train data, but using the test set from chebi_version
  • build the test set as an adaption of the chebi_version test set that has all the same entries, but only the labels that also appear in the classes.txt of chebi_version_train
  • test the implementation: classes ChEBIOver50(chebi_version=231) and ChEBIOver50(chebi_version=231, chebi_version_train=200) should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlaps

Most of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.

Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation. To me, this is looking good and working as intended.

I only have two change requests:

  • Could you also update the tests for chebi (testChebiData)? I would assume sume of them don't work anymore since there are no test.pt files anymore.
  • For the seed: The cleanest way would be to embed it into the lightning config system. You wrote a custom parsing function. You could have saved yourself the effort by adding the seed as an init parameter to the class. Then, a user can add a seed to their regular data config file. And if not, some default value is used instead of raising an error. Could you do that instead?

@sfluegel05
Copy link
Collaborator

Some additional changes we talked about:

  • tests:
    • instantiate data class directly instead of loading it from a config file,
    • automatically create data if it not already present
  • store created datasplits in variables to avoid recreating them
  • overload load_processed_data instead of dataloader to avoid code duplication
  • check if the evaluation in tutorials/eval_model_basic.ipnb still works, update if necessary

@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Jun 11, 2024

Hi @sfluegel05, Please review the PR.
I request for bit more time to look into tutorials/eval_model_basic.ipynb due to some errors.

@aditya0by0 aditya0by0 added the enhancement New feature or request label Jun 11, 2024
This was linked to issues Jun 11, 2024
Updates
- Evaluation notebook
- classification.py
- utils.py
- pre-commit + some suggestions
@aditya0by0
Copy link
Collaborator Author

Hi @sfluegel05, Please review the PR. I request for bit more time to look into tutorials/eval_model_basic.ipynb due to some errors.

Hi @sfluegel05, I have made the changes for tutorials/eval_model_basic.ipynb and to other relevant .py files related to it.
Please review the PR.

Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining change requests for this PR from my side:

  • for a class with chebi_version != train_version, instead of creating additional v_{train_version} files, refer to the same files that would be used by a class that has the train_version as its main version (e.g. instead of creating two files chebi.obo and chebi_v200.obo in chebi_v231/raw, create chebi_v200/raw/chebi.obo and chebi_v231/raw/chebi.obo
  • remove commented out cells from notebook

"test": self.dynamic_df_test,
}

def load_processed_data(self, kind: str = None) -> List:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, you should add a filename parameter that allows the user to load an arbitrary file (independent of splits).

- removed list comprehension from data split logic
- used dataframe operations instead as they are faster
- remove looping for msss.split as no need for it, used `next` instead
@aditya0by0
Copy link
Collaborator Author

Also, I have updated the wiki Data-Management/Data folder structure for the new folder structure according to this PR.
Please review.

@aditya0by0 aditya0by0 removed a link to an issue Jun 23, 2024
2 tasks
@aditya0by0
Copy link
Collaborator Author

Also, can you please confirm whether merging this PR will lead to closure of the below issue too

Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Thanks for implementing.
Regarding issue #12, I'd say this solves this first part (reproducing data splits), but not the second one (reproducing runs).

@aditya0by0
Copy link
Collaborator Author

Hi @sfluegel05, as you have approved the changes. Can you please merge the PR if there are no further actions/changes left for this issue.

@sfluegel05
Copy link
Collaborator

I will merge this PR, but before that, I have another task. Since this change will require other users of this tool to change their datasets, it would be useful to have a migration script. I created an issue for that: #34

@aditya0by0 aditya0by0 mentioned this pull request Jun 26, 2024
@aditya0by0 aditya0by0 self-assigned this Jun 26, 2024
@aditya0by0 aditya0by0 mentioned this pull request Jul 3, 2024
@sfluegel05 sfluegel05 merged commit f747257 into dev Jul 5, 2024
2 checks passed
@sfluegel05 sfluegel05 deleted the feature/testing_framework branch July 5, 2024 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data handling needs to be restructured
2 participants