Data handling restructure #29

aditya0by0 · 2024-05-27T20:53:45Z

PR for Issue Data handling needs to be restructured #10

Goal

Have 3 preprocessing stages:
- first stages only contains chebi.obo (raw)
- second stage contains data without split, but with labels attached (processed 1)
- third level contains encoded data (again without split) (processed 2)
Splits are created "on the fly"
- Test that they can be reproduced with some seed (compare hashes)
The file structure should represent this:
- Current file paths are data/ChEBIX/chebi_version/raw / data/ChEBIX/chebi_version/processed/encoding
- Instead, only take the parameters that are important for each step:
- raw: data/chebi_version/raw
- processed 1: data/chebi_version/ChEBIX/processed
- processed 2: data/chebi_version/ChEBIX/processed/encoding

A special case for the data splits is the chebi_version_train:

Use case

You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).

Tasks

if chebi_version_train is set, create and process two datasets (one for the chebi_version, one for chebi_version_train)
when creating splits, build the training and validation splits based on the chebi_version_train data, but using the test set from chebi_version
build the test set as an adaption of the chebi_version test set that has all the same entries, but only the labels that also appear in the classes.txt of chebi_version_train
test the implementation: classes ChEBIOver50(chebi_version=231) and ChEBIOver50(chebi_version=231, chebi_version_train=200) should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlaps

Most of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.

sfluegel05

Thanks for the implementation. To me, this is looking good and working as intended.

I only have two change requests:

Could you also update the tests for chebi (testChebiData)? I would assume sume of them don't work anymore since there are no test.pt files anymore.
For the seed: The cleanest way would be to embed it into the lightning config system. You wrote a custom parsing function. You could have saved yourself the effort by adding the seed as an init parameter to the class. Then, a user can add a seed to their regular data config file. And if not, some default value is used instead of raising an error. Could you do that instead?

sfluegel05 · 2024-05-30T08:35:44Z

Some additional changes we talked about:

tests:
- instantiate data class directly instead of loading it from a config file,
- automatically create data if it not already present
store created datasplits in variables to avoid recreating them
overload load_processed_data instead of dataloader to avoid code duplication
check if the evaluation in tutorials/eval_model_basic.ipnb still works, update if necessary

aditya0by0 · 2024-06-11T09:01:32Z

Hi @sfluegel05, Please review the PR.
I request for bit more time to look into tutorials/eval_model_basic.ipynb due to some errors.

Updates - Evaluation notebook - classification.py - utils.py - pre-commit + some suggestions

aditya0by0 · 2024-06-12T14:22:46Z

Hi @sfluegel05, Please review the PR. I request for bit more time to look into tutorials/eval_model_basic.ipynb due to some errors.

Hi @sfluegel05, I have made the changes for tutorials/eval_model_basic.ipynb and to other relevant .py files related to it.
Please review the PR.

sfluegel05

The remaining change requests for this PR from my side:

for a class with chebi_version != train_version, instead of creating additional v_{train_version} files, refer to the same files that would be used by a class that has the train_version as its main version (e.g. instead of creating two files chebi.obo and chebi_v200.obo in chebi_v231/raw, create chebi_v200/raw/chebi.obo and chebi_v231/raw/chebi.obo
remove commented out cells from notebook

sfluegel05 · 2024-06-13T15:36:37Z

chebai/preprocessing/datasets/chebi.py

+            "test": self.dynamic_df_test,
+        }
+
+    def load_processed_data(self, kind: str = None) -> List:


here, you should add a filename parameter that allows the user to load an arbitrary file (independent of splits).

- removed list comprehension from data split logic - used dataframe operations instead as they are faster - remove looping for msss.split as no need for it, used `next` instead

aditya0by0 · 2024-06-19T21:44:53Z

Also, I have updated the wiki Data-Management/Data folder structure for the new folder structure according to this PR.
Please review.

aditya0by0 · 2024-06-23T15:14:11Z

Also, can you please confirm whether merging this PR will lead to closure of the below issue too

Runs should be reproducable #12

sfluegel05

This looks good. Thanks for implementing.
Regarding issue #12, I'd say this solves this first part (reproducing data splits), but not the second one (reproducing runs).

aditya0by0 · 2024-06-24T16:35:02Z

Hi @sfluegel05, as you have approved the changes. Can you please merge the PR if there are no further actions/changes left for this issue.

sfluegel05 · 2024-06-26T12:59:13Z

I will merge this PR, but before that, I have another task. Since this change will require other users of this tool to change their datasets, it would be useful to have a migration script. I created an issue for that: #34

aditya0by0 and others added 2 commits May 27, 2024 22:51

Data handling restructure

89ffbd3

Merge branch 'dev' into feature/testing_framework

f07f312

aditya0by0 requested a review from sfluegel05 May 27, 2024 21:19

aditya0by0 assigned sfluegel05 May 27, 2024

sfluegel05 requested changes May 28, 2024

View reviewed changes

aditya0by0 added 5 commits June 5, 2024 11:37

Update chebi tests for dynamic splits

bd6382b

Dynamic split for chebi_version_train + changes

d8abee2

Update dynamic split tests

91aa484

Update chebi + dynamic test

22f882c

Update setup.py

dde4196

aditya0by0 requested a review from sfluegel05 June 11, 2024 09:01

aditya0by0 added the enhancement New feature or request label Jun 11, 2024

This was linked to issues Jun 11, 2024

Data handling needs to be restructured #10

Closed

Runs should be reproducable #12

Closed

Update Evaluation notebook + rel. code

aecb7e6

Updates - Evaluation notebook - classification.py - utils.py - pre-commit + some suggestions

sfluegel added 4 commits June 13, 2024 17:22

set split variables when required instead of during setup

98342af

remove unnecessary class instantiation

89cbdb6

Merge branch 'refs/heads/dev' into feature/testing_framework

8b22601

add isort to pre-commit, reformat with isort

b2439f8

sfluegel05 reviewed Jun 13, 2024

View reviewed changes

sfluegel05 mentioned this pull request Jun 13, 2024

Data processing performance needs to be improved #32

Closed

aditya0by0 added 4 commits June 13, 2024 20:22

Update .gitignore

ec6254d

remove commented out cells - eval notebook

c1b6b0d

add filename parameter to load_processed_data

667b079

Updated chebi.py for train_version restructure

8c9dfe1

aditya0by0 linked an issue Jun 19, 2024 that may be closed by this pull request

Data processing performance needs to be improved #32

Closed

aditya0by0 removed a link to an issue Jun 19, 2024

Data processing performance needs to be improved #32

Closed

minor changes in data split code

cd03023

- removed list comprehension from data split logic - used dataframe operations instead as they are faster - remove looping for msss.split as no need for it, used `next` instead

aditya0by0 requested a review from sfluegel05 June 19, 2024 19:02

Merge branch 'dev' into feature/testing_framework

0584345

aditya0by0 removed a link to an issue Jun 23, 2024

Runs should be reproducable #12

Closed

2 tasks

sfluegel05 approved these changes Jun 24, 2024

View reviewed changes

aditya0by0 requested a review from sfluegel05 June 24, 2024 18:11

aditya0by0 mentioned this pull request Jun 26, 2024

Code documentation #35

Merged

aditya0by0 self-assigned this Jun 26, 2024

fix: test for consistency across runs did validate the same run twice

f747257

aditya0by0 mentioned this pull request Jul 3, 2024

Data migration #37

Merged

sfluegel05 merged commit f747257 into dev Jul 5, 2024
2 checks passed

sfluegel05 deleted the feature/testing_framework branch July 5, 2024 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data handling restructure #29

Data handling restructure #29

aditya0by0 commented May 27, 2024 •

edited

Loading

sfluegel05 left a comment

sfluegel05 commented May 30, 2024

aditya0by0 commented Jun 11, 2024 •

edited

Loading

aditya0by0 commented Jun 12, 2024

sfluegel05 left a comment

sfluegel05 Jun 13, 2024

aditya0by0 commented Jun 19, 2024

aditya0by0 commented Jun 23, 2024

sfluegel05 left a comment

aditya0by0 commented Jun 24, 2024

sfluegel05 commented Jun 26, 2024

Data handling restructure #29

Data handling restructure #29

Conversation

aditya0by0 commented May 27, 2024 • edited Loading

Goal

Use case

Tasks

sfluegel05 left a comment

Choose a reason for hiding this comment

sfluegel05 commented May 30, 2024

aditya0by0 commented Jun 11, 2024 • edited Loading

aditya0by0 commented Jun 12, 2024

sfluegel05 left a comment

Choose a reason for hiding this comment

sfluegel05 Jun 13, 2024

Choose a reason for hiding this comment

aditya0by0 commented Jun 19, 2024

aditya0by0 commented Jun 23, 2024

sfluegel05 left a comment

Choose a reason for hiding this comment

aditya0by0 commented Jun 24, 2024

sfluegel05 commented Jun 26, 2024

aditya0by0 commented May 27, 2024 •

edited

Loading

aditya0by0 commented Jun 11, 2024 •

edited

Loading