-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data handling restructure #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation. To me, this is looking good and working as intended.
I only have two change requests:
- Could you also update the tests for chebi (
testChebiData
)? I would assume sume of them don't work anymore since there are notest.pt
files anymore. - For the seed: The cleanest way would be to embed it into the lightning config system. You wrote a custom parsing function. You could have saved yourself the effort by adding the seed as an init parameter to the class. Then, a user can add a seed to their regular data config file. And if not, some default value is used instead of raising an error. Could you do that instead?
Some additional changes we talked about:
|
Hi @sfluegel05, Please review the PR. |
Updates - Evaluation notebook - classification.py - utils.py - pre-commit + some suggestions
Hi @sfluegel05, I have made the changes for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining change requests for this PR from my side:
- for a class with
chebi_version
!=train_version
, instead of creating additionalv_{train_version}
files, refer to the same files that would be used by a class that has thetrain_version
as its main version (e.g. instead of creating two fileschebi.obo
andchebi_v200.obo
inchebi_v231/raw
, createchebi_v200/raw/chebi.obo
andchebi_v231/raw/chebi.obo
- remove commented out cells from notebook
"test": self.dynamic_df_test, | ||
} | ||
|
||
def load_processed_data(self, kind: str = None) -> List: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, you should add a filename
parameter that allows the user to load an arbitrary file (independent of splits).
- removed list comprehension from data split logic - used dataframe operations instead as they are faster - remove looping for msss.split as no need for it, used `next` instead
Also, I have updated the wiki |
Also, can you please confirm whether merging this PR will lead to closure of the below issue too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Thanks for implementing.
Regarding issue #12, I'd say this solves this first part (reproducing data splits), but not the second one (reproducing runs).
Hi @sfluegel05, as you have approved the changes. Can you please merge the PR if there are no further actions/changes left for this issue. |
I will merge this PR, but before that, I have another task. Since this change will require other users of this tool to change their datasets, it would be useful to have a migration script. I created an issue for that: #34 |
Goal
chebi.obo
(raw)data/ChEBIX/chebi_version/raw
/data/ChEBIX/chebi_version/processed/encoding
data/chebi_version/raw
data/chebi_version/ChEBIX/processed
data/chebi_version/ChEBIX/processed/encoding
A special case for the data splits is the
chebi_version_train
:Use case
You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).
Tasks
chebi_version_train
is set, create and process two datasets (one for thechebi_version
, one forchebi_version_train
)chebi_version_train
data, but using the test set fromchebi_version
chebi_version
test set that has all the same entries, but only the labels that also appear in theclasses.txt
ofchebi_version_train
ChEBIOver50(chebi_version=231)
andChEBIOver50(chebi_version=231, chebi_version_train=200)
should have the same ids in their test sets (but different numbers of labels), the latter should also pass the test for no overlapsMost of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.