Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module for datasets in SED #401

Merged
merged 64 commits into from
Jun 14, 2024
Merged

Module for datasets in SED #401

merged 64 commits into from
Jun 14, 2024

Conversation

zain-sohail
Copy link
Member

@zain-sohail zain-sohail commented May 14, 2024

This PR introduces:

  • Easy API for getting and removing datasets [Stored in ./datasets/<data_name>/]
  • A Datasets config file (JSON) to add new datasets (or remove) [Folder config stored in ./datasets.json]
    • Herirachically merges the module, user and folder files into one completed datasets.json
  • Introduces logging setup for this module [Now stored in current working dir ./logs]
  • Using platformdirs to store user state of datasets
    • datasets.json is copied to user_config_path and updated there to keep track of state
  • Tutorials are updated to use the new dataset module these changes

These changes are arranged in two classes.

  • to get/remove datasets: Dataset which user only accesses the class instance dataset
    from sed.dataset import dataset
  • to configure (url etc.) for new/old datasets: DatasetsManager
    from sed.dataset import DatasetsManager

Please look at this notebook for complete usecase: datasets_example.ipynb.zip

related to #249 and closes #403

@zain-sohail zain-sohail requested a review from rettigl May 22, 2024 14:13
@rettigl
Copy link
Member

rettigl commented May 22, 2024

Did not look in detail, but documentation is no longer working, so please fix this first.

Copy link
Member

@rettigl rettigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ì'm not really sure I understand the purpose of this dataset module. How do you intend it to be used? What is the advantage over the current lines of code (you add ~400 or so lines of code, without a clear purpose or use case that is apparent to me).
The logging part is probably nice, but I suggest to separate this out into another PR. We shall have to discuss which of the default outputs can be put into logs, I am not sure.

sed/core/logging.py Outdated Show resolved Hide resolved
sed/core/logging.py Outdated Show resolved Hide resolved
sed/core/user_dirs.py Outdated Show resolved Hide resolved
sed/dataset/dataset.py Outdated Show resolved Hide resolved
sed/dataset/dataset.py Outdated Show resolved Hide resolved
sed/dataset/dataset.py Outdated Show resolved Hide resolved
sed/dataset/dataset.py Show resolved Hide resolved
@zain-sohail
Copy link
Member Author

Ì'm not really sure I understand the purpose of this dataset module. How do you intend it to be used? What is the advantage over the current lines of code (you add ~400 or so lines of code, without a clear purpose or use case that is apparent to me).

I understand your point. The lines of code seem a lot due to a lot done for documentation/testing. There are a few reasons why I decided to work on this:

  • Most user and data focused repositories have functionality to get datasets for the user. It's one thing for user to run the tutorials, but another if they want to just fetch the given datasets in their own pipelines.
  • New datasets can easily be added and used. We just need to add to the json file (I thought of adding an entry point for user so they can add other data from zenodo etc. to their datasets.json).
  • User keeps track of data they downloaded and doesn't redownload/reextract them. I faced this problem because the notebooks/scripts can exist in different places and use relative paths. With having access to state (from datasets.json in user config), the user can get to know where the data should already exist. Moreover, with the flash data, where the data needs to be rearranged (likely case for most data), I once deleted the empty directory and that made it download the whole file again.

In short the additional complexity in the package allows for simplicity for user.

You to check how datasets.json in your config dir is updated to see the benefit e.g.

  • Use load_dataset(<dataset_name>) to get the data
  • Use load_dataset(<dataset_name>, <own_path>) to get the data
  • Try to reload the same dataset with another path load_dataset(<dataset_name>, <anonther_path>)

The logging part is probably nice, but I suggest to separate this out into another PR. We shall have to discuss which of the default outputs can be put into logs, I am not sure.

I can do that. But I put it here because the use case was quite apparent (good test case)
Currently log.warning and above is shown to user and everything else just written to the log file.

@rettigl
Copy link
Member

rettigl commented May 23, 2024

  • Most user and data focused repositories have functionality to get datasets for the user. It's one thing for user to run the tutorials, but another if they want to just fetch the given datasets in their own pipelines.

I see this point, but then we should see how it is done for other package large scale example data (e.g. xarray, pandas, etc.). typically you can just import datasets from a module or so.

  • New datasets can easily be added and used. We just need to add to the json file (I thought of adding an entry point for user so they can add other data from zenodo etc. to their datasets.json).

For some example data this might be helpful, but for the general use case for converting data from the lab/beamline, this won't happen, no?

  • User keeps track of data they downloaded and doesn't redownload/reextract them. I faced this problem because the notebooks/scripts can exist in different places and use relative paths. With having access to state (from datasets.json in user config), the user can get to know where the data should already exist. Moreover, with the flash data, where the data needs to be rearranged (likely case for most data), I once deleted the empty directory and that made it download the whole file again.

This indeed is true. I need to check this.

I can do that. But I put it here because the use case was quite apparent (good test case)
Currently log.warning and above is shown to user and everything else just written to the log file.

We can use this here as an example then, and integrate it for the other modules. Then this PR however does not close #249

@zain-sohail
Copy link
Member Author

zain-sohail commented May 23, 2024

I see this point, but then we should see how it is done for other package large scale example data (e.g. xarray, pandas, etc.). typically you can just import datasets from a module or so.

Looking here, that's true:
https://scikit-learn.org/stable/datasets/toy_dataset.html
due to size being small.

But for example this downloads the dataset first. I will look how they do their data loading procedure.
https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
Another example with source: https://github.com/keras-team/keras/tree/v3.3.3/keras/src/datasets

For some example data this might be helpful, but for the general use case for converting data from the lab/beamline, this won't happen, no?

At the end, this just downloads/extracts/arranges the data on the specified path. One could also do this with lab/beamtimes if a url for that set is given. But everything afterwards happens with the SedProcessor.
The example I imagine is when I want to look at other interesting single event datasets from zenodo, for arpes or maybe even other single event applications.

We can use this here as an example then, and integrate it for the other modules. Then this PR however does not close #249

Noted.

This indeed is true. I need to check this.

Improvements can definitely be made to the current state.

  • Account for multiple paths where same dataset can exist
  • Take snapshot of file state after download, after extraction and after arrangment so those states are saved more robustly (right now only the final file state is captured)
  • Remove default user data path. User has to either define their own path, or specifically ask to use default
  • Use heirarchical dict merging already in use by config files
  • Maybe put this all in a class
  • add option to delete datasets
  • log files with notebooks
  • let user add more datasets

@zain-sohail
Copy link
Member Author

The docs building now works:
https://github.com/OpenCOMPES/sed/actions/runs/9507832092

I have also included the tutorial notebook I provided along with API/config doc. It will be under dataset after building.

I don't really understand the benefit. The thing that takes the most time is the compilation of the notebooks. The rest takes maybe 10-20s or so. I don't think it is worth building a complicated caching system to safe <1% of the time. And the notebook building you have to do anyways, otherwise you don't know if something changed in them...

Regarding this, I will make a new issue but this is not urgent problem.
But in essence, if nbsphinx (that compiles the notebooks) sees that our notebooks haven't changed, it will not compile them. And the only way that works is if we have both the tutorial notebooks (because currently we copy the notebooks from tutorial to docs/tutorial) along with _build folder are cached, since that retains the timestamps and state for nbsphinx to understand. So after using the cache, we only copy the tutorials that have changed (can use a hash function for that) so nbsphinx realizes that those are updated notebooks.
If somehow, we don't have to copy the notebooks, I believe we can get by with caching _build only.

@zain-sohail zain-sohail requested a review from rettigl June 14, 2024 12:52
@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9516441467

Details

  • 291 of 339 (85.84%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 91.469%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 164 212 77.36%
Totals Coverage Status
Change from base Build 9506613369: -0.3%
Covered Lines: 6412
Relevant Lines: 7010

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9516441113

Details

  • 291 of 339 (85.84%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 91.469%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 164 212 77.36%
Totals Coverage Status
Change from base Build 9506613369: -0.3%
Covered Lines: 6412
Relevant Lines: 7010

💛 - Coveralls

@rettigl
Copy link
Member

rettigl commented Jun 14, 2024

But in essence, if nbsphinx (that compiles the notebooks) sees that our notebooks haven't changed, it will not compile them.

That is not sufficient, as even if the notebooks themselves did not change, the output of the code might. So you need to process then in order to know.

@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9516865701

Details

  • 341 of 356 (95.79%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 91.96%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 197 212 92.92%
Totals Coverage Status
Change from base Build 9506613369: 0.2%
Covered Lines: 6462
Relevant Lines: 7027

💛 - Coveralls

@zain-sohail
Copy link
Member Author

That is not sufficient, as even if the notebooks themselves did not change, the output of the code might. So you need to process then in order to know.

That's a good point. I guess there's no way to really speed it up then.

@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9519662780

Details

  • 341 of 356 (95.79%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 91.96%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 197 212 92.92%
Totals Coverage Status
Change from base Build 9506613369: 0.2%
Covered Lines: 6462
Relevant Lines: 7027

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9519662421

Details

  • 341 of 356 (95.79%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 91.96%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 197 212 92.92%
Totals Coverage Status
Change from base Build 9506613369: 0.2%
Covered Lines: 6462
Relevant Lines: 7027

💛 - Coveralls

Copy link
Member

@rettigl rettigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First this did not work for me, because my old user datasets.json was out of date. By deleting this, it now works.
I did not check everything now again, but as it appears to run, LGTM.
I added a couple of things:

  • gitignore for datasets
  • use tqdm.auto
  • remove existing logging handlers when initializing loggers, to prevent multiple same loggers if files are being reloaded during a session

@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9519855726

Details

  • 343 of 358 (95.81%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 91.962%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 197 212 92.92%
Totals Coverage Status
Change from base Build 9506613369: 0.2%
Covered Lines: 6464
Relevant Lines: 7029

💛 - Coveralls

@rettigl
Copy link
Member

rettigl commented Jun 14, 2024

Only thing I find still strange is that you get the warning everytime that a dataset is already there. It appears strange, because there is nothing wrong or to warn or even info about, when a dataset is already there. The using existing data message should be enought, for my taste.

@rettigl
Copy link
Member

rettigl commented Jun 14, 2024

I also found it helpful to also print the log-level on the console

@rettigl
Copy link
Member

rettigl commented Jun 14, 2024

Another thing I fixed now is that the rearrange action will overwrite existing files (if they are there, but missing in the config). It gave an error before.

@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9520169431

Details

  • 343 of 358 (95.81%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 91.962%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 197 212 92.92%
Totals Coverage Status
Change from base Build 9506613369: 0.2%
Covered Lines: 6464
Relevant Lines: 7029

💛 - Coveralls

@zain-sohail
Copy link
Member Author

First this did not work for me, because my old user datasets.json was out of date. By deleting this, it now works. I did not check everything now again, but as it appears to run, LGTM. I added a couple of things:

Yes this was a problem for me too. But I think that's a problem mostly during development. The only time I can imagine it maybe failing is when user deletes the dataset not through API but by themselves.
But if that becomes an issue, we can update the code somehow.

Thanks for the updates/suggestions. I updated it to them.

@coveralls
Copy link
Collaborator

coveralls commented Jun 14, 2024

Pull Request Test Coverage Report for Build 9521969503

Details

  • 343 of 358 (95.81%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 91.962%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/dataset/dataset.py 197 212 92.92%
Totals Coverage Status
Change from base Build 9506613369: 0.2%
Covered Lines: 6464
Relevant Lines: 7029

💛 - Coveralls

@zain-sohail zain-sohail merged commit f555a74 into main Jun 14, 2024
6 checks passed
@zain-sohail zain-sohail deleted the data-fetch branch June 14, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use standardized directories
3 participants