Module for datasets in SED #401

zain-sohail · 2024-05-14T07:46:17Z

This PR introduces:

Easy API for getting and removing datasets [Stored in ./datasets/<data_name>/]
A Datasets config file (JSON) to add new datasets (or remove) [Folder config stored in ./datasets.json]
- Herirachically merges the module, user and folder files into one completed datasets.json
Introduces logging setup for this module [Now stored in current working dir ./logs]
Using platformdirs to store user state of datasets
- datasets.json is copied to user_config_path and updated there to keep track of state
Tutorials are updated to use the new dataset module these changes

These changes are arranged in two classes.

to get/remove datasets: Dataset which user only accesses the class instance dataset
from sed.dataset import dataset
to configure (url etc.) for new/old datasets: DatasetsManager
from sed.dataset import DatasetsManager

Please look at this notebook for complete usecase: datasets_example.ipynb.zip

related to #249 and closes #403

rettigl · 2024-05-22T19:14:03Z

Did not look in detail, but documentation is no longer working, so please fix this first.

rettigl

Ì'm not really sure I understand the purpose of this dataset module. How do you intend it to be used? What is the advantage over the current lines of code (you add ~400 or so lines of code, without a clear purpose or use case that is apparent to me).
The logging part is probably nice, but I suggest to separate this out into another PR. We shall have to discuss which of the default outputs can be put into logs, I am not sure.

sed/core/logging.py

sed/core/user_dirs.py

sed/dataset/dataset.py

zain-sohail · 2024-05-22T22:15:56Z

Ì'm not really sure I understand the purpose of this dataset module. How do you intend it to be used? What is the advantage over the current lines of code (you add ~400 or so lines of code, without a clear purpose or use case that is apparent to me).

I understand your point. The lines of code seem a lot due to a lot done for documentation/testing. There are a few reasons why I decided to work on this:

Most user and data focused repositories have functionality to get datasets for the user. It's one thing for user to run the tutorials, but another if they want to just fetch the given datasets in their own pipelines.
New datasets can easily be added and used. We just need to add to the json file (I thought of adding an entry point for user so they can add other data from zenodo etc. to their datasets.json).
User keeps track of data they downloaded and doesn't redownload/reextract them. I faced this problem because the notebooks/scripts can exist in different places and use relative paths. With having access to state (from datasets.json in user config), the user can get to know where the data should already exist. Moreover, with the flash data, where the data needs to be rearranged (likely case for most data), I once deleted the empty directory and that made it download the whole file again.

In short the additional complexity in the package allows for simplicity for user.

You to check how datasets.json in your config dir is updated to see the benefit e.g.

Use load_dataset(<dataset_name>) to get the data
Use load_dataset(<dataset_name>, <own_path>) to get the data
Try to reload the same dataset with another path load_dataset(<dataset_name>, <anonther_path>)

The logging part is probably nice, but I suggest to separate this out into another PR. We shall have to discuss which of the default outputs can be put into logs, I am not sure.

I can do that. But I put it here because the use case was quite apparent (good test case)
Currently log.warning and above is shown to user and everything else just written to the log file.

rettigl · 2024-05-23T07:31:00Z

Most user and data focused repositories have functionality to get datasets for the user. It's one thing for user to run the tutorials, but another if they want to just fetch the given datasets in their own pipelines.

I see this point, but then we should see how it is done for other package large scale example data (e.g. xarray, pandas, etc.). typically you can just import datasets from a module or so.

New datasets can easily be added and used. We just need to add to the json file (I thought of adding an entry point for user so they can add other data from zenodo etc. to their datasets.json).

For some example data this might be helpful, but for the general use case for converting data from the lab/beamline, this won't happen, no?

User keeps track of data they downloaded and doesn't redownload/reextract them. I faced this problem because the notebooks/scripts can exist in different places and use relative paths. With having access to state (from datasets.json in user config), the user can get to know where the data should already exist. Moreover, with the flash data, where the data needs to be rearranged (likely case for most data), I once deleted the empty directory and that made it download the whole file again.

This indeed is true. I need to check this.

I can do that. But I put it here because the use case was quite apparent (good test case)
Currently log.warning and above is shown to user and everything else just written to the log file.

We can use this here as an example then, and integrate it for the other modules. Then this PR however does not close #249

zain-sohail · 2024-05-23T11:07:47Z

I see this point, but then we should see how it is done for other package large scale example data (e.g. xarray, pandas, etc.). typically you can just import datasets from a module or so.

Looking here, that's true:
https://scikit-learn.org/stable/datasets/toy_dataset.html
due to size being small.

But for example this downloads the dataset first. I will look how they do their data loading procedure.
https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
Another example with source: https://github.com/keras-team/keras/tree/v3.3.3/keras/src/datasets

For some example data this might be helpful, but for the general use case for converting data from the lab/beamline, this won't happen, no?

At the end, this just downloads/extracts/arranges the data on the specified path. One could also do this with lab/beamtimes if a url for that set is given. But everything afterwards happens with the SedProcessor.
The example I imagine is when I want to look at other interesting single event datasets from zenodo, for arpes or maybe even other single event applications.

We can use this here as an example then, and integrate it for the other modules. Then this PR however does not close #249

Noted.

This indeed is true. I need to check this.

Improvements can definitely be made to the current state.

Account for multiple paths where same dataset can exist
Take snapshot of file state after download, after extraction and after arrangment so those states are saved more robustly (right now only the final file state is captured)
Remove default user data path. User has to either define their own path, or specifically ask to use default
Use heirarchical dict merging already in use by config files
Maybe put this all in a class
add option to delete datasets
log files with notebooks
let user add more datasets

zain-sohail · 2024-06-14T12:52:33Z

The docs building now works:
https://github.com/OpenCOMPES/sed/actions/runs/9507832092

I have also included the tutorial notebook I provided along with API/config doc. It will be under dataset after building.

I don't really understand the benefit. The thing that takes the most time is the compilation of the notebooks. The rest takes maybe 10-20s or so. I don't think it is worth building a complicated caching system to safe <1% of the time. And the notebook building you have to do anyways, otherwise you don't know if something changed in them...

Regarding this, I will make a new issue but this is not urgent problem.
But in essence, if nbsphinx (that compiles the notebooks) sees that our notebooks haven't changed, it will not compile them. And the only way that works is if we have both the tutorial notebooks (because currently we copy the notebooks from tutorial to docs/tutorial) along with _build folder are cached, since that retains the timestamps and state for nbsphinx to understand. So after using the cache, we only copy the tutorials that have changed (can use a hash function for that) so nbsphinx realizes that those are updated notebooks.
If somehow, we don't have to copy the notebooks, I believe we can get by with caching _build only.

coveralls · 2024-06-14T13:00:36Z

Pull Request Test Coverage Report for Build 9516441467

Details

291 of 339 (85.84%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 91.469%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	164	212	77.36%

Totals
Change from base Build 9506613369:	-0.3%
Covered Lines:	6412
Relevant Lines:	7010

💛 - Coveralls

coveralls · 2024-06-14T13:00:36Z

Pull Request Test Coverage Report for Build 9516441113

Details

291 of 339 (85.84%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 91.469%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	164	212	77.36%

Totals
Change from base Build 9506613369:	-0.3%
Covered Lines:	6412
Relevant Lines:	7010

💛 - Coveralls

rettigl · 2024-06-14T13:22:17Z

But in essence, if nbsphinx (that compiles the notebooks) sees that our notebooks haven't changed, it will not compile them.

That is not sufficient, as even if the notebooks themselves did not change, the output of the code might. So you need to process then in order to know.

coveralls · 2024-06-14T13:27:36Z

Pull Request Test Coverage Report for Build 9516865701

Details

341 of 356 (95.79%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 91.96%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	197	212	92.92%

Totals
Change from base Build 9506613369:	0.2%
Covered Lines:	6462
Relevant Lines:	7027

💛 - Coveralls

zain-sohail · 2024-06-14T15:21:38Z

That is not sufficient, as even if the notebooks themselves did not change, the output of the code might. So you need to process then in order to know.

That's a good point. I guess there's no way to really speed it up then.

coveralls · 2024-06-14T17:03:49Z

Pull Request Test Coverage Report for Build 9519662780

Details

341 of 356 (95.79%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 91.96%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	197	212	92.92%

Totals
Change from base Build 9506613369:	0.2%
Covered Lines:	6462
Relevant Lines:	7027

💛 - Coveralls

coveralls · 2024-06-14T17:03:59Z

Pull Request Test Coverage Report for Build 9519662421

Details

341 of 356 (95.79%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 91.96%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	197	212	92.92%

Totals
Change from base Build 9506613369:	0.2%
Covered Lines:	6462
Relevant Lines:	7027

💛 - Coveralls

rettigl

First this did not work for me, because my old user datasets.json was out of date. By deleting this, it now works.
I did not check everything now again, but as it appears to run, LGTM.
I added a couple of things:

gitignore for datasets
use tqdm.auto
remove existing logging handlers when initializing loggers, to prevent multiple same loggers if files are being reloaded during a session

coveralls · 2024-06-14T17:20:03Z

Pull Request Test Coverage Report for Build 9519855726

Details

343 of 358 (95.81%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 91.962%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	197	212	92.92%

Totals
Change from base Build 9506613369:	0.2%
Covered Lines:	6464
Relevant Lines:	7029

💛 - Coveralls

rettigl · 2024-06-14T17:22:08Z

Only thing I find still strange is that you get the warning everytime that a dataset is already there. It appears strange, because there is nothing wrong or to warn or even info about, when a dataset is already there. The using existing data message should be enought, for my taste.

rettigl · 2024-06-14T17:22:35Z

I also found it helpful to also print the log-level on the console

rettigl · 2024-06-14T17:40:37Z

Another thing I fixed now is that the rearrange action will overwrite existing files (if they are there, but missing in the config). It gave an error before.

coveralls · 2024-06-14T17:45:47Z

Pull Request Test Coverage Report for Build 9520169431

Details

343 of 358 (95.81%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 91.962%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	197	212	92.92%

Totals
Change from base Build 9506613369:	0.2%
Covered Lines:	6464
Relevant Lines:	7029

💛 - Coveralls

zain-sohail · 2024-06-14T20:29:21Z

First this did not work for me, because my old user datasets.json was out of date. By deleting this, it now works. I did not check everything now again, but as it appears to run, LGTM. I added a couple of things:

Yes this was a problem for me too. But I think that's a problem mostly during development. The only time I can imagine it maybe failing is when user deletes the dataset not through API but by themselves.
But if that becomes an issue, we can update the code somehow.

Thanks for the updates/suggestions. I updated it to them.

coveralls · 2024-06-14T20:32:47Z

Pull Request Test Coverage Report for Build 9521969503

Details

343 of 358 (95.81%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 91.962%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/dataset/dataset.py	197	212	92.92%

Totals
Change from base Build 9506613369:	0.2%
Covered Lines:	6464
Relevant Lines:	7029

💛 - Coveralls

zain-sohail added 21 commits May 9, 2024 19:12

add fetch functionality for datasets

6ce0283

add logging

f72cc01

add user config dirs etc.

00b011b

modularize the loading process

474ef5d

add TaS2 dataset

7d3e583

update tutorials to use load_dataset

03777d2

add docstrings/typehints

ef023f8

fix paths

4055da4

refactoring

23eded6

enhance logging seperate user_dirs

16a9ba8

add docstrings

0604055

change name to Test for test dataset

d9459e9

fix lint error

689b3b3

add tests, and update code accordingly

4fdb0a5

add tests, and update code accordingly

cf9185e

fix lint error

2e5d9c3

add creation of datasets folder in user dir, if not already existing

e92479c

update Test data

4b4c4ef

fix test problem

ffba4d7

fixes

ac553c2

minor fixes

cb9b8d5

zain-sohail requested a review from rettigl May 22, 2024 14:13

update tutorials

8a6f319

rettigl reviewed May 22, 2024

View reviewed changes

fix in build_flash_parquets to allow doc building

e72e966

fix for docs

f98dbf0

zain-sohail added 3 commits June 13, 2024 23:10

fix flash_parquet paths

54acb87

fix condition

3b2f328

fix tutorials

48dd2ff

zain-sohail had a problem deploying to github-pages June 13, 2024 23:40 — with GitHub Actions Failure

add documentation for dataset module

5b25134

Merge branch 'main' into data-fetch

d230bb0

zain-sohail requested a review from rettigl June 14, 2024 12:52

add DatasetsManager tests

26dfd69

add datasets to gitignore, and use tqdm.auto

6805762

prevent duplicate logger handlers

b5f7252

rettigl approved these changes Jun 14, 2024

View reviewed changes

overwrite existing files

53b90b0

remove warning and add levelname to logging

c1939fd

zain-sohail merged commit f555a74 into main Jun 14, 2024
6 checks passed

zain-sohail deleted the data-fetch branch June 14, 2024 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module for datasets in SED #401

Module for datasets in SED #401

zain-sohail commented May 14, 2024 •

edited

Loading

rettigl commented May 22, 2024

rettigl left a comment

zain-sohail commented May 22, 2024

rettigl commented May 23, 2024

zain-sohail commented May 23, 2024 •

edited

Loading

zain-sohail commented Jun 14, 2024

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

rettigl commented Jun 14, 2024

coveralls commented Jun 14, 2024 •

edited

Loading

zain-sohail commented Jun 14, 2024

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

rettigl left a comment

coveralls commented Jun 14, 2024 •

edited

Loading

rettigl commented Jun 14, 2024

rettigl commented Jun 14, 2024

rettigl commented Jun 14, 2024

coveralls commented Jun 14, 2024 •

edited

Loading

zain-sohail commented Jun 14, 2024

coveralls commented Jun 14, 2024 •

edited

Loading

Module for datasets in SED #401

Module for datasets in SED #401

Conversation

zain-sohail commented May 14, 2024 • edited Loading

rettigl commented May 22, 2024

rettigl left a comment

Choose a reason for hiding this comment

zain-sohail commented May 22, 2024

rettigl commented May 23, 2024

zain-sohail commented May 23, 2024 • edited Loading

zain-sohail commented Jun 14, 2024

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9516441467

Details

💛 - Coveralls

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9516441113

Details

💛 - Coveralls

rettigl commented Jun 14, 2024

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9516865701

Details

💛 - Coveralls

zain-sohail commented Jun 14, 2024

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9519662780

Details

💛 - Coveralls

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9519662421

Details

💛 - Coveralls

rettigl left a comment

Choose a reason for hiding this comment

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9519855726

Details

💛 - Coveralls

rettigl commented Jun 14, 2024

rettigl commented Jun 14, 2024

rettigl commented Jun 14, 2024

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9520169431

Details

💛 - Coveralls

zain-sohail commented Jun 14, 2024

coveralls commented Jun 14, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9521969503

Details

💛 - Coveralls

zain-sohail commented May 14, 2024 •

edited

Loading

zain-sohail commented May 23, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading

coveralls commented Jun 14, 2024 •

edited

Loading