-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module for datasets in SED #401
Conversation
Did not look in detail, but documentation is no longer working, so please fix this first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ì'm not really sure I understand the purpose of this dataset module. How do you intend it to be used? What is the advantage over the current lines of code (you add ~400 or so lines of code, without a clear purpose or use case that is apparent to me).
The logging part is probably nice, but I suggest to separate this out into another PR. We shall have to discuss which of the default outputs can be put into logs, I am not sure.
I understand your point. The lines of code seem a lot due to a lot done for documentation/testing. There are a few reasons why I decided to work on this:
In short the additional complexity in the package allows for simplicity for user. You to check how datasets.json in your config dir is updated to see the benefit e.g.
I can do that. But I put it here because the use case was quite apparent (good test case) |
I see this point, but then we should see how it is done for other package large scale example data (e.g. xarray, pandas, etc.). typically you can just import datasets from a module or so.
For some example data this might be helpful, but for the general use case for converting data from the lab/beamline, this won't happen, no?
This indeed is true. I need to check this.
We can use this here as an example then, and integrate it for the other modules. Then this PR however does not close #249 |
Looking here, that's true: But for example this downloads the dataset first. I will look how they do their data loading procedure.
At the end, this just downloads/extracts/arranges the data on the specified path. One could also do this with lab/beamtimes if a url for that set is given. But everything afterwards happens with the SedProcessor.
Noted.
Improvements can definitely be made to the current state.
|
The docs building now works: I have also included the tutorial notebook I provided along with API/config doc. It will be under dataset after building.
Regarding this, I will make a new issue but this is not urgent problem. |
Pull Request Test Coverage Report for Build 9516441467Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9516441113Details
💛 - Coveralls |
That is not sufficient, as even if the notebooks themselves did not change, the output of the code might. So you need to process then in order to know. |
Pull Request Test Coverage Report for Build 9516865701Details
💛 - Coveralls |
That's a good point. I guess there's no way to really speed it up then. |
Pull Request Test Coverage Report for Build 9519662780Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9519662421Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First this did not work for me, because my old user datasets.json was out of date. By deleting this, it now works.
I did not check everything now again, but as it appears to run, LGTM.
I added a couple of things:
- gitignore for datasets
- use tqdm.auto
- remove existing logging handlers when initializing loggers, to prevent multiple same loggers if files are being reloaded during a session
Pull Request Test Coverage Report for Build 9519855726Details
💛 - Coveralls |
Only thing I find still strange is that you get the warning everytime that a dataset is already there. It appears strange, because there is nothing wrong or to warn or even info about, when a dataset is already there. The using existing data message should be enought, for my taste. |
I also found it helpful to also print the log-level on the console |
Another thing I fixed now is that the rearrange action will overwrite existing files (if they are there, but missing in the config). It gave an error before. |
Pull Request Test Coverage Report for Build 9520169431Details
💛 - Coveralls |
Yes this was a problem for me too. But I think that's a problem mostly during development. The only time I can imagine it maybe failing is when user deletes the dataset not through API but by themselves. Thanks for the updates/suggestions. I updated it to them. |
Pull Request Test Coverage Report for Build 9521969503Details
💛 - Coveralls |
This PR introduces:
[Stored in ./datasets/<data_name>/]
[Folder config stored in ./datasets.json]
[Now stored in current working dir ./logs]
These changes are arranged in two classes.
Dataset
which user only accesses the class instancedataset
from sed.dataset import dataset
DatasetsManager
from sed.dataset import DatasetsManager
Please look at this notebook for complete usecase: datasets_example.ipynb.zip
related to #249 and closes #403