-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add data-source persistence #237
Conversation
adds 'persist()' method to sources, which uses the container lookup to find the appropriate thing to call for that source. Saves to one particular format for each container type, only dataframe implemented so far here. Also includes GenericDataFrame, which can load anything given a set of files and a function which can make an open file into a data-frame piece.
For now, the metadata just includes a time-stamp. Ideally, it should be possible to call "refresh" to update persisted datasets and have old ones expire; persisted datasets should also allow the user to fall back to the original on-demand data. |
intake/config.py
Outdated
@@ -23,7 +23,8 @@ | |||
'cache_disabled': False, | |||
'cache_download_progress': True, | |||
'logging': 'INFO', | |||
'catalog_path': [] | |||
'catalog_path': [], | |||
'persist_path': 'DEFAULT' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come to a consensus on what the right way to set defaults is. There are at least 3 different approaches in this file. I think I mildly prefer to set a meaningful default in the defaults dict then we could create another dict with all the values from the env vars and update the defaults with the dict from env vars. Does that sound reasonable?
from dask.bytes import open_files | ||
self.files = open_files(self.url, **self.storage_options) | ||
|
||
def read_a_file(open_file, reader, kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to allow a preprocess function in here to make it more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. So read_a_file
is the fallback? Any suggestion for what to call this argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be called preprocess
and should happen right after read_a_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intake/source/base.py
Outdated
out.description = self.description | ||
metadata = {'type': 'persisted_dataset', | ||
'timestamp': datetime.datetime.now().isoformat(), | ||
'previous_metadata': self.metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the consequences be of leaving metadata = self.metadata
and putting the persistence-related info beside it in a new attribute rather than nesting?
intake/container/dataframe.py
Outdated
---------- | ||
source: a DataSource instance to save | ||
name: str or None | ||
Key to refer to this persisted dataset by. If nto given, will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nto -> not
intake/source/base.py
Outdated
@@ -280,6 +280,26 @@ def hvplot(self): | |||
""" | |||
return self.plot | |||
|
|||
def persist(self, name, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a use case in mind for name
? Is there ever a circumstance where it should not be source.name
?
Additional suggestions for persistence, which may appear in another PR:
|
plus fix knockon tests
Filled out the API with what I imagine to be the final user experience, at least for the first iteration. No tests or documentation yet, beyond basic docstrings. |
adds 'persist()' method to sources, which uses the container lookup to
find the appropriate thing to call for that source. Saves to one
particular format for each container type, only dataframe implemented so
far here.
Also includes GenericDataFrame, which can load anything given a set of
files and a function which can make an open file into a data-frame piece.