Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add data-source persistence #237

Merged
merged 34 commits into from
Feb 20, 2019
Merged

add data-source persistence #237

merged 34 commits into from
Feb 20, 2019

Conversation

martindurant
Copy link
Member

adds 'persist()' method to sources, which uses the container lookup to
find the appropriate thing to call for that source. Saves to one
particular format for each container type, only dataframe implemented so
far here.

Also includes GenericDataFrame, which can load anything given a set of
files and a function which can make an open file into a data-frame piece.

adds 'persist()' method to sources, which uses the container lookup to
find the appropriate thing to call for that source. Saves to one
particular format for each container type, only dataframe implemented so
far here.

Also includes GenericDataFrame, which can load anything given a set of
files and a function which can make an open file into a data-frame piece.
@martindurant
Copy link
Member Author

For now, the metadata just includes a time-stamp. Ideally, it should be possible to call "refresh" to update persisted datasets and have old ones expire; persisted datasets should also allow the user to fall back to the original on-demand data.

intake/config.py Outdated
@@ -23,7 +23,8 @@
'cache_disabled': False,
'cache_download_progress': True,
'logging': 'INFO',
'catalog_path': []
'catalog_path': [],
'persist_path': 'DEFAULT'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should come to a consensus on what the right way to set defaults is. There are at least 3 different approaches in this file. I think I mildly prefer to set a meaningful default in the defaults dict then we could create another dict with all the values from the env vars and update the defaults with the dict from env vars. Does that sound reasonable?

intake/container/dataframe.py Outdated Show resolved Hide resolved
intake/container/dataframe.py Outdated Show resolved Hide resolved
intake/container/dataframe.py Show resolved Hide resolved
intake/container/dataframe.py Outdated Show resolved Hide resolved
intake/container/dataframe.py Outdated Show resolved Hide resolved
intake/container/dataframe.py Outdated Show resolved Hide resolved
intake/container/dataframe.py Outdated Show resolved Hide resolved
from dask.bytes import open_files
self.files = open_files(self.url, **self.storage_options)

def read_a_file(open_file, reader, kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to allow a preprocess function in here to make it more general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. So read_a_file is the fallback? Any suggestion for what to call this argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be called preprocess and should happen right after read_a_file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intake/source/base.py Outdated Show resolved Hide resolved
out.description = self.description
metadata = {'type': 'persisted_dataset',
'timestamp': datetime.datetime.now().isoformat(),
'previous_metadata': self.metadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the consequences be of leaving metadata = self.metadata and putting the persistence-related info beside it in a new attribute rather than nesting?

----------
source: a DataSource instance to save
name: str or None
Key to refer to this persisted dataset by. If nto given, will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nto -> not

@@ -280,6 +280,26 @@ def hvplot(self):
"""
return self.plot

def persist(self, name, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a use case in mind for name? Is there ever a circumstance where it should not be source.name?

@martindurant martindurant added this to In progress in intake Jan 24, 2019
Martin Durant added 2 commits January 28, 2019 16:55
Giving sources a reliable hash (enables equality and keying) and ensuring
that sources know how they were made, if they were made by a catalog
(helps will full provanance)
@martindurant martindurant mentioned this pull request Jan 29, 2019
@martindurant
Copy link
Member Author

Additional suggestions for persistence, which may appear in another PR:

  • multiple checkpointing of some data source, such that you not only have the latest version, but can access previous ones
  • an "export" function that does the same as persist, but not into the persisted things location: you get files and a catalogue wherever specified. This could be done to an already-persisted source, which would mean just copying the files.

Martin Durant added 2 commits February 12, 2019 11:26
@martindurant martindurant changed the title WIP: add data-source persistence add data-source persistence Feb 14, 2019
@martindurant
Copy link
Member Author

Filled out the API with what I imagine to be the final user experience, at least for the first iteration. No tests or documentation yet, beyond basic docstrings.

@martindurant martindurant merged commit 54b2683 into master Feb 20, 2019
@martindurant martindurant deleted the persisting branch February 20, 2019 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
intake
  
In progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants