add data-source persistence #237

martindurant · 2019-01-19T21:05:09Z

adds 'persist()' method to sources, which uses the container lookup to
find the appropriate thing to call for that source. Saves to one
particular format for each container type, only dataframe implemented so
far here.

Also includes GenericDataFrame, which can load anything given a set of
files and a function which can make an open file into a data-frame piece.

adds 'persist()' method to sources, which uses the container lookup to find the appropriate thing to call for that source. Saves to one particular format for each container type, only dataframe implemented so far here. Also includes GenericDataFrame, which can load anything given a set of files and a function which can make an open file into a data-frame piece.

martindurant · 2019-01-19T21:06:16Z

For now, the metadata just includes a time-stamp. Ideally, it should be possible to call "refresh" to update persisted datasets and have old ones expire; persisted datasets should also allow the user to fall back to the original on-demand data.

jsignell · 2019-01-21T14:10:15Z

intake/config.py

@@ -23,7 +23,8 @@
    'cache_disabled': False,
    'cache_download_progress': True,
    'logging': 'INFO',
-    'catalog_path': []
+    'catalog_path': [],
+    'persist_path': 'DEFAULT'


I think we should come to a consensus on what the right way to set defaults is. There are at least 3 different approaches in this file. I think I mildly prefer to set a meaningful default in the defaults dict then we could create another dict with all the values from the env vars and update the defaults with the dict from env vars. Does that sound reasonable?

intake/container/dataframe.py

jsignell · 2019-01-21T14:34:45Z

intake/container/dataframe.py

+        from dask.bytes import open_files
+        self.files = open_files(self.url, **self.storage_options)
+
+        def read_a_file(open_file, reader, kwargs):


Might want to allow a preprocess function in here to make it more general.

Good idea. So read_a_file is the fallback? Any suggestion for what to call this argument?

I think it should be called preprocess and should happen right after read_a_file

See http://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html for instance

intake/source/base.py

danielballan · 2019-01-24T12:13:35Z

intake/source/base.py

+        out.description = self.description
+        metadata = {'type': 'persisted_dataset',
+                    'timestamp': datetime.datetime.now().isoformat(),
+                    'previous_metadata': self.metadata,


What would the consequences be of leaving metadata = self.metadata and putting the persistence-related info beside it in a new attribute rather than nesting?

danielballan · 2019-01-24T12:14:12Z

intake/container/dataframe.py

+        ----------
+        source: a DataSource instance to save
+        name: str or None
+            Key to refer to this persisted dataset by. If nto given, will


danielballan · 2019-01-24T12:15:39Z

intake/source/base.py

@@ -280,6 +280,26 @@ def hvplot(self):
        """
        return self.plot

+    def persist(self, name, **kwargs):


Do you have a use case in mind for name? Is there ever a circumstance where it should not be source.name?

Giving sources a reliable hash (enables equality and keying) and ensuring that sources know how they were made, if they were made by a catalog (helps will full provanance)

martindurant · 2019-02-08T15:27:07Z

Additional suggestions for persistence, which may appear in another PR:

multiple checkpointing of some data source, such that you not only have the latest version, but can access previous ones
an "export" function that does the same as persist, but not into the persisted things location: you get files and a catalogue wherever specified. This could be done to an already-persisted source, which would mean just copying the files.

plus fix knockon tests

martindurant · 2019-02-14T16:50:13Z

Filled out the API with what I imagine to be the final user experience, at least for the first iteration. No tests or documentation yet, beyond basic docstrings.

jsignell reviewed Jan 21, 2019

View reviewed changes

Martin Durant added 3 commits January 23, 2019 16:04

Merge branch 'master' into persisting

9fafe92

Merge branch 'master' into persisting

4561917

Rename persist -> _persist for implementation

31ee7c6

danielballan reviewed Jan 24, 2019

View reviewed changes

martindurant added this to In progress in intake Jan 24, 2019

Martin Durant added 2 commits January 28, 2019 16:55

Plumbing

b2b048a

Giving sources a reliable hash (enables equality and keying) and ensuring that sources know how they were made, if they were made by a catalog (helps will full provanance)

Upvamp textfile source

1bfff72

martindurant mentioned this pull request Jan 29, 2019

JSON support #197

Closed

Merge branch 'master' into persisting

6542747

martindurant force-pushed the master branch from 1bfff72 to 6b084f4 Compare January 29, 2019 15:17

Martin Durant added 6 commits January 29, 2019 11:11

Correct simpler review comments

d26198e

small massage

4a047f7

Merge branch 'master' into persisting

73f3a93

Begin persistance store

97e98f6

point

57ed90e

more plumbing

abaf5b2

martindurant self-assigned this Jan 31, 2019

martindurant added enhancement in progress labels Jan 31, 2019

martindurant mentioned this pull request Feb 6, 2019

Cache files get created in the current directory #254

Closed

Martin Durant added 2 commits February 12, 2019 11:26

Merge branch 'master' into persisting

6ae903f

Fill out API

a9b5f17

plus fix knockon tests

martindurant changed the title ~~WIP: add data-source persistence~~ add data-source persistence Feb 14, 2019

Martin Durant added 2 commits February 16, 2019 10:13

update some docs

c83094b

Add first test

a1f6a7f

Martin Durant added 4 commits February 17, 2019 14:56

Add zarr as persist mechanism for arrays

ff12a06

Add zarr test

4b17b77

more

e4d35fb

Add basic docs

3456ce7

martindurant mentioned this pull request Feb 18, 2019

Intake breaks merge key functionality in yaml #269

Closed

Martin Durant added 10 commits February 18, 2019 13:18

revert some posixpath occurances

b6af4d1

typo

c28d7ab

Add C-F to install

083ae64

small changes

b415da4

revert more posixpaths

2642719

Merge branch 'master' into persisting

6fa35b5

Add coverage for compressions

82fd5ff

also compress infer

3f6a1aa

simplify persist test for win

8af5527

Add nested cat docs

a3e1ef0

martindurant added needs review and removed in progress labels Feb 19, 2019

martindurant mentioned this pull request Feb 20, 2019

Caching SQL queries with expiration custom driver #273

Closed

Martin Durant added 3 commits February 20, 2019 15:12

Add docs page for persist

35f0e4d

fix for win's slow refresh

936a6b9

add sleeps for windows test

c6088ca

martindurant merged commit 54b2683 into master Feb 20, 2019

martindurant deleted the persisting branch February 20, 2019 22:03

martindurant removed the needs review label Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add data-source persistence #237

add data-source persistence #237

martindurant commented Jan 19, 2019

martindurant commented Jan 19, 2019

jsignell Jan 21, 2019

jsignell Jan 21, 2019

martindurant Jan 23, 2019

jsignell Jan 23, 2019

jsignell Jan 23, 2019

danielballan Jan 24, 2019

danielballan Jan 24, 2019

danielballan Jan 24, 2019

martindurant commented Feb 8, 2019

martindurant commented Feb 14, 2019

add data-source persistence #237

add data-source persistence #237

Conversation

martindurant commented Jan 19, 2019

martindurant commented Jan 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Feb 8, 2019

martindurant commented Feb 14, 2019