Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressed feedback. #1

Merged
merged 3 commits into from
Apr 23, 2018
Merged

Addressed feedback. #1

merged 3 commits into from
Apr 23, 2018

Conversation

mmccarty
Copy link
Member

No description provided.


def read_chunked(self):
raise Exception('read_chunked not supported for xarray containers.')
self._load_metadata()
return self._ds

def read_partition(self, i):
raise Exception('read_partition not supported for xarray containers.')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant read_partition is not implemented yet. I'm wondering if we need to support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. To my mind, it should be implemented at some point, but the input is a tuple. I would keep an exception for now (but should be NotImplemented, because we may do it later).


def __init__(self, urlpath, xarray_kwargs=None, metadata=None):
def __init__(self, urlpath, chunks, xarray_kwargs=None, metadata=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added chunks to signature and doc string.

@@ -70,17 +77,17 @@ def _get_schema(self):

def read(self):
self._load_metadata()
return self._ds
return self._ds.load()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loads all the data.


def read_partition(self, i):
raise Exception('read_partition not supported for xarray containers.')

def to_dask(self):
self._load_metadata()
return self._ds.to_dask_dataframe()
return self.read_chunked()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_chunked is will return the dataset as a lazy container, which I believe is also what we want for the to_dask case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this should just return the xarray directly (which points to dask arrays); this would be self._ds if you agree that read_chunked should be an exception.

@mmccarty mmccarty self-assigned this Mar 27, 2018
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what I see here.

I feel some other members of the team should think about the issues around xarray in general.

You should enable Travis on this repo.

self._ds = None
super(NetCDFSource, self).__init__(
container=self.container,
container=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not xarray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container is not used since we are overloading the read* methods (Intake plugins the "hard" way). I'd rather give no information rather than false information.


def read_chunked(self):
raise Exception('read_chunked not supported for xarray containers.')
self._load_metadata()
return self._ds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this is approximately correct, but it doesn't give something you can iterate over. I'm not sure whether this should be an exception just like get_partition.


def read_partition(self, i):
raise Exception('read_partition not supported for xarray containers.')

def to_dask(self):
self._load_metadata()
return self._ds.to_dask_dataframe()
return self.read_chunked()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this should just return the xarray directly (which points to dask arrays); this would be self._ds if you agree that read_chunked should be an exception.

@martindurant
Copy link
Member

The travis script seems to be missing install dependencies.

@mmccarty
Copy link
Member Author

mmccarty commented Mar 31, 2018 via email

@martindurant
Copy link
Member

That's fine, enjoy your relaxation

@martindurant
Copy link
Member

Notes on our conversation about the future of this PR/repo (cc @jbcrail , but note that @mmccarty will presumably return to work on this when he has time). I believe the following is a reasonable course of action.

  • a new container type should be acceptable to Intake, "xarray". The builtin functionality in the source class will be overridden.
  • We must consider carefully what this means for an xarray opened on an Intake server - presumably communication will be the same as an ndarray, which doesn't yet exist (right?). Note that being able to load netCDF/HDF from a remote location would be a huge boon, and there are servers around doing only that job, because it is so useful - can we make it happen? We would need to create each variable as a dask-array, where any chunk calls the server with its multi-dimensional index, and create a local xarray that stores these dask-arrays in the same arrangement and with the same metadata as remotely.
  • The natural representation of an open xarray is the open xarray object itself, and that is what discover() should return. Also, the arrays should be chunked from the start, so to_dask() is a no-op on that, and read() should call whatever xarray function it is to materialise the data into memory.
  • This repo should be renamed intake-xarray, and include three separate plugins: netCDF which opens one or more files (these are separate functions in xarray); rasterIO and zarr. The latter is the only one that can actually directly open files remotely, and we should take care to parse s3:, hdfs:, and gcs: and create the mappers that zarr needs (I'll help with that). It would be nice is an unstructured zarr array returns an xarray data-array as opposed to a dataset, although maybe that should be a separate plugin. Note again, that we have no array readers at all, not even numpy (never mind scientific formats)

@mmccarty
Copy link
Member Author

Thanks @martindurant I'll try to find spare cycles to make progress. I went ahead an renamed this GH repo.

We should probably go ahead and merge this PR and create issues to follow up. For tracking, I moved this bullet list to an issue and created an issue for fixing CI.

@martindurant
Copy link
Member

Merging to allow further progress in subsequent PRs.

@martindurant martindurant merged commit d43c042 into master Apr 23, 2018
@martindurant martindurant deleted the initial-feedback branch April 23, 2018 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants