Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update streaming tutorial #1526

Merged
merged 17 commits into from
Aug 11, 2022
Merged

update streaming tutorial #1526

merged 17 commits into from
Aug 11, 2022

Conversation

bendichter
Copy link
Contributor

  • change fsspec from S3 to http protocol
  • intersphinx into dandi and fsspec

fix #1525

* change fsspec from S3 to http protocol
* intersphinx into dandi and fsspec
@bendichter
Copy link
Contributor Author

@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #1526 (3ce0a89) into dev (4d59cc7) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev    #1526   +/-   ##
=======================================
  Coverage   90.65%   90.65%           
=======================================
  Files          25       25           
  Lines        2494     2494           
  Branches      466      466           
=======================================
  Hits         2261     2261           
  Misses        148      148           
  Partials       85       85           
Flag Coverage Δ
integration 69.60% <ø> (ø)
unit 84.16% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>
Comment on lines -76 to -88
# .. code-block:: python
#
# import s3fs
# import pynwb
# import h5py
#
# fs = s3fs.S3FileSystem(anon=True)
#
# f = fs.open("s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35", 'rb')
# file = h5py.File(f)
# io = pynwb.NWBHDF5IO(file=file, load_namespaces=True)
#
# io.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the s3fs builds on fsspec and since the s3fs code is simpler than using the full fsspec I think it would be nice to add a subsection to the Streaming Method 2: fsspec section that shows this code example for s3fs as well. In this way you can still keep the tutorial to the two main options but still mention thes3fs option, which may be sufficient and simpler for many users.

Copy link
Contributor Author

@bendichter bendichter Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So s3fs is a specific implementation of the fsspec protocol. I changed over to using the more generic fsspec library with http for three reasons:

  1. This way we can use the http path as input, which is the same path that the ros3 driver takes, so it is clear how the two methods relate to each other.
  2. Showing it this way makes it pretty clear how you would extend this to other backends.
  3. This method also shows how to use caching, which I think is an important feature that a lot of users will want to leverage.

Do you really think this workflow is more complex than the s3fs syntax?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing the fsspec example as the main example makes sense. If you think that showing the full s3fs example in addition is too confusing, then maybe we can just add a note along the lines of "The s3fs package also builds on fsspec and can be used to access S3 in a similar manner. s3fs provides additional convenience methods for interacting with S3 more specifically while fsspec provides greater flexibility with regard to supported remote file systems. Note, when using s3fs requires the use of s3 URIs instead of http. "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best would be if someone checks for what s3fs might be providing "on top" or in addition to pure straight fsspec. E.g. it could be some better choices of blocksize etc. I would have at least compared performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true that s3fs provides additional convenience methods for interacting with S3 more specifically?

s3fs implements in S3FileSystem which inherits from fsspec's AsyncFileSystem abstract class. My impression is that it just implements the method of a fsspec.FileSystem without much else besides maybe credential controls that we don't use. You can switch to an s3 backend by simply changing fs=fsspec.filesystem("http") to fs=fsspec.filesystem("s3"), but then you'll have to supply the s3 path, not the http path.

Copy link
Contributor

@oruebel oruebel Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell from the documentation what the specific differences are, but my impression is that s3fs adds support for async, improved logging, and S3 specific implementation of the file system functions. Purely to read an NWB file, this may not be big difference. Pending further investigation, maybe what we could do is to add a note to indicate that there are a growing number of libraries build on ffspec that provide functionality specific for different file systems (e.g., S3, Google Cloud, or Azure) and link to https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=S3#other-known-implementations. We could then post alternate examples, such as the code using S3FS, as GitHub Gists online and link to them from the same note. This would keep the docs clean and at the same time provide examples for folks that may want to use the more specific libraries build on fsspec. Given that we don't know how all the various options perform, we could also add a "disclaimer" along those lines.

@oruebel
Copy link
Contributor

oruebel commented Aug 4, 2022

Thanks @bendichter for adding the fsspec option and thanks @yarikoptic for investing and suggesting this option

Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>
Copy link
Contributor

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments / recommendations

docs/gallery/advanced_io/streaming.py Show resolved Hide resolved
Comment on lines -76 to -88
# .. code-block:: python
#
# import s3fs
# import pynwb
# import h5py
#
# fs = s3fs.S3FileSystem(anon=True)
#
# f = fs.open("s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35", 'rb')
# file = h5py.File(f)
# io = pynwb.NWBHDF5IO(file=file, load_namespaces=True)
#
# io.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best would be if someone checks for what s3fs might be providing "on top" or in addition to pure straight fsspec. E.g. it could be some better choices of blocksize etc. I would have at least compared performance.

docs/gallery/advanced_io/streaming.py Outdated Show resolved Hide resolved
docs/gallery/advanced_io/streaming.py Outdated Show resolved Hide resolved
docs/gallery/advanced_io/streaming.py Outdated Show resolved Hide resolved
docs/gallery/advanced_io/streaming.py Outdated Show resolved Hide resolved
docs/gallery/general/file.py Outdated Show resolved Hide resolved
bendichter and others added 4 commits August 4, 2022 15:01
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
undo formatting changes
remove changes to images tutorial
Co-authored-by: Ryan Ly <rly@lbl.gov>
@rly
Copy link
Contributor

rly commented Aug 8, 2022

This looks good to me. And fsspec works on Windows!

Co-authored-by: Ryan Ly <rly@lbl.gov>
@oruebel
Copy link
Contributor

oruebel commented Aug 9, 2022

HDF5 just published the following blog post https://www.hdfgroup.org/2022/08/cloud-storage-options-for-hdf5/ discussing the various cloud storage options with HDF5 and some of the trade-offs to consider. If we need a reference to point to, this may be a good link to point folks to for further reading about available options.

@oruebel
Copy link
Contributor

oruebel commented Aug 9, 2022

At some point it may be useful to also look at HSDS. From a PyNWB perspective it may be as simple as handing in a file handle from h5pyd, but I'm not sure. I've never used HSDS so I'm not sure how involved the setup on the server and client side are. @yarikoptic has DANDI ever looked at HSDS.

@bendichter bendichter merged commit 92aa13f into dev Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: support fsspec / opening from a file object
4 participants