-
Notifications
You must be signed in to change notification settings - Fork 315
HTTP file I/O backend #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Collaborator
Author
|
Nice work on the test suite catching all my silly mistakes! |
Member
|
@bemoody please could you rebase this on the main branch? |
This module contains functions and classes for accessing remote files via HTTP. The RangeTransfer class provides an efficient interface for requesting and retrieving a range of bytes from a remote file, without knowing in advance if the server supports random access. The NetFile class implements a standard "buffered binary file" (BufferedIOBase) object that is accessed via HTTP. This class implements its own buffering, rather than implementing the RawIOBase API and using a BufferedReader, because the buffering behavior depends on the server (if the server doesn't support random access, we want to buffer the entire file at once.) We also want to enable a mode where the entire file is explicitly buffered in advance, which may be more efficient when the caller intends to read the entire file. The openurl function provides an open-like API for creating either a text or binary file object.
This module provides test cases to check that the behavior of wfdb.io._url.NetFile conforms to the standard file object API.
This function requests the file index of the specified database, to check whether it exists, but does not actually read its contents. This is a waste of bandwidth and a waste of processing time on both ends. Change this to perform a HEAD request instead.
This function provides a simple wrapper for requests.get, handling the common case of reading the entire contents of a binary file and raising an exception if the file doesn't exist. This function is a temporary shim to assist in porting the wfdb.io.download functions to using wfdb.io._url.
Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than trying to parse the error document.
Note that this will handle errors (e.g. a file that we are not authorized to access) and raise an exception, rather than trying to parse the error document. Previously, 404 errors were handled, but other types of errors were not.
Note that this will handle errors (e.g. a file that we are not authorized to access) and raise an exception, rather than trying to parse the error document. Previously, 404 errors were handled, but other types of errors were not.
Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file.
Note that this will correctly handle remote files that do not support random access.
Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file. It will also correctly handle remote files that do not support random access.
This temporary function is no longer needed.
Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file.
Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than trying to parse the error document.
Note that this will avoid downloading the index since its contents are not used. Furthermore, when trying to enumerate annotation files, this will handle errors other than 404 errors, and raise an exception as appropriate.
file:// URLs are not supported by python-requests, but it is easy enough to support them here, and can be useful for testing remote-file APIs using local data.
If a normal exception occurs while reading an HTTP response, we want to read the remaining data so that the connection can be reused for another transfer. If the RangeTransfer is deleted without calling close() or __exit__(), the response object must still be explicitly closed so that it no longer counts against the connection pool limit. This doesn't happen automatically when a Response is garbage-collected, which is probably a bug in python-requests.
Member
|
thanks @bemoody, this sounds like a big improvement in the way remote files are handled. please could you help to answer a couple of questions?
|
Collaborator
Author
|
All of these methods are part of the Python file API
(https://docs.python.org/3/library/io.html). The reason for
implementing them is so that the NetFile object can be plugged into
other libraries that expect a standard file object.
In particular, this will work with the soundfile library for reading
FLAC (and other audio formats).
Unfortunately, one function this *doesn't* work for is np.fromfile,
but we can work around that.
|
Member
|
thanks for explaining. looks good to me! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When reading remote files via HTTP(S), currently a variety of ad-hoc methods are used: e.g., there's one function for retrieving and parsing a header file, another function for retrieving a segment of a signal file, and another for retrieving an annotation file.
This isn't conducive to reading complex data formats, because every format needs to be implemented twice (once using the standard file API, for local files, and once using the requests API, for remote files.) This leads to bugs and inconsistencies.
Furthermore, it is desirable to reuse sockets, set a meaningful User-Agent, and retain cookies. This requires handling all of the package's HTTP requests in a central place, rather than each function calling
requests.getorrequests.head.This pull request introduces a function
openurlwhich opens a remote URL as a file object. This object implements the standard Python file API, so you can read() and seek() and do everything else that you'd do with a normal Python file. This is analogous to thestruct netfileused by libwfdb.Here, I've simply modified the
wfdb.io.downloadfunctions to use this backend instead of callingrequestsdirectly. In future pull requests I'll try to remove as much as I can of the extraneous duplicated logic.