Skip to content

Conversation

@bemoody
Copy link
Collaborator

@bemoody bemoody commented Aug 13, 2021

When reading remote files via HTTP(S), currently a variety of ad-hoc methods are used: e.g., there's one function for retrieving and parsing a header file, another function for retrieving a segment of a signal file, and another for retrieving an annotation file.

This isn't conducive to reading complex data formats, because every format needs to be implemented twice (once using the standard file API, for local files, and once using the requests API, for remote files.) This leads to bugs and inconsistencies.

Furthermore, it is desirable to reuse sockets, set a meaningful User-Agent, and retain cookies. This requires handling all of the package's HTTP requests in a central place, rather than each function calling requests.get or requests.head.

This pull request introduces a function openurl which opens a remote URL as a file object. This object implements the standard Python file API, so you can read() and seek() and do everything else that you'd do with a normal Python file. This is analogous to the struct netfile used by libwfdb.

Here, I've simply modified the wfdb.io.download functions to use this backend instead of calling requests directly. In future pull requests I'll try to remove as much as I can of the extraneous duplicated logic.

@bemoody
Copy link
Collaborator Author

bemoody commented Aug 13, 2021

Nice work on the test suite catching all my silly mistakes!

@tompollard
Copy link
Member

@bemoody please could you rebase this on the main branch?

Benjamin Moody added 24 commits September 13, 2021 12:27
This module contains functions and classes for accessing remote files
via HTTP.

The RangeTransfer class provides an efficient interface for requesting
and retrieving a range of bytes from a remote file, without knowing in
advance if the server supports random access.

The NetFile class implements a standard "buffered binary file"
(BufferedIOBase) object that is accessed via HTTP.  This class
implements its own buffering, rather than implementing the RawIOBase
API and using a BufferedReader, because the buffering behavior depends
on the server (if the server doesn't support random access, we want to
buffer the entire file at once.)  We also want to enable a mode where
the entire file is explicitly buffered in advance, which may be more
efficient when the caller intends to read the entire file.

The openurl function provides an open-like API for creating either a
text or binary file object.
This module provides test cases to check that the behavior of
wfdb.io._url.NetFile conforms to the standard file object API.
This function requests the file index of the specified database, to
check whether it exists, but does not actually read its contents.
This is a waste of bandwidth and a waste of processing time on both
ends.  Change this to perform a HEAD request instead.
This function provides a simple wrapper for requests.get, handling the
common case of reading the entire contents of a binary file and
raising an exception if the file doesn't exist.  This function is a
temporary shim to assist in porting the wfdb.io.download functions to
using wfdb.io._url.
Note that this will handle errors (e.g. a file that does not exist)
and raise an exception, rather than trying to parse the error
document.
Note that this will handle errors (e.g. a file that we are not
authorized to access) and raise an exception, rather than trying to
parse the error document.  Previously, 404 errors were handled, but
other types of errors were not.
Note that this will handle errors (e.g. a file that we are not
authorized to access) and raise an exception, rather than trying to
parse the error document.  Previously, 404 errors were handled, but
other types of errors were not.
Note that this will handle errors (e.g. a file that does not exist)
and raise an exception, rather than writing the error document to the
output file.
Note that this will correctly handle remote files that do not support
random access.
Note that this will handle errors (e.g. a file that does not exist)
and raise an exception, rather than writing the error document to the
output file.

It will also correctly handle remote files that do not support random
access.
This temporary function is no longer needed.
Note that this will handle errors (e.g. a file that does not exist)
and raise an exception, rather than writing the error document to the
output file.
Note that this will handle errors (e.g. a file that does not exist)
and raise an exception, rather than trying to parse the error
document.
Note that this will avoid downloading the index since its contents are
not used.

Furthermore, when trying to enumerate annotation files, this will
handle errors other than 404 errors, and raise an exception as
appropriate.
file:// URLs are not supported by python-requests, but it is easy
enough to support them here, and can be useful for testing remote-file
APIs using local data.
If a normal exception occurs while reading an HTTP response, we want
to read the remaining data so that the connection can be reused for
another transfer.

If the RangeTransfer is deleted without calling close() or __exit__(),
the response object must still be explicitly closed so that it no
longer counts against the connection pool limit.  This doesn't happen
automatically when a Response is garbage-collected, which is probably
a bug in python-requests.
@tompollard
Copy link
Member

thanks @bemoody, this sounds like a big improvement in the way remote files are handled. please could you help to answer a couple of questions?

  • what is the purpose of the duplicate functions (e.g. read1, readinto1)? i.e. why not call read and readinto directly?
  • what is the purpose of the readable() and seekable() methods on NetFile when these always return True?

@bemoody
Copy link
Collaborator Author

bemoody commented Sep 14, 2021 via email

@tompollard
Copy link
Member

thanks for explaining. looks good to me!

@tompollard tompollard merged commit 6be3066 into master Sep 15, 2021
@tompollard tompollard deleted the netfiles branch September 15, 2021 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants