HTTP file I/O backend #320

bemoody · 2021-08-13T22:24:24Z

When reading remote files via HTTP(S), currently a variety of ad-hoc methods are used: e.g., there's one function for retrieving and parsing a header file, another function for retrieving a segment of a signal file, and another for retrieving an annotation file.

This isn't conducive to reading complex data formats, because every format needs to be implemented twice (once using the standard file API, for local files, and once using the requests API, for remote files.) This leads to bugs and inconsistencies.

Furthermore, it is desirable to reuse sockets, set a meaningful User-Agent, and retain cookies. This requires handling all of the package's HTTP requests in a central place, rather than each function calling requests.get or requests.head.

This pull request introduces a function openurl which opens a remote URL as a file object. This object implements the standard Python file API, so you can read() and seek() and do everything else that you'd do with a normal Python file. This is analogous to the struct netfile used by libwfdb.

Here, I've simply modified the wfdb.io.download functions to use this backend instead of calling requests directly. In future pull requests I'll try to remove as much as I can of the extraneous duplicated logic.

bemoody · 2021-08-13T22:32:05Z

Nice work on the test suite catching all my silly mistakes!

tompollard · 2021-09-13T03:36:29Z

@bemoody please could you rebase this on the main branch?

This module contains functions and classes for accessing remote files via HTTP. The RangeTransfer class provides an efficient interface for requesting and retrieving a range of bytes from a remote file, without knowing in advance if the server supports random access. The NetFile class implements a standard "buffered binary file" (BufferedIOBase) object that is accessed via HTTP. This class implements its own buffering, rather than implementing the RawIOBase API and using a BufferedReader, because the buffering behavior depends on the server (if the server doesn't support random access, we want to buffer the entire file at once.) We also want to enable a mode where the entire file is explicitly buffered in advance, which may be more efficient when the caller intends to read the entire file. The openurl function provides an open-like API for creating either a text or binary file object.

This module provides test cases to check that the behavior of wfdb.io._url.NetFile conforms to the standard file object API.

This function requests the file index of the specified database, to check whether it exists, but does not actually read its contents. This is a waste of bandwidth and a waste of processing time on both ends. Change this to perform a HEAD request instead.

This function provides a simple wrapper for requests.get, handling the common case of reading the entire contents of a binary file and raising an exception if the file doesn't exist. This function is a temporary shim to assist in porting the wfdb.io.download functions to using wfdb.io._url.

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than trying to parse the error document.

Note that this will handle errors (e.g. a file that we are not authorized to access) and raise an exception, rather than trying to parse the error document. Previously, 404 errors were handled, but other types of errors were not.

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file.

Note that this will correctly handle remote files that do not support random access.

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file. It will also correctly handle remote files that do not support random access.

This temporary function is no longer needed.

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file.

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than trying to parse the error document.

Note that this will avoid downloading the index since its contents are not used. Furthermore, when trying to enumerate annotation files, this will handle errors other than 404 errors, and raise an exception as appropriate.

file:// URLs are not supported by python-requests, but it is easy enough to support them here, and can be useful for testing remote-file APIs using local data.

If a normal exception occurs while reading an HTTP response, we want to read the remaining data so that the connection can be reused for another transfer. If the RangeTransfer is deleted without calling close() or __exit__(), the response object must still be explicitly closed so that it no longer counts against the connection pool limit. This doesn't happen automatically when a Response is garbage-collected, which is probably a bug in python-requests.

tompollard · 2021-09-14T17:54:52Z

thanks @bemoody, this sounds like a big improvement in the way remote files are handled. please could you help to answer a couple of questions?

what is the purpose of the duplicate functions (e.g. read1, readinto1)? i.e. why not call read and readinto directly?
what is the purpose of the readable() and seekable() methods on NetFile when these always return True?

bemoody · 2021-09-14T20:44:26Z

All of these methods are part of the Python file API (https://docs.python.org/3/library/io.html). The reason for implementing them is so that the NetFile object can be plugged into other libraries that expect a standard file object. In particular, this will work with the soundfile library for reading FLAC (and other audio formats). Unfortunately, one function this *doesn't* work for is np.fromfile, but we can work around that.

tompollard · 2021-09-15T15:03:43Z

thanks for explaining. looks good to me!

bemoody force-pushed the netfiles branch from 38ff040 to 75a950e Compare August 13, 2021 22:30

bemoody mentioned this pull request Aug 20, 2021

Adds Amazon AWS S3 support #268 #298

Closed

Benjamin Moody added 24 commits September 13, 2021 12:27

New module tests.test_url.

26fc602

This module provides test cases to check that the behavior of wfdb.io._url.NetFile conforms to the standard file object API.

_stream_header: use _get_url in place of requests.get.

b4bd57d

_stream_annotation: use _get_url in place of requests.get.

a173913

get_dbs: use _get_url in place of requests.get.

7f090fe

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than trying to parse the error document.

get_record_list: use _get_url in place of requests.get.

3cec71f

Note that this will handle errors (e.g. a file that we are not authorized to access) and raise an exception, rather than trying to parse the error document. Previously, 404 errors were handled, but other types of errors were not.

get_annotators: use _get_url in place of requests.get.

c0a6a41

Note that this will handle errors (e.g. a file that we are not authorized to access) and raise an exception, rather than trying to parse the error document. Previously, 404 errors were handled, but other types of errors were not.

dl_full_file: use _get_url in place of requests.get.

9c5fdba

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file.

Use openurl in place of _get_url.

d95af13

_remote_file_size: use openurl in place of requests.head.

442383b

_stream_dat: use openurl in place of requests.get.

ac3a0b5

Note that this will correctly handle remote files that do not support random access.

dl_pn_file: use openurl in place of requests.get.

95f4822

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file. It will also correctly handle remote files that do not support random access.

dl_files: use openurl in place of requests.head.

6b93950

wfdb.io.download: remove _get_url function.

5508edc

This temporary function is no longer needed.

wfdb.io.download: do not import requests.

6ff2f00

edf2mit, wav2mit: use openurl in place of requests.get.

7b6b15f

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than writing the error document to the output file.

get_version: use openurl in place of requests.get.

4ffdc51

Note that this will handle errors (e.g. a file that does not exist) and raise an exception, rather than trying to parse the error document.

dl_database: use openurl in place of requests.

0760fac

Note that this will avoid downloading the index since its contents are not used. Furthermore, when trying to enumerate annotation files, this will handle errors other than 404 errors, and raise an exception as appropriate.

wfdb.io.record: do not import requests.

244b8fe

wfdb.processing.evaluate: do not import requests.

649535d

openurl: support file:// URLs.

390411e

file:// URLs are not supported by python-requests, but it is easy enough to support them here, and can be useful for testing remote-file APIs using local data.

bemoody force-pushed the netfiles branch from e9ceb8f to 34c6937 Compare September 13, 2021 16:56

tompollard merged commit 6be3066 into master Sep 15, 2021

tompollard deleted the netfiles branch September 15, 2021 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTTP file I/O backend #320

HTTP file I/O backend #320

Uh oh!

bemoody commented Aug 13, 2021

Uh oh!

bemoody commented Aug 13, 2021

Uh oh!

tompollard commented Sep 13, 2021

Uh oh!

tompollard commented Sep 14, 2021

Uh oh!

bemoody commented Sep 14, 2021 via email

Uh oh!

tompollard commented Sep 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HTTP file I/O backend #320

HTTP file I/O backend #320

Uh oh!

Conversation

bemoody commented Aug 13, 2021

Uh oh!

bemoody commented Aug 13, 2021

Uh oh!

tompollard commented Sep 13, 2021

Uh oh!

tompollard commented Sep 14, 2021

Uh oh!

bemoody commented Sep 14, 2021 via email

Uh oh!

tompollard commented Sep 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants