Further TypeScript I/O optimizations #122

bjornharrtell · 2021-05-06T21:05:54Z

Some ideas for even more I/O optimization:

Configurable minimum feature buffer size
Buffer size doubling on subsequent requests (up to a default and configurable maximum)
Optional cache/reuse of buffered data for header and index for separate requests against the same URL

michaelkirk · 2021-05-06T22:30:42Z

Optional cache/reuse of buffered data for header and index for separate requests against the same URL

Currently the JS API is kind of a oneshot thing. I think to support this we might want to change the API to expose something like the HttpReader that the user can repeatedly setBBox on. What do you think? Or were you thinking some kind of global hidden cache?

Buffer size doubling on subsequent requests (up to a default and configurable maximum)
Configurable minimum feature buffer size

On a slightly related tangent, I have a question about the FGB format: Is it safe to assume that features are tightly packed (adjacent, unpadded) in the feature portion of the buffer?

If so, can we then deduce the precise byte-range of each feature using only the index, by taking the difference between adjacent leaf nodes: Range: ${leaf_node[i].offset} - ${leaf_node[i+1].offset - 1}.

If we can in fact determine the precise byte-range for each feature, we could trivially merge adjacent byte ranges to combine requests. And, like with the new index traversal logic, there's probably a reasonable heuristic for merging non-adjacent ranges, provided they're "close enough" to be worth saving a request. That ought to give us good performance and doesn't require any user configuration.

All of that's predicated on my assumption that we can use the index in this way... what do you think?

If I'm mistaken, and that's not possible or inordinately complicated, then I support your two proposals.

michaelkirk · 2021-05-06T22:34:44Z

Another thought would be that we could very crudely estimate feature size by taking the feature_buffer_size / feature count . By the time we're fetching features, we know how many features we're fetching, so if we want to say... do it in approximately (e.g.) 4 requests, our buffer size could be:

// try to get all the feature in 4 requests
let query_count_goal = 4;
let avg_feature_size = (feature_buffer_size / total_feature_count);
feature_count * avg_feature_size / query_count_goal

bjornharrtell · 2021-05-07T06:12:30Z

Yes feature data is adjacent and unpadded so your ideas are valid. 🙂

I was thinking an opt in global cache yes. The underlying implementation could also be exposed for more advanced cases but I want to provide a simple as possible API for common use cases.

michaelkirk · 2021-05-19T22:40:19Z

Some more ideas for optimizing index traversal...

To recap a bit, the vast majority of time spent when doing a bbox query for a large file like https://flatgeobuf.org/examples/leaflet/large.html is waiting on round trip network requests.

We're currently quite naive and do all these round trip requests in serial. This is because when traversing the index, we don't know precisely which children to request until we have their parent.

For simplicity of illustration, I'm ignoring the "fetch the first 12kb" optimization we currently have, but you can equally imagine that this is some subtree later in the index. I'm also showing branching factor of 2, but again it should generalize to any branching factor.

Conceptually, it looks something like this (the filled in nodes are the nodes we need, those within our bounding box):

So, let's say that's 8 round trip requests in the current situation, one after the other.

One way to improve upon this is to recognize that every time we split down separate branches during tree traversal is an opportunity to issue requests concurrently, so then our conceptual example could become:

option 1: concurrent requests

So, after completing R2 the [R3, R4, R5] series could happen concurrent with the [R6, R7, R8] series. That helps!

Another observation, consider how, as an optimization, we fetch the first several layers of the tree at once, obviating a couple serial fetches, without fetching too much extra data, because the tree is small near the root.

Ignore that trees can have different branching factors for a minute. Consider that, in theory, since we already have the first three levels locally, we don't even really need to look at the first two levels. Their only purpose is to tell us which level-3 nodes we need, and we already have all of them, so we could equally skip the level-1 and level-2 nodes, and start by iterating through all the level-3 nodes and continuing with the ones which intersect our bbox.

We can generalize this to any subtree within the tree, not just the root. Provided we're willing to risk fetching up to N unused nodes per request, we can leap-frog over layers (skipping their request), by simply fetching all the grand-children (or grand-grand-children) provided there are less than N of them, and then using the grand-childrends bbox to exclude them from the traversal.

So then our above example becomes like:
option 2: skip some requests

Note that option 1 and option 2 are independent. I think we could expect benefits from doing either or both of them.

The main issue I see is that it makes the traversal ever more complicated - I'd be very interested in less complicated solutions to reducing the number of round trips.

michaelkirk · 2021-05-19T22:44:55Z

I'm also extremely curious about the long delay I'm seeing between the first and second requests... I haven't had a chance to look into what that's about.

https://flatgeobuf.org/examples/leaflet/large.html

Can you reproduce that?

michaelkirk · 2021-05-19T22:46:47Z

And another interesting question is why does the first request for the FGB happen so much later than, e.g. the first map tile requests? (My guess: maybe I did something dumb in the event listener/init?)

michaelkirk · 2021-06-10T17:07:18Z

And another interesting question is why does the first request for the FGB happen so much later than, e.g. the first map tile requests? (My guess: maybe I did something dumb in the event listener/init?)

🎯 Indeed I did something dumb in the examples - fixed in #130.

michaelkirk · 2021-06-10T20:39:43Z

I'm also extremely curious about the long delay I'm seeing between the first and second requests... I haven't had a chance to look into what that's about.

Discussed in #131

Basically, because we're using Range requests to access files on another domain, CORS applies, and a preflight request is being issued. The CDN used by the examples isn't currently configured to cache OPTIONS requests, so we wait a long time while the CDN roundtrips to the origin.

bjornharrtell · 2021-06-10T20:42:52Z

@michaelkirk I'll see what I can do about the CDN.

michaelkirk · 2021-06-15T23:18:26Z

If so, can we then deduce the precise byte-range of each feature using only the index, by taking the difference between adjacent leaf nodes: Range: ${leaf_node[i].offset} - ${leaf_node[i+1].offset - 1}.

By way of an update, I've got a working prototype of using the index to right-size feature fetching requests: https://github.com/michaelkirk/flatgeobuf/tree/mkirk/batch-features

For sanity, I'm waiting to open a cleaned up PR until some house-keeping stuff is addressed: #133, #134

michaelkirk mentioned this issue May 25, 2021

revamp JS examples #118

Closed

4 tasks

michaelkirk mentioned this issue Jun 10, 2021

document CDN recommendations for faster spatial filtering #131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further TypeScript I/O optimizations #122

Further TypeScript I/O optimizations #122

bjornharrtell commented May 6, 2021

michaelkirk commented May 6, 2021 •

edited

michaelkirk commented May 6, 2021 •

edited

bjornharrtell commented May 7, 2021

michaelkirk commented May 19, 2021

michaelkirk commented May 19, 2021

michaelkirk commented May 19, 2021 •

edited

michaelkirk commented Jun 10, 2021

michaelkirk commented Jun 10, 2021

bjornharrtell commented Jun 10, 2021

michaelkirk commented Jun 15, 2021

Further TypeScript I/O optimizations #122

Further TypeScript I/O optimizations #122

Comments

bjornharrtell commented May 6, 2021

michaelkirk commented May 6, 2021 • edited

michaelkirk commented May 6, 2021 • edited

bjornharrtell commented May 7, 2021

michaelkirk commented May 19, 2021

michaelkirk commented May 19, 2021

michaelkirk commented May 19, 2021 • edited

michaelkirk commented Jun 10, 2021

michaelkirk commented Jun 10, 2021

bjornharrtell commented Jun 10, 2021

michaelkirk commented Jun 15, 2021

michaelkirk commented May 6, 2021 •

edited

michaelkirk commented May 6, 2021 •

edited

michaelkirk commented May 19, 2021 •

edited