Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auxiliary index #147

Open
jratike80 opened this issue Oct 5, 2021 · 11 comments
Open

Auxiliary index #147

jratike80 opened this issue Oct 5, 2021 · 11 comments

Comments

@jratike80
Copy link

I wonder if it could be possible, even in theory, to build a sidecar index file that fundamentally contains only feature ID and offset/range of that feature in the fgb file. Use case that I have in my mind is to find a cadastral parcel by ID, or a building by a building ID from a large collection. The index file would be small and fast to download and clients could keep it in the cache.

I have been reading some articles where people do not seem to miss external dBase indexes at all, or they hate the bunch of files needed for a complete shapefile. However, in cloud environment where every request has a price tag it could be a benefit to download the whole index into a cache and after that send http range requests only for fetching the data.

@bjornharrtell
Copy link
Member

bjornharrtell commented Oct 5, 2021

Don't mention the sidecar.. :P

Seriously though, unless I misunderstand you want this because the built in index is "too large" too fetch into cache because your case only needs feature id/offset index not the spatial index? That's pretty domain specific in my opinion and I would need more convincing arguments to introduce an official sidecar concept. In hindsight I did have some regret that I did not make room for other index types that would be ignored for reader implementations that do not understand them - perhaps there is some tricks to introduce that in a non-breaking fashion but I'm not really into increasing complexity/adding more index types at this time. But I would rather invest in that than introducing sidecars. Of course you could have unofficial sidecars if you like.. :P

Btw, an optimized feature id / offset index should probably use delta encoded varints to make it even smaller for the use case that it should be read in full and cached. It would not be too hard to produce such a custom index by processing the existing spatial index.

@jratike80
Copy link
Author

Right, the spatial index works fine but that's the only supported index type and find by attribute is a rather common use case as well. I guess that writing indexes into a place where they are reasonable easy to find with range requests makes it somehow heavy to finalize the optimized file. At least that's the case with Cloud Optimized GeoTIFFs but why not with fgb as well.

I was wondering that it could be faster/simpler to write a fgb file first with spatial index only and create another index (or indexes) as a sidecar with no hurry if it feels useful. So I was just thinking, I have not played much with fgb yet even it seems that I have still the Finnish buildings as fgb remaining from some old experiments https://latuviitta.kapsi.fi/data/mtk/rakennus.fgb. Feel free to use if you need such data. Source is National Land Survey of Finland and license is CC-BY 4.0.

@bjornharrtell
Copy link
Member

I can agree that there are pro and cons with externalizing indexes but that is equally true for the built in spatial index. It can also be argued that it could be useful to externalise attributes and store them in columnar fashion if that is the main access pattern. I did think about this when designing flatgeobuf and my main argument for not going down that road is that I saw simplicity as much more important than flexibility.

Attribute indexes are also complex because the type of index is dependent on the data. There is a reason there are multiple index types in PostgreSQL and there is a reason a full blown database is needed for attribute indexes. My rationale is that in most use cases for static spatial data the spatial index is the primary filter and if you need more flexibility you should probably use a real database. I don't want to implement a database, that's what file geodatabases and GeoPackage does and I don't like the cons that come with that. Even adding a b-tree type index is adding complexity to the spec and implementations that I don't want.

@bjornharrtell
Copy link
Member

Another reason I didn't want to externalize the spatial index in the first place is that I wanted to constrain it to be optimal and coupled to guaranteed spatially ordered data.

@bjornharrtell
Copy link
Member

bjornharrtell commented Oct 5, 2021

That said, having "exotic" things like a specialized btree index for an identifier type attribute for fast id lookup could make sense to externalize and specified in a way so it's clear it is optional and for specific use cases. After all the biggest problem with shapefile sidecars is that a couple of them are non-optional (IMHO). But still, I'm hesistant to work/accept it. Perhaps I will reconsider when flatgeobuf is the most popular spatial format. :P

@jratike80
Copy link
Author

Databases do not work in a cloud with just http range requests. SQLite/GeoPackage does work to some extent but still too many bytes must be read. Somehow I feel that somebody, some day, will advance the queryable serverless cloud-pay-model-optimized read only vector storage to support attribute queries.

@bjornharrtell
Copy link
Member

bjornharrtell commented Oct 5, 2021

@jratike80 I can't say I disagree. It would be kind of fun with a POC of such an external index against an unindexed flatgeobuf, and describe that as a potential extension point to the format. I don't think it's very difficult thing to do actually, but it's not on my agenda at this time.

@jratike80
Copy link
Author

As I told, I was just pondering without any exact use case or need for the feature.

@bjornharrtell
Copy link
Member

@jratike80 and you kind of got me interested and not completely against the concept of externalized indexes.

@pka
Copy link
Member

pka commented Oct 6, 2021

I'm supporting the idea of additional indexes. I already had a use case in need for a non-geometric index and for this I would like to experiment with a PGM index.

Why not having an extensible index struct in flatbuffers format like e.g.

enum IndexType : byte { HilbertRTree, Custom }

struct Index {
  type:IndexType;
  index:[ubyte];
}

A reader could simply skip unknown indexes.

@bjornharrtell
Copy link
Member

bjornharrtell commented Oct 6, 2021

@pka could work but two drawbacks is it would require a major revision of the spec and it would limit index size to about 2 GB. (the latter is why I didn't contain the current spatial index in flatgeobuf in a flatbuffer message)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants