Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Consider caching data per-layer rather than per-request #25

Closed
jerluc opened this issue Dec 31, 2019 · 1 comment · Fixed by #45
Closed

[RFC] Consider caching data per-layer rather than per-request #25

jerluc opened this issue Dec 31, 2019 · 1 comment · Fixed by #45
Assignees
Labels
perf Performance-related rfc Request For Comments
Milestone

Comments

@jerluc
Copy link
Member

jerluc commented Dec 31, 2019

Currently, we cache data using the request path as a key and store the full response body as the value. This has the nice side effect of being very simple to implement and maintain, but comes with its drawbacks:

  1. When a request to /_all/{z}/{x}/{y}.mvt is made, another call to /layer1/{z}/{x}/{y}.mvt will miss the cache, because the cache key is based purely on the request path
  2. When a request to /_all/{z}/{x}/{y}.mvt is made, and a partial failure occurs, not only does the entire request fail, but none of the successful layer responses are cached, meaning a subsequent call would have to recompute the entire response, rather than only the failed responses

To fix these problems, we should consider using something like {layer}/{z}/{x}/{y} as a cache key, and caching individual feature collections per layer response. Then, in the above two scenarios:

  1. When a request to /_all/{z}/{x}/{y}.mvt is made, all layer responses get cached, and another call to /layer1/{z}/{x}/{y}.mvt will hit the cache, because the cache key is based on the layer name
  2. When a request to /_all/{z}/{x}/{y}.mvt is made, and a partial failure occurs, the successful layer responses are cached, meaning a subsequent call would only have to recompute the failed layers
@jerluc jerluc added the enhancement New feature or request label Dec 31, 2019
@jerluc jerluc added perf Performance-related rfc Request For Comments and removed enhancement New feature or request labels Nov 2, 2020
@jerluc
Copy link
Member Author

jerluc commented Feb 3, 2021

As a further improvement to this logic, we may even want to encode some content-based hashing of the layer configuration itself as a pseudo version number that is encoded in the cache key, so that layer configuration changes are picked up instantaneously, rather than only at cache entry timeout.

@jerluc jerluc self-assigned this Feb 3, 2021
@jerluc jerluc changed the title Consider caching data per-layer rather than per-request [RFC] Consider caching data per-layer rather than per-request Feb 4, 2021
@jerluc jerluc added this to the v1.1.0 milestone Feb 4, 2021
@jerluc jerluc mentioned this issue Feb 10, 2021
jerluc added a commit that referenced this issue Feb 10, 2021
This changeset closes #25 by implementing a simple cache-per-layer approach (as opposed to the current approach of cache-per-URL). At a high level, this is how the new caching approach works:

- When a request comes in for one or more layers, we do the following for each layer:
  - We create a cache key, by composing together a stringified representation of the layer and the tile request
    - A layer is stringified by composing its name and a content-based hash (SHA256) of its effective configuration; the idea here is that the hash should only change when its configuration changes, but should be the same across restarts
    - A tile request is stringified by composing the `z`, `x`, and `y`, values, along with any additional query parameters
    - The cache key format roughly resembles: `{layer}@{layer-sha256}/{z}/{x}/{y}?{args}`
  - If the current layer is cacheable (configured with `nocache: false`, which is the default when omitted), then we first check to see if the layer data is stored in the cache at the computed key; otherwise, we pull the layer data from its backend source
  - Also, if the current layer is cacheable, we then marshal the layer data into the raw GZIPed MVT binary format and store it at the computed cache key
  - Lastly, we return the in-memory layer data to the caller

With these changes in place, we get a few major improvements:

1. **Improved cache hit ratio for multi-layer configurations**: because this now caches per layer vs. per URL/request, we should see moderate performance improvements for mixed-combination tile requests, e.g. a request to something like `/_all/z/x/y.mvt`, followed by `/layer1/z/x/y.mvt`, should hit the `layer1` cache on the second request, since `layer1` gets cached in the first request at a more granular cache key
2. **More fine-grained control for the caching behavior of multi-layer configurations**: there's still room for improvement, but between computing hash keys per layer (vs. per URL), and exposing the optional `nocache` option for each configured layer, this adds an extra level of flexibility
3. **Better path forward for reconfigurations**: previously there was some buggy cache behavior when reconfiguring a layer under the cache-per-URL implementation, as the URL doesn't change even if your layer configuration does; by using the configuration in the cache key (using the configuration hash digest), we can ensure that layer data is freshly retrieved whenever its configuration changes, and that layer cache data is reused when the configuration is the same
4. **Better resilience to cache failure**: previously, if retrieving data from the cache failed for any reason, we would fail the entire request; in this revised implementation, cache retrieval failures are treated the same as a cache miss, which should fix a broad class of potential issues (e.g. temporary network failures, bad data/deserialization issues, etc.)

That said, there are a couple of potential new thoughts that come out of this:

* The cache configuration is still global (including things like TTL); since we now have additional cache controls per layer, should we instead consider moving the cache configuration as a whole into each layer? Is there some hybrid approach where each layer could instead reference a global cache?
* In this implementation, there is some additional serialization/deserialization overhead that has been introduced between the tilenol server and the backend cache; what is the impact of this overhead and are there alternative ways to approach layer data serialization to avoid this overhead?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Performance-related rfc Request For Comments
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant