Skip to content

Query timeouts caused by interplay between compression, sampling strategy, and underlying data #31

@apcarp

Description

@apcarp

TL;DR

A constant underlying datastream (set points, very fast sample rates) will timeout when using the N_QUERIES sampling strategy if the client requests compression and the total amount of data requested is too large to be processed within 60 seconds.

Details

It appears that response timeouts have been caused by a complicated intersection of parameters, client settings, and underlying data. I've noticed that queries using the N_QUERIES strategy will timeout after 60 seconds while identical queries using the STREAM strategy will return after processing for a very long time (several minutes at least). Below is pathological n_queries example that will timeout in firefox plus it's equivalent stream query.

Times out in firefox (N_QUERIES) (502)
https://epicsweb.jlab.org/myquery/mysampler?c=R121GMES%2CR122GMES&b=2026-05-10T00%3A00%3A00&n=50000&m=history&s=10&f=0&v=6&x=n

Works in firefox (STREAM) (< 1 second)
https://epicsweb.jlab.org/myquery/mysampler?c=R121GMES%2CR122GMES&b=2026-05-10T00%3A00%3A00&n=50000&m=history&s=10&f=0&v=6&x=s

At first I thought this is just because the stream query completes before the 60 second timeout, but the timeout isn't for the transfer to complete. The timeout is the maximum amount of time that the HTTP proxy can go without seeing data from the Tomcat server. The mysampler endpoint is setup to stream the results out so both end points should easily perform well enough to stream at least one sample every 60 seconds.

Running this in curl show that to be true (time_starttransfer is 1.35 seconds) and allows the n_queries query to successfully complete after 104 seconds (longer than the 60 second timeout).

> curl -w '%{time_starttransfer} %{time_total}\n' --no-buffer 'https://epicsweb.jlab.org/myquery/mysampler?c=R121GMES%2CR122GMES&b=2026-05-10T00%3A00%3A00&n=50000&m=history&s=10&f=0&v=6&x=n' -o /tmp/test.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3906k    0 3906k    0     0  38119      0 --:--:--  0:01:44 --:--:-- 47294
1.350663 104.943293

However adding the compressed flag causes the query to timeout.

> curl --compressed -w '%{time_starttransfer} %{time_total}\n' --no-buffer 'https://epicsweb.jlab.org/myquery/mysampler?c=R121GMES%2CR122GMES&b=2026-05-10T00%3A00%3A00&n=50000&m=history&s=10&f=0&v=6&x=n' -o /tmp/test.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   443  100   443    0     0      7      0  0:01:03  0:01:00  0:00:03   115
60.090399 60.091799

Running a stream query with 100 times the data and compression enabled does not timeout. The transfer starts after 0.5 seconds.

> curl --compressed -w '%{time_starttransfer} %{time_total}\n' --no-buffer 'https://epicsweb.jlab.org/myquery/mysampler?c=R121GMES%2CR122GMES&b=2026-05-10T00%3A00%3A00&n=5000000&m=history&s=10&f=0&v=6&x=s' -o /tmp/test.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1311k    0 1311k    0     0   118k      0 --:--:--  0:00:11 --:--:--  144k
0.518108 11.087084

My guess (suggested by Claude AI) is that the compression buffer doesn't fill up fast enough given that the data is practically identical (10 ms intervals) and that the n_queries strategy is about 150-200x slower than the stream strategy for this constant valued data sampling. Given the transfer start happened after 0.5s in the stream scenario, this math is all roughly consistent (200x0.5s = 100s first n_queries transfer > 60s timeout).

Compression is useful feature here since the uncompressed stream is around 25 Mbps, but having users' queries timeout if they include a low-variance PV (i.e., set point) in their channel list is non-starter. I think a work around is to periodically trigger a flush. I will have to look into the details some more since I'm not sure where the compression is performed (Java app, Tomcat server, or Apache ReverseProxy). A fix like that will probably be needed in each function like the one linked below.

public long generateFloatStream(

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions