Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download timeout after 15 minutes #6927

Closed
lmaylein opened this issue May 22, 2020 · 37 comments · Fixed by #6996
Closed

Download timeout after 15 minutes #6927

lmaylein opened this issue May 22, 2020 · 37 comments · Fixed by #6996

Comments

@lmaylein
Copy link
Contributor

We have a problem with downloading large files from a Dataverse instance (v. 4.18.1 build 267-a91d370) The download of the large file in https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/TMEDTX aborts after exactly 15 minutes. But I still don't understand if this is a timeout of the Apache webserver or the glassfish server.

The problem was reported by a user and I can reproduce it.

Can you help me? Where do I have to look for the corresponding timeout?

Apache log:

147.142.***.*** - - [22/May/2020:14:38:01 +0200] "GET /api/access/datafile/3092?gb recs=true HTTP/1.1" 200 36631117824 "https://heidata.uni-heidelberg.de/dataset.x html?persistentId=doi:10.11588/data/TMEDTX" "Mozilla/5.0 (X11; Ubuntu; Linux x86 _64; rv:76.0) Gecko/20100101 Firefox/76.0

Glassfish log:

[2020-05-22T14:53:01.451+0200] [glassfish 4.1] [SEVERE] [] [org.glassfish.jersey.server.ServerRuntime$Responder] [tid: _ThreadID=52 _ThreadName=jk-connector(5)] [timeMillis: 1590151981451] [levelValue: 1000] [[ An I/O error has occurred while writing a response message entity to the container output stream. org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: java.lang.InterruptedException at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:97) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162) at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1154) at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:621) at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:377) at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:367) at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:274) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:297) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:254) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1028) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:372) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:381) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:344) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221) at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1682) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:344) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:226) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at edu.harvard.iq.dataverse.api.ApiBlockingFilter.doFilter(ApiBlockingFilter.java:168) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at edu.harvard.iq.dataverse.api.ApiRouter.doFilter(ApiRouter.java:30) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.apache.catalina.core.ApplicationDispatcher.doInvoke(ApplicationDispatcher.java:873) at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:739) at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:575) at org.apache.catalina.core.ApplicationDispatcher.doDispatch(ApplicationDispatcher.java:546) at org.apache.catalina.core.ApplicationDispatcher.dispatch(ApplicationDispatcher.java:428) at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:378) at edu.harvard.iq.dataverse.api.ApiRouter.doFilter(ApiRouter.java:34) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:316) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:160) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:734) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:673) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:99) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:174) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:415) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:282) at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:459) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:167) at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:201) at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:175) at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:235) at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:284) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:201) at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:133) at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:112) at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:561) at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:565) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: java.lang.InterruptedException at org.glassfish.grizzly.http.io.OutputBuffer.blockAfterWriteIfNeeded(OutputBuffer.java:973) at org.glassfish.grizzly.http.io.OutputBuffer.write(OutputBuffer.java:686) at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:355) at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:342) at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:161) at org.glassfish.jersey.servlet.internal.ResponseWriter$NonCloseableOutputStreamWrapper.write(ResponseWriter.java:298) at org.glassfish.jersey.message.internal.CommittingOutputStream.write(CommittingOutputStream.java:229) at edu.harvard.iq.dataverse.api.DownloadInstanceWriter.writeTo(DownloadInstanceWriter.java:334) at edu.harvard.iq.dataverse.api.DownloadInstanceWriter.writeTo(DownloadInstanceWriter.java:49) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:263) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:250) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162) at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:106) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162) at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:89) ... 66 more Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at org.glassfish.grizzly.impl.SafeFutureImpl$Sync.innerGet(SafeFutureImpl.java:356) at org.glassfish.grizzly.impl.SafeFutureImpl.get(SafeFutureImpl.java:264) at org.glassfish.grizzly.http.io.OutputBuffer.blockAfterWriteIfNeeded(OutputBuffer.java:962) ... 80 more ]]

@donsizemore
Copy link
Contributor

You could try adjusting the Apache Proxy timeout? It's a bit buried in the guide, on the Shibboleth page :

You may wish to also add a timeout directive to the ProxyPass line within ssl.conf. This is especially useful for larger file uploads as apache may prematurely kill the connection before the upload is processed.

e.g. ProxyPass / ajp://localhost:8009/ timeout=600 defines a timeout of 600 seconds.

If your files are in S3 there's a download-direct JVM option for Dataverse which may work around Apache entirely.

@lmaylein
Copy link
Contributor Author

I changed it for a test run.

ProxyPass / ajp://localhost:8009/ timeout=1200

This changes nothing. I would have been surprised, because the default for this timeout is 600 seconds and not 900 seconds.

@juancorr
Copy link

I changed it for a test run.

ProxyPass / ajp://localhost:8009/ timeout=1200

This changes nothing. I would have been surprised, because the default for this timeout is 600 seconds and not 900 seconds.

Hi @lmaylein and @donsizemore . We have the same problem in e-cienciaDatos and we can't solve it yet. We has an alternative download address that avoid the Glassfish for large files as workaronund .

@lmaylein
Copy link
Contributor Author

lmaylein commented Jun 9, 2020

Is it possible to prioritize this issue higher? This is a big problem for our customers.

@djbrooke
Copy link
Contributor

djbrooke commented Jun 9, 2020

@lmaylein larger data support is certainly one of the goals of the project but it will challenging to prioritize specific work here in the short term. It may be helpful to get some information about the infrastructure that you're running so that we can suggest workarounds.

To discuss future plans in this area, I encourage you to attend the "Remote Storage/Large Datasets" session on June 19th at #dataverse2020: https://projects.iq.harvard.edu/dcm2020/agenda

@juancorr, can you provide some more information about the alternate address that avoids Glassfish, and how you provide that address to users?

@djbrooke
Copy link
Contributor

djbrooke commented Jun 9, 2020

Also, @qqmyers, do you have any thoughts in this area, related to that work you're doing for TDL w/r/t increasing upload size limits? Is there parallel work on the download side, or is there less issue in that setup because of redirecting to S3?

@juancorr
Copy link

juancorr commented Jun 9, 2020

@djbrooke , we have created a symbolic link under the Apache www path which points to the real file. It is not a clean solution, it is only a workaround.

You can see an example in the notes related to the WordEmbeddings.zip file here: https://doi.org/10.21950/AQ1CVX

https://doi.org/10.21950/wordembeddings.zip points to the link in the Apache server. You can see it in the disk too:

$ ls -l /var/www/html/redirects/AQ1CVX/WordEmbeddings.zip
lrwxrwxrwx 1 www-data www-data 94 dic 5 2019 /var/www/html/redirects/AQ1CVX/WordEmbeddings.zip -> /usr/local/glassfish4/glassfish/domains/domain1/files/10.21950/AQ1CVX/166630f39ce-44355f90ea

@pdurbin
Copy link
Member

pdurbin commented Jun 9, 2020

We discussed this a bit in Slack the other day: https://iqss.slack.com/archives/C010LA04BCG/p1590504250002200

"To answer the user above - 15 min. does sound like one of the server timeouts. We had to adjust timeouts on multiple levels in our prod. - both inside Glassfish and Apache... if it's not already documented in the guides, we should add it."

"The answer to what that person was asking was "all of the above" - the timeouts need to be increased everywhere - Apache, Glassfish and ajpproxy... But I was hoping we were already explaining it in the guide."

I hope this helps a little. Obviously, as we're saying above, we should document these settings.

Someone else who might know is @Venki18 who got 20 GB upload working: #4439 (comment)

@juancorr
Copy link

juancorr commented Jun 9, 2020

We have changed all timeouts that we have seen:

glassfish/domains/domain1/config/domain.xml: <thread-pool idle-thread-timeout-seconds="3600" name="http-thread-pool"></thread-pool>
glassfish/domains/domain1/config/default-web.xml: <session-timeout>60</session-timeout>

libapache2-mod-jk/workers.properties:worker.worker1.cache_timeout=3600
libapache2-mod-jk/workers.properties:worker.worker1.socket_timeout=3600
apache2/apache2.conf:Timeout 6000

But we have not be able to allow download a file more than 15 minutes by Glassfish. The download works with apache.

@qqmyers
Copy link
Member

qqmyers commented Jun 9, 2020

more thoughts later but thought I'd report some findings. I've tried using curl with --limit-rate 1K to get the file from Heidelberg and seen the connection die after 28,36, and 72 minutes. Similarly, trying to get the WordEmbeddings.zip file above I have a slow curl connection that has been downloading for 34+ minutes. So I don't think we're seeing timeouts on the overall download time. Perhaps they are related to server load, e.g. Dataverse gets so busy it stops sending packets for longer than some timeout.
(FYI - I set the rate limit at 1K to avoid load from this specific call - at this rate it would take 88 to 244 days to download the 11GB and 61 GB files!)

@qqmyers
Copy link
Member

qqmyers commented Jun 9, 2020

Most of the timeouts that I'm aware of are concerned with the time to wait for a first response from dataverse or the time between subsequent increments to the response rather than the overall time a response takes. For example, one of the things I helped TDL with was related to uploads where Dataverse was taking too long unzipping a file and storing the individual files before responding to the uploader. There's really no equivalent during download so I think the only things that would stop a download via Dataverse (e.g. not redirecting to S3) would be the disk or glassfish server becoming so busy they don't respond for awhile. (I guess it's still possible that there's some overall timeout happening but that would contradict my findings on both servers reporting problems that I have curl downloads lasting for an hour+.)

I guess one direction that suggests for debugging would be to watch server and disk load when downloads are happening/failing. If it glassfish load, the only thing I've done much of is to increase memory, but others probably know more about the best ways to optimize/speed things.

In some sense, if its a timeout or resource issue, it isn't really something that can be fixed in the Dataverse code (aside from trying to make Dataverse more efficient/use less memory overall). The switch to allowing a redirect to S3 for downloads (and uploads) is one way that Dataverse has been adapted to help with this. There have been suggestions that the redirect mechanism could also be added for file stores (not sure if the two servers reporting issues here are on files or S3 - if the latter, turning on the S3 redirect would be useful). There's also discussion of implementing the ability to handle range requests for downloads. This is the thing needed on the server side to allow smart clients to restart downloads that fail in the middle somewhere (you just ask the server for bytes starting where you last got to). This is probably a good thing to do in general, but it is essentially a way to mask the problem that downloads are failing rather than a fix for the underlying issue. (But, if that issue is something else, like a flaky network that's not under your control, the work-around is all you can do.)

If these issues can't be resolved by system configuration changes, having IQSS/GDCC prioritize implementing the range requests might be the fastest way to help (I think support file redirects would be more work, but I could be wrong, and, if the issue is not glassfish, using a file redirect might not solve the problem.)

Another note - with upload issues, changes that were made to report HTTP errors in the upload UI helped in debugging the timeout errors and we could see in the browser console/developer tools which piece of software was trigger the timeout - it would be reported as the Server: in the Response Headers (so we saw something like 'AWS LB' instead of 'Apache' when it was a timeout on a load balancer, etc.). The downloads work differently and return a 200 response up front (at least in my tests here) and the failure occurs as the data streams. I don't know how one can debug the type of thing being seen with downloads except checking the logs of any proxy/load balancer/httpd service etc between glassfish and the user.

@lmaylein
Copy link
Contributor Author

We discussed this a bit in Slack the other day: https://iqss.slack.com/archives/C010LA04BCG/p1590504250002200

Is it possible to get access here?
Which parameters should be increased here for Glassfish or Apache?

@lmaylein
Copy link
Contributor Author

lmaylein commented Jun 10, 2020

@qqmyers

Here my test results with "limit-rate":

withoud limit-rate:

curl -o t.gz https://heidata.uni-heidelberg.de/api/access/datafile/3092?gbrecs=true
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 56 61.2G   56 34.6G    0     0  39.1M      0  0:26:43  0:15:06  0:11:37 3775k
curl: (56) GnuTLS recv error (-110): The TLS connection was non-properly terminated.

The connection is terminate at 15:06

With rate-limit 1M:

curl --limit-rate 1M -o t.gz https://heidata.uni-heidelberg.de/api/access/datafile/3092?gbrecs=true
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  1 61.2G    1  916M    0     0  1021k      0 17:28:30  0:15:18 17:13:12  517k
curl: (56) GnuTLS recv error (-110): The TLS connection was non-properly terminated.

The connection is terminate at 15:18

With rate-limit 300K:

curl --limit-rate 300K -o t.gz https://heidata.uni-heidelberg.de/api/access/datafile/3092?gbrecs=true
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 61.2G    0  279M    0     0   299k      0 59:29:17  0:15:52 59:13:25  298k
curl: (56) GnuTLS recv error (-110): The TLS connection was non-properly terminated.

I still think a timeout is striking here. The extra seconds beyond the 15 minutes may be due to an overhead in the connection between the client and the apache server. The load on the server was very low in both attempts.

@lmaylein
Copy link
Contributor Author

lmaylein commented Jun 10, 2020

And here are my test results when downloading directly over the Glassfish (via localhost):

Without limit-rate

curl -o t.gz http://localhost:8080/api/access/datafile/3092?gbrecs=true
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 61.2G  100 61.2G    0     0   100M      0  0:10:27  0:10:27 --:--:--  109M

Download takes less than 15 minutes - no problem

With limit-rate 1M:

curl --limit-rate 1M -o t.gz http://localhost:8080/api/access/datafile/3092?gbrecs=true
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  1 61.2G    1  913M    0     0  1023k      0 17:25:40  0:15:13 17:10:27  922k
curl: (18) transfer closed with 64830595443 bytes remaining to read

So the problem is definitely not the connection apache - glassfish.
This would also have been unlikely because @juancorr uses mod_jk and we use mod_proxy.

@poikilotherm
Copy link
Contributor

poikilotherm commented Jun 10, 2020

I can confirm the abort when downloading from HeiDATA.

grafik

This was from within FZJ. Nearer to the AS of Uni Heidelberg than from the US, plus within the same provider network by DFN thus hopefully eliminating a few possible problems along the route.

grafik

@juancorr
Copy link

I think that the problem is in the keep alive Glassfish timeout, but I couldn't change it yet. I have to test it in Payara but I can not time yet.

HTTP/1.1 200 OK
Date: Wed, 10 Jun 2020 06:58:11 GMT
Server: Apache/2.4.41 (Ubuntu)
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: PUT, GET, POST, DELETE, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Dataverse-Key
Content-disposition: attachment; filename="hesml-biomedical-benchmark.tar.gz"
Content-Length: 36502638286
Keep-Alive: timeout=15, max=299
Connection: Keep-Alive
Content-Type: application/gzip; name="hesml-biomedical-benchmark.tar.gz"

@poikilotherm
Copy link
Contributor

@juancorr I don't think so - the Keep-Alive is in seconds, not minutes. And it is only used for idle connections... https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive

@poikilotherm
Copy link
Contributor

poikilotherm commented Jun 10, 2020

I tried to reproduce the issue in our own installation, using Glassfish + NGINX + HAProxy (so HTTP reverse proxies only) and files stored on S3. This seems to work OK, running on DV 4.19.

grafik

Feel free to test and report back if things break. It's only a 1GB file, so you'll need some rate limiting... 😉

@qqmyers
Copy link
Member

qqmyers commented Jun 10, 2020

FWIW: Hmm - My rate limit was much more limiting - 1K, but the time-spent reported from curl was 1:12:08 from HeiDATA (same 61GB file) for the longest run.

@lmaylein
Copy link
Contributor Author

To make sure that the problem is not caused by the underlying storage infrastructure, I once used a "scp -l(imit)". Here the file access does not abort.

@qqmyers
Copy link
Member

qqmyers commented Jun 15, 2020

I'm out of good ideas... doing a search on "java.lang.InterruptedException at org.glassfish.grizzly.http.io.OutputBuffer.blockAfterWriteIfNeeded" - which is in the log above - led me to https://stackoverflow.com/questions/26990616/connection-timeout-glassfish-java-ee-application which has some discussion of circumstances where load could cause timeouts. The only parameter mentioned in the answers there that I've ever tweaked is the max-thread-pool-size for the http-thread-pool - in domain.xml . I've increased that to 50 on some instances where we were debugging other memory/slow response issues. If you have a low number for that, you could try an increase.

The answer there also points to some timeout and buffer-size config options at the TCP transport layer that is also configurable in domain.xml . The timeouts are again not global timeouts but timeouts for how long to wait for more bytes, so they are consistent with failures at different times. (The issue links things together - few threads and small buffers lead to situations where a download thread could be blocked and end up waiting for longer than the timeout, so work-arounds could be a combination of more threads, bigger buffers, longer timeouts.) Making changes at this level is beyond my experience - perhaps others know more - and I hesitate to suggest trying things without knowing much about them. That said, if you're stuck, these are things you can try and then remove from you config if they don't work.

@lmaylein
Copy link
Contributor Author

I tried:

<thread-pool name="http-thread-pool" max-thread-pool-size="50"></thread-pool>

and

<thread-pool name="http-thread-pool" max-thread-pool-size="200" max-queue-size="256"></thread-pool>

That didn't help.
I'll have a look at what other "buffer size" parameters are available in the domain.xml.

@pdurbin
Copy link
Member

pdurbin commented Jun 16, 2020

@lmaylein I've been meaning to ask you what you think about the symlink hack/workaround @juancorr mentioned earlier: #6927 (comment)

My understanding is that you'd be using Apache (or similar) to serve up the file. Not a long term solution, of course, but perhaps it could give you some relief.

@lmaylein
Copy link
Contributor Author

@pdurbin Our workaround was to split up the file. The Apache symlink would require a change of the rights configuration on our server, because only the glassfish user has access to the file, but not the Apache user. In case of need, however, we would change that.
I would like to find a general solution here, as I assume that we will have more and more large files in the future (especially in the natural sciences).

@lmaylein
Copy link
Contributor Author

Okay, very interesting:
I've added a request-timeout-seconds:

      <network-config>
        <protocols>
          <protocol name="http-listener-1">
            <http request-timeout-seconds="1800" comet-support-enabled="true" max-connections="250" default-virtual-server="server">
              <file-cache></file-cache>
            </http>

instead of

      <network-config>
        <protocols>
          <protocol name="http-listener-1">
            <http comet-support-enabled="true" max-connections="250" default-virtual-server="server">
              <file-cache></file-cache>
            </http>

And this is the result:

curl --limit-rate 300K -o t.gz https://heidata.uni-heidelberg.de/api/access/datafile/3092?gbrecs=true
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 61.2G    0  541M    0     0   299k      0 59:31:48  0:30:49 59:00:59  213k
curl: (56) GnuTLS recv error (-110): The TLS connection was non-properly terminated.

The connection aborts after 30 minutes, instead of 15 minutes as before.
This is interesting because I thought the default for request-timeout-seconds would be 30 seconds (!). And I thought it would refer to a timeout that only strikes when no data is exchanged in the connection. But my download aborts, even though data was downloaded continuously until immediately before.

@lmaylein
Copy link
Contributor Author

Apparently the default is 15 minutes after all:
https://stackoverflow.com/questions/22666847/grizzly0023-glassfish-warning

@qqmyers
Copy link
Member

qqmyers commented Jun 16, 2020

Hmm - as far as I know you're correct that all these timeouts are related to idle time not accumulated transfer times, so the coincidence with when you're seeing the connection drop is odd (but I did see a really slow connection stay up for an hour already).

The other timeouts I've seen are at the transport level - I've only seen it in the glassfish v3 docs so far: https://docs.oracle.com/cd/E19798-01/821-1753/girmh/index.html - the buffer size is in that doc too. The timeouts here are all ~ 30 seconds, so they could be involved if there is some random time when threads block etc.

@Venki18
Copy link

Venki18 commented Jun 17, 2020

@pdurbin we achieved the large file upload and downloads with the following settings, we are using Apache as the frontend to Glassfish

  1. Add to Apache HTTP Web Service config /etc/httpd/conf.d/ssl.conf:
    ProxyTimeout 3600
  2. Changed GlassFish's timeout setting as follows:
    GET CURRENT VALUE
    asadmin get server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds
    SET NEW VALUE 3600s
    asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600

I hope this helps. Sorry for the delay in replying.

@lmaylein
Copy link
Contributor Author

@Venki18 Thank you very much. This confirms my tests.

@qqmyers Just to rule out any coincidence, I have now also increased request-timeout-seconds to 3600. The result is as expected: Abortion after 1:00:52

I think the Apache ProxyTimeout parameter is not necessary. This is probably really an idle timeout.

Maybe the request-timeout-seconds parameter should be mentioned in the Dataverse documentation?

I think we can close this issue?

@juancorr
Copy link

Thank you @Venki18 and @lmaylein . It works for us too.

@donsizemore
Copy link
Contributor

@djbrooke if you’ll leave the issue open i’ll submit a PR for the docs.

@pdurbin
Copy link
Member

pdurbin commented Jun 17, 2020

@Venki18 thank you! @lmaylein @juancorr glad it works! @donsizemore yes, yes, docs please!

donsizemore pushed a commit to uncch-rdmc/dataverse that referenced this issue Jun 17, 2020
donsizemore pushed a commit to uncch-rdmc/dataverse that referenced this issue Jun 17, 2020
donsizemore pushed a commit to uncch-rdmc/dataverse that referenced this issue Jun 18, 2020
donsizemore pushed a commit to uncch-rdmc/dataverse that referenced this issue Jun 18, 2020
donsizemore pushed a commit to uncch-rdmc/dataverse that referenced this issue Jun 24, 2020
donsizemore pushed a commit to uncch-rdmc/dataverse that referenced this issue Jun 25, 2020
@lmaylein
Copy link
Contributor Author

lmaylein commented Nov 9, 2020

I still have a question about this closed issue:
I hoped that this problem was solved with version 5.x or the switch to Payara.
But now that I see the parameter "http.request-timeout-seconds" in the documentation, I assume that I still have to define a very high timeout (for clients with a very slow network connection who want to retrieve a large file).

@donsizemore
Copy link
Contributor

I still have a question about this closed issue:
I hoped that this problem was solved with version 5.x or the switch to Payara.
But now that I see the parameter "http.request-timeout-seconds" in the documentation, I assume that I still have to define a very high timeout (for clients with a very slow network connection who want to retrieve a large file).

This became "GLASSFISH_REQUEST_TIMEOUT" for new installations, which now defaults to 30 minutes:
https://github.com/IQSS/dataverse/blob/develop/scripts/installer/interactive.config#L32
(the installer has not yet renamed its variables from GLASSFISH to PAYARA or DATAVERSE).

If you have an existing installation, you may want to bump this above the default 15 minutes:
https://guides.dataverse.org/en/latest/installation/config.html#application-server-settings

@lmaylein
Copy link
Contributor Author

lmaylein commented Nov 9, 2020

I had hoped that there would be a parameter by now that defines the timeout between single packets and not for the whole duration of the download.
Some clients seem to access with very slow network connections, so I'm afraid I have to set the timeout value very high (several hours!).

@pdurbin
Copy link
Member

pdurbin commented Nov 9, 2020

I had hoped that there would be a parameter by now that defines the timeout between single packets and not for the whole duration of the download.

@lmaylein would that mean opening an issue upstream with Payara? Here is there issue tracker: https://github.com/payara/Payara/issues

@lmaylein
Copy link
Contributor Author

I'm just not sure if I've got it all right. Probably there is a timeout on packet level and the timeout http.request-timeout-seconds is only additional. Then maybe it would not be harmful to set it as high as possible. But I don't know if setting a very high value might cause a vulnerability against DOS attacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants