Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploads Stopping for Projects with Large Files #4572

Closed
Aurum420 opened this issue Nov 6, 2021 · 8 comments · Fixed by #4575
Closed

Uploads Stopping for Projects with Large Files #4572

Aurum420 opened this issue Nov 6, 2021 · 8 comments · Fixed by #4575

Comments

@Aurum420
Copy link

Aurum420 commented Nov 6, 2021

The Problem is that projects that upload large files tend to trigger errors such as "transient http error" that halts uploading. When this happens the only way to restart uploads for that computer is to reboot or restart the BOINC client. A consequence of this halt is that after a certain number of work units are in the upload queue it triggers "Not requesting tasks: too many uploads in progress" and downloads halt. After all work units complete the computer sits idle.

28618			10/29/2021 8:57:33 AM	Project communication failed: attempting access to reference site	
28619	World Community Grid	10/29/2021 8:57:33 AM	Temporarily failed upload of OPN1_0084321_00498_0_r1080501240_0: transient HTTP error	
28620	World Community Grid	10/29/2021 8:57:33 AM	Backing off 00:05:28 on upload of OPN1_0084321_00498_0_r1080501240_0

To reproduce use a 12c/24t or greater CPU to run 16 or more ARP1 work units to completion.
BOINC 7.6.16, Linux Mint 20.2, x86_64-pc-linux-gnu

The Goal is to allow any user with a bank of computers operating from a single IP address to run as many work units of any size for any BOINC project.
My goal is to turn in over 2,000 ARP work units per day which is 10% of that project's current daily progress. It doesn't seem in the spirit of BOINC to let a single simulation project run for over a year.

Anecdotal reports are that this does not happen when running a small number of large work units. My experience with ARP work units is that trying to run 16 ARP work units per 18c/36t computer is 100% guaranteed to fail and running 8 fails too often to endure. I'm currently trying 4 and 3 ARP work units but have already seen an upload seizure. Since ARP might not checkpoint for up to nine hours it wastes a lot of time either by dumping work or waiting for all work units to checkpoint.

It's been suggested that it's caused by too many large files uploading at once. The use of <max_file_xfers_per_project> to restrict the number of concurrent uploads has offered no benefit in keeping uploads from seizing up.

It's been suggested that having too many computers running on the same IP address is part of the problem. I thought the objective of BOINC was to get as much work done as fast as possible.

It's been suggested that perhaps <max_nbytes> is being specified too low. The following is an example of a seized WU where ARP returns 7 files. Note that <max_nbytes> takes on 3 different values:

<file>
    <name>ARP1_0028085_102_1_r1835721786_0</name>
    <nbytes>15923553.000000</nbytes>
    <max_nbytes>104857600.000000</max_nbytes>
    <md5_cksum>5072fc387af5ae4d884ec0a22044c364</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.559970</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_1</name>
    <nbytes>16475866.000000</nbytes>
    <max_nbytes>104857600.000000</max_nbytes>
    <md5_cksum>677e411747706ad3b16c6bd86742a649</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.588016</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_2</name>
    <nbytes>15998165.000000</nbytes>
    <max_nbytes>104857600.000000</max_nbytes>
    <md5_cksum>71446531f56153380749450507dc3767</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.568259</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_3</name>
    <nbytes>18614248.000000</nbytes>
    <max_nbytes>31457280.000000</max_nbytes>
    <md5_cksum>5ce5b3b488ac687b549bdc97170ecbbf</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.568259</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_4</name>
    <nbytes>16084763.000000</nbytes>
    <max_nbytes>31457280.000000</max_nbytes>
    <md5_cksum>78e54c61957a45ee11c70e167b2a0b00</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.554810</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_5</name>
    <nbytes>15363410.000000</nbytes>
    <max_nbytes>31457280.000000</max_nbytes>
    <md5_cksum>ff167aca654c19fe7de5760bc2b245d6</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>0.000000</next_request_time>
        <time_so_far>300.531569</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
<file>
    <name>ARP1_0028085_102_1_r1835721786_6</name>
    <nbytes>132.000000</nbytes>
    <max_nbytes>10240.000000</max_nbytes>
    <md5_cksum>e16122bf2611e311bdb0ea8b8d826897</md5_cksum>
    <status>1</status>
    <upload_url>https://upload.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>1</num_retries>
        <first_request_time>1636203793.649186</first_request_time>
        <next_request_time>1636204842.009948</next_request_time>
        <time_so_far>300.530445</time_so_far>
        <last_bytes_xferred>0.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>

Activating the http_debug flag produces a complicated output since the problem isn't triggered unless multiple files are affected.

29/10/2021 18:58:24 | World Community Grid | Computation for task OPNG_0098417_00141_0 finished
29/10/2021 18:58:26 | World Community Grid | [http] HTTP_OP::libcurl_exec(): ca-bundle 'D:\BOINC\ca-bundle.crt'
29/10/2021 18:58:26 | World Community Grid | [http] HTTP_OP::libcurl_exec(): ca-bundle set
29/10/2021 18:58:26 | World Community Grid | Started upload of OPNG_0098417_00141_0_r1166140054_0
29/10/2021 18:58:27 | World Community Grid | [http] [ID#27363] Info: Trying 169.47.63.74...
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: Connected to upload.worldcommunitygrid.org (169.47.63.74) port 443 (#13034)
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: ALPN, offering http/1.1
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: successfully set certificate verify locations:
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: CAfile: D:\BOINC\ca-bundle.crt
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: CApath: none
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (OUT), TLS header, Certificate Status (22):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (OUT), TLS handshake, Client hello (1):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (IN), TLS handshake, Server hello (2):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (IN), TLS handshake, Certificate (11):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (IN), TLS handshake, Server key exchange (12):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (IN), TLS handshake, Server finished (14):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (OUT), TLS change cipher, Client hello (1):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (OUT), TLS handshake, Finished (20):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (IN), TLS change cipher, Client hello (1):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: TLSv1.2 (IN), TLS handshake, Finished (20):
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: ALPN, server accepted to use http/1.1
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: Server certificate:
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: subject: C=US; ST=New York; L=Armonk; O=International Business Machines Corporation; CN=*.worldcommunitygrid.org
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: start date: Jun 10 00:00:00 2020 GMT
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: expire date: Sep 9 12:00:00 2022 GMT
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: subjectAltName: upload.worldcommunitygrid.org matched
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=Thawte RSA CA 2018
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: SSL certificate verify ok.
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: POST /boinc/wcg_cgi/file_upload_handler HTTP/1.1
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: Host: upload.worldcommunitygrid.org
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.16.20)
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: Accept: */*
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: Accept-Encoding: deflate, gzip
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: Accept-Language: en_GB
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: Content-Length: 288
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server: Content-Type: application/x-www-form-urlencoded
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Sent header to server:
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: We are completely uploaded and fine
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server: HTTP/1.1 200 OK
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server: Date: Fri, 29 Oct 2021 17:58:25 GMT
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server: Server: Apache
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server: Vary: Accept-Encoding
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server: Content-Encoding: gzip
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server: Content-Length: 75
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server: Content-Type: text/plain; charset=UTF-8
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Received header from server:
29/10/2021 18:58:34 | World Community Grid |
29/10/2021 18:58:34 | World Community Grid | [http] [ID#27363] Info: Connection #13034 to host upload.worldcommunitygrid.org left intact
29/10/2021 18:58:35 | World Community Grid | [http] HTTP_OP::libcurl_exec(): ca-bundle set
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Info: Found bundle for host upload.worldcommunitygrid.org: 0x1e276e0 [can pipeline]
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Info: Re-using existing connection! (#13034) with host upload.worldcommunitygrid.org
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Info: Connected to upload.worldcommunitygrid.org (169.47.63.74) port 443 (#13034)
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: POST /boinc/wcg_cgi/file_upload_handler HTTP/1.1
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: Host: upload.worldcommunitygrid.org
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.16.20)
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: Accept: */*
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: Accept-Encoding: deflate, gzip
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: Accept-Language: en_GB
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: Content-Length: 142400
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: Content-Type: application/x-www-form-urlencoded
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server: Expect: 100-continue
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Sent header to server:
29/10/2021 18:58:35 | World Community Grid | [http] [ID#27363] Received header from server: HTTP/1.1 100 Continue
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Info: We are completely uploaded and fine
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: HTTP/1.1 200 OK
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: Date: Fri, 29 Oct 2021 17:58:26 GMT
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: Server: Apache
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: Content-Length: 64
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: Content-Type: text/plain; charset=UTF-8
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server:
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: <data_server_reply>
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: <status>0</status>
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Received header from server: </data_server_reply>
29/10/2021 18:58:37 | World Community Grid | [http] [ID#27363] Info: Connection #13034 to host upload.worldcommunitygrid.org left intact
29/10/2021 18:58:38 | World Community Grid | Finished upload of OPNG_0098417_00141_0_r1166140054_0
@davidpanderson
Copy link
Contributor

What happens when you do "Tools / Retry pending transfers"?

@Aurum420
Copy link
Author

Aurum420 commented Nov 7, 2021

What happens when you do "Tools / Retry pending transfers"?

I have one computer (Rig-18) waiting to checkpoint ARP and a reboot. They're headless and often when I remote into them BOINCmgr says "Disconnected" and the only way I know to get "Connected" is either restart or reboot. From BoincTasks 1.85 it does nothing. This example is unusual in that no ARP1 WUs are pending upload just OPN1 and OPNG. But 3 ARP WUs are running.

13684	World Community Grid	11/7/2021 12:47:14 AM	Scheduler request completed	
13685	World Community Grid	11/7/2021 12:47:14 AM	Project requested delay of 121 seconds	
13686			11/7/2021 12:47:50 AM	Project communication failed: attempting access to reference site	
13687	World Community Grid	11/7/2021 12:47:50 AM	Temporarily failed upload of OPN1_0087693_00635_0_r241424194_0: transient HTTP error	
13688	World Community Grid	11/7/2021 12:47:50 AM	Backing off 00:02:50 on upload of OPN1_0087693_00635_0_r241424194_0	
13689	World Community Grid	11/7/2021 12:47:50 AM	Started upload of OPNG_0100830_00066_0_r631133531_0	
13690			11/7/2021 12:47:51 AM	Internet access OK - project servers may be temporarily down.	
13691			11/7/2021 12:47:54 AM	Project communication failed: attempting access to reference site	
13692	World Community Grid	11/7/2021 12:47:54 AM	Temporarily failed upload of OPN1_0087813_00026_0_r1011078017_0: transient HTTP error	
13693	World Community Grid	11/7/2021 12:47:54 AM	Backing off 00:03:36 on upload of OPN1_0087813_00026_0_r1011078017_0	
13694	World Community Grid	11/7/2021 12:47:54 AM	Started upload of OPNG_0100830_00066_0_r631133531_1	
13695			11/7/2021 12:47:55 AM	Internet access OK - project servers may be temporarily down.	
13696			11/7/2021 12:48:04 AM	Project communication failed: attempting access to reference site	
13697	World Community Grid	11/7/2021 12:48:04 AM	Temporarily failed upload of OPN1_0087770_00525_0_r237206635_0: transient HTTP error	
13698	World Community Grid	11/7/2021 12:48:04 AM	Backing off 00:02:01 on upload of OPN1_0087770_00525_0_r237206635_0	
13699	World Community Grid	11/7/2021 12:48:04 AM	Started upload of OPN1_0087770_00031_0_r1213581315_0	
13700			11/7/2021 12:48:05 AM	Internet access OK - project servers may be temporarily down.	
13701			11/7/2021 12:48:20 AM	Project communication failed: attempting access to reference site	
13702	World Community Grid	11/7/2021 12:48:20 AM	Temporarily failed upload of OPN1_0087770_00103_0_r1954765800_0: transient HTTP error	
13703	World Community Grid	11/7/2021 12:48:20 AM	Backing off 00:03:56 on upload of OPN1_0087770_00103_0_r1954765800_0	
13704	World Community Grid	11/7/2021 12:48:20 AM	Started upload of OPN1_0087770_00080_0_r1764371523_0	
13705			11/7/2021 12:48:21 AM	Internet access OK - project servers may be temporarily down.	
13706	World Community Grid	11/7/2021 12:49:27 AM	update requested by user	
13707	World Community Grid	11/7/2021 12:49:27 AM	Sending scheduler request: Requested by user.	
13708	World Community Grid	11/7/2021 12:49:27 AM	Not requesting tasks: some task is suspended via Manager	
13709	World Community Grid	11/7/2021 12:49:28 AM	Scheduler request completed	
13710	World Community Grid	11/7/2021 12:49:28 AM	Project requested delay of 121 seconds	
13711	World Community Grid	11/7/2021 12:51:45 AM	update requested by user	
13712	World Community Grid	11/7/2021 12:51:48 AM	Sending scheduler request: Requested by user.	
13713	World Community Grid	11/7/2021 12:51:48 AM	Not requesting tasks: some task is suspended via Manager	
13714	World Community Grid	11/7/2021 12:51:49 AM	Scheduler request completed	
13715	World Community Grid	11/7/2021 12:51:49 AM	Project requested delay of 121 seconds	
13716	World Community Grid	11/7/2021 12:54:03 AM	update requested by user	
13717	World Community Grid	11/7/2021 12:54:05 AM	Sending scheduler request: Requested by user.	
13718	World Community Grid	11/7/2021 12:54:05 AM	Not requesting tasks: some task is suspended via Manager	
13719	World Community Grid	11/7/2021 12:54:06 AM	Scheduler request completed	
13720	World Community Grid	11/7/2021 12:54:06 AM	Project requested delay of 121 seconds	
13721	World Community Grid	11/7/2021 12:56:21 AM	update requested by user	
13722	World Community Grid	11/7/2021 12:56:21 AM	Sending scheduler request: Requested by user.	
13723	World Community Grid	11/7/2021 12:56:21 AM	Not requesting tasks: some task is suspended via Manager	
13724	World Community Grid	11/7/2021 12:56:22 AM	Scheduler request completed	
13725	World Community Grid	11/7/2021 12:56:22 AM	Project requested delay of 121 seconds	
13726	World Community Grid	11/7/2021 12:56:52 AM	task ARP1_0012148_101_1 resumed by user	
13727	World Community Grid	11/7/2021 12:58:28 AM	Sending scheduler request: To fetch work.	
13728	World Community Grid	11/7/2021 12:58:28 AM	Requesting new tasks for CPU and NVIDIA GPU	
13729	World Community Grid	11/7/2021 12:58:29 AM	Scheduler request completed: got 0 new tasks	
13730	World Community Grid	11/7/2021 12:58:29 AM	No tasks sent	
13731	World Community Grid	11/7/2021 12:58:29 AM	No tasks are available for OpenPandemics - COVID 19	
13732	World Community Grid	11/7/2021 12:58:29 AM	No tasks are available for OpenPandemics - COVID-19 - GPU	
13733	World Community Grid	11/7/2021 12:58:29 AM	No tasks are available for Help Stop TB	
13734	World Community Grid	11/7/2021 12:58:29 AM	No tasks are available for Africa Rainfall Project	
13735	World Community Grid	11/7/2021 12:58:29 AM	No tasks are available for the applications you have selected.	
13736	World Community Grid	11/7/2021 12:58:29 AM	Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them	
13737	World Community Grid	11/7/2021 12:58:29 AM	Tasks for Intel GPU are available, but your preferences are set to not accept them	
13738	World Community Grid	11/7/2021 12:58:29 AM	Project requested delay of 121 seconds	
13739	World Community Grid	11/7/2021 12:58:39 AM	update requested by user	
13740	World Community Grid	11/7/2021 12:58:39 AM	Sending scheduler request: Requested by user.	
13741	World Community Grid	11/7/2021 12:58:39 AM	Requesting new tasks for CPU and NVIDIA GPU	
13742	World Community Grid	11/7/2021 12:58:40 AM	Scheduler request completed: got 0 new tasks	
13743	World Community Grid	11/7/2021 12:58:40 AM	Not sending work - last request too recent: 10 sec	
13744	World Community Grid	11/7/2021 12:58:40 AM	Project requested delay of 121 seconds

image

@Aurum420
Copy link
Author

client: fix overly aggressive project-wide file transfer backoff policy. #4575
Will not solve this problem.

@Aurum420
Copy link
Author

If all uploads get into the same "Upload pending (Project backoff)" state, and it's rare they do (see next examples), then uploads can be restarted as in these 3 screenshots:
Rig-8 all backoff
Rig-8 retry one
Rig-8 retry uploading

@Aurum420
Copy link
Author

Aurum420 commented Nov 14, 2021

The typical state for stalled uploads is to have a few in the "uploading" state. The files that say they are "uploading," but are not actually transferring, jumps around. As long as even one file is "uploading" then they cannot be restarted. Then a moment later (see 2nd screenshot) they all switch to the "Upload pending, retried" state and still do not upload.
Rig-20 2 waiting
Rig-20 2 waiting - minute later

@Aurum420
Copy link
Author

Aurum420 commented Dec 3, 2021

It sure would be nice if someone that appreciates the physics of file transfer would take an interest in this issue. E.g., completed ARP WUs are returned as 7 files with multiplexing. I have no idea why they're in 7 files instead of just one file. It seems that one file without multiplexing would be 14x times more efficient. When transferring files there's some handshaking between the client and server and instead of doing that once it's done 2 x 7 = 14 times assuming multiplexing only divides the transfer between two destination servers. If multiplexed the servers then have to recombine the files for additional transactions wasting, time, energy, and bandwidth. If those 7 files need to be kept separate then couldn't they be zipped together on the client and transferred as a single file?

@AenBleidd
Copy link
Member

@Aurum420, it was decided by the Project to have separate files. BOINC client can't zip these files because then they will be rejected by the Project. BOINC acts as instructed by the Project, and does no additional actions

@Aurum420
Copy link
Author

Aurum420 commented Dec 9, 2021

It has been suggested that I batch completed WUs and try to send too many at once. I do not batch and have even doubled my ISP speed.
(https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,41910_offset,1600#669488)
All of my cc_config files have this command: <report_results_immediately>1</report_results_immediately>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants