Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf-macstadium-macos1015-x64-1 & 2 unable to download dacapo.jar #1363

Closed
andrew-m-leonard opened this issue Jun 3, 2020 · 29 comments
Closed
Assignees
Labels
Milestone

Comments

@andrew-m-leonard
Copy link
Contributor

Platform:
Mac
https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.perf_x86-64_mac_xl/174/console

18:48:13 getDacapoSuite:
18:48:13 [echo] curl -Lks https://sourceforge.net/projects/dacapobench/files/latest/download -o dacapo.jar
18:49:36
18:49:36 BUILD FAILED
18:49:36 /Users/jenkins/workspace/Test_openjdk8_j9_sanity.perf_x86-64_mac_xl/openjdk-tests/TKG/scripts/build_test.xml:58: The following error occurred while executing this line:
18:49:36 /Users/jenkins/workspace/Test_openjdk8_j9_sanity.perf_x86-64_mac_xl/openjdk-tests/perf/build.xml:31: The following error occurred while executing this line:
18:49:36 /Users/jenkins/workspace/Test_openjdk8_j9_sanity.perf_x86-64_mac_xl/openjdk-tests/perf/dacapo/build.xml:44: The following error occurred while executing this line:
18:49:36 /Users/jenkins/workspace/Test_openjdk8_j9_sanity.perf_x86-64_mac_xl/openjdk-tests/perf/dacapo/build.xml:32: exec returned: 7

Tried re-building same issue, although sometimes rc=35
rc 7 : CURLE_COULDNT_CONNECT (7) Failed to connect() to host or proxy.
rc 35 : CURLE_SSL_CONNECT_ERROR (35) A problem occurred somewhere in the SSL/TLS handshake. You really want the error buffer and read the message there as it pinpoints the problem slightly more. Could be certificates (file formats, paths, permissions), passwords, and others.

Testing on other mac slaves it seems fine...
Running a simple test via the Scripting console:

println "curl -Lks https://sourceforge.net/projects/dacapobench/files/latest/download -o dacapo.jar".execute().text

shows it seems to be downloading but VERY VERY slowly, and seemingly sometimes fails as a result

@andrew-m-leonard
Copy link
Contributor Author

reference original issue: adoptium/temurin-build#1808

@sxa
Copy link
Member

sxa commented Jun 3, 2020

@andrew-m-leonard If you need to do this in the future please let me know as I can move the issues between repos if required instead of opening a new one

@andrew-m-leonard
Copy link
Contributor Author

I was wondering why openjdk-infrastructure does not appear in the github "transfer issue" dropdown?...seemed a bit weird!

@sxa
Copy link
Member

sxa commented Jun 3, 2020

Feel fre to reach out to me if you think there's an issue if it happens again
image

@andrew-m-leonard
Copy link
Contributor Author

Feel fre to reach out to me if you think there's an issue if it happens again
image

it must be an authority thing, as it's not on my dropdown, wondering if it only lists repos I have certain access to...?

@andrew-m-leonard
Copy link
Contributor Author

yep it is:

To transfer an open issue to another repository, you must have write permissions on the repository the issue is in and the repository you're transferring the issue to. For more information, see "Repository permission levels for an organization."

@sxa
Copy link
Member

sxa commented Jun 3, 2020

Right ... That suggests you probably don't have comit rights in this repo then :-)
Interesting that it lets you raise issues but not move to it. i've learned something today so I'm considering that a win and giving up for the day ;-)

@andrew-m-leonard
Copy link
Contributor Author

@karianna
Copy link
Contributor

karianna commented Jun 4, 2020

You could take the approach of grabbing a copy of that binary and putting it on Jenkins master. We do this for freetype and some other libs.

@smlambert
Copy link
Contributor

smlambert commented Jun 4, 2020

Agree, this can be staged on Jenkins master, we do this for core test dependencies, can have a job that pulls perf dependencies. (It is odd that it is only 2 machines that have issues with downloading from the location... so it does make me wonder if they will still have the issue fetching from the Adopt Jenkins server).

@andrew-m-leonard
Copy link
Contributor Author

@andrew-m-leonard
Copy link
Contributor Author

i'll try a fix to perf/build.xml to check if dacapo.jar already exists, in this case downloaded from Jenkins master

@andrew-m-leonard
Copy link
Contributor Author

maybe add a check for DACAPO_URL

@smlambert
Copy link
Contributor

smlambert commented Jun 4, 2020

Ya, recognizing this is suboptimal way to resolve it (if it is indeed to fix a problem only seen on 2 machines, seems we are not solving the core issue).

We are changing the location of the dacapo url, implication is that anyone who runs AQA perf testing anywhere would now fetch it from Adopt Jenkins which better always be 'up' (and where we will now have to have a mechanism for keeping it up-to-date, which was the case for pulling latest dacapo from sourceforge).

Suppose could update perf/dacapo/build.xml to try sourceforge and if fail, fallback to Adopt Jenkins.

@andrew-m-leonard
Copy link
Contributor Author

The problem occured on zLinux box as well this morning, I think it's possibly an issue with the host server: https://sourceforge.net/projects/dacapobench/files/latest/download

@andrew-m-leonard
Copy link
Contributor Author

here's my suggestion:

           <if> <isset property="env.DACAPO_URL"/>
           <then>
                <var name="curl_options" value="-Lks ${env.DACAPO_URL} -o dacapo.jar"/>
           </then>
           <else>
                <var name="curl_options" value="-Lks https://sourceforge.net/projects/dacapobench/files/latest/download -o dacapo.jar"/>
           </else>
           </if>

@andrew-m-leonard
Copy link
Contributor Author

@andrew-m-leonard
Copy link
Contributor Author

They have an async javascript download timer, wondering if something "glitches" in that...?

@smlambert
Copy link
Contributor

smlambert commented Jun 4, 2020

Ya, maybe its flaky because of redirect mechanism (we are using curl options for this, presume same version of curl on all machines...).

vague recollection that we had to request an update a year or two ago, as older version of curl did not have a particular curl option that was needed.

@smlambert
Copy link
Contributor

We use the redirect link, so when a newer version of dacopo uploaded, we just 'get it', but we could hard-code to v9.12 and see how that looks when Grinderized on the machines in question.

@andrew-m-leonard
Copy link
Contributor Author

curl seems up to date:

curl 7.64.1 (x86_64-apple-darwin19.0) libcurl/7.64.1 (SecureTransport) LibreSSL/2.8.3 zlib/1.2.11 nghttp2/1.39.2
Release-Date: 2019-03-27

@andrew-m-leonard
Copy link
Contributor Author

@smlambert which way would you like to try first:

  1. Changing perf/dacapo/build.xml to download direct URL: curl -Lks https://downloads.sourceforge.net/project/dacapobench/9.12-bach-MR1/dacapo-9.12-MR1-bach.jar -o dacapo.jar

or

  1. Place dacapo.jar on master and set DACAPO_URL to point to it and add check in build/XML

@smlambert
Copy link
Contributor

smlambert commented Jun 4, 2020

Likely the first option, so we go to the benchmarks public location to get it. Option 3 is use the current approach, and if it fails to redirect and find latest file, pull it from a cached version at Adopt server.

In either case, do we have Grinder stats for how frequently this occurs on one of these machines? That way we can grind to see if we ever do hit the same type of problem with either of these approaches in xx number of runs.

For 2nd option, I can upload the jar file with the UploadFile job and we can Grind to see how it fares. (https://ci.adoptopenjdk.net/view/Test_grinder/job/UploadFile/22/artifact/upload/dacapo-9.12-MR1-bach.jar)

@andrew-m-leonard
Copy link
Contributor Author

andrew-m-leonard commented Jun 4, 2020

i'm suspecting the failure occurs when the network route to sourceforge.net download is slow or bad in someway... I had a search back through the Test.perf jobs and found the problem is not confined to certain slaves. The problem has happened from the perf mac machines, a zLinux machine, an aarch64 machine and also several xLinux machines
I would estimate it happens about 1 time in 8 or there abouts.
@smlambert maybe we should just add a "retry" ? say try 3 times?

01:04:55  getDacapoSuite:
01:04:55       [echo] curl -Lks https://sourceforge.net/projects/dacapobench/files/latest/download -o dacapo.jar
01:07:19  
01:07:19  BUILD FAILED
01:07:19  /home/jenkins/workspace/Test_openjdk11_hs_sanity.perf_x86-64_linux/openjdk-tests/TKG/scripts/build_test.xml:58: The following error occurred while executing this line:
01:07:19  /home/jenkins/workspace/Test_openjdk11_hs_sanity.perf_x86-64_linux/openjdk-tests/perf/build.xml:31: The following error occurred while executing this line:
01:07:19  /home/jenkins/workspace/Test_openjdk11_hs_sanity.perf_x86-64_linux/openjdk-tests/perf/dacapo/build.xml:44: The following error occurred while executing this line:
01:07:19  /home/jenkins/workspace/Test_openjdk11_hs_sanity.perf_x86-64_linux/openjdk-tests/perf/dacapo/build.xml:32: exec returned: 7
01:07:19 

Found an aarch64 failure: https://ci.adoptopenjdk.net/view/Test_perf/job/Test_openjdk11_j9_sanity.perf_aarch64_linux/111/console
note rc = 18, sounds particularly odd!:

CURLE_PARTIAL_FILE (18)
A file transfer was shorter or larger than expected. This happens when the server first reports an expected transfer size, and then delivers data that doesn't match the previously given size. 

Another zLinux: https://ci.adoptopenjdk.net/view/Test_perf/job/Test_openjdk14_hs_sanity.perf_s390x_linux/92/console

@andrew-m-leonard
Copy link
Contributor Author

Wondering whether we ought to have a generic "curl-with-retry" ant download task, to use generically...? network issues and glitches, which would work on a retry would save a lot of failed builds.....

@andrew-m-leonard
Copy link
Contributor Author

andrew-m-leonard commented Jun 4, 2020

This sounds ideal? https://ant.apache.org/manual/Tasks/retry.html

This example shows how to use <retry> to wrap a task which must interact with an unreliable network resource.

@smlambert
Copy link
Contributor

I was thinking same thing re: generic retry task in ant that we eventually convert all test fetches to use. I had not yet looked to see one exists. Shall we try that approach and see the outcome?

@andrew-m-leonard
Copy link
Contributor Author

I've tested the ant retry task on a couple of dozen grinders, no issues at all, no network failures either unfortunately (sods law! it will happen on release day!)
but I think it's a good idea
@smlambert please can you review this PR: adoptium/aqa-tests#1817

@karianna karianna added this to the June 2020 milestone Jun 5, 2020
@karianna karianna moved this from TODO to In Progress in infrastructure Jun 5, 2020
@sxa
Copy link
Member

sxa commented Jun 11, 2020

With the aforementioned PR being merged it looks like the job is now passing therefore I shall close this

infrastructure automation moved this from In Progress to Done Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

4 participants