Skip to content

checksum-retry #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

andred
Copy link

@andred andred commented Jan 17, 2019

These two patches make upload a bit more resilient against networking issues:

  • We see uploads failing with a 404, but eventually they succeed with the patch
  • we also upload files that are symlinks, and without the 2nd patch this plugin will try to do a full upload of all the files, without retrying the checksum deploy in case of network issues

This is probably not the best way to solve this, open for better suggestions :-)

André Draszik added 2 commits January 17, 2019 16:12
…loy failure)

Seeing a lot of networking issues, this is trying to make
upload a bit more tolerant to network issues.

There are two cases where we can get IOExceptions:
* from the actual apache httpclient org.apache.http.client CloseableHttpClient
* as a result of parsing the Artifactory response (JSON parsing) in
  private ArtifactoryUploadResponse uploadFile(DeployDetails details, String uploadUrl)

The 1st case is handled by existing code, but the 2nd case
isn't so far. This commit adds handling to the JSON parsing.

Signed-off-by: André Draszik <andre.draszik@jci.com>
There are two issues:
* in the mean time, upload can have succeeded (e.g. due to symlinked files)
  PreemptiveHttpClient will still keep trying to upload,
  although a checksum deploy would be enough in that
  case
* SpecDeploymentConsumer has no way of noticing that
  PreemptiveHttpClient should not be continuing in
  that case

Let's avoid it here, and delegate to SpecDeploymentConsumer so
that we also benefit from checksum deploy.

Signed-off-by: André Draszik <andre.draszik@jci.com>
@eyalbe4
Copy link
Contributor

eyalbe4 commented Feb 13, 2019

@andred,
If I understand correctly, your code overrides the http client's retry mechanism, and implements it manually, so that a retry is performed even if the PUT requests returns 404.
Do you know what is the root cause of the 404? I'm not sure that a network hiccup can cause a 404 response. I'm concerned that this code change will hide the issue on your side, as well as add redundant retries for other users.
Maybe we should dig down further and understand how one PUT return a 404 and an identical one right after does not.

@andred
Copy link
Author

andred commented Feb 14, 2019

Hi,

To answer the first part:
In a way, yes. At the moment, the http client retries itself (up to three times if I remember right), unless the server (Artifactory) returns a 404 to the POST. In that case, the upload fails, there is no retry (and the Jenkins build fails subsequently).

So this means
a) that the http client's retry is not in line with the number of retries set within the rest of the Jenkins Artifactory plugin.
b) if the same file has been successfully uploaded in the mean-time by some other means, the http-client instance doesn't know about that, and will not try to do a checksum-deploy before retrying the upload for real on subsequent attempts
c) a 404 causes the build to fail, even though Artifactory might not return a 404 the next time, e.g. because the upload succeeds on the subsequent attempt by the same http-client, or because the file appeared via a different means on the Artifactory server and checksum deploy could have been executed

b) above is a useful optimisation for me, because we upload large files (100s of MiB), where some files are just symlinks, and Artifactory doesn't really support symlinks transparently, but they work OK and fast with this change because of Artifactory's checksum deploy feature / checksum comparison.

As for the last question:
I am not sure why our Artifactory instance returns 404 sometimes. We've started to look into that, there is nothing in the Artifactory logs itself. It does look like a network issue (TCP timeouts), our IT department is still looking into that. What basically happens is that the upload stalls, and after a while Artifactory returns 404, probably because of hitting an internal timeout.

So yes, this patch here is maybe a work-around. But keep in mind that network issues are nothing unusual, so making the plugin more resilient against that doesn't seem like a bad idea.

@eyalbe4
Copy link
Contributor

eyalbe4 commented Feb 14, 2019

@andred,
I'd be really surprised if the 404 is returned by Artifactory to a PUT request.
If it does, you should find this in the requests.log, which is available also from the Artifactory UI. If you do find a 404 there, can you please share it with us?
It is more likely that the 404 is returned by a proxy somewhere between the build agent and Artifactory. In that case, there's a chance that only some of the requests go through that proxy (for example, a load balancer can cause this) and that's why your workaround works.
If indeed this is what happens, this is not a common / normal network issue / hiccup, and this workaround may hide a real issue on your environment. Apache's http client does nor support retries for 404 responses because it is wrong to do so. I therefore don't think we should support this workaround officially. The build-info client is a core components used by many systems, and it is expected to use the common http standards.
You can use your fork till the issue is resolved, but I'd recommend fixing the real cause of the issue and revert back to the master branch code once the issue is resolved.

@andred
Copy link
Author

andred commented Feb 14, 2019

Thank you for sharing your thoughts.

I think I remembered wrong, though. I looked at some older logs that our Artifactory admins shared with me last year:

user@host $2022 ~/tmp > grep 20181009173942.rootfs.ext4 ArtifactoryLogs.txt Line 19868: 20181009130036|2|REQUEST|10.32.234.132|username|PUT|/my-project-images/nightlies/2018-10-09T17%3A17%2B00%3A00/the-platform/my-project-image-the-platform-20181009173942.rootfs.ext4;build.timestamp=1539105430142;build.name=my-project+%3A%3A+yocto-my-project-pipelines+%3A%3A+space-integration-yocto-master;build.number=31|HTTP/1.1|404|0 Line 20378: 20181009130140|62995|REQUEST|10.32.234.132|username|PUT|/my-project-images/nightlies/2018-10-09T17%3A17%2B00%3A00/the-platform/my-project-image-the-platform-20181009173942.rootfs.ext4;build.timestamp=1539105430142;build.name=my-project+%3A%3A+yocto-my-project-pipelines+%3A%3A+space-integration-yocto-master;build.number=31|HTTP/1.1|404|486539264 Line 21898: 20181009131744|21|REQUEST|10.32.234.132|username|PUT|/my-project-images/nightlies/2018-10-09T17%3A17%2B00%3A00/the-platform/my-project-image-the-platform-20181009173942.rootfs.ext4;build.timestamp=1539105430142;build.name=my-project+%3A%3A+yocto-my-project-pipelines+%3A%3A+space-integration-yocto-master;build.number=31|HTTP/1.1|201|0

As can be seen, there is 404 for the PUT of initial checksum deploy, which is fine and expected.
Afterwards, there is a 404 on the PUT when actually trying to upload the 460MiB file.

Please correct me if I'm interpreting the log incorrectly.

@eyalbe4
Copy link
Contributor

eyalbe4 commented Feb 14, 2019

Thanks @andred - you are correct!
This seems like a bug in Artifactory, which RTFACT-18518 tracks.
When the 404 is returned, the Artifactory log should also include an exception, which should hint to the real cause fo the issue. The exception should be logged at the time that the 404 is returned. If you're unable to track down the exception in the log, try looking for a message that contains failed due to

@andred
Copy link
Author

andred commented Feb 14, 2019

Thanks.

Unfortunately, there is no 'failed due to' in the request.log that I have. Would it have been in a different log file? Those three lines pasted above are the only mentions of the file involved.

@eyalbe4
Copy link
Contributor

eyalbe4 commented Feb 27, 2019

@andred - I discussed this with the developers.
Once RTFACT-18518 is fixed, Artifactory will return the correct HTTP response code instead of a 404.
There is however a real problem, which can be traced in Artofactory logs at around the time of the failure.

@eyalbe4
Copy link
Contributor

eyalbe4 commented Nov 10, 2019

Since RTFACT-18518 tracks this issue, I think that this once cam be closed.

@eyalbe4 eyalbe4 closed this Nov 10, 2019
RobiNino pushed a commit to RobiNino/build-info that referenced this pull request Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants