Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent failures in pod exec operations in jenkins-x pipelines #2582

Closed
ryandawsonuk opened this issue Feb 28, 2019 · 10 comments
Closed
Milestone

Comments

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Feb 28, 2019

Have raised
jenkins-x/jx#2861

This seems especially likely to happen when multiple builds happen in parallel. Earlier I saw:

image

They were all failures with the same error message:

java.util.concurrent.RejectedExecutionException: Task okhttp3.RealCall$AsyncCall@5a85ec0b rejected from java.util.concurrent.ThreadPoolExecutor@2e54f447[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 20]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at okhttp3.Dispatcher.enqueue(Dispatcher.java:130)
at okhttp3.RealCall.enqueue(RealCall.java:100)
at okhttp3.internal.ws.RealWebSocket.connect(RealWebSocket.java:183)
at okhttp3.OkHttpClient.newWebSocket(OkHttpClient.java:436)
at io.fabric8.kubernetes.client.dsl.internal.PodOperationsImpl.exec(PodOperationsImpl.java:267)
at io.fabric8.kubernetes.client.dsl.internal.PodOperationsImpl.exec(PodOperationsImpl.java:61)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:318)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:236)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:188)

http://jenkins.jx.35.228.195.195.nip.io/job/Activiti/job/activiti-cloud-query/job/PR-154/1/console
http://jenkins.jx.35.228.195.195.nip.io/job/Activiti/job/activiti-cloud-audit/job/PR-154/1/console
http://jenkins.jx.35.228.195.195.nip.io/job/Activiti/job/activiti-cloud-notifications-graphql/job/PR-42/1/console

In each case the travis builds all passed without problems.

It could be related to resource usage in the cluster. Will have to try to get there quick enough to see resource usage at the time when it happens.

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Feb 28, 2019

It seems to always work on a re-trigger, presumably because then only one pipeline is running. It seems to especially affect the examples cluster, which does run the most parallel pipelines. So I had a look at the build pod resources and noted they are now up to:

image

Whereas the defaults are:

image

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 4, 2019

A new instance of this is http://jenkins.jx.35.242.205.159.nip.io/job/Activiti/job/activiti-cloud-notifications-service-graphql/job/develop/23/console

Note that is on the cloud cluster, not examples

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 4, 2019

I also had a look at the cloud cluster and am seeing that during the peak there can be about 7 maven build pods running and each uses about 500mCPU and 600Mi. This currently leaves the cluster able to meet the requests but the pods are actually exceeding the requests and the cluster is not able to meet the limit numbers. Usage did not appear to hit the limits of availability during the time I was watching but kubectl top nodes did get as high as 89% for CPU and 79% for memory on some nodes. Not sure if this helps much as I didn't see the issue either. This does seem to be enough for it though.

I did notice that the notifications service tests seem to be especially intensive. That pod can use 1237mCPU and 1019Mi, so it can briefly go above the limits which are currently set at 1 and 1024Mi.

The cluster has the ability to add an extra node as part of the autoscaling group but I think that would require to be hitting its limits for a sustained period. We're talking about short-lived jobs and if one of them fails then that's actually quite an inconvenience for us.

In the process I did notice jenkins-x/jx#3267

I have put the requests back up to around 700mCPU and 900Mi as that is more representative of usage and added an extra node. This may help - we can continue to keep an eye on it.

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 4, 2019

I don't think this is exclusively about resource availability but it looks like an aggravating factor.

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 5, 2019

Now changed the examples cluster maven builder pod resources to match the cloud one:

image

Then manually started build creating version 7.1.19 of acitiviti-cloud-dependencies

Then I watched this and realised that most of the resource usage is the dind pods:

image

Note the numbers are considerably higher than https://github.com/Activiti/activiti-cloud-audit/blob/7af19da9ade79af70d2292e30adf689ca6b28c8a/maven-dind.yaml#L38

Not sure whether this is related as the failures I see in the pipeline as they all fail at the point when the maven build has just succeeded (in each case the "Build Release" stage is skipped) and they are on PRs, not master branch (actually master seems to go through fine), so I guess at the sh "make build" command. Must be the helm-related steps.

Perhaps that needs a retry. Although I'm not sure we even need that step in the PRs. It is actually most of the example repos that failed so not sure retrying will work.

Another way to tackle this might be to try to limit the total number of concurrent jobs on the cluster but I think that's a bit special with the jenkins kubernetes plugin.

This problem does not affect the ttc cluster and that has about the same number of concurrent jobs and also uses dind. One difference is that the ttc jobs only do maven builds on PR, not the additional make steps. So we could try doing the same as ttc as skip the helm build for PRs

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 5, 2019

I did a kubectl describe on one of these pods named maven-dind. The reason the request and limits are so high is that the pod contains the maven-java11 container as well as the dind container and the resources are the totals for both added together. Note that the dind resources are therefore allocated even if no docker image is built (as is the case with a PR).

I checked with kubectl top pods during a docker build and the whole combined pod did not get above using 1089mCPU and 924Mi... which is as much as we are currently requesting for dind alone.

Seems like we could reduce the requests on both a bit and try setting each to the same value.

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 5, 2019

Made the change to reduce requests on example runtime bundle and it built ok - http://jenkins.jx.35.228.195.195.nip.io/job/Activiti/job/example-runtime-bundle/job/master/160/console so merged for the others too.

Then did another manual build of activiti-cloud-dependencies to produce version 7.1.20 - that seems to go through fine from a jenkins perspective. (Some jobs got queued on travis, presumably because I've now run it so much that we're hitting travis quotas.) Looks promising this time.

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 6, 2019

Examples cluster is looking better so far.

Saw an interesting example of this in the cloud cluster. Error is

java.util.concurrent.RejectedExecutionException: Task okhttp3.RealCall$AsyncCall@5ba63f67 rejected from java.util.concurrent.ThreadPoolExecutor@3c0b8202[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 69]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
	at okhttp3.Dispatcher.enqueue(Dispatcher.java:130)
	at okhttp3.RealCall.enqueue(RealCall.java:100)
	at okhttp3.internal.ws.RealWebSocket.connect(RealWebSocket.java:183)
	at okhttp3.OkHttpClient.newWebSocket(OkHttpClient.java:436)
	at io.fabric8.kubernetes.client.dsl.internal.PodOperationsImpl.exec(PodOperationsImpl.java:267)
	at io.fabric8.kubernetes.client.dsl.internal.PodOperationsImpl.exec(PodOperationsImpl.java:61)

in http://jenkins.jx.35.242.205.159.nip.io/job/Activiti/job/activiti-cloud-service-common/job/develop/102/console

This happens during updatebot commands and actually gets retried https://github.com/Activiti/activiti-cloud-service-common/blob/develop/Jenkinsfile#L60

So presumably what is happening here is that any steps wrapped in container('maven') in the pipeline involve a watch being created in the Jenkins Kubernetes Plugin implementation and we see the error when the watch is lost.

@ryandawsonuk
Copy link
Contributor Author

ryandawsonuk commented Mar 6, 2019

With the resource changes above this seems to be much less common now - just had a couple of fully clean runs

I don't expect it will be fully gone but it is at least less common.

@salaboy salaboy added this to the 7.1.0.M1 milestone Mar 6, 2019
ryandawsonuk added a commit to AlfrescoArchive/activiti-cloud-dependencies that referenced this issue Mar 7, 2019
takes a little while to generate updatebot PRs so spreading out the examples-level ones might further help their reliability Activiti/Activiti#2582
ryandawsonuk added a commit to AlfrescoArchive/ttc-rb-english-campaign that referenced this issue Mar 8, 2019
match to example-runtime-bundle following Activiti/Activiti#2582 (comment)
ryandawsonuk added a commit to AlfrescoArchive/ttc-connectors-dummytwitter that referenced this issue Mar 8, 2019
match to example-runtime-bundle following Activiti/Activiti#2582 (comment)
ryandawsonuk added a commit to AlfrescoArchive/ttc-connectors-reward that referenced this issue Mar 8, 2019
match to example-runtime-bundle following Activiti/Activiti#2582 (comment)
ryandawsonuk added a commit to AlfrescoArchive/ttc-query-campaign that referenced this issue Mar 8, 2019
match to example-runtime-bundle following Activiti/Activiti#2582 (comment)
@salaboy salaboy modified the milestones: 7.1.0.M1, 7.1.0.M2 Mar 8, 2019
@mteodori mteodori closed this as completed Feb 5, 2020
@mteodori mteodori modified the milestones: 7.1.0.RC1, 7.1.0.M6 Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants