Jobs with many parallel tasks cause GKE cluster failures #96

dleehr · 2019-08-12T15:03:30Z

When testing a moderately parallelized job on GKE (24 exome samples, 48 files), the cluster API becomes unresponsive and triggers failures in the workflow. It's not failing because of the computations.

This is easily reproducible with a simple but highly parallel worfklow that generates 175 parallel containers: https://gist.github.com/dleehr/afdcde15aef9d727fd5226beddef126d

The above sleep-echo workflow doesn't seem to cause problems on a development cluster (Docker Desktop for Mac), so I'm surprised it can overwhelm a GKE cluster

Some ideas:

local k8s was 1.14, GKE starts at 1.12. Try upgrading cluster
Upgrade python k8s client to 9.0 (see KubernetesClient.follow_logs() exits before getting all logs #60 )
Compare to openshift on local infrastructure
Revisit threading logic. Threading logic is using cwltool --parallel. While running, docker stats reported over 2000 PIDs for the calrissian container that orchestrates the workflow. This PID count includes processes and kernel threads. It's not clear that this is a problem but it may be a symptom.
Revisit design that launches a cluster API connection to watch/follow logs on every pod. Maintaining 175 parallel watcher connections to the cluster may not be a solid foundation
Add retry logic to the cluster API calls like tenacity, Adds retry logic to handle expired upload chunk urls DukeDSClient#261

The text was updated successfully, but these errors were encountered:

dleehr · 2019-08-12T15:13:42Z

Logs from running sleep echo: errors.txt

dleehr · 2019-08-14T18:15:11Z

Some ideas after discussing with @hlapp:

probably a fair assumption to assume others aren't seeing this level of problem with running a couple hundred pods in parallel. What is it about our use that's different?
common suspects for this type of problem:
- is some I/O operation blocking when it should not be?
- Are there any locks/race conditions encountered?
- networking - is there a long-open TCP connection timing out?

We're certainly eagerly watching logs, and I know we've encountered issues in other projects when expecting streaming connections (e.g. RabbitMQ in lando) to be held open indefinitely. So after the easy/obvious steps of upgrading dependencies and k8s itself, it's probably worth looking at the watch/log connections.

johnbradley · 2019-08-14T20:00:49Z

I tried running the sleep-echo workflow after upgrading the google k8s cluster software to 1.13.7-gke.19. The cluster become unresponsive after I started the sleep-echo workflow job. I noticed there was a message about "Upgrading master" in the google cloud console. I let it sit for a couple minutes and the cluster came back. The sleep-echo workflow job was still there and had a failed status.
I ran the sleep-echo workflow again and this time it failed with an error from cwltool error that a javascript expression took too long. The cluster remained responsive this time.
I ran sleep-echo workflow two more times and it worked both times and the cluster remained responsive.

I also ran the sleep-echo workflow against our hardspin openshift test cluster and it worked fine.

dleehr · 2019-08-15T13:10:23Z

I ran sleep-echo workflow two more times and it worked both times and the cluster remained responsive.

So there's certainly a possibility that all of the troubles are due to a cluster upgrade happening in the middle of a job - not at all related to the workload. A surprise cluster upgrade certainly explains:

Why k8s API connections from calrissian to the master are dropped/disconnected
Why the google cloud console dashboard becomes unresponsive trying to obtain cluster info

Great detective work, seems like the auto-upgrade is causing this.

dleehr · 2019-08-15T13:16:57Z

The docs say:

When you create a new cluster using Google Cloud Platform Console, node auto-upgrade is enabled by default.

And when creating a cluster on CLI, I see:

WARNING: In June 2019, node auto-upgrade will be enabled by default for newly created clusters and node pools. To disable it, use the `--no-enable-autoupgrade` flag.

I suggest we disable auto upgrade when creating these clusters, in addition to adding retry logic.

johnbradley · 2019-08-15T13:27:58Z

I think there are two different "master upgrades" that display the same within the google cloud console.

Upgrading the master software (for example to 1.13.7-gke.19)
The upgrade that I consistently see during the first run of the sleep-echo workflow

They both show the following message in https://console.cloud.google.com:

Upgrading cluster master.
The values shown below are going to change soon.

There are commands to perform the software upgrade via the command line or through the google cloud console. My assumption is that secondary master upgrade is when google is upgrading the master server to a bigger server, since workflows seem to run fine after this upgrade. I haven't been able to find a flag or command to perform this upgrade manually.

dleehr · 2019-08-15T13:31:37Z

I see - so it's not necessarily due to a software version upgrade, it may be responding to the number of pods and migrating to a larger master.

So, disabling the automatic software upgrade wouldn't prevent GKE from migrating to a bigger master.

Do you have a screenshot on where you see that message?

johnbradley · 2019-08-15T13:34:23Z

dleehr · 2019-08-15T13:40:08Z

Thanks. Your findings match up with the behavior described in this issue, and it seems to be the designed/intended behavior: hashicorp/terraform-provider-google#3385.

And that makes sense - the master is dynamic and may be unavailable while it's scaling up to manage more nodes. That can certainly happen on other cloud platforms, so we should handle it better :)

johnbradley · 2019-08-15T19:07:16Z

I came across some documentation that seems related to "Upgrading cluster master": https://github.com/GoogleCloudPlatform/gke-stateful-applications-demo#deployment-steps

It refers to the k8s master as the control plane. If you spawn a cluster with 6 nodes it may go

into a 'RECONCILING' status as the control plane's instance size may be increased.

I found that you can see this status in gcloud container clusters list while the cluster master is being upgraded.

I tried staring a cluster with 6 small nodes and only saw failures due to javascript taking too long.
This is more of a work-around than actually addressing the core issue of being able to handle cluster upgrades.

johnbradley · 2019-08-19T14:35:18Z

I added the calrissian --eval-timeout 600 flag when running the sleep-echo workflow on a google cloud cluster started with 6 small nodes and a larger pool it ran to completion with no errors. The larger pool scaled up during the job.

The --eval-timeout flag was added to avoid the Long-running script killed exception. The cwltool logic that checks this timeout just waits a number of seconds:
https://github.com/common-workflow-language/cwltool/blob/a9078408aafdee0e0f7c27954fb80ef1a0c39e9c/cwltool/sandboxjs.py#L176-L186

johnbradley · 2019-08-20T12:57:16Z

Running the full 24 exome sample workflow failed even with the above tweaks.
There were a handful of errors but the first one was:

Traceback (most recent call last): 
File "/usr/local/lib/python3.6/site-packages/cwltool/executors.py", line 264, 
  in runner job.run(runtime_context) 
File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 593, 
  in run completion_result = self.wait_for_kubernetes_pod() 
File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 372, 
  in wait_for_kubernetes_pod return self.client.wait_for_completion() 
File "/usr/local/lib/python3.6/site-packages/calrissian/k8s.py", line 151, 
  in wait_for_completion monitor.remove(pod) 
File "/usr/local/lib/python3.6/site-packages/calrissian/k8s.py", line 268, 
  in remove PodMonitor.pod_names.remove(pod.metadata.name) ValueError: list.remove(x): x not in list

I was able to look at logging around this error via the the Google Cloud Console.
In StackDriver Error Reporting I clicked view logs for the error which takes you to Stackdriver Logging. I cleared the limit field to +/- 1 hour and cleared out the error specific filter.
The above ValueErorr occured after a successful trim command where we tried to remove it.

Side Thought:
I noticed in the Stackdriver Logging each log entry has a good deal of data stored with it. I was wondering if the quantity of logging we do to stdout/stderr might be part of the larger problem.

johnbradley · 2019-08-21T12:57:06Z

Attempting to summarize different types errors seen in google cloud error logs.

Connection Errors
- [Errno 111] Connection refused',
- urllib3.exceptions.ProtocolError 'Connection broken: IncompleteRead'
Blocking IO Errors
- BlockingIOError: [Errno 11] write could not complete without blocking
  at flush (/usr/local/lib/python3.6/logging/init.py:978)
  at emit (/usr/local/lib/python3.6/logging/init.py:998)
kubernetes.client.rest.ApiException 400, 404 and 500
- Example read_namespaced_pod_log but also seen elsewhere
ValueError: list.remove(x): x not in list
- at remove (/usr/local/lib/python3.6/site-packages/calrissian/k8s.py:268)
  at wait_for_completion (/usr/local/lib/python3.6/site-packages/calrissian/k8s.py:151)
  at wait_for_kubernetes_pod (/usr/local/lib/python3.6/site-packages/calrissian/job.py:372)
  at run (/usr/local/lib/python3.6/site-packages/calrissian/job.py:593)
  at runner (/usr/local/lib/python3.6/site-packages/cwltool/executors.py:264)
Errors within subprocesses
- java.io.EOFException: Unexpected end of ZLIB input stream
  at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.read (MultiMemberGZIPInputStream.java:68)
  at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext (FastQFile.java:138)
  at uk.ac.babraham.FastQC.Sequence.FastQFile.next (FastQFile.java:125)
  at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run (AnalysisRunner.java:76)
  at java.lang.Thread.run (Thread.java:748)
Errors collecting output
- File "", line 1, in : cwltool.errors.WorkflowException: ../input/workflow/sleep-echo.cwl:41:7: Expression evaluation error:
  at raise_from (/usr/local/lib/python3.6/site-packages/future/utils/init.py:400)
  at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:298)
  at collect_output (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:737)
  at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:657)
- cwltool.errors.WorkflowException: Error collecting output for parameter 'output_qc_report':
  at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:168)
  at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:612)
  at run (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:191)
  at runner (/usr/local/lib/python3.6/site-packages/cwltool/executors.py:264)
Javascript Timeout errors
- These are from cwltool logic. By using the --eval-timeout 600 these can be avoided.

johnbradley · 2019-08-21T13:19:24Z

Some of these exceptions seem to be side effects of others. For example ValueError: list.remove(x): x not in list seems to be a side effect of the wait_for_completion exiting early.

calrissian/calrissian/k8s.py

Lines 132 to 134 in a249143

    
           def wait_for_completion(self): 
        
               w = watch.Watch() 
        
               for event in w.stream(self.core_api_instance.list_namespaced_pod, self.namespace, field_selector=self._get_pod_field_selector()):

Typically w.stream blocks until we run w.stop() inside wait_for_completion. It looks like when we lose connection to the k8s server it does return(or throws exception). This in turn causes an early call to PodMonitor.cleanup() that eventually results in the ValueError above.

dleehr · 2019-08-28T19:53:23Z

researched errors and tried to find source/cause

Attempting to summarize different types errors seen in google cloud error logs.

Connection Errors

[Errno 111] Connection refused',

urllib3.exceptions.ProtocolError 'Connection broken: IncompleteRead'

Should be addressed in k8s.py - anywhere we make an API call to kubernetes, check for errors and possibly retry

Blocking IO Errors

BlockingIOError: [Errno 11] write could not complete without blocking
at flush (/usr/local/lib/python3.6/logging/init.py:978)
at emit (/usr/local/lib/python3.6/logging/init.py:998)

Could be caused by

calrissian/calrissian/main.py

Lines 99 to 101 in 2ed7854

    
           def flush_tees(): 
        
               sys.stdout.flush() 
        
               sys.stderr.flush()

The sys.stdout and sys.stderr are reopened as tee subprocesses, one of which could be blocking

kubernetes.client.rest.ApiException 400, 404 and 500

Example read_namespaced_pod_log but also seen elsewhere

403 - forbidden, and the API responded with the user and namespace. Seems like this may be a user misconfiguration
400 - happened when reading log, error from k8s API was container not found (a docker rpc error). Seems to correlate with another error when pods were being deleted. Sounds like a concurrency/timing issue
500 - Happened as a result of get_current_pod. This happens when initializing a job and adding the persistent volume entries. Could have happened as a result of a cluster master upgrade, where the cluster API was not ready to return real data. I suggest we read this info early and keep it (instead of reading it for every job step)
404 - happened when reading logs. Error from k8s API was pod not found. Sounds like 400 error above (container not found) but maybe just a little later in the destruction of the pod.

ValueError: list.remove(x): x not in list

Fixed by Fix error when PodMonitor cleanup is called before remove #98

Errors within subprocesses

java.io.EOFException: Unexpected end of ZLIB input stream
at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.read (MultiMemberGZIPInputStream.java:68)
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext (FastQFile.java:138)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next (FastQFile.java:125)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run (AnalysisRunner.java:76)
at java.lang.Thread.run (Thread.java:748)

Seems to be downstream of an earlier failure in concat-gz-files. I believe that script needs to set pipefail: https://github.com/bespin-workflows/exomeseq-gatk4/blob/develop/tools/concat-gz-files.cwl

Errors collecting output

File "", line 1, in : cwltool.errors.WorkflowException: ../input/workflow/sleep-echo.cwl:41:7: Expression evaluation error:
at raise_from (/usr/local/lib/python3.6/site-packages/future/utils/init.py:400)
at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:298)
at collect_output (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:737)
at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:657)

cwltool.errors.WorkflowException: Error collecting output for parameter 'output_qc_report':
at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:168)
at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:612)
at run (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:191)
at runner (/usr/local/lib/python3.6/site-packages/cwltool/executors.py:264)

Errors collecting output usually indicate an earlier failure. We should make sure early tools are returning failure codes when they fail (concat-gz-output: pipefail) and revisit this.

Javascript Timeout errors

These are from cwltool logic. By using the --eval-timeout 600 these can be avoided.

This is odd. The expressions are simple but taking over 20s to run? $(inputs.message + '.txt'). We notice in sandboxjs.py that there is a lot of overhead to running nodejs (separate process for each invocation and multiple threads). For now we can work around by raising the timeout but this seems like it's not very efficient.

dleehr · 2019-10-08T19:53:39Z

Investigating this currently and have implemented some retry logic with tenacity. Have temporarily disabled deleting pods in an attempt to sort out the noise. I've found that in the sleep-echo workflow of 175 parallel tasks, the workflow doesn't finish because some of the pods get deleted as part of the node upgrade.

[sleep-echo-5-pod-llwypgls] follow_logs start
[sleep-echo-5-pod-llwypgls] 0: hello
[sleep-echo-5-pod-llwypgls] 1: hello
[sleep-echo-5-pod-llwypgls] 2: hello
[sleep-echo-5-pod-llwypgls] 3: hello
[sleep-echo-5-pod-llwypgls] 4: hello
[sleep-echo-5-pod-llwypgls] 5: hello
[sleep-echo-5-pod-llwypgls] 6: hello
[sleep-echo-5-pod-llwypgls] 7: hello
[sleep-echo-5-pod-llwypgls] 8: hello
[sleep-echo-5-pod-llwypgls] 9: hello
[sleep-echo-5-pod-llwypgls] rpc error: code = Unknown desc = Error: No such container: 7e9bb62be850fe96f94636cbec4c92559e9b65cb31b31bdd697f3746ed40cc48
[sleep-echo-5-pod-llwypgls] follow_logs end
[sleep-echo-5-pod-llwypgls] follow_logs start
Retrying calrissian.k8s.KubernetesClient.follow_logs in 5.0 seconds as it raised (404)
Reason: Not Found

Initially this was baffling, but on further thought, kubernetes resources such as deployments, jobs, and replicasets are responsible for creating pods if they are deleted manually. So calrissian should be no different. If the pod is deleted out from under it, it should be responsible for re-submitting that pod.

I considered looking into pod disruption budgets, restartPolicy, or disabling auto-scaling, but at the end of the day, raw pods are not expected to be resilient. they're the lowest level object, and we should either be acting as a manager of pods or using a Job if we want to delegate that responsibility. We decided in #23 to submit pods rather than jobs since it seemed a better fit.

johnbradley mentioned this issue Aug 19, 2019

Upgrade to latest stable kubernetes library release (10.0.1) #97

Merged

johnbradley mentioned this issue Aug 23, 2019

Fix error when PodMonitor cleanup is called before remove #98

Merged

dleehr mentioned this issue Sep 9, 2019

Thread Pool Job Executor #99

Merged

dleehr mentioned this issue Oct 9, 2019

Implements simple k8s API retries #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs with many parallel tasks cause GKE cluster failures #96

Jobs with many parallel tasks cause GKE cluster failures #96

dleehr commented Aug 12, 2019 •

edited

Loading

dleehr commented Aug 12, 2019 •

edited

Loading

dleehr commented Aug 14, 2019

johnbradley commented Aug 14, 2019 •

edited

Loading

dleehr commented Aug 15, 2019

dleehr commented Aug 15, 2019

johnbradley commented Aug 15, 2019 •

edited

Loading

dleehr commented Aug 15, 2019

johnbradley commented Aug 15, 2019

dleehr commented Aug 15, 2019

johnbradley commented Aug 15, 2019 •

edited

Loading

johnbradley commented Aug 19, 2019

johnbradley commented Aug 20, 2019

johnbradley commented Aug 21, 2019 •

edited

Loading

johnbradley commented Aug 21, 2019 •

edited

Loading

dleehr commented Aug 28, 2019

dleehr commented Oct 8, 2019

Jobs with many parallel tasks cause GKE cluster failures #96

Jobs with many parallel tasks cause GKE cluster failures #96

Comments

dleehr commented Aug 12, 2019 • edited Loading

dleehr commented Aug 12, 2019 • edited Loading

dleehr commented Aug 14, 2019

johnbradley commented Aug 14, 2019 • edited Loading

dleehr commented Aug 15, 2019

dleehr commented Aug 15, 2019

johnbradley commented Aug 15, 2019 • edited Loading

dleehr commented Aug 15, 2019

johnbradley commented Aug 15, 2019

dleehr commented Aug 15, 2019

johnbradley commented Aug 15, 2019 • edited Loading

johnbradley commented Aug 19, 2019

johnbradley commented Aug 20, 2019

johnbradley commented Aug 21, 2019 • edited Loading

johnbradley commented Aug 21, 2019 • edited Loading

dleehr commented Aug 28, 2019

dleehr commented Oct 8, 2019

dleehr commented Aug 12, 2019 •

edited

Loading

dleehr commented Aug 12, 2019 •

edited

Loading

johnbradley commented Aug 14, 2019 •

edited

Loading

johnbradley commented Aug 15, 2019 •

edited

Loading

johnbradley commented Aug 15, 2019 •

edited

Loading

johnbradley commented Aug 21, 2019 •

edited

Loading

johnbradley commented Aug 21, 2019 •

edited

Loading