Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs with many parallel tasks cause GKE cluster failures #96

Open
dleehr opened this issue Aug 12, 2019 · 16 comments
Open

Jobs with many parallel tasks cause GKE cluster failures #96

dleehr opened this issue Aug 12, 2019 · 16 comments

Comments

@dleehr
Copy link
Member

dleehr commented Aug 12, 2019

When testing a moderately parallelized job on GKE (24 exome samples, 48 files), the cluster API becomes unresponsive and triggers failures in the workflow. It's not failing because of the computations.

This is easily reproducible with a simple but highly parallel worfklow that generates 175 parallel containers: https://gist.github.com/dleehr/afdcde15aef9d727fd5226beddef126d

The above sleep-echo workflow doesn't seem to cause problems on a development cluster (Docker Desktop for Mac), so I'm surprised it can overwhelm a GKE cluster

Some ideas:

  • local k8s was 1.14, GKE starts at 1.12. Try upgrading cluster
  • Upgrade python k8s client to 9.0 (see KubernetesClient.follow_logs() exits before getting all logs #60 )
  • Compare to openshift on local infrastructure
  • Revisit threading logic. Threading logic is using cwltool --parallel. While running, docker stats reported over 2000 PIDs for the calrissian container that orchestrates the workflow. This PID count includes processes and kernel threads. It's not clear that this is a problem but it may be a symptom.
  • Revisit design that launches a cluster API connection to watch/follow logs on every pod. Maintaining 175 parallel watcher connections to the cluster may not be a solid foundation
  • Add retry logic to the cluster API calls like tenacity, Adds retry logic to handle expired upload chunk urls DukeDSClient#261
@dleehr
Copy link
Member Author

dleehr commented Aug 12, 2019

Logs from running sleep echo: errors.txt

@dleehr
Copy link
Member Author

dleehr commented Aug 14, 2019

Some ideas after discussing with @hlapp:

  • probably a fair assumption to assume others aren't seeing this level of problem with running a couple hundred pods in parallel. What is it about our use that's different?
  • common suspects for this type of problem:
    • is some I/O operation blocking when it should not be?
    • Are there any locks/race conditions encountered?
    • networking - is there a long-open TCP connection timing out?

We're certainly eagerly watching logs, and I know we've encountered issues in other projects when expecting streaming connections (e.g. RabbitMQ in lando) to be held open indefinitely. So after the easy/obvious steps of upgrading dependencies and k8s itself, it's probably worth looking at the watch/log connections.

@johnbradley
Copy link
Contributor

johnbradley commented Aug 14, 2019

I tried running the sleep-echo workflow after upgrading the google k8s cluster software to 1.13.7-gke.19. The cluster become unresponsive after I started the sleep-echo workflow job. I noticed there was a message about "Upgrading master" in the google cloud console. I let it sit for a couple minutes and the cluster came back. The sleep-echo workflow job was still there and had a failed status.
I ran the sleep-echo workflow again and this time it failed with an error from cwltool error that a javascript expression took too long. The cluster remained responsive this time.
I ran sleep-echo workflow two more times and it worked both times and the cluster remained responsive.

I also ran the sleep-echo workflow against our hardspin openshift test cluster and it worked fine.

@dleehr
Copy link
Member Author

dleehr commented Aug 15, 2019

I ran sleep-echo workflow two more times and it worked both times and the cluster remained responsive.

So there's certainly a possibility that all of the troubles are due to a cluster upgrade happening in the middle of a job - not at all related to the workload. A surprise cluster upgrade certainly explains:

  1. Why k8s API connections from calrissian to the master are dropped/disconnected
  2. Why the google cloud console dashboard becomes unresponsive trying to obtain cluster info

Great detective work, seems like the auto-upgrade is causing this.

@dleehr
Copy link
Member Author

dleehr commented Aug 15, 2019

The docs say:

When you create a new cluster using Google Cloud Platform Console, node auto-upgrade is enabled by default.

And when creating a cluster on CLI, I see:

WARNING: In June 2019, node auto-upgrade will be enabled by default for newly created clusters and node pools. To disable it, use the `--no-enable-autoupgrade` flag.

I suggest we disable auto upgrade when creating these clusters, in addition to adding retry logic.

@johnbradley
Copy link
Contributor

johnbradley commented Aug 15, 2019

I think there are two different "master upgrades" that display the same within the google cloud console.

  1. Upgrading the master software (for example to 1.13.7-gke.19)
  2. The upgrade that I consistently see during the first run of the sleep-echo workflow

They both show the following message in https://console.cloud.google.com:

Upgrading cluster master.
The values shown below are going to change soon.

There are commands to perform the software upgrade via the command line or through the google cloud console. My assumption is that secondary master upgrade is when google is upgrading the master server to a bigger server, since workflows seem to run fine after this upgrade. I haven't been able to find a flag or command to perform this upgrade manually.

@dleehr
Copy link
Member Author

dleehr commented Aug 15, 2019

I see - so it's not necessarily due to a software version upgrade, it may be responding to the number of pods and migrating to a larger master.

So, disabling the automatic software upgrade wouldn't prevent GKE from migrating to a bigger master.

Do you have a screenshot on where you see that message?

@johnbradley
Copy link
Contributor

Screen Shot 2019-08-15 at 9 02 06 AM

@dleehr
Copy link
Member Author

dleehr commented Aug 15, 2019

Thanks. Your findings match up with the behavior described in this issue, and it seems to be the designed/intended behavior: hashicorp/terraform-provider-google#3385.

And that makes sense - the master is dynamic and may be unavailable while it's scaling up to manage more nodes. That can certainly happen on other cloud platforms, so we should handle it better :)

@johnbradley
Copy link
Contributor

johnbradley commented Aug 15, 2019

I came across some documentation that seems related to "Upgrading cluster master": https://github.com/GoogleCloudPlatform/gke-stateful-applications-demo#deployment-steps

It refers to the k8s master as the control plane. If you spawn a cluster with 6 nodes it may go

into a 'RECONCILING' status as the control plane's instance size may be increased.

I found that you can see this status in gcloud container clusters list while the cluster master is being upgraded.

I tried staring a cluster with 6 small nodes and only saw failures due to javascript taking too long.
This is more of a work-around than actually addressing the core issue of being able to handle cluster upgrades.

@johnbradley
Copy link
Contributor

I added the calrissian --eval-timeout 600 flag when running the sleep-echo workflow on a google cloud cluster started with 6 small nodes and a larger pool it ran to completion with no errors. The larger pool scaled up during the job.

The --eval-timeout flag was added to avoid the Long-running script killed exception. The cwltool logic that checks this timeout just waits a number of seconds:
https://github.com/common-workflow-language/cwltool/blob/a9078408aafdee0e0f7c27954fb80ef1a0c39e9c/cwltool/sandboxjs.py#L176-L186

@johnbradley
Copy link
Contributor

Running the full 24 exome sample workflow failed even with the above tweaks.
There were a handful of errors but the first one was:

Traceback (most recent call last): 
File "/usr/local/lib/python3.6/site-packages/cwltool/executors.py", line 264, 
  in runner job.run(runtime_context) 
File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 593, 
  in run completion_result = self.wait_for_kubernetes_pod() 
File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 372, 
  in wait_for_kubernetes_pod return self.client.wait_for_completion() 
File "/usr/local/lib/python3.6/site-packages/calrissian/k8s.py", line 151, 
  in wait_for_completion monitor.remove(pod) 
File "/usr/local/lib/python3.6/site-packages/calrissian/k8s.py", line 268, 
  in remove PodMonitor.pod_names.remove(pod.metadata.name) ValueError: list.remove(x): x not in list

I was able to look at logging around this error via the the Google Cloud Console.
In StackDriver Error Reporting I clicked view logs for the error which takes you to Stackdriver Logging. I cleared the limit field to +/- 1 hour and cleared out the error specific filter.
The above ValueErorr occured after a successful trim command where we tried to remove it.

Side Thought:
I noticed in the Stackdriver Logging each log entry has a good deal of data stored with it. I was wondering if the quantity of logging we do to stdout/stderr might be part of the larger problem.

@johnbradley
Copy link
Contributor

johnbradley commented Aug 21, 2019

Attempting to summarize different types errors seen in google cloud error logs.

  • Connection Errors

    • [Errno 111] Connection refused',
    • urllib3.exceptions.ProtocolError 'Connection broken: IncompleteRead'
  • Blocking IO Errors

    • BlockingIOError: [Errno 11] write could not complete without blocking
      at flush (/usr/local/lib/python3.6/logging/init.py:978)
      at emit (/usr/local/lib/python3.6/logging/init.py:998)
  • kubernetes.client.rest.ApiException 400, 404 and 500

    • Example read_namespaced_pod_log but also seen elsewhere
  • ValueError: list.remove(x): x not in list

    • at remove (/usr/local/lib/python3.6/site-packages/calrissian/k8s.py:268)
      at wait_for_completion (/usr/local/lib/python3.6/site-packages/calrissian/k8s.py:151)
      at wait_for_kubernetes_pod (/usr/local/lib/python3.6/site-packages/calrissian/job.py:372)
      at run (/usr/local/lib/python3.6/site-packages/calrissian/job.py:593)
      at runner (/usr/local/lib/python3.6/site-packages/cwltool/executors.py:264)
  • Errors within subprocesses

    • java.io.EOFException: Unexpected end of ZLIB input stream
      at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.read (MultiMemberGZIPInputStream.java:68)
      at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext (FastQFile.java:138)
      at uk.ac.babraham.FastQC.Sequence.FastQFile.next (FastQFile.java:125)
      at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run (AnalysisRunner.java:76)
      at java.lang.Thread.run (Thread.java:748)
  • Errors collecting output

    • File "", line 1, in : cwltool.errors.WorkflowException: ../input/workflow/sleep-echo.cwl:41:7: Expression evaluation error:
      at raise_from (/usr/local/lib/python3.6/site-packages/future/utils/init.py:400)
      at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:298)
      at collect_output (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:737)
      at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:657)
    • cwltool.errors.WorkflowException: Error collecting output for parameter 'output_qc_report':
      at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:168)
      at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:612)
      at run (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:191)
      at runner (/usr/local/lib/python3.6/site-packages/cwltool/executors.py:264)
  • Javascript Timeout errors

    • These are from cwltool logic. By using the --eval-timeout 600 these can be avoided.

@johnbradley
Copy link
Contributor

johnbradley commented Aug 21, 2019

Some of these exceptions seem to be side effects of others. For example ValueError: list.remove(x): x not in list seems to be a side effect of the wait_for_completion exiting early.

def wait_for_completion(self):
w = watch.Watch()
for event in w.stream(self.core_api_instance.list_namespaced_pod, self.namespace, field_selector=self._get_pod_field_selector()):

Typically w.stream blocks until we run w.stop() inside wait_for_completion. It looks like when we lose connection to the k8s server it does return(or throws exception). This in turn causes an early call to PodMonitor.cleanup() that eventually results in the ValueError above.

@dleehr
Copy link
Member Author

dleehr commented Aug 28, 2019

researched errors and tried to find source/cause

Attempting to summarize different types errors seen in google cloud error logs.

  • Connection Errors

    • [Errno 111] Connection refused',
    • urllib3.exceptions.ProtocolError 'Connection broken: IncompleteRead'

Should be addressed in k8s.py - anywhere we make an API call to kubernetes, check for errors and possibly retry

  • Blocking IO Errors

    • BlockingIOError: [Errno 11] write could not complete without blocking
      at flush (/usr/local/lib/python3.6/logging/init.py:978)
      at emit (/usr/local/lib/python3.6/logging/init.py:998)

Could be caused by

def flush_tees():
sys.stdout.flush()
sys.stderr.flush()

The sys.stdout and sys.stderr are reopened as tee subprocesses, one of which could be blocking

  • kubernetes.client.rest.ApiException 400, 404 and 500

    • Example read_namespaced_pod_log but also seen elsewhere
  • 403 - forbidden, and the API responded with the user and namespace. Seems like this may be a user misconfiguration
  • 400 - happened when reading log, error from k8s API was container not found (a docker rpc error). Seems to correlate with another error when pods were being deleted. Sounds like a concurrency/timing issue
  • 500 - Happened as a result of get_current_pod. This happens when initializing a job and adding the persistent volume entries. Could have happened as a result of a cluster master upgrade, where the cluster API was not ready to return real data. I suggest we read this info early and keep it (instead of reading it for every job step)
  • 404 - happened when reading logs. Error from k8s API was pod not found. Sounds like 400 error above (container not found) but maybe just a little later in the destruction of the pod.
  • ValueError: list.remove(x): x not in list
  • Errors within subprocesses

    • java.io.EOFException: Unexpected end of ZLIB input stream
      at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.read (MultiMemberGZIPInputStream.java:68)
      at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext (FastQFile.java:138)
      at uk.ac.babraham.FastQC.Sequence.FastQFile.next (FastQFile.java:125)
      at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run (AnalysisRunner.java:76)
      at java.lang.Thread.run (Thread.java:748)

Seems to be downstream of an earlier failure in concat-gz-files. I believe that script needs to set pipefail: https://github.com/bespin-workflows/exomeseq-gatk4/blob/develop/tools/concat-gz-files.cwl

  • Errors collecting output

    • File "", line 1, in : cwltool.errors.WorkflowException: ../input/workflow/sleep-echo.cwl:41:7: Expression evaluation error:
      at raise_from (/usr/local/lib/python3.6/site-packages/future/utils/init.py:400)
      at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:298)
      at collect_output (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:737)
      at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:657)
    • cwltool.errors.WorkflowException: Error collecting output for parameter 'output_qc_report':
      at exit (/usr/local/lib/python3.6/site-packages/schema_salad/sourceline.py:168)
      at collect_output_ports (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:612)
      at run (/usr/local/lib/python3.6/site-packages/cwltool/command_line_tool.py:191)
      at runner (/usr/local/lib/python3.6/site-packages/cwltool/executors.py:264)

Errors collecting output usually indicate an earlier failure. We should make sure early tools are returning failure codes when they fail (concat-gz-output: pipefail) and revisit this.

  • Javascript Timeout errors

    • These are from cwltool logic. By using the --eval-timeout 600 these can be avoided.

This is odd. The expressions are simple but taking over 20s to run? $(inputs.message + '.txt'). We notice in sandboxjs.py that there is a lot of overhead to running nodejs (separate process for each invocation and multiple threads). For now we can work around by raising the timeout but this seems like it's not very efficient.

@dleehr
Copy link
Member Author

dleehr commented Oct 8, 2019

Investigating this currently and have implemented some retry logic with tenacity. Have temporarily disabled deleting pods in an attempt to sort out the noise. I've found that in the sleep-echo workflow of 175 parallel tasks, the workflow doesn't finish because some of the pods get deleted as part of the node upgrade.

[sleep-echo-5-pod-llwypgls] follow_logs start
[sleep-echo-5-pod-llwypgls] 0: hello
[sleep-echo-5-pod-llwypgls] 1: hello
[sleep-echo-5-pod-llwypgls] 2: hello
[sleep-echo-5-pod-llwypgls] 3: hello
[sleep-echo-5-pod-llwypgls] 4: hello
[sleep-echo-5-pod-llwypgls] 5: hello
[sleep-echo-5-pod-llwypgls] 6: hello
[sleep-echo-5-pod-llwypgls] 7: hello
[sleep-echo-5-pod-llwypgls] 8: hello
[sleep-echo-5-pod-llwypgls] 9: hello
[sleep-echo-5-pod-llwypgls] rpc error: code = Unknown desc = Error: No such container: 7e9bb62be850fe96f94636cbec4c92559e9b65cb31b31bdd697f3746ed40cc48
[sleep-echo-5-pod-llwypgls] follow_logs end
[sleep-echo-5-pod-llwypgls] follow_logs start
Retrying calrissian.k8s.KubernetesClient.follow_logs in 5.0 seconds as it raised (404)
Reason: Not Found

Initially this was baffling, but on further thought, kubernetes resources such as deployments, jobs, and replicasets are responsible for creating pods if they are deleted manually. So calrissian should be no different. If the pod is deleted out from under it, it should be responsible for re-submitting that pod.

I considered looking into pod disruption budgets, restartPolicy, or disabling auto-scaling, but at the end of the day, raw pods are not expected to be resilient. they're the lowest level object, and we should either be acting as a manager of pods or using a Job if we want to delegate that responsibility. We decided in #23 to submit pods rather than jobs since it seemed a better fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants