Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected job state [failed] while waiting for job to be opened #37545

Closed
polyfractal opened this issue Jan 16, 2019 · 7 comments
Closed

Unexpected job state [failed] while waiting for job to be opened #37545

polyfractal opened this issue Jan 16, 2019 · 7 comments
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@polyfractal
Copy link
Contributor

Not entirely sure what's going on here, but I have a rough timeline based on the gradle and server log.

It seems that when job_get_stats.yml is setting up job-stats-test, something goes wrong with the autodetect process and the job is failed while opening. This has the knock-on effect of killing a bunch of tests because the job is stuck in failed status:

Unexpected job state [failed] while waiting for job to be opened

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/166/

Appears to happen somewhat consistently over the last six months, although sporadic. Does not reproduce for me:

./gradlew :x-pack:plugin:integTestRunner \
  -Dtests.seed=A89CB62AEE52974F \
  -Dtests.class=org.elasticsearch.xpack.test.rest.XPackRestIT \
  -Dtests.method="test {p0=ml/jobs_get_stats/Test get job stats given pattern and allow_no_jobs}" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-DZ \
  -Dtests.timezone=America/Indiana/Knox \
  -Dcompiler.java=11 \
  -Druntime.java=11 \
  -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*

The full log is below, but the salient error is:

Failed to launch autodetect for job job-stats-test
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch.zxk6gUel/autodetect_job-stats-test_log_6945 (No such file or directory)

which causes other tests to either fail explicitly because they are waiting on that job to open, or because they notice there are unexpected persistent tasks:

"message": "1 active tasks found: xpack/ml/job[c] lRTIDwSTTjSZcjLRREK2LQ:22675 cluster:39 persistent 1545197471985 05:31:11 1.1m 127.0.0.1 node-0 job-job-stats-test expected:<0> but was:<1>",

Or even more interesting, the cleanup code tries to kill the job via _all but the job doesn't exist... which sounds like the state is only partially cleaned up, or a mismatch between a local node and master's cluster state? Note this is for a different job (job-post-data-job) but it seems like a similar situation:

09:52:23 [2019-01-16T14:45:06,927][WARN ][r.suppressed ] [node-0] path: /_xpack/ml/anomaly_detectors/_all/_close, params: {job_id=_all, force=true}
09:52:23 org.elasticsearch.ElasticsearchException: Failed to force close job [_all] with [1] failures, rethrowing last, all Exceptions: [the task with id job-post-data-job doesn't exist]

Full logs

[2019-01-16T14:44:43,034][WARN ][o.e.p.PersistentTasksNodeService] [node-0] task job-job-stats-test failed with an exception
org.elasticsearch.ElasticsearchException: Failed to launch autodetect for job job-stats-test
	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:38) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createNativeProcess(NativeAutodetectProcessFactory.java:114) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:64) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:506) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:458) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$800(AutodetectProcessManager.java:90) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:427) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch.zxk6gUel/autodetect_job-stats-test_log_6945 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method) ~[?:?]
	at java.io.FileInputStream.open(FileInputStream.java:219) ~[?:?]
	at java.io.FileInputStream.<init>(FileInputStream.java:157) ~[?:?]
	at java.io.FileInputStream.<init>(FileInputStream.java:112) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.ProcessPipes.connectStreams(ProcessPipes.java:131) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createNativeProcess(NativeAutodetectProcessFactory.java:110) ~[?:?]
	... 10 more
[2019-01-16T14:44:44,128][WARN ][o.e.p.PersistentTasksClusterService] [node-0] persistent task job-job-stats-test failed
org.elasticsearch.ElasticsearchException: Failed to launch autodetect for job job-stats-test
	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:38) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createNativeProcess(NativeAutodetectProcessFactory.java:114) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createAutodetectProcess(NativeAutodetectProcessFactory.java:64) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.create(AutodetectProcessManager.java:506) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.createProcessAndSetRunning(AutodetectProcessManager.java:458) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager.access$800(AutodetectProcessManager.java:90) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager$2.doRun(AutodetectProcessManager.java:427) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.io.FileNotFoundException: /tmp/elasticsearch.zxk6gUel/autodetect_job-stats-test_log_6945 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method) ~[?:?]
	at java.io.FileInputStream.open(FileInputStream.java:219) ~[?:?]
	at java.io.FileInputStream.<init>(FileInputStream.java:157) ~[?:?]
	at java.io.FileInputStream.<init>(FileInputStream.java:112) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130) ~[?:?]
	at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.ProcessPipes.connectStreams(ProcessPipes.java:131) ~[?:?]
	at org.elasticsearch.xpack.ml.job.process.autodetect.NativeAutodetectProcessFactory.createNativeProcess(NativeAutodetectProcessFactory.java:110) ~[?:?]
	... 10 more
[2019-01-16T14:44:44,140][WARN ][o.e.p.PersistentTasksClusterService] [node-0] The task [job-job-stats-test] wasn't found, status is not updated
[2019-01-16T14:44:44,141][WARN ][o.e.p.PersistentTasksNodeService] [node-0] notification for task [xpack/ml/job[c]] with id [job-job-stats-test] failed
org.elasticsearch.ResourceNotFoundException: the task with id [job-job-stats-test] and allocation id [63] not found
	at org.elasticsearch.persistent.PersistentTasksClusterService$2.execute(PersistentTasksClusterService.java:131) ~[elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.5.5-SNAPSHOT.jar:6.5.5-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

And finally, a bunch of these tests start to stall right when the autodetect process throws those errors:

09:44:17 HEARTBEAT J0 PID(9379@elasticsearch-ci-immutable-centos-6-1547643590318618758): 2019-01-16T14:44:13, stalled for 10.4s at: XPackRestIT.test {p0=token/10_basic/Test invalidate token}
09:44:43 HEARTBEAT J0 PID(9379@elasticsearch-ci-immutable-centos-6-1547643590318618758): 2019-01-16T14:44:43, stalled for 10.6s at: XPackRestIT.test {p0=ml/jobs_get_stats/Test get job stats given pattern and allow_no_jobs}
09:45:23 HEARTBEAT J0 PID(9379@elasticsearch-ci-immutable-centos-6-1547643590318618758): 2019-01-16T14:45:23, stalled for 10.3s at: XPackRestIT.test {p0=ml/delete_job_force/Test force delete an open job}
09:47:13 HEARTBEAT J0 PID(9379@elasticsearch-ci-immutable-centos-6-1547643590318618758): 2019-01-16T14:47:11, stalled for 11.0s at: XPackRestIT.test {p0=watcher/ack_watch/40_reset_ack_after_unmet_action_condition/Ensure that ack status is reset after unmet action condition}
09:48:26 HEARTBEAT J0 PID(9379@elasticsearch-ci-immutable-centos-6-1547643590318618758): 2019-01-16T14:48:21, stalled for 10.2s at: XPackRestIT.test {p0=ml/get_datafeed_stats/Test get stats for started datafeed}
09:50:05 HEARTBEAT J0 PID(9379@elasticsearch-ci-immutable-centos-6-1547643590318618758): 2019-01-16T14:50:05, stalled for 10.6s at: XPackRestIT.test {p0=ml/jobs_crud/Test close jobs with expression that matches}
09:50:29 HEARTBEAT J0 PID(9379@elasticsearch-ci-immutable-centos-6-1547643590318618758): 2019-01-16T14:50:29, stalled for 10.6s at: XPackRestIT.test {p0=ml/start_stop_datafeed/Test stop given expression}
09:52:10 Suite: org.elasticsearch.xpack.test.rest.XPackRestIT
@polyfractal polyfractal added >test-failure Triaged test failures from CI :ml Machine learning labels Jan 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@polyfractal
Copy link
Contributor Author

Note: this might be related to #30300, since various tests resort to force-killing jobs when they are in FAILED status.

@benwtrent
Copy link
Member

https://github.com/elastic/elasticsearch/pull/32615/files

That PR addressed a bug we had a while back around the temp dir being cleaned up by the OS, the error of the file not being created is similar.

I would have thought that these tests would have the fix...I am reaching out to others who would know more.

@davidkyle
Copy link
Member

09:52:23 [2019-01-16T14:45:06,927][WARN ][r.suppressed ] [node-0] path: /_xpack/ml/anomaly_detectors/_all/_close, params: {job_id=_all, force=true}
09:52:23 org.elasticsearch.ElasticsearchException: Failed to force close job [_all] with [1] failures, rethrowing last, all Exceptions: [the task with id job-post-data-job doesn't exist]

This is because of the change to move the ml config out of the cluster state. Previously a job not found error was returned but now it is difficult to check the job's existence in the transport action. The problem is described in #34747

@droberts195
Copy link
Contributor

That PR addressed a bug we had a while back around the temp dir being cleaned up by the OS, the error of the file not being created is similar.

The error message is the same but the cause of the problem is almost certainly different. A huge number of tests running in GCP failed yesterday due to the extreme slowness of the VMs. We expect the named pipes to open within 10 seconds of requesting the controller start the autodetect process. I suspect yesterday whatever was causing the slowness in GCP meant this didn't happen.

The problem that #32615 was working around was sub-directories of the /tmp directory getting deleted by systemd after 10 days - the tests are slow but not that slow.

Previously a job not found error was returned but now it is difficult to check the job's existence in the transport action. The problem is described in #34747

#34747 is describing the problem with error reporting, but it seems to me that there's a more fundamental issue here that _all can now suffer "job does not exist" errors. Whether they're reported as "job does not exist" or "job is not open" is of secondary importance to the fact that it seems crazy from a user perspective that _all would ever include something that didn't exist.

Now that evaluation of _all means doing a search I guess there's a danger that something that search finds gets deleted before the endpoint gets a chance to operate on the expansion of _all. So we should probably change the error handling of endpoints that can work on _all to silently ignore errors due to items not existing by the time the main function runs.

And finally, a bunch of these tests start to stall right when the autodetect process throws those errors

It looks like the other ML tests in the list failed for the same reason as the test that's analyzed in detail - autodetect didn't create its named pipes within 10 seconds. The 10 second wait for named pipes explains why they took over 10 seconds. (On a machine that performs sensibly it should only take a few milliseconds.)

The security and watcher ones seem to have just taken that long unrelated to ML. When I saw the list I initially thought ML had left a hung task behind that the waitForPendingTasks cleanup was waiting on, but looking at the log more closely the time seems to have been spent inside the test:

  1> [2019-01-16T08:44:02,928][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=token/10_basic/Test invalidate token] before test
  1> [2019-01-16T08:44:18,174][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=token/10_basic/Test invalidate token] after test

and:

  1> [2019-01-16T08:47:00,127][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=watcher/ack_watch/40_reset_ack_after_unmet_action_condition/Ensure that ack status is reset after unmet action condition] before test
  1> [2019-01-16T08:47:13,514][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=watcher/ack_watch/40_reset_ack_after_unmet_action_condition/Ensure that ack status is reset after unmet action condition] after test

Maybe those tests don't usually take that long, but I suspect they were slower in this build due to the unexplained general slowness in GCP yesterday.

Summary

The actual test failure reported in this issue is due to the unreasonable slowness of VMs that caused a huge number of test failures yesterday.

BUT

The problem that occurred during the force-close that was caused by the spurious failure is a problem that could affect real users and that we should try to address. It's not going to be very common, so shouldn't be considered a blocker for 6.6.0, but it's clearly not right for an _all operation to fail due to attempting to operate on an item that doesn't exist.

@polyfractal
Copy link
Contributor Author

Thanks for the detailed analysis @droberts195! Makes sense re: the slowness... I was thinking this was a legit failure because the autodetect failure didn't look like a timeout on the surface, but in retrospect it does make sense that it was just a different manifestation of the timeout stuff we're seeing elsewhere.

Going to add this symptom to the infra ticket as well, since it sounds like these named pipes failing are a pretty extreme example of the issues (even more so than the other timeouts that were failing). Thanks!

@droberts195
Copy link
Contributor

I have raised a dedicated issue, #37959, for the underlying problem that this test failure revealed, because it's hard to quickly see the thing that needs to be fixed within this long issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants