I encountered a situation where I has some scheduled requests that got stuck in a PENDING state. I'm not entirely sure how they got into that state but I think it was related to pausing/unpausing them repeatedly. Once these pending requests existed, I began seeing singularity abort periodically. I tracked it down to this in the log:
2016-04-19T11:45:52.54317 ERROR [2016-04-19 11:45:52,537] com.hubspot.singularity.scheduler.SingularityLeaderOnlyPoller: Caught an exception while running SingularityScheduledJobPoller
2016-04-19T11:45:52.54320 ! java.lang.IllegalStateException: Optional.get() cannot be called on an absent value
2016-04-19T11:45:52.54321 ! at com.google.common.base.Absent.get(Absent.java:47) ~[SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54321 ! at com.hubspot.singularity.smtp.SingularityMailer.getDestination(SingularityMailer.java:317) ~[SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54322 ! at com.hubspot.singularity.smtp.SingularityMailer.prepareTaskMail(SingularityMailer.java:275) ~[SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54324 ! at com.hubspot.singularity.smtp.SingularityMailer.sendTaskOverdueMail(SingularityMailer.java:249) ~[SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54325 ! at com.hubspot.singularity.scheduler.SingularityScheduledJobPoller.runActionOnPoll(SingularityScheduledJobPoller.java:94) ~[SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54325 ! at com.hubspot.singularity.scheduler.SingularityLeaderOnlyPoller.runActionIfLeaderAndMesosIsRunning(SingularityLeaderOnlyPoller.java:108) [SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54326 ! at com.hubspot.singularity.scheduler.SingularityLeaderOnlyPoller.access$000(SingularityLeaderOnlyPoller.java:24) [SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54327 ! at com.hubspot.singularity.scheduler.SingularityLeaderOnlyPoller$1.run(SingularityLeaderOnlyPoller.java:83) [SingularityService-shaded.jar:0.5.0]
2016-04-19T11:45:52.54327 ! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
2016-04-19T11:45:52.54328 ! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_45]
2016-04-19T11:45:52.54329 ! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45]
2016-04-19T11:45:52.54330 ! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_45]
2016-04-19T11:45:52.54331 ! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
2016-04-19T11:45:52.54331 ! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
2016-04-19T11:45:52.54332 ! at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2016-04-19T11:45:52.54332 ERROR [2016-04-19 11:45:52,539] com.hubspot.singularity.SingularityAbort: Singularity on prd-useast-mesos-platform-master-02.prd.yb0t.cc is aborting due to UNRECOVERABLE_ERROR
2016-04-19T11:45:52.54333 WARN [2016-04-19 11:45:52,539] com.hubspot.singularity.SingularityAbort: Couldn't send abort mail because no SMTP configuration is present
2016-04-19T11:45:52.54333 INFO [2016-04-19 11:45:52,540] com.hubspot.singularity.SingularityAbort: Attempting to flush logs and wait 00:00.100 ...
2016-04-19T11:45:52.65734 I0419 07:45:52.657297 9032 sched.cpp:1805] Asked to abort the driver
2016-04-19T11:45:52.74077 I0419 07:45:52.740716 9003 sched.cpp:1070] Aborting framework 'Singularity'
It looks like it was trying to send notification emails that tasks for these pending requests were overdue. On this particular cluster though, I haven't configured any smtp settings yet it seems to have died in trying to build the email that would be sent.
wrt the odd state of the pending scheduled requests, I was able to clear that by going into zookeeper and manually doing a: rmr /<zkNamespace>/requests/pending I'd like to better understand how those pending requests could have gotten in that state to begin with as well...
I encountered a situation where I has some scheduled requests that got stuck in a PENDING state. I'm not entirely sure how they got into that state but I think it was related to pausing/unpausing them repeatedly. Once these pending requests existed, I began seeing singularity abort periodically. I tracked it down to this in the log:
It looks like it was trying to send notification emails that tasks for these pending requests were overdue. On this particular cluster though, I haven't configured any smtp settings yet it seems to have died in trying to build the email that would be sent.
wrt the odd state of the pending scheduled requests, I was able to clear that by going into zookeeper and manually doing a:
rmr /<zkNamespace>/requests/pendingI'd like to better understand how those pending requests could have gotten in that state to begin with as well...