We run the standard Docker executor. Unfortunately an administrator deleted one of the Docker images of an active task.
This caused Singularity to get in an endless loop restarting the task.
Apparently, if the active task fails abnormally before it begins, this does not count against the cooldown limit. So it was left running overnight.
This filled up our ZooKepeer cluster with tens of thousands of tasks, which caused the ZK cluster to go south.
INFO [2016-02-02 07:07:20,963] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S4 (mesos-slave6-prod-sc.otsql.opentable.com)
INFO [2016-02-02 07:07:20,978] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:21,188] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n \"errors\" : [ {\n \"status\" : 400,\n \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n } ]\n}"
) at 1454396841177
INFO [2016-02-02 07:07:21,197] com.hubspot.singularity.mesos.SingularityLogSupport: Fetching slave data to find log directory for task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME from uri http://mesos-slave6-prod-sc.otsql.opentable.com:5051/slave(1)/state.json
DEBUG [2016-02-02 07:07:21,205] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396841204, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
WARN [2016-02-02 07:07:21,207] com.hubspot.singularity.smtp.SingularityMailer: Couldn't retrieve stdout for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME because task (true) or directory (false) wasn't present
WARN [2016-02-02 07:07:21,207] com.hubspot.singularity.smtp.SingularityMailer: Couldn't retrieve stderr for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME because task (true) or directory (false) wasn't present
DEBUG [2016-02-02 07:07:21,207] com.hubspot.singularity.smtp.SingularityMailer: Not sending TASK_FAILED for prod-umami-config-server - mail cooldown has 1550318 time left out of 3600000
DEBUG [2016-02-02 07:07:21,214] com.hubspot.singularity.mesos.SingularityMesosSchedulerDelegator: Handled status update for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME in 00:00.026
DEBUG [2016-02-02 07:07:21,223] com.hubspot.singularity.mesos.SingularityLogSupport: Found a directory /mnt/mesos-slave/slaves/20151217-222149-3943699466-5050-19261-S4/frameworks/Singularity/executors/prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME/runs/c0382d0d-97ef-404f-93dd-30e72c5b1b2e for task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME
INFO [2016-02-02 07:07:21,965] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S4 (mesos-slave6-prod-sc.otsql.opentable.com)
INFO [2016-02-02 07:07:21,981] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:22,196] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n \"errors\" : [ {\n \"status\" : 400,\n \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n } ]\n}"
) at 1454396842185
DEBUG [2016-02-02 07:07:22,209] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396842207, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
DEBUG [2016-02-02 07:07:22,218] com.hubspot.singularity.mesos.SingularityMesosSchedulerDelegator: Handled status update for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME in 00:00.022
INFO [2016-02-02 07:07:22,966] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396842966-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S8 (mesos-slave7-prod-sc.otsql.opentable.com)
DEBUG [2016-02-02 07:07:23,162] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396842966-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n \"errors\" : [ {\n \"status\" : 400,\n \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n } ]\n}"
) at 1454396843159
INFO [2016-02-02 07:07:23,967] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396843967-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S8 (mesos-slave7-prod-sc.otsql.opentable.com)
INFO [2016-02-02 07:07:23,987] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396843967-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:24,170] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396843967-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n \"errors\" : [ {\n \"status\" : 400,\n \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n } ]\n}"
) at 1454396844169
DEBUG [2016-02-02 07:07:24,189] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396844188, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
INFO [2016-02-02 07:07:24,975] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396844975-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S5 (mesos-slave5-prod-sc.otsql.opentable.com)
INFO [2016-02-02 07:07:25,009] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396844975-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:25,202] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396844975-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n \"errors\" : [ {\n \"status\" : 400,\n \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n } ]\n}"
) at 1454396845176
INFO [2016-02-02 07:07:25,971] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396845971-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S5 (mesos-slave5-prod-sc.otsql.opentable.com)
INFO [2016-02-02 07:07:25,990] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396845971-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:26,205] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396845971-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n \"errors\" : [ {\n \"status\" : 400,\n \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n } ]\n}"
) at 1454396846183
DEBUG [2016-02-02 07:07:26,221] com.hubspot.singularity.smtp.SingularityMailer: Not sending TASK_FAILED for prod-umami-config-server - mail cooldown has 1545306 time left out of 3600000
INFO [2016-02-02 07:07:26,978] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396846978-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S9 (mesos-slave8-prod-sc.otsql.opentable.com)
INFO [2016-02-02 07:07:27,006] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396846978-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:27,198] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396846978-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n \"errors\" : [ {\n \"status\" : 400,\n \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n } ]\n}"
) at 1454396847200
DEBUG [2016-02-02 07:07:27,242] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396847241, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
INFO [2016-02-02 07:07:27,975] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396847975-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S9 (mesos-slave8-prod-sc.otsql.opentable.com)
We run the standard Docker executor. Unfortunately an administrator deleted one of the Docker images of an active task.
This caused Singularity to get in an endless loop restarting the task.
Apparently, if the active task fails abnormally before it begins, this does not count against the cooldown limit. So it was left running overnight.
This filled up our ZooKepeer cluster with tens of thousands of tasks, which caused the ZK cluster to go south.
This is against Singularity 0.4.3