Skip to content

Tasks can fail repeatedly without entering cooldown if they fail too early in the launching process #874

@stevenschlansker

Description

@stevenschlansker

We run the standard Docker executor. Unfortunately an administrator deleted one of the Docker images of an active task.
This caused Singularity to get in an endless loop restarting the task.
Apparently, if the active task fails abnormally before it begins, this does not count against the cooldown limit. So it was left running overnight.

This filled up our ZooKepeer cluster with tens of thousands of tasks, which caused the ZK cluster to go south.

  • Any task failure should count towards cooldown
  • There should be some hard limit of how many tasks are kept in history, even if they haven't expired.
INFO  [2016-02-02 07:07:20,963] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S4 (mesos-slave6-prod-sc.otsql.opentable.com)
INFO  [2016-02-02 07:07:20,978] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:21,188] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n  \"errors\" : [ {\n    \"status\" : 400,\n    \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n  } ]\n}"
) at 1454396841177 
INFO  [2016-02-02 07:07:21,197] com.hubspot.singularity.mesos.SingularityLogSupport: Fetching slave data to find log directory for task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME from uri http://mesos-slave6-prod-sc.otsql.opentable.com:5051/slave(1)/state.json
DEBUG [2016-02-02 07:07:21,205] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396841204, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
WARN  [2016-02-02 07:07:21,207] com.hubspot.singularity.smtp.SingularityMailer: Couldn't retrieve stdout for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME because task (true) or directory (false) wasn't present
WARN  [2016-02-02 07:07:21,207] com.hubspot.singularity.smtp.SingularityMailer: Couldn't retrieve stderr for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME because task (true) or directory (false) wasn't present
DEBUG [2016-02-02 07:07:21,207] com.hubspot.singularity.smtp.SingularityMailer: Not sending TASK_FAILED for prod-umami-config-server - mail cooldown has 1550318 time left out of 3600000
DEBUG [2016-02-02 07:07:21,214] com.hubspot.singularity.mesos.SingularityMesosSchedulerDelegator: Handled status update for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME in 00:00.026
DEBUG [2016-02-02 07:07:21,223] com.hubspot.singularity.mesos.SingularityLogSupport: Found a directory /mnt/mesos-slave/slaves/20151217-222149-3943699466-5050-19261-S4/frameworks/Singularity/executors/prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME/runs/c0382d0d-97ef-404f-93dd-30e72c5b1b2e for task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396840963-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME
INFO  [2016-02-02 07:07:21,965] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S4 (mesos-slave6-prod-sc.otsql.opentable.com)
INFO  [2016-02-02 07:07:21,981] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:22,196] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n  \"errors\" : [ {\n    \"status\" : 400,\n    \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n  } ]\n}"
) at 1454396842185 
DEBUG [2016-02-02 07:07:22,209] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396842207, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
DEBUG [2016-02-02 07:07:22,218] com.hubspot.singularity.mesos.SingularityMesosSchedulerDelegator: Handled status update for prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396841965-2-mesos_slave6_prod_sc.otsql.opentable.com-FIXME in 00:00.022
INFO  [2016-02-02 07:07:22,966] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396842966-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S8 (mesos-slave7-prod-sc.otsql.opentable.com)
DEBUG [2016-02-02 07:07:23,162] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396842966-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n  \"errors\" : [ {\n    \"status\" : 400,\n    \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n  } ]\n}"
) at 1454396843159 
INFO  [2016-02-02 07:07:23,967] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396843967-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S8 (mesos-slave7-prod-sc.otsql.opentable.com)
INFO  [2016-02-02 07:07:23,987] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396843967-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:24,170] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396843967-2-mesos_slave7_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n  \"errors\" : [ {\n    \"status\" : 400,\n    \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n  } ]\n}"
) at 1454396844169 
DEBUG [2016-02-02 07:07:24,189] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396844188, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
INFO  [2016-02-02 07:07:24,975] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396844975-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S5 (mesos-slave5-prod-sc.otsql.opentable.com)
INFO  [2016-02-02 07:07:25,009] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396844975-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:25,202] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396844975-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n  \"errors\" : [ {\n    \"status\" : 400,\n    \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n  } ]\n}"
) at 1454396845176 
INFO  [2016-02-02 07:07:25,971] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396845971-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S5 (mesos-slave5-prod-sc.otsql.opentable.com)
INFO  [2016-02-02 07:07:25,990] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396845971-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:26,205] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396845971-2-mesos_slave5_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n  \"errors\" : [ {\n    \"status\" : 400,\n    \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n  } ]\n}"
) at 1454396846183 
DEBUG [2016-02-02 07:07:26,221] com.hubspot.singularity.smtp.SingularityMailer: Not sending TASK_FAILED for prod-umami-config-server - mail cooldown has 1545306 time left out of 3600000
INFO  [2016-02-02 07:07:26,978] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396846978-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S9 (mesos-slave8-prod-sc.otsql.opentable.com)
INFO  [2016-02-02 07:07:27,006] com.hubspot.singularity.mesos.SingularityMesosScheduler: 1 tasks ([prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396846978-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME]) launched with status DRIVER_RUNNING
DEBUG [2016-02-02 07:07:27,198] com.hubspot.singularity.mesos.SingularityMesosScheduler: Task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396846978-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME is now TASK_FAILED (Failed to launch container: Failed to 'docker pull docker-prod-sc.otenv.com/umami-config-server:23': exit status = exited with status 1 stderr = Error: Status 400 trying to pull repository umami-config-server: "{\n  \"errors\" : [ {\n    \"status\" : 400,\n    \"message\" : \"Unsupported docker v1 repository request for 'docker-v2'\"\n  } ]\n}"
) at 1454396847200 
DEBUG [2016-02-02 07:07:27,242] com.hubspot.singularity.scheduler.SingularityScheduler: Missing 1 instances of request prod-umami-config-server (matching tasks: [prod-umami-config-server-teamcity.2015.08.26T01.30.04-1453924178939-1-mesos_slave2_prod_sc.otsql.opentable.com-FIXME]), pending request: SingularityPendingRequest [requestId=prod-umami-config-server, deployId=teamcity.2015.08.26T01.30.04, timestamp=1454396847241, pendingType=TASK_DONE, user=Optional.absent(), cmdLineArgsList=[]]
INFO  [2016-02-02 07:07:27,975] com.hubspot.singularity.mesos.SingularityMesosScheduler: Launching task prod-umami-config-server-teamcity.2015.08.26T01.30.04-1454396847975-2-mesos_slave8_prod_sc.otsql.opentable.com-FIXME slot on slave 20151217-222149-3943699466-5050-19261-S9 (mesos-slave8-prod-sc.otsql.opentable.com)

This is against Singularity 0.4.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions