New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesos 1.1.2 #1571

Merged
merged 58 commits into from Oct 11, 2017

Conversation

Projects
None yet
2 participants
@ssalinas
Member

ssalinas commented Jun 20, 2017

This upgrades us to mesos 1.1.2. We can't go straight to newest because masters in 1.2 onward will no longer accept connections from 0.x slaves, so the upgrade path would be 馃槶 .

The api and protos objects also change quite a bit, meaning that all of our current task history with MesosTaskInfo, Offer, etc saved in the json will not be readable in the new version unless we write the code to convert it. As an alternative to keeping the pieces in json, which requires Singularity client users to pull in a mesos dep, I would propose we find a way to wrap the data in our own POJO instead (like we have done for much of the rest of the objects)

TODOs

FYIs

  • --work_dir flag must be set for all mesos slaves/agents
  • internally the slave -> agent rename is there, but all apis endpoints and fields still reference slave as they did before for now

Eventual things we should do to keep up with newer features:

  • executor should honor the kill_policy setting of kill messages
  • executor shutdown grace period is set in ExecutorInfo
  • slave -> agent rename
  • use labels in ExecutorInfo in favor of source

The Fun Stuff

Things that we can explore once we upgrade:

  • support --http_command_executor
  • explore use of mesos-native health checks
  • per container linux capabilities
  • partition-aware mesos

frameworks can opt-in to the new PARTITION_AWARE capability. If they do this, their tasks will not be killed when a partition is healed. This allows frameworks to define their own policies for how to handle partitioned tasks. Enabling the PARTITION_AWARE capability also introduces a new set of task states: TASK_UNREACHABLE, TASK_DROPPED, TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These new states are intended to eventually replace the TASK_LOST state.

ssalinas added some commits Jun 20, 2017

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Jun 21, 2017

Member

Updates here:

  • current code runs and schedules tasks locally. Subscription, offers, framwork messages, etc all working smoothly
  • When written to json, the new protos objects still conform to the same structure as the old ones. So, while I would like to remove mesos protos from things written to json in zk, it isn't a requirement for upgrading
  • The executor can be left on the unversioned mesos library binding for now. It can still connect to newer masters with the older library as long as it is running the newer mesos libs underneath
  • a user for the framework is now required and is used in certain isolator calls. Defaulting this to root, overridable in the yaml config
  • docker images are using 1.1.1 because mesosphere never published a 1.1.2. Not a large enough difference to build my own

Also /cc @tpetr in case you're interested in the upgrade at all ;)

Member

ssalinas commented Jun 21, 2017

Updates here:

  • current code runs and schedules tasks locally. Subscription, offers, framwork messages, etc all working smoothly
  • When written to json, the new protos objects still conform to the same structure as the old ones. So, while I would like to remove mesos protos from things written to json in zk, it isn't a requirement for upgrading
  • The executor can be left on the unversioned mesos library binding for now. It can still connect to newer masters with the older library as long as it is running the newer mesos libs underneath
  • a user for the framework is now required and is used in certain isolator calls. Defaulting this to root, overridable in the yaml config
  • docker images are using 1.1.1 because mesosphere never published a 1.1.2. Not a large enough difference to build my own

Also /cc @tpetr in case you're interested in the upgrade at all ;)

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Jun 23, 2017

Member

Next round of updates:

  • finished refactoring to use the http api via https://github.com/mesosphere/mesos-rxjava . The scheduler lock is still in place, but the addition of observables frees us up to do some more interesting things internally if we want to. But, this means we are free from native mesos lib bindings!
  • Likely going to leave the protos stuff as-is for now. I'll do some additional tests for backwards compatibility, but looks like we should be all set there.
Member

ssalinas commented Jun 23, 2017

Next round of updates:

  • finished refactoring to use the http api via https://github.com/mesosphere/mesos-rxjava . The scheduler lock is still in place, but the addition of observables frees us up to do some more interesting things internally if we want to. But, this means we are free from native mesos lib bindings!
  • Likely going to leave the protos stuff as-is for now. I'll do some additional tests for backwards compatibility, but looks like we should be all set there.

@ssalinas ssalinas modified the milestone: 0.17.0 Jun 23, 2017

ssalinas added some commits Jun 23, 2017

@ssalinas ssalinas changed the title from (WIP) Mesos 1.1.2 to Mesos 1.1.2 Jun 23, 2017

@ssalinas ssalinas added the hs_staging label Jun 30, 2017

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Jun 30, 2017

Member

Remaining TODO on this PR:

  • Update the new client so that we will can take a list of mesos master hosts to attempt to connect to. Right now it's just a single one
Member

ssalinas commented Jun 30, 2017

Remaining TODO on this PR:

  • Update the new client so that we will can take a list of mesos master hosts to attempt to connect to. Right now it's just a single one
@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Jul 14, 2017

Member

One additional comment here that would be a future TODO. It seems that the mesos rxjava library could likely be used to subscribe to the executor endpoint as well, even though it is listed as only a scheduler library. That upgrade would be much easier now that the scheduler portion is written, and shouldn't require any major configuration changes in the executor

Member

ssalinas commented Jul 14, 2017

One additional comment here that would be a future TODO. It seems that the mesos rxjava library could likely be used to subscribe to the executor endpoint as well, even though it is listed as only a scheduler library. That upgrade would be much easier now that the scheduler portion is written, and shouldn't require any major configuration changes in the executor

@ssalinas ssalinas added the hs_qa label Jul 20, 2017

@baconmania

This comment has been minimized.

Show comment
Hide comment
@baconmania

baconmania Oct 9, 2017

Contributor

馃殺

Contributor

baconmania commented Oct 9, 2017

馃殺

@ssalinas ssalinas added the hs_stable label Oct 9, 2017

private Thread subscriberThread;
@Inject
public SingularityMesosSchedulerClient(SingularityConfiguration configuration, @Named(SingularityMainModule.SINGULARITY_URI_BASE) final String singularityUriBase) {

This comment has been minimized.

@baconmania

baconmania Oct 11, 2017

Contributor

Would it make sense to have this implement AutoCloseable?

@baconmania

baconmania Oct 11, 2017

Contributor

Would it make sense to have this implement AutoCloseable?

This comment has been minimized.

@ssalinas

ssalinas Oct 11, 2017

Member

My eventual thought for this is to have it be able to fix #273 , meaning the same singleton scheduler client could subscribe again or renegotiate its connection. The client isn't ever used in a single try-with-resource type of scope so the only closing that would have to be done is on shutdown. Also, the dropwizard guicier we use here doesn't have the auto-close-singleton-closeables bits that the other version does

@ssalinas

ssalinas Oct 11, 2017

Member

My eventual thought for this is to have it be able to fix #273 , meaning the same singleton scheduler client could subscribe again or renegotiate its connection. The client isn't ever used in a single try-with-resource type of scope so the only closing that would have to be done is on shutdown. Also, the dropwizard guicier we use here doesn't have the auto-close-singleton-closeables bits that the other version does

@baconmania

This comment has been minimized.

Show comment
Hide comment
@baconmania

baconmania Oct 11, 2017

Contributor

馃殺

Contributor

baconmania commented Oct 11, 2017

馃殺

@ssalinas ssalinas merged commit b5b9bfb into master Oct 11, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ssalinas ssalinas deleted the mesos_1 branch Oct 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment