Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesos 1.1.2 #1571

Merged
merged 58 commits into from Oct 11, 2017
Merged

Mesos 1.1.2 #1571

merged 58 commits into from Oct 11, 2017

Conversation

@ssalinas
Copy link
Member

@ssalinas ssalinas commented Jun 20, 2017

This upgrades us to mesos 1.1.2. We can't go straight to newest because masters in 1.2 onward will no longer accept connections from 0.x slaves, so the upgrade path would be 馃槶 .

The api and protos objects also change quite a bit, meaning that all of our current task history with MesosTaskInfo, Offer, etc saved in the json will not be readable in the new version unless we write the code to convert it. As an alternative to keeping the pieces in json, which requires Singularity client users to pull in a mesos dep, I would propose we find a way to wrap the data in our own POJO instead (like we have done for much of the rest of the objects)

TODOs

FYIs

  • --work_dir flag must be set for all mesos slaves/agents
  • internally the slave -> agent rename is there, but all apis endpoints and fields still reference slave as they did before for now

Eventual things we should do to keep up with newer features:

  • executor should honor the kill_policy setting of kill messages
  • executor shutdown grace period is set in ExecutorInfo
  • slave -> agent rename
  • use labels in ExecutorInfo in favor of source

The Fun Stuff

Things that we can explore once we upgrade:

  • support --http_command_executor
  • explore use of mesos-native health checks
  • per container linux capabilities
  • partition-aware mesos

frameworks can opt-in to the new PARTITION_AWARE capability. If they do this, their tasks will not be killed when a partition is healed. This allows frameworks to define their own policies for how to handle partitioned tasks. Enabling the PARTITION_AWARE capability also introduces a new set of task states: TASK_UNREACHABLE, TASK_DROPPED, TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These new states are intended to eventually replace the TASK_LOST state.

ssalinas added 2 commits Jun 20, 2017
@ssalinas
Copy link
Member Author

@ssalinas ssalinas commented Jun 21, 2017

Updates here:

  • current code runs and schedules tasks locally. Subscription, offers, framwork messages, etc all working smoothly
  • When written to json, the new protos objects still conform to the same structure as the old ones. So, while I would like to remove mesos protos from things written to json in zk, it isn't a requirement for upgrading
  • The executor can be left on the unversioned mesos library binding for now. It can still connect to newer masters with the older library as long as it is running the newer mesos libs underneath
  • a user for the framework is now required and is used in certain isolator calls. Defaulting this to root, overridable in the yaml config
  • docker images are using 1.1.1 because mesosphere never published a 1.1.2. Not a large enough difference to build my own

Also /cc @tpetr in case you're interested in the upgrade at all ;)

@ssalinas
Copy link
Member Author

@ssalinas ssalinas commented Jun 23, 2017

Next round of updates:

  • finished refactoring to use the http api via https://github.com/mesosphere/mesos-rxjava . The scheduler lock is still in place, but the addition of observables frees us up to do some more interesting things internally if we want to. But, this means we are free from native mesos lib bindings!
  • Likely going to leave the protos stuff as-is for now. I'll do some additional tests for backwards compatibility, but looks like we should be all set there.
@ssalinas ssalinas modified the milestone: 0.17.0 Jun 23, 2017
ssalinas added 3 commits Jun 23, 2017
@ssalinas ssalinas changed the title (WIP) Mesos 1.1.2 Mesos 1.1.2 Jun 23, 2017
@ssalinas ssalinas added the hs_staging label Jun 30, 2017
@ssalinas
Copy link
Member Author

@ssalinas ssalinas commented Jun 30, 2017

Remaining TODO on this PR:

  • Update the new client so that we will can take a list of mesos master hosts to attempt to connect to. Right now it's just a single one
ssalinas added 12 commits Jul 18, 2017
@ssalinas ssalinas added the hs_qa label Jul 20, 2017
ssalinas added 8 commits Jul 21, 2017
(Mesos 1) Better backpressure and abort logic
@baconmania
Copy link
Contributor

@baconmania baconmania commented Oct 9, 2017

馃殺

@ssalinas ssalinas added the hs_stable label Oct 9, 2017
private Thread subscriberThread;

@Inject
public SingularityMesosSchedulerClient(SingularityConfiguration configuration, @Named(SingularityMainModule.SINGULARITY_URI_BASE) final String singularityUriBase) {

This comment has been minimized.

@baconmania

baconmania Oct 11, 2017
Contributor

Would it make sense to have this implement AutoCloseable?

This comment has been minimized.

@ssalinas

ssalinas Oct 11, 2017
Author Member

My eventual thought for this is to have it be able to fix #273 , meaning the same singleton scheduler client could subscribe again or renegotiate its connection. The client isn't ever used in a single try-with-resource type of scope so the only closing that would have to be done is on shutdown. Also, the dropwizard guicier we use here doesn't have the auto-close-singleton-closeables bits that the other version does

@baconmania
Copy link
Contributor

@baconmania baconmania commented Oct 11, 2017

馃殺

@ssalinas ssalinas merged commit b5b9bfb into master Oct 11, 2017
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@ssalinas ssalinas deleted the mesos_1 branch Oct 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can鈥檛 perform that action at this time.