Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler extension #13580

Merged
merged 1 commit into from Nov 25, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
117 changes: 117 additions & 0 deletions docs/design/scheduler_extender.md
@@ -0,0 +1,117 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.1/docs/design/scheduler_extender.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Scheduler extender
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to call this an 'extension' ? We use nouns for things like this elsewhere in the system (viz: 'plugin').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm sure it wasn't the original intention, one reason to call it an "extender" might be so that we can reserve the term "extension" for the unified extension architecture we will hopefully develop eventually (to cover all call-out extensions like this one, admission controller extension, etc.).


There are three ways to add new scheduling rules (predicates and priority functions) to Kubernetes: (1) by adding these rules to the scheduler and recompiling (described here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), (2) implementing your own scheduler process that runs instead of, or alongside of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" process that the standard Kubernetes scheduler calls out to as a final pass when making scheduling decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, we can improve the docs by separating it into two sub-sections

  • Motivation
  • Technical details/approaches. This part could be further separated into one about algorithm/workflow and another talking about API changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common to lay out goals and non-goals in k8s docs? Another nice thing to talk about is that lay out goals such as:

  • Allow external process to access generic scheduler API, including filtering and prioritization. The ordering is k8s scheduler goes first, then extension...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, will address this separately in another PR.


This document describes the third approach. This approach is needed for use cases where scheduling decisions need to be made on resources not directly managed by the standard Kubernetes scheduler. The extender helps make scheduling decisions based on such resources. (Note that the three approaches are not mutually exclusive.)

When scheduling a pod, the extender allows an external process to filter and prioritize nodes. Two separate http/https calls are issued to the extender, one for "filter" and one for "prioritize" actions. To use the extender, you must create a scheduler policy configuration file. The configuration specifies how to reach the extender, whether to use http or https and the timeout.

```go
// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
// it is assumed that the extender chose not to provide that extension.
type ExtenderConfig struct {
// URLPrefix at which the extender is available
URLPrefix string `json:"urlPrefix"`
// Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
FilterVerb string `json:"filterVerb,omitempty"`
// Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
PrioritizeVerb string `json:"prioritizeVerb,omitempty"`
// The numeric multiplier for the node scores that the prioritize call generates.
// The weight should be a positive integer
Weight int `json:"weight,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be only one value of weight. But shouldn't one extender have multiple prioritize function?

Additionally, why note let the remote call also returns the weight too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each extender has one priority function and one predicate function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents each extender to have multiple filter and priority functions? It's one HTTP roundtrip :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a way to model this is that one extender is an isolated identity that provides some extension functions to do the work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents each extender to have multiple filter and priority functions? It's one HTTP roundtrip :)

I think it would just make things more complicated. I think it's simpler if there's one prioritize endpoint that runs one prioritize function, and one filter endpoint that runs one predicate function. Since they are presumably in the same process, it should not be hard to combine the logic into one predicate function and one priority function, so I don't think this overly restricts the generality of the extender.

// EnableHttps specifies whether https should be used to communicate with the extender
EnableHttps bool `json:"enableHttps,omitempty"`
// TLSConfig specifies the transport layer security config
TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"`
// HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
// timeout is ignored, k8s/other extenders priorities are used to select the node.
HTTPTimeout time.Duration `json:"httpTimeout,omitempty"`
}
```

A sample scheduler policy file with extender configuration:

```json
{
"predicates": [
{
"name": "HostName"
},
{
"name": "MatchNodeSelector"
},
{
"name": "PodFitsResources"
}
],
"priorities": [
{
"name": "LeastRequestedPriority",
"weight": 1
}
],
"extenders": [
{
"urlPrefix": "http://127.0.0.1:12345/api/scheduler",
"filterVerb": "filter",
"enableHttps": false
}
]
}
```

Arguments passed to the FilterVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and the pod. Arguments passed to the PrioritizeVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and extender predicates and the pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can improve the docs by adding two more sections here -- one for predicate and one for prioritization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take up in another PR


```go
// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
// nodes for a pod.
type ExtenderArgs struct {
// Pod being scheduled
Pod api.Pod `json:"pod"`
// List of candidate nodes where the pod can be scheduled
Nodes api.NodeList `json:"nodes"`
}
```

The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call returns priorities for each node (schedulerapi.HostPriorityList).

The "filter" call may prune the set of nodes based on its predicates. Scores returned by the "prioritize" call are added to the k8s scores (computed through its priority functions) and used for final host selection.

Multiple extenders can be configured in the scheduler policy.

<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
5 changes: 3 additions & 2 deletions examples/examples_test.go
Expand Up @@ -235,7 +235,8 @@ func TestExampleObjectSchemas(t *testing.T) {
"daemon": &extensions.DaemonSet{},
},
"../examples": {
"scheduler-policy-config": &schedulerapi.Policy{},
"scheduler-policy-config": &schedulerapi.Policy{},
"scheduler-policy-config-with-extender": &schedulerapi.Policy{},
},
"../examples/rbd/secret": {
"ceph-secret": &api.Secret{},
Expand Down Expand Up @@ -409,7 +410,7 @@ func TestExampleObjectSchemas(t *testing.T) {
t.Logf("skipping : %s/%s\n", path, name)
return
}
if name == "scheduler-policy-config" {
if strings.Contains(name, "scheduler-policy-config") {
if err := schedulerapilatest.Codec.DecodeInto(data, expectedType); err != nil {
t.Errorf("%s did not decode correctly: %v\n%s", path, err, string(data))
return
Expand Down
25 changes: 25 additions & 0 deletions examples/scheduler-policy-config-with-extender.json
@@ -0,0 +1,25 @@
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsPorts"},
{"name" : "PodFitsResources"},
{"name" : "NoDiskConflict"},
{"name" : "MatchNodeSelector"},
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 1},
{"name" : "EqualPriority", "weight" : 1}
],
"extender": {
"url": "http://127.0.0.1:12346/scheduler",
"apiVersion": "v1beta1",
"filterVerb": "filter",
"prioritizeVerb": "prioritize",
"weight": 5,
"enableHttps": false
}
}
27 changes: 14 additions & 13 deletions plugin/pkg/scheduler/algorithm/priorities/priorities.go
Expand Up @@ -25,6 +25,7 @@ import (
"k8s.io/kubernetes/pkg/labels"
"k8s.io/kubernetes/plugin/pkg/scheduler/algorithm"
"k8s.io/kubernetes/plugin/pkg/scheduler/algorithm/predicates"
schedulerapi "k8s.io/kubernetes/plugin/pkg/scheduler/api"
)

// the unused capacity is calculated on a scale of 0-10
Expand Down Expand Up @@ -73,7 +74,7 @@ func getNonzeroRequests(requests *api.ResourceList) (int64, int64) {

// Calculate the resource occupancy on a node. 'node' has information about the resources on the node.
// 'pods' is a list of pods currently scheduled on the node.
func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) algorithm.HostPriority {
func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) schedulerapi.HostPriority {
totalMilliCPU := int64(0)
totalMemory := int64(0)
capacityMilliCPU := node.Status.Capacity.Cpu().MilliValue()
Expand Down Expand Up @@ -104,7 +105,7 @@ func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) al
cpuScore, memoryScore,
)

return algorithm.HostPriority{
return schedulerapi.HostPriority{
Host: node.Name,
Score: int((cpuScore + memoryScore) / 2),
}
Expand All @@ -114,14 +115,14 @@ func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) al
// It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes
// based on the minimum of the average of the fraction of requested to capacity.
// Details: cpu((capacity - sum(requested)) * 10 / capacity) + memory((capacity - sum(requested)) * 10 / capacity) / 2
func LeastRequestedPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (algorithm.HostPriorityList, error) {
func LeastRequestedPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
nodes, err := nodeLister.List()
if err != nil {
return algorithm.HostPriorityList{}, err
return schedulerapi.HostPriorityList{}, err
}
podsToMachines, err := predicates.MapPodsToMachines(podLister)

list := algorithm.HostPriorityList{}
list := schedulerapi.HostPriorityList{}
for _, node := range nodes.Items {
list = append(list, calculateResourceOccupancy(pod, node, podsToMachines[node.Name]))
}
Expand All @@ -144,7 +145,7 @@ func NewNodeLabelPriority(label string, presence bool) algorithm.PriorityFunctio
// CalculateNodeLabelPriority checks whether a particular label exists on a node or not, regardless of its value.
// If presence is true, prioritizes nodes that have the specified label, regardless of value.
// If presence is false, prioritizes nodes that do not have the specified label.
func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (algorithm.HostPriorityList, error) {
func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
var score int
nodes, err := nodeLister.List()
if err != nil {
Expand All @@ -157,7 +158,7 @@ func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podListe
labeledNodes[node.Name] = (exists && n.presence) || (!exists && !n.presence)
}

result := []algorithm.HostPriority{}
result := []schedulerapi.HostPriority{}
//score int - scale of 0-10
// 0 being the lowest priority and 10 being the highest
for nodeName, success := range labeledNodes {
Expand All @@ -166,7 +167,7 @@ func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podListe
} else {
score = 0
}
result = append(result, algorithm.HostPriority{Host: nodeName, Score: score})
result = append(result, schedulerapi.HostPriority{Host: nodeName, Score: score})
}
return result, nil
}
Expand All @@ -177,21 +178,21 @@ func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podListe
// close the two metrics are to each other.
// Detail: score = 10 - abs(cpuFraction-memoryFraction)*10. The algorithm is partly inspired by:
// "Wei Huang et al. An Energy Efficient Virtual Machine Placement Algorithm with Balanced Resource Utilization"
func BalancedResourceAllocation(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (algorithm.HostPriorityList, error) {
func BalancedResourceAllocation(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
nodes, err := nodeLister.List()
if err != nil {
return algorithm.HostPriorityList{}, err
return schedulerapi.HostPriorityList{}, err
}
podsToMachines, err := predicates.MapPodsToMachines(podLister)

list := algorithm.HostPriorityList{}
list := schedulerapi.HostPriorityList{}
for _, node := range nodes.Items {
list = append(list, calculateBalancedResourceAllocation(pod, node, podsToMachines[node.Name]))
}
return list, nil
}

func calculateBalancedResourceAllocation(pod *api.Pod, node api.Node, pods []*api.Pod) algorithm.HostPriority {
func calculateBalancedResourceAllocation(pod *api.Pod, node api.Node, pods []*api.Pod) schedulerapi.HostPriority {
totalMilliCPU := int64(0)
totalMemory := int64(0)
score := int(0)
Expand Down Expand Up @@ -234,7 +235,7 @@ func calculateBalancedResourceAllocation(pod *api.Pod, node api.Node, pods []*ap
score,
)

return algorithm.HostPriority{
return schedulerapi.HostPriority{
Host: node.Name,
Score: score,
}
Expand Down