Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Blueprint : https://blueprints.launchpad.net/opencontrail/+spec/add-delete-nodes-roles Partial-Bug: #1788447 Patch: re-wrote based on review comments. Change-Id: I7daebe590889f0971dc31b7185d49e0e1c3dd0e9
- Loading branch information
nitishkrishna
committed
Oct 10, 2018
1 parent
5acee3b
commit a0e586a
Showing
1 changed file
with
247 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,247 @@ | ||
|
||
# 1. Introduction | ||
|
||
Any deployed Contrail Cluster will require to be modified as time goes on, | ||
whether it is to scale the cluster by adding additional Controllers/Compute | ||
Nodes or to maintain the existing Compute/Storage nodes. There will be a | ||
particular need to pull unresponsive compute nodes off the cluster while | ||
simultaneously adding healthy servers to replace them. There is also a common | ||
case of user misconfiguration of the initial deployment which might require a | ||
removal and cleanup before that node can be re-added to the cluster. | ||
|
||
# 2. Problem statement | ||
|
||
In an Ansible-provisioned Contrail Cluster, we do not currently have the | ||
ability to modify the cluster in a simplified way using Ansible. There is only | ||
a specialized work flow for adding additional Compute Nodes to a cluster. There | ||
are currently no playbooks or workflows to delete any node or role. In previous | ||
releases of Contrail, we have documented manual workflows such as the below: | ||
|
||
[ Adding a Compute Node | ||
](https://www.juniper.net/documentation/en_US/contrail4.0/topics/task/installation/add-new-compute-node-vnc.html) | ||
|
||
[ Adding a Controller Node | ||
](https://www.juniper.net/documentation/en_US/contrail4.1/topics/concept/add-node-existing-container.html) | ||
|
||
Neither of these provided a seamless Ansible-backed workflow which allowed the | ||
user to intuitively manage their Contrail Cluster. Since we have moved all our | ||
provisioning logic to Ansible we need to be able to invoke the same set of | ||
playbooks regardless of the provisioning operation we intend to deploy. We | ||
should not expect the user to manually log in to any node on the cluster in | ||
order to expand or shrink its size. All provisioning or deployment based | ||
actions should have a simple way of being invoked from an ansible playbook. | ||
Ansible needs to be able to handle all these previously manual tasks | ||
internally. | ||
|
||
# 3. Proposed solution | ||
|
||
Our proposed solution has two broad aims: 1. Simplify and standardize the | ||
workflow needed to achieve any desired cluster deployment 2. Maintain the | ||
ability to have fine-grained control over deployment of roles | ||
|
||
Goal number 1 is straightforward: we wish to be able to deploy a cluster based | ||
on a specified instances file, regardless of the action needed to reach that | ||
state (fresh deployment/delete node/add role). If a user specifies what they | ||
wish their cluster to look like in the instances file, the Ansible playbook has | ||
to be able to figure out the set of actions required to have the cluster | ||
deployed exactly as intended. | ||
|
||
Goal number 2 relates to the granularity at which we perform these actions. | ||
Since we currently allow a user to deploy a single role on a single node, we | ||
need to maintain the ability to do this minimal action alone, if needed. | ||
|
||
Further, we want our solution to be generic such that a similar workflow can be | ||
followed regardless of the cloud orchestrator being deployed. The complexity of | ||
each action in any orchestrator should be handled by the playbooks for that | ||
particular orchestrator. | ||
|
||
This spec file will not handle any orchestrator related actions, whether add or | ||
delete. Separate spec file/blueprint will be filed for that per orchestrator | ||
dependent on what that particular orchestrator supports. | ||
|
||
Our solution is to first calculate a list of actions to take for each node in | ||
the cluster when the main playbook for installation - "install_contrail.yml" is | ||
called. This playbook consumes the supplied instances.yml file and after the | ||
per-node list of actions is calculated, calls the appropriate roles for those | ||
nodes. The construction of this list of actions depends on both the supplied | ||
instances config as well as getting information from the existing cluster (if | ||
any). | ||
|
||
Alongside this exercise, the playbooks for installation, which today are on the | ||
basis of the component we are deploying (contrail/openstack/k8s) will be split | ||
into more granular "role" based playbooks which match the roles we support in | ||
the instances file. | ||
|
||
As the first step before deployment, the install playbook will map the state of | ||
the existing cluster (getting the deployment details from Config API server) | ||
and compare this to the intended state specified in the instances file. | ||
Comparing the two it will come up with a list of roles per node that have to be | ||
provisioned or removed. Let us look at the different use cases in a little mode | ||
detail. | ||
|
||
## 3.1 Use cases | ||
|
||
In each use case we will use the below to illustrate how our action list is | ||
calculated: | ||
|
||
Instances(A) - List of roles provided for Node A in instances file Cluster(A) - | ||
List of roles currently successfully deployed for Node A | ||
|
||
In each of the following cases, the user provides the nodes, associated roles | ||
and contrail configuration in the instances file as done today. | ||
|
||
### 3.1.1 Fresh Provision of cluster | ||
|
||
Since the install playbook will not be able to contact any Contrail Config API | ||
server , it will be able to calculate that there is no existing deployment. | ||
i.e. : Cluster(A) = [] For every node A in cluster | ||
|
||
Thus our final list of roles to provision will be Instances(A) for every node A | ||
in cluster. We can call the appropriate Ansible role for each of these roles | ||
for each node. | ||
|
||
### 3.1.2 Adding a role to a node in the cluster | ||
|
||
The install playbook will be able to contact the Contrail Config API server of | ||
the existing deployment and calculate the list of provisioned roles for all | ||
nodes. For one of these nodes there will be an extra role in Instances(A) | ||
Thus, the set of roles to be provisioned for each node will be Instances(A) - | ||
Cluster(A) and we can call the appropriate Ansible role for these roles. | ||
|
||
### 3.1.3 Deleting a role from a node in the cluster | ||
|
||
Before describing the next two scenarios, some terms used below are described | ||
here: | ||
"de-registering tasks" refers to those tasks that simply remove a node | ||
from Contrail Config API and remove them logically from the cluster. These tasks | ||
do not need to actually run on the node where the role is being deleted. | ||
"cleanup tasks" refers to those tasks that are called on the node for the role | ||
being deleted. These include actions like removing containers, images and dirs | ||
related to that role. | ||
Each delete task for a role will be composed of both of these types of tasks. | ||
The cleanup tasks will be controlled by a cleanup flag which will be enabled by | ||
default. | ||
|
||
The install playbook will be able to contact contact the Contrail Config API | ||
server of the existing deployment and calculate the list of provisioned roles | ||
for all nodes. For one of these nodes there will be an extra role in | ||
Cluster(A) Thus, the set of roles to be deleted for each node will be | ||
Cluster(A) - Instances(A). We can call the appropriate delete YAML task file | ||
for each of these roles. | ||
|
||
### 3.1.4 Removing a node from the cluster | ||
|
||
The instances.yml file will not have any entry for this node but we will still | ||
get the node role list from the Contrail Config API server. In this scenario, | ||
Instances(A) will be empty while Cluster(A) will have currently provisioned | ||
roles on that node. | ||
|
||
There are two cases where we remove a node: | ||
#### 1. Node is not reachable via SSH: | ||
In this case we will not be able to execute tasks on the target so only the | ||
mandatory de-registery tasks will be executed to remove node from the Config API | ||
server. The Cleanup tasks will be skipped. | ||
|
||
#### 2. Node is reachable via SSH: | ||
This case is similar to 3.1.3, where we will remove all the roles currently on | ||
that node. | ||
|
||
Additionally, if the user wishes to do speed up the removal of the deleted nodes | ||
from the cluster, we provide a global cleanup flag which user can set to "no". | ||
This will result in these cleanup tasks being skipped for all deleted nodes and | ||
roles and only the de-registering tasks will be run. | ||
|
||
## 3.2 Alternatives considered | ||
|
||
### Specifically calling an add or delete playbook | ||
|
||
The general idea here was to have the operator manually call the appropriate | ||
add or delete playbook based on their requirements. | ||
|
||
PROS: This solution does not interfere with the provision code for a fresh | ||
provision. The user can craft a very specific configuration to add to cluster | ||
or delete from cluster. | ||
|
||
CONS: There are a lot of unsupported edge cases and exposing the delete and add | ||
playbooks directly to the end user increases the chances of failed deployments. | ||
In addition there are issues with the input files that operator would use to | ||
delete roles. These are specified below. | ||
|
||
#### Using a separate instances file just to delete | ||
|
||
Today, we use an instances file to provision a cluster, with the understanding | ||
that the instances file reflects the expected end state of the cluster after | ||
provision. By adding an additional file that lists only the nodes we want to | ||
delete, we lose a complete picture of what the cluster looks like after any | ||
such operation as ansible does not maintain state. In addition we change the | ||
meaning of the input instances file which as of today is a declarative | ||
representation of the cluster we want. | ||
|
||
#### Passing list of nodes to delete as an additional cluster parameter | ||
|
||
This approach is flawed in a similar way to the one above as we lose the | ||
picture of the cluster post a delete operation. It depends on user manually | ||
maintaining the state in some YAML file. We also go out of line with the | ||
inherent declarative state of instances file. | ||
|
||
## 3.3 Provision Architecture and Workflow Changes | ||
|
||
The architecture of ansible playbooks for deployment largely remains unchanged. | ||
Today we have action specific playbooks like install_contrail.yml, | ||
configure_instances.yml and destroy_contrail.yml. Alongside this exercise, the | ||
Ansible roles are expected to be split into more granular roles that match the | ||
granularity that we support in instances file. | ||
|
||
As a result, all roles that we support in instances.yaml have to have an | ||
appropriate delete_<role>.yml because that is the granularity that we have to | ||
support for delete of role. | ||
|
||
If an operator is trying to do simultaneous add and delete of roles, for | ||
example, trying to change a compute node to a controller, we need to delete the | ||
role first, initiate a cleanup of node and only then do the provision of the | ||
added role. Even though this is not a workflow change from something operator | ||
would manually do, it is a change as far as Ansible Deployer is concerned. | ||
|
||
# 4. Implementation | ||
|
||
The install playbook will need to have an additional role that contacts all | ||
nodes to calculate the various role lists described above for each node. For | ||
calculating the existing roles deployed on the cluster, the first set of tasks | ||
will communicate with the Contrail Config API of an existing cluster. The | ||
cluster node_role dictionary will also be constructed as done today. These are | ||
the roles that are to be installed in this playbook run. | ||
|
||
The new tasks to calculate existing roles will be implemented in a similar | ||
fashion to the existing "openstack_host_groups" filter plugin which calculates | ||
the list of nodes for each openstack role. These filter plugins are written in | ||
python and make API calls to Contrail Config API to construct the dictionary of | ||
existing roles. | ||
|
||
# 5. Performance and scaling impact | ||
|
||
N/A | ||
|
||
# 6. Upgrade | ||
|
||
N/A | ||
|
||
# 7. Deprecations | ||
|
||
Some workflows to add compute nodes will be deprecated and wikis have to be | ||
updated reflecting use of the install playbook. | ||
|
||
# 8. Dependencies | ||
|
||
N/A | ||
|
||
# 9. Testing | ||
|
||
The install playbook needs to be tested in all supported scenarios described | ||
above. In addition, each individual role playbook has to be tested for the | ||
action it is expected to achieve (add/delete). | ||
|
||
# 10. Documentation Impact | ||
|
||
New documentation has to be added for add and delete of roles. | ||
|
||
# 11. References |