Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added flows for un-manage cluster #798

Merged
merged 3 commits into from
Feb 15, 2018
Merged

Conversation

shtripat
Copy link
Member

@shtripat shtripat commented Jan 5, 2018

tendrl-bug-id: #797

Signed-off-by: Shubhendu shtripat@redhat.com

@shtripat shtripat force-pushed the unmanage-cluster branch 2 times, most recently from f5532b8 to 5312b2d Compare January 8, 2018 04:53
from tendrl.commons.objects.node import Node


class ClearClusterDetails(objects.BaseAtom):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets call it Unmanage or Delete cluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to DeleteClusterDetails

).load()

try:
NS._int.client.delete(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the NS.tendrl.objects.Cluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

from tendrl.commons.event import Event
from tendrl.commons import objects
from tendrl.commons.message import Message
from tendrl.commons.objects.cluster_tendrl_context \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont import objects, use NS.tendrl.objects.ClusterTendrlContext

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

recursive=True
)
NS._int.client.delete(
"/indexes/tags/detected_cluster_id_to_integration_id/%s" %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an object called NS.tendrl.objects.Index and use it instead of raw read/write

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are different data stored under /indexes/tags and they are based on node tags as well like /indexes/tags/tendrl/monitor. /indexes/tags/tendrl/node etc. How many such objects we create here?

node_ids.append(node_id)
nc = Node(node_id=node_id).load()
node_ips[node_id] = nc.fqdn
NS._int.client.delete(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to delete /nodes/%node_id ?

I thought we are simply unmanaging the cluster here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so do you mean anything updated by tendrl-node-agent ideally shouldnt be deleted as part of un-manager? If so later as part of forget cluster, we would need to clear all these details. comments?

@@ -111,6 +127,36 @@ namespace.tendrl:
list: /clusters
value: clusters/$TendrlContext.integration_id
atoms:
StopNodeServices:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break this down into StopMonitoringServices and StopIntegrationServices

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- "tendrl/monitor"
atoms:
- tendrl.objects.Cluster.atoms.StopNodeServices
- tendrl.objects.Cluster.atoms.ClearClusterDetails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call this DeleteCluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteClusterDetails

atoms:
- tendrl.objects.Cluster.atoms.StopNodeServices
- tendrl.objects.Cluster.atoms.ClearClusterDetails
- tendrl.objects.Cluster.atoms.ClearMonitoringData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteMonitoring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteMonitoringDetails

type: Update
uuid: 333c3333-3c33-33c3-333c-c33cc3c5555c
help: Clear cluster details
ClearMonitoringData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteMonitoring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteMonitoringDetails

type: Update
uuid: 333c3333-3c33-33c3-333c-c33cc3c4444c
help: Stop node services
ClearClusterDetails:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteCluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteClusterDetails

@shtripat shtripat force-pushed the unmanage-cluster branch 2 times, most recently from 39030e2 to bb7ba56 Compare January 11, 2018 07:36
@centos-ci
Copy link
Collaborator

Can one of the admins verify this patch?

@shtripat
Copy link
Member Author

shtripat commented Feb 7, 2018

Tested the latest change along with Tendrl/monitoring-integration#317 and it works as expected. The cluster can be un-managed and managed back. Across un-manage and manage back, the integration-id for the cluster remains the same. Also while un-manager towards end there is a notification about cluster state change to unmanaged and location of the archived graphite data for the cluster.

@r0h4n still need to work on unittests for these changes. Will work out a separate PR for the same.

@centos-ci
Copy link
Collaborator

Can one of the admins verify this patch?

tendrl-bug-id: Tendrl#797
Signed-off-by: Shubhendu <shtripat@redhat.com>
@shtripat shtripat changed the title WIP - Added flows for un-manage cluster Added flows for un-manage cluster Feb 9, 2018
@shtripat shtripat force-pushed the unmanage-cluster branch 2 times, most recently from 1af76c2 to e882668 Compare February 10, 2018 14:15
tendrl-bug-id: Tendrl#797
Signed-off-by: Shubhendu <shtripat@redhat.com>
_cluster = NS.tendrl.objects.Cluster(
integration_id=integration_id
).load()
if _cluster.status is not None and \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if cluster.is_managed = "yes"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will add cluster.is_managed as well

).load()
if _cluster.status is not None and \
_cluster.status != "" and \
_cluster.status in \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest acquiring lock on "cluster" object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. will try that

# the cluster nodes
try:
gl_srvr_list = NS._int.client.read(
"/indexes/tags/gluster/server"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this list includes all gluster nodes from all clusters. We need to change this tag to /indexes/tags/:integration_id/gluster/server

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually I am not removing the tag actually here. I am just removing the nodes of the cluster being unmanaged from the list and write the tag back with pending node ids

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to improve this. read("/indexes/tags/gluster/server") gives you every gluster node irrespective of cluster id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove the nodes of currently unmanaging cluster, it shouldnt matter for other clusters. How you see issue in this? Also what you suggest if not this way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we let the tendrl-gluster-integration clean up its own entry in /indexes/tags/gluster/server

So when you shutdown the tendrl-gluster-integration service during unmanage you need to add this clean up of tags.

The reason I dont want to load /indexes/tags/gluster/server is because it adds overhead of going through each and every gluster node across all clusters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean before stopping tendrl-gluster-integration, we should invoke a flow to clean the tags. So in gluster-integration as well we would need to load the list /indexes/tags/gluster/server and remove it node-id from the list and write it back.

So effectively each node loads this tag from etcd, removes its node-id from the list and writes it back. In existing case we are loading once in server and all the cluster nodes entries are removed from the list and then written back. Here is just one read and write whereas in other case (suggested one) it would no os nodes * (1 read + 1 write)

gl_srvr_list = NS._int.client.read(
"/indexes/tags/gluster/server"
).value
gl_srvr_list = ast.literal_eval(gl_srvr_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please dont use eval, if this is a json array, use json libs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# Create jobs on nodes for stoping services
_job_id = str(uuid.uuid4())
params = {
"Services[]": ["collectd"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be documented, stating that Tendrl will have full control over "collectd" service on the storage nodes, which includes shutting it down

@julienlim @mbukatov @nthomas-redhat whats your take on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

time.sleep(5)
continue
_cluster.status = ""
_cluster.current_job['status'] = "done"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have been using "finished" all over Tendrl, lets stick to that instead of "done"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -157,7 +157,8 @@ def run(self):
_cluster = NS.tendrl.objects.Cluster(
integration_id=NS.tendrl_context.integration_id
).load()
_cluster.import_status = "done"
_cluster.status = ""
_cluster.current_job['status'] = "done"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have been using "finished" all over Tendrl, lets stick to that instead of "done"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@codecov-io
Copy link

codecov-io commented Feb 13, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@160b4d4). Click here to learn what that means.
The diff coverage is 88.21%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #798   +/-   ##
=========================================
  Coverage          ?   77.47%           
=========================================
  Files             ?       91           
  Lines             ?     3481           
  Branches          ?      443           
=========================================
  Hits              ?     2697           
  Misses            ?      718           
  Partials          ?       66
Impacted Files Coverage Δ
tendrl/commons/utils/event_utils.py 0% <ø> (ø)
...cts/cluster/atoms/configure_monitoring/__init__.py 25% <0%> (ø)
...jects/cluster/atoms/is_cluster_managed/__init__.py 100% <100%> (ø)
...ts/cluster/atoms/set_cluster_unmanaged/__init__.py 100% <100%> (ø)
tendrl/commons/objects/cluster/__init__.py 100% <100%> (ø)
...luster/atoms/delete_monitoring_details/__init__.py 72.41% <72.41%> (ø)
tendrl/commons/flows/import_cluster/__init__.py 84.9% <75%> (ø)
...s/cluster/atoms/delete_cluster_details/__init__.py 84.09% <84.09%> (ø)
tendrl/commons/flows/unmanage_cluster/__init__.py 85% <85%> (ø)
...mmons/objects/node/flows/stop_services/__init__.py 86.95% <86.95%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 160b4d4...b168461. Read the comment docs.

@shtripat shtripat force-pushed the unmanage-cluster branch 3 times, most recently from 95e592c to 26304b6 Compare February 14, 2018 04:41
@@ -25,8 +25,7 @@ def emit_event(resource, curr_value, msg, instance,
plugin_instance=instance,
message=msg,
integration_id=integration_id or NS.tendrl_context.integration_id,
cluster_name=cluster_name or NS.tendrl_context.cluster_name,
sds_name=sds_name or NS.tendrl_context.sds_name,
cluster_name=cluster_name or NS.tendrl_context.cluster_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you removed sds_name?

Lets keep it with default as None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As confirmed by @GowthamShanmugam it was not required and sds_name internally is decided using integration_id.

cluster_tendrl_context.cluster_id
)
etcd_keys_to_delete.append(
"/indexes/tags/provisioner/%s" % integration_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets clean this tag in integration service

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"/indexes/tags/provisioner/%s" % integration_id
)
etcd_keys_to_delete.append(
"/indexes/tags/tendrl/integration/%s" %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets clean this tag in integration service

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@nthomas-redhat nthomas-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the pep8 issues and most obvious codacy issues

env:
- TOXENV=pep8
- TOXENV=vulture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vulture env is not setup. Could you please add https://github.com/Tendrl/node-agent/blob/master/tox.ini#L38 to tox.ini?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

in ["in_progress", "done", "failed"]:
if (_cluster.status is not None and
_cluster.status != "" and
_cluster.status in ["syncing", "importing", "unmanaging"]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status of the cluster which is already imported?
I could see the defs as:
status:
help: status of the cluster (importing, syncing, unmanaging, unknown)
type: String
Do we need state to indicate no-sync and no-import is on?
Also what will be the value for a freshly detected cluster?
What will happen if someone trigger an import on already imported cluster(may be using api)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status field maintains status of any jobs running on the cluster. If import successful it would be set as syncing by GI. State of the cluster is maintained under final details as healthy /unhealthy. For afresh detected cluster status would be empty and I'd user submit import through app for a cluster which has import running back-end would reject the job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

_cluster = NS.tendrl.objects.Cluster(
integration_id=integration_id
).load()
_cluster.status = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to define and set a proper state here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State of cluster is maintained in global details object already and ui uses the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

# Create jobs on nodes for stoping services
_job_id = str(uuid.uuid4())
params = {
"Services[]": ["collectd"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

run: tendrl.flows.UnmanageCluster
type: Update
uuid: 2f94a48a-05d7-408c-b400-e27827f4efed
version: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a post_run to make sure the cluster is un_managed and the services are stopped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a post task to make sure cluster is detected back again. Will check this part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue to track

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of node-agent re-start only we do sds_sync for cluster but in un-manage we are not (re)starting node-agent on storage nodes so removal of tag indexes/tags/tendrl/integration/{int-id} should not be done. Also we cannot depend on this to figure out that an un-managed cluster is import ready now.

@r0h4n any other suggestion??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a post task for the same and posted below

@shtripat shtripat force-pushed the unmanage-cluster branch 2 times, most recently from abc300f to e43c568 Compare February 14, 2018 13:04
nthomas-redhat
nthomas-redhat previously approved these changes Feb 14, 2018
tendrl-bug-id: Tendrl#797
Signed-off-by: Shubhendu <shtripat@redhat.com>
@julienlim
Copy link
Member

Per discussion with @a2batic, there are a couple of UI changes related to Unmanage cluster. We will be posting updates to the design soon.

@r0h4n r0h4n merged commit ee5d5fb into Tendrl:master Feb 15, 2018
@r0h4n r0h4n added this to the Archive milestone Mar 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants