Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_agent: Atom to check service status + Alerts socket #67

Closed
wants to merge 4 commits into from

Conversation

anmolbabu
Copy link
Contributor

@anmolbabu anmolbabu commented Nov 22, 2016

This patch adds the following:

  1. check_service_status.py: Atom to check service status
  2. alert_socket.py: Opens a socket to send out alerts to etcd

tendrl-bug-id: #70
Signed-off-by: anmolbabu anmolbudugutta@gmail.com

@anmolbabu
Copy link
Contributor Author

This is modified version of #19 raised in accordance with comments on the patch and discussions around provisioning, the code separation etc...
The previous pr got closed accidentally by a force rebase of my node_agent repo.

@anmolbabu
Copy link
Contributor Author

@shtripat @nthomas-redhat @brainfunked @r0h4n Kindly review

class AlertSocketHandler(SocketServer.BaseRequestHandler):
def handle(self):
try:
data = self.request.recv(4096)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel 4096 is size of data to be read. We should define a constant for the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

try:
data = self.request.recv(4096)
alert = json.loads(data)
alertKey = '/alerts/' + alert['Host'] + "/" + alert['Plugin']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use format string like

'/alerts/%s/%s' % (alert['Host'], alert['Plugin'])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

class CheckServiceStatus(object):
def run(self, parameters):
service_name = parameters.get("service_name")
response = os.system("service %s status" % service_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use systemctl command rather

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@r0h4n
Copy link
Contributor

r0h4n commented Nov 22, 2016

Please refer to the github issue being worked upon for this PR

alertKey = '/alerts/%s/%s' % (alert['Host'], alert['Plugin'])
if 'PluginInstance' in alert:
alertKey = "%s/%s" % (alertKey, alert['PluginInstance'])
etcd_kwargs = {'port': int(config.get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd_kwargs = {
    'port': int(config.get("common", "etcd_port")),
    'host': config.get("common", "etcd_connection")
}

looks more readable to me :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

try:
self.server.serve_forever()
except Exception as e:
print 'Exception caught'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why print and not log? Also include actual exception in log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def test_alertsSuccessfulValidate(self):
expected_alert = {
'Host': '127.0.0.1',
'Plugin': 'cpu',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this "Plugin" value expected of al alerts that collectd writes to the node agent socket?

Also, ceph_integration and gluster_integration will be writing alerts to the node agent socket,

You willl need to have a generic alert json schema using which any process can write alerts to the node agent socket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 'Plugin' is expected to be present in any collectd dispatched threshold alert. But yes, as you are saying I'll have a generic structure for this because gluster and ceph integrations cannot be expected to send alerts in the collectd format I have a schema proposal in Tendrl/specifications#18 could you please review if the structure there is good enough

@@ -0,0 +1,12 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the Tendrl definitions for the node object and these atoms under it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self, server_address, RequestHandlerClass)


class AlertsManager(Process):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason to choose Process instead of greenlet eg: https://github.com/Tendrl/node_agent/blob/master/tendrl/node_agent/manager/manager.py#L36

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call @ line 80 server.serve_forever() will block all other threads unless its deployed as process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to drop using "SocketServer.ThreadingMixIn, SocketServer.TCPServer" and use http://www.gevent.org/gevent.server.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

)
port = config.get(
"node_agent",
"collectd_socket_port"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should renamed as "tendrl_alerts_socket_port" since all tendrl components and collectd will be writing alerts to this socket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

try:
hostname = config.get(
"node_agent",
"collectd_socket_addr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should renamed as "tendrl_alerts_socket_addr" since all tendrl components and collectd will be writing alerts to this socket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

self.request.close()


class AlertsServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use gevent socket functionality eg: http://www.gevent.org/gevent.socket.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

EtcdRPC().client.write(key, alert)


class AlertSocketHandler(SocketServer.BaseRequestHandler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

self, server_address, RequestHandlerClass)


class AlertsManager(Process):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to drop using "SocketServer.ThreadingMixIn, SocketServer.TCPServer" and use http://www.gevent.org/gevent.server.html

@anmolbabu
Copy link
Contributor Author

@shtripat @nthomas-redhat @r0h4n Please review

@@ -25,7 +25,8 @@ def __init__(self, flow_name, job, atoms, pre_run, post_run, uuid):

self.etcd_client = etcd.Client(**etcd_kwargs)
self.node_id = manager_utils.get_local_node_context()
self.cluster_id = self.parameters['Tendrl_context.cluster_id']
if 'Tendrl_context.cluster_id' in self.parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel you can use self.parameters.get("Tendrl_context.cluster_id"). This would return None if key not found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

- Service.name
name: "check whether the service is running"
run: tendrl.node_agent.objects.service.atoms.check_service_status.CheckServiceStatus
type: Create
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not a Create type atom

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt that while writing this but I found this used for check_node_up which is a similar thing(basically a check atom)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atom names can be re-verified in future, type is not really used much. we can move on

@@ -7,6 +8,6 @@
class Cmd(object):
def run(self, parameters):
cmd = parameters.get("Node.cmd_str")
cmd = ["nohup"] + cmd.split(" ")
cmd = ["nohup"] + shlex.split(cmd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of Tendrl/commons#79 command executor comes to common module. We might need refactor later to use the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

@shtripat shtripat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now


def join(self):
LOG.info("%s joining" % self.__class__.__name__)
self._user_request_thread.join()
self._discovery_thread.join()
self.persister.join()

def append_definitions(self):
defs_path = 'tendrl_definitions_node_agent/data'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r0h4n I'll probably change this once the framework changes related to this are in because otherwise, I won't be able to dev-test this code

@anmolbabu
Copy link
Contributor Author

The spec issue for this PR is: Tendrl/specifications#40

@anmolbabu
Copy link
Contributor Author

TODO(anmolbabu): Add UTs

data = sock.recv(RECEIVE_DATA_SIZE)
alert_utils = AlertUtils()
alert = alert_utils.validate_alert_json(data)
alert['significance'] = 'HIGH'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded as HIGH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is not configurable.

And as per the plan in https://github.com/Tendrl/specifications/blob/master/specs/pluggable_alert_delivery.adoc --
"Each alert will carry additional attribute that carries its significance which will be added by respective handler. The supported significance levels can be 'HIGH', 'MEDIUM' and 'LOW'.This logic will be hardcoded in the codebase that handles the alerts."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

try:
self.alert = Alert.to_obj(alert_json_string)
self.alert.alert_id = str(uuid.uuid4())
self.alert.significance = 'HIGH'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded as HIGH

Copy link
Contributor Author

@anmolbabu anmolbabu Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be overriden by custom handlers. So in effect the hard coding done here(base handler) is only like a default value which will be overriden by the respective custom handler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

r0h4n
r0h4n previously approved these changes Jan 11, 2017
Copy link
Contributor

@r0h4n r0h4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmolbabu fix travis unit test failures, rest LGTM

@@ -0,0 +1,70 @@
from _socket import error as _socket_error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm.. what is _socket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in https://github.com/gevent/gevent/blob/master/src/gevent/server.py , gevent uses _socket which I found somewhere can throw the exception _socket._socket_error. So line 65 handles that exception. I haven't myself observed that happening though..

Copy link
Contributor

@r0h4n r0h4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix travis unit test failures, rest LGTM

Copy link
Contributor

@nthomas-redhat nthomas-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@shtripat shtripat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But we need to rebase and use latest commons module. There are changes related to config loading.

tendrl-bug-id: Tendrl#70
tendrl-spec: Tendrl/specifications#18
Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
tendrl-bug-id: Tendrl#70
Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
tendrl-bug-id: Tendrl#70
Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
tendrl-bug-id: Tendrl#70
Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.4%) to 52.666% when pulling 38136d7 on anmolbabu:monitoring into 94eb6ce on Tendrl:master.

) as ex:
LOG.error('Failed to handle data on alert socket.Error %s' % ex)

def __init__(self, alerts_socket_addr, alerts_socket_port, etcd_client):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to be a UNIX socket, not a TCP socket. The canonical path for such a socket would be under /run/tendrl/message.sock. This would require root privileges, which would be available when deployed via RPM. However, for scenarios where the node agent is being run directly from the git checkout or a non-global installation directory, the socket should be created in the run directory, at the top of the installation tree.

super(AlertsManager, self).__init__()
self.hostname = alerts_socket_addr
self.port = alerts_socket_port
global db_client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

@anmolbabu
Copy link
Contributor Author

Abandoning as an alternate strategy has been adopted...

@anmolbabu anmolbabu closed this Jan 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants