New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node_agent: Atom to check service status + Alerts socket #67
Conversation
This is modified version of #19 raised in accordance with comments on the patch and discussions around provisioning, the code separation etc... |
@shtripat @nthomas-redhat @brainfunked @r0h4n Kindly review |
class AlertSocketHandler(SocketServer.BaseRequestHandler): | ||
def handle(self): | ||
try: | ||
data = self.request.recv(4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel 4096 is size of data to be read. We should define a constant for the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
try: | ||
data = self.request.recv(4096) | ||
alert = json.loads(data) | ||
alertKey = '/alerts/' + alert['Host'] + "/" + alert['Plugin'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use format string like
'/alerts/%s/%s' % (alert['Host'], alert['Plugin'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
class CheckServiceStatus(object): | ||
def run(self, parameters): | ||
service_name = parameters.get("service_name") | ||
response = os.system("service %s status" % service_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use systemctl command rather
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Please refer to the github issue being worked upon for this PR |
alertKey = '/alerts/%s/%s' % (alert['Host'], alert['Plugin']) | ||
if 'PluginInstance' in alert: | ||
alertKey = "%s/%s" % (alertKey, alert['PluginInstance']) | ||
etcd_kwargs = {'port': int(config.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etcd_kwargs = {
'port': int(config.get("common", "etcd_port")),
'host': config.get("common", "etcd_connection")
}
looks more readable to me :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
try: | ||
self.server.serve_forever() | ||
except Exception as e: | ||
print 'Exception caught' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why print and not log? Also include actual exception in log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
def test_alertsSuccessfulValidate(self): | ||
expected_alert = { | ||
'Host': '127.0.0.1', | ||
'Plugin': 'cpu', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this "Plugin" value expected of al alerts that collectd writes to the node agent socket?
Also, ceph_integration and gluster_integration will be writing alerts to the node agent socket,
You willl need to have a generic alert json schema using which any process can write alerts to the node agent socket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 'Plugin' is expected to be present in any collectd dispatched threshold alert. But yes, as you are saying I'll have a generic structure for this because gluster and ceph integrations cannot be expected to send alerts in the collectd format I have a schema proposal in Tendrl/specifications#18 could you please review if the structure there is good enough
@@ -0,0 +1,12 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the Tendrl definitions for the node object and these atoms under it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self, server_address, RequestHandlerClass) | ||
|
||
|
||
class AlertsManager(Process): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason to choose Process instead of greenlet eg: https://github.com/Tendrl/node_agent/blob/master/tendrl/node_agent/manager/manager.py#L36
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call @ line 80 server.serve_forever() will block all other threads unless its deployed as process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to drop using "SocketServer.ThreadingMixIn, SocketServer.TCPServer" and use http://www.gevent.org/gevent.server.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
) | ||
port = config.get( | ||
"node_agent", | ||
"collectd_socket_port" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should renamed as "tendrl_alerts_socket_port" since all tendrl components and collectd will be writing alerts to this socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
try: | ||
hostname = config.get( | ||
"node_agent", | ||
"collectd_socket_addr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should renamed as "tendrl_alerts_socket_addr" since all tendrl components and collectd will be writing alerts to this socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
self.request.close() | ||
|
||
|
||
class AlertsServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use gevent socket functionality eg: http://www.gevent.org/gevent.socket.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
EtcdRPC().client.write(key, alert) | ||
|
||
|
||
class AlertSocketHandler(SocketServer.BaseRequestHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
self, server_address, RequestHandlerClass) | ||
|
||
|
||
class AlertsManager(Process): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to drop using "SocketServer.ThreadingMixIn, SocketServer.TCPServer" and use http://www.gevent.org/gevent.server.html
70b1de3
to
e6f9d9d
Compare
@shtripat @nthomas-redhat @r0h4n Please review |
@@ -25,7 +25,8 @@ def __init__(self, flow_name, job, atoms, pre_run, post_run, uuid): | |||
|
|||
self.etcd_client = etcd.Client(**etcd_kwargs) | |||
self.node_id = manager_utils.get_local_node_context() | |||
self.cluster_id = self.parameters['Tendrl_context.cluster_id'] | |||
if 'Tendrl_context.cluster_id' in self.parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel you can use self.parameters.get("Tendrl_context.cluster_id"). This would return None if key not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
- Service.name | ||
name: "check whether the service is running" | ||
run: tendrl.node_agent.objects.service.atoms.check_service_status.CheckServiceStatus | ||
type: Create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not a Create type atom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that while writing this but I found this used for check_node_up which is a similar thing(basically a check atom)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atom names can be re-verified in future, type is not really used much. we can move on
@@ -7,6 +8,6 @@ | |||
class Cmd(object): | |||
def run(self, parameters): | |||
cmd = parameters.get("Node.cmd_str") | |||
cmd = ["nohup"] + cmd.split(" ") | |||
cmd = ["nohup"] + shlex.split(cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of Tendrl/commons#79 command executor comes to common module. We might need refactor later to use the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now
|
||
def join(self): | ||
LOG.info("%s joining" % self.__class__.__name__) | ||
self._user_request_thread.join() | ||
self._discovery_thread.join() | ||
self.persister.join() | ||
|
||
def append_definitions(self): | ||
defs_path = 'tendrl_definitions_node_agent/data' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r0h4n I'll probably change this once the framework changes related to this are in because otherwise, I won't be able to dev-test this code
The spec issue for this PR is: Tendrl/specifications#40 |
TODO(anmolbabu): Add UTs |
data = sock.recv(RECEIVE_DATA_SIZE) | ||
alert_utils = AlertUtils() | ||
alert = alert_utils.validate_alert_json(data) | ||
alert['significance'] = 'HIGH' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded as HIGH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is not configurable.
And as per the plan in https://github.com/Tendrl/specifications/blob/master/specs/pluggable_alert_delivery.adoc --
"Each alert will carry additional attribute that carries its significance which will be added by respective handler. The supported significance levels can be 'HIGH', 'MEDIUM' and 'LOW'.This logic will be hardcoded in the codebase that handles the alerts."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
try: | ||
self.alert = Alert.to_obj(alert_json_string) | ||
self.alert.alert_id = str(uuid.uuid4()) | ||
self.alert.significance = 'HIGH' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded as HIGH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be overriden by custom handlers. So in effect the hard coding done here(base handler) is only like a default value which will be overriden by the respective custom handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anmolbabu fix travis unit test failures, rest LGTM
@@ -0,0 +1,70 @@ | |||
from _socket import error as _socket_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm.. what is _socket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in https://github.com/gevent/gevent/blob/master/src/gevent/server.py , gevent uses _socket which I found somewhere can throw the exception _socket._socket_error. So line 65 handles that exception. I haven't myself observed that happening though..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix travis unit test failures, rest LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But we need to rebase and use latest commons module. There are changes related to config loading.
tendrl-bug-id: Tendrl#70 tendrl-spec: Tendrl/specifications#18 Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
tendrl-bug-id: Tendrl#70 Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
tendrl-bug-id: Tendrl#70 Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
) as ex: | ||
LOG.error('Failed to handle data on alert socket.Error %s' % ex) | ||
|
||
def __init__(self, alerts_socket_addr, alerts_socket_port, etcd_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to be a UNIX socket, not a TCP socket. The canonical path for such a socket would be under /run/tendrl/message.sock. This would require root privileges, which would be available when deployed via RPM. However, for scenarios where the node agent is being run directly from the git checkout or a non-global installation directory, the socket should be created in the run
directory, at the top of the installation tree.
super(AlertsManager, self).__init__() | ||
self.hostname = alerts_socket_addr | ||
self.port = alerts_socket_port | ||
global db_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here?
Abandoning as an alternate strategy has been adopted... |
This patch adds the following:
tendrl-bug-id: #70
Signed-off-by: anmolbabu anmolbudugutta@gmail.com