Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation on monitoring #235

Open
johrstrom opened this issue Feb 19, 2020 · 2 comments
Open

Documentation on monitoring #235

johrstrom opened this issue Feb 19, 2020 · 2 comments

Comments

@johrstrom
Copy link
Contributor

johrstrom commented Feb 19, 2020

We have no documentation on how to monitor OOD. Not even references to our own ganglia or prometheus exporter or base apache monitoring.

┆Issue is synchronized with this Asana task by Unito

@treydock
Copy link
Contributor

Once OSC/ondemand#400 is merged that will add the support for Grafana and can document both Grafana and Ganglia but that just covers integrating with monitoring.

The existing ondemand-specific monitoring is mostly around PUNs and Apache connections. The rest isn't specific to OnDemand but rather just checking filesystems aren't full, ports are open, Apache responds to requests and certificates aren't expired. With Prometheus we can also more easily monitor memory and CPU levels to keep an eye for spikes in those on OnDemand host. I suppose we can cover what we provide as well as ideas of what else to monitor.

@ericfranz
Copy link
Contributor

I'd like to see our approach to this redesigned first. The idea is mentioned to a degree in #235, but we would change the app so the AJAX request for the job details returns HTML for the job details pane, instead of JSON. Once HTML rendering is done server side, we could have a view template partial similar approach that we can override with a custom one in /etc/ood/config. That way we could embed site specific logic, like for example, "if this job is on pitzer and the job's native attribute has something about gpus, lets display graphs for GPUs". Something the current abstraction in the app doesn't support.

At that point we could move the bulk of our custom logic to these custom views, removing this functionality from ActiveJobs. That is when I would like to document this feature.

We could add this work to the 1.8 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants