Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize logging system for WM #11929

Open
vkuznet opened this issue Mar 12, 2024 · 2 comments
Open

Centralize logging system for WM #11929

vkuznet opened this issue Mar 12, 2024 · 2 comments

Comments

@vkuznet
Copy link
Contributor

vkuznet commented Mar 12, 2024

Impact of the new feature
Uniform, centralized logging system can bring many benefits to data operations, debugging, and monitoring of various WM services, components, workflows, etc.

Is your feature request related to a problem? Please describe.
At the moment we have de-centralized, non-overlapping logging solutions, like LogDB, file logs within WMAgents, Component logs for WM services, etc., which is very tedious to navigate and require different access patterns. From end-user point of view, e.g. data-ops, it is very cumbersome to navigate and find specific information about different topics.

Describe the solution you'd like
We may adopt CMS Monitoring system based on CERN brokers (AMQ) and MONIT backend using Elastic/OpenSearch and HDFS for storing semi-structured (JSON) documents. Each log entry can be represented as JSON document and injected to central service, similar to WMArchvie, which can proxy it to MONIT backend. Here is detailed plan for such system:

  1. Logs can be represented as JSON document, i.e. we may have uniform logging format, e.g. {"service": SERVICE_NAME, "message": MESSAGE, "code": CODE, "timestamp": UNIX_TIME, "application": APPLICATION}
  2. Request MONIT topics, e.g. WMLogs, within MONIT AMQ brokers and two backend streams ElasticSearch and HDFS areas. The former can use short retention policy, e.g. 1 month, the latter can hold logs up to 13 months.
  3. Setup CMSAMQProxy (generalization of WMArchive) service on CMSWEB to be a proxy between clients and MONIT infrastructure; we already have its deployment for CMSWEB k8s infrastructure, docker and k8s manifest
  4. Logs can be injected from distributed on and off-site locations in a similar fashion as we inject monitoring information, e.g. WMArchive. We can either use WMArchive code base to injects logs (JSON docs) or develop stand-along Logger class to use instead of python logger. If latter, it can log to either file based sink or to HTTP CMSWEB AMQ proxy or both.
  5. Gradually patch various WM systems, services to inject new logs into MONIT
  6. Logs in Elastic/OpenSearch can be viewed in browser, they can be queried and sliced using elastic search QueryLanguage, e.g. we may have predefined queries for different use-case, e.g. how to search for workflow evolution, or find specific error code, etc.
  7. Logs can be used for additional monitoring purposes, e.g. watch workflow progress.
  8. It is possible to have subscription channels where someone or application can subscribe to MONIT (kafka) topics, similar to what we have with Rucio traces

Describe alternatives you've considered
Do nothing and use existing system

Additional context
To implement this solution we may required the following:

  • perform evaluation of total log volume
    • log volume per individual service/application
    • rate of injection
  • discuss and establish log structure (see above proposal)
  • setup CMSAMQProxy and perform integration with MONIT
    • identify number of replicas for CMSAMQProxy based on evaluated log volume and injection rates
    • setup MONIT topic and retention policies
    • perform integration between WM application/service and CMSAMQProxy and MONIT
  • patch existing services to either use new injectors or class which will handle logging injection
  • manage authentication, current one is based on X509 if we'll setup CMSAMQProxy on CMSWEB and later should be adopted token based authentication

The migration can be done gradually without disruption of existing logging solutions, and even we may try individual service/application to start with and verify whole workflow. If successful, start gradually adding new services/applications.

Some training may be required for end-users to get use to Elastic/OpenSearch queries. Additional CLI interface can be developed too via proxy access to MONIT.

@todor-ivanov
Copy link
Contributor

@vkuznet @amaltaro I want to make one really important remark, which we should always keep in our heads while working on that.

  • Many of these logs contain sensitive information. And we must pay attention and invest additional effort for this information to be blurred before uploading the contents to a public repository!

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 13, 2024

@todor-ivanov , the MONIT infrastructure has "public" (open to CERN network) and pure "private" (open to restricted list) channels. The former is where we store our Monitoring metrics/records, while latter where our kubernetes log entries go and it has restricted access. We don't need to hide anything as the content will be restricted within CERN network, i.e. it does not have public internet access, it will be only visible to CMS, and if we want it will only visible to a specific e-groups, users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants