Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMArchive Grafana Monitoring Down / No Data #11960

Open
hassan11196 opened this issue Apr 8, 2024 · 18 comments · May be fixed by #11967
Open

WMArchive Grafana Monitoring Down / No Data #11960

hassan11196 opened this issue Apr 8, 2024 · 18 comments · May be fixed by #11967

Comments

@hassan11196
Copy link

Impact of the bug
WMArchive is an essential dashboard utilized by P&R for day-to-day operations, to investigate failing workflows and issues with sites. This monitoring being down severely affects P&R operations.
image

Describe the bug
The WMArchive Grafana Dashboard dashboard is missing data since 27th March as reported by Jen on 27th March on mattermost.

As I understand the Failed Workflow Job reports are ingested to the /wmarchive/data/ endpoint by the ArchiveDataPoller.
There are Failed Workflow Job reports (FWJR) that are too large to be digested by the system and the CMSWeb cluster returns an HTTP error code 413 Request Entity Too Large.

The following ArchiveDataPoller log was pasted by @todor-ivanov in the Mattermost thread.

2024-03-29 17:32:21,876:140698532333312:INFO:ArchiveDataPoller:Found 1000 not archived documents from FWRJ db to upload to WMArchive.
2024-03-29 17:32:22,174:140698532333312:ERROR:ArchiveDataPoller:Error occurred, will retry later:
2024-03-29 17:32:22,174:140698532333312:ERROR:ArchiveDataPoller:url=https://cmsweb.cern.ch:8443/wmarchive/data/, code=413, reason=Request Entity Too Large, headers={'Date': 'Fri, 29 Mar 2024 16:32:22 GMT', 'Server': 'Apache', 'Content-Type': 'text/html', 'Content-Length': '176', 'CMS-Server-Time': 'D=32921 t=1711729942140346', 'Connection': 'close'}, result=b'<html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body>\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
2024-03-29 17:32:22,174:140698532333312:ERROR:ArchiveDataPoller:Trace back: 
Traceback (most recent call last):
  File "/data/srv/wmagent/v2.3.0.2/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.0.2/lib/python3.8/site-packages/WMComponent/ArchiveDataReporter/ArchiveDataPoller.py", line 57, in algorithm
    response = self.wmarchiver.archiveData(archiveDocs)
  File "/data/srv/wmagent/v2.3.0.2/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.0.2/lib/python3.8/site-packages/WMCore/Services/WMArchive/WMArchive.py", line 33, in archiveData
    return self["requests"].post('', {'data': data})[0]['result']
  File "/data/srv/wmagent/v2.3.0.2/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.0.2/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 154, in post
    return self.makeRequest(uri, data, 'POST', incoming_headers,
  File "/data/srv/wmagent/v2.3.0.2/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.0.2/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/wmagent/v2.3.0.2/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.0.2/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 202, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/wmagent/v2.3.0.2/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.0.2/lib/python3.8/site-packages/Utils/PortForward.py", line 66, in portMangle
    return callFunc(callObj, newUrl, *args, **kwargs)
  File "/data/srv/wmagent/v2.3.0.2/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.3.0.2/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 353, in request
    raise exc
http.client.HTTPException: url=https://cmsweb.cern.ch:8443/wmarchive/data/, code=413, reason=Request Entity Too Large, headers={'Date': 'Fri, 29 Mar 2024 16:32:22 GMT', 'Server': 'Apache', 'Content-Type': 'text/html', 'Content-Length': '176', 'CMS-Server-Time': 'D=32921 t=1711729942140346', 'Connection': 'close'}, result=b'<html>\r\n<head><title>413 Request Entity Too Large</title></head>\r\n<body>\r\n<center><h1>413 Request Entity Too Large</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

2024-03-29 17:32:22,174:140698532333312:INFO:BaseWorkerThread:ArchiveDataPoller took 0.734 secs to execute

How to reproduce it
Steps to reproduce the behavior:
Submit a "large" FWJR to the /wmarchive/data/ endpoint.
Not sure whats the actual size of the payload that will qualify as being "large" and make the endpoint return the 413 Request Entity Too Large.

Expected behavior
The /wmarchive/data/ endpoint to be able to handle "large" FWJRs and the WMArchive dashboard be working as usual

@vkuznet
Copy link
Contributor

vkuznet commented Apr 8, 2024

Here is my reply through email thread to Todor and CMS Monitoring team:

Regarding WMArchive issue, if you'll look closely to reported error you'll see that it comes from nginx, i.e.

<head><title>413 Request Entity Too Large</title></head>\r\n<body>\r\n<center><h1>413 Request Entity Too
Large</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

So, it is issue on nginx k8s frontend which supposes to pass request to backend server (in this case WMArchive). Therefore, the actual issue is on CMSWEB side and neither on WMCore or WMArchive. In other words our k8s frontend rejects request based on its size. I suggest that you check with Imran/Aroosha what is a current limit of nginx we have on k8s, and most likely it needs to be increased. Please note that from reported error we can't see the actual size of the request since it is payload data sent over HTTP POST method.

A long term solution which should put in place is compression which can be imposed on WMCore side which will first compress, e.g. gzip, payload and then send it over to WMArchive. But this will require additional effort on several fronts:

  • on WMCore to compress and set up properly mime type for such request
  • on WMArchive to decompress it and properly send to MONIT
  • on MONIT side to consume compressed payload or handle it in plain form but with large payload side (please note that MONIT infrastructure also relies on k8s and may be subject of the same nginx limitation as in our clusters).

Finally, another approach is just review what is send from WMCore and limit it based on current nginx settings.

@amaltaro
Copy link
Contributor

amaltaro commented Apr 8, 2024

I would suggest dumping at least 1 or 2 documents and inspecting its construction and where the heavy data is. We might want to refactor it in WMAgent, in case the information isn't relevant (like free text error message might be a good candidate to be dropped in WMArchive).

@nikodemas
Copy link
Member

Some of the logs can also be seen on the nginx controller on the Kubernetes:

~ $ k logs pod/cern-magnum-ingress-nginx-controller-5b8j8 -n kube-system | grep -i large
2024/04/11 07:13:05 [error] 36#36: *2223712 client intended to send too large body: 15734904 bytes, client: 188.184.75.82, server: cmsweb-k8s-prodsrv2.cern.ch, request: "POST /wmarchive/data/ HTTP/1.1", host: "cmsweb-k8s-prodsrv2.cern.ch"
2024/04/11 07:55:35 [error] 36#36: *2317077 client intended to send too large body: 23517605 bytes, client: 188.184.73.32, server: cmsweb-k8s-prodsrv2.cern.ch, request: "POST /wmarchive/data/ HTTP/1.1", host: "cmsweb-k8s-prodsrv2.cern.ch"
2024/04/11 07:59:51 [error] 34#34: *2328673 client intended to send too large body: 17649066 bytes, client: 188.184.75.82, server: cmsweb-k8s-prodsrv2.cern.ch, request: "POST /wmarchive/data/ HTTP/1.1", host: "cmsweb-k8s-prodsrv2.cern.ch"
2024/04/11 08:03:52 [error] 39#39: *2341664 client intended to send too large body: 15734904 bytes, client: 188.184.75.82, server: cmsweb-k8s-prodsrv2.cern.ch, request: "POST /wmarchive/data/ HTTP/1.1", host: "cmsweb-k8s-prodsrv2.cern.ch"
2024/04/11 08:08:56 [error] 39#39: *2359348 client intended to send too large body: 15734904 bytes, client: 188.184.75.82, server: cmsweb-k8s-prodsrv2.cern.ch, request: "POST /wmarchive/data/ HTTP/1.1", host: "cmsweb-k8s-prodsrv2.cern.ch"
2024/04/11 08:10:20 [error] 36#36: *2364292 client intended to send too large body: 19739784 bytes, client: 188.184.75.82, server: cmsweb-k8s-prodsrv2.cern.ch, request: "POST /wmarchive/data/ HTTP/1.1", host: "cmsweb-k8s-prodsrv2.cern.ch"

Although @arooshap has just raised the nginx limit from 1MB to 8MB, but these sizes are around 20MB, so docs like these are still failing.

@vkuznet
Copy link
Contributor

vkuznet commented Apr 11, 2024

raising threshold may be a temporary solution since we never really know the size of payload. I rather prefer to see how it can be constrained in WM system which knows by definition the size when it construct the HTTP request. Now we know the nginx threshold and WM system should acknowledge it.

@vkuznet vkuznet linked a pull request Apr 15, 2024 that will close this issue
@vkuznet vkuznet self-assigned this Apr 15, 2024
@vkuznet
Copy link
Contributor

vkuznet commented Apr 15, 2024

I posted first draft of the fix in #11967 which basically checks newly created WMArchive doc within WM component before sending it over to WMArchive service. As such we can accomplish few things:

  • the short document details, along its size and used threshold will be printed out to the log
  • the document above threshold will not be send to WMArchive

Once we investigate further the cause of such large sizes we can inspect those documents from local CouchDB (the document id is printed to the log) and identify the source of large document size. After that a more concrete remedy can be put in place to avoid creation of such large documents and we can restore data flow to wmarchive.

@arooshap
Copy link
Member

@vkuznet we changed nginx limit to accommodate data of any size. This setting was also being applied in the older cluster, but it was not mentioned in the upgrade procedure, so it was skipped. I am letting you know in case you did not see the discussion on mattermost!

@vkuznet
Copy link
Contributor

vkuznet commented Apr 15, 2024

@arooshap , does it mean that there is no limit now on nginx? If this is the case we'll get problem on MONIT site since it also has internal constrain, I hope @leggerf or @brij01 or @nikodemas can remind us what is a current limit on MONIT size for JSON documents we send, if I recall from the past it was around 20-30MB. In other words increasing or removing limits on nginx does not solve the entire problem of creating big WMArchive docs since they content should not that big for monitoring purposes and we still must identify what are those documents and how to fix them before sending to MONIT. I just fear that by removing nginx limit we delegate problem from CMSWEB to MONIT and sooner or later the MONIT team will complain about such large documents.

@arooshap
Copy link
Member

@vkuznet yes, that is exactly that. Apparently, it was decided to use these parameters after a careful consideration a few years ago. I proposed that we can have a discussion among the teams to decide a concrete value for these limits(as you and @belforte mentioned that it is not good to have no limits as we are working under the assumption that it might break things in the future) . Maybe I can open a ticket to discuss this, what do you think?

@belforte
Copy link
Member

as Alan already sugested, we can start by looking at past statitics and start with a limit which envelopes what we do, while protecting against new problems. Then we can look at cost of keeping it vs. cost of reducing and adapting applications.

@vkuznet
Copy link
Contributor

vkuznet commented Apr 15, 2024

@vkuznet yes, that is exactly that. Apparently, it was decided to use these parameters after a careful consideration a few years ago. I proposed that we can have a discussion among the teams to decide a concrete value for these limits(as you and @belforte mentioned that it is not good to have no limits as we are working under the assumption that it might break things in the future) . Maybe I can open a ticket to discuss this, what do you think?

Aroosha, we may face now few problems:

  1. By removing the nginx limit you may open CMSWEB infrastructure to potential impact of large docs. We don't know their total size, neither their rate or impact on nginx performance, and if new documents will start flooding nginx it can slow down entire CMSWEB throughput.
  2. We don't know impact on MONIT either since now these docs will flow from client to WMArchive to MONIT and MONIT team can start complaining
  3. We don't know impact of memory footprint of WMArchive either which may hickup, or at least have memory spikes due to large document size from upstream client.

I suggest that for item 1 you may use hey tool and performance performance studies how it may impact CMSWEB ngins/FE if you'll pass around 20-50MB JSON in payload to it. This can also help to identify impacton WMArchive, item 3. But for item 2 we need input from CMSMonitoring team to tell us the exact limit from MONIT infrastructure and I would suggest to bound nginx within this limit at least because we don't have access to MONIT infrastructure which is bigger then CMSWEB and MONIT/IT team will complain since CMS docs may impact performance of kafka, etc. Bottom line, simply removing the limit may hit us back. I would rather keep some limit, at least driven by threshold from MONIT.

@belforte
Copy link
Member

IIUC there was no limit in nginx until two weeks ago. A limit is good, but it is not a disaster to start with same configuration as we had in last years.

FWIW, if one puts their hands in WMArchive I suggest to critically review the need for it. It dates from the times that we did not have HDFS etc. Do we really need al lthat info ? IIUC the only user now is P&R operations (Jen) who finds some useful plots in Grafana of OK/fail vs. campaing/worflow/agent which could also be filled with information already collected by the HTCondor spider.

@nikodemas
Copy link
Member

If this is the case we'll get problem on MONIT site since it also has internal constrain, I hope @leggerf or @brij01 or @nikodemas can remind us what is a current limit on MONIT size for JSON documents we send, if I recall from the past it was around 20-30MB.

MONIT is accepting the documents up to 30MB and they are kind of recommending to stay under this limit.

@jenimal
Copy link

jenimal commented Apr 19, 2024

I have monitoring! I can see what is going on for the last 24 hrs. Thank you!

@nikodemas
Copy link
Member

Just an update on my previous answer about the 30MB limit - there are some problems going near or above the limit. If the message is larger than the limit, then it will simply be rejected, so basically the data will be lost. Otherwise, if some message is only slightly under the limit and for some reason the compression that is done internally on MONIT's side doesn't work too well on it, it can get stuck Flume and disturb the whole data ingestion pipeline for some time (I think this has happened a few months ago). Therefore, may I ask if there are any plans to change anything regarding this issue?

@nikodemas
Copy link
Member

And another update - one of the possible suggestions from @vkuznet was to send already compressed data to the MONIT, however we were just told that currently MONIT infrastructure only accepts .json type data, so sending compressed messages wouldn't be possible :(

@belforte
Copy link
Member

do we really need all that data in MONIT ? In a single document ?

@amaltaro
Copy link
Contributor

A reasonable/standard document should not be greater than 1 or 2MB.
Yes Nikodemas, we want to review the data structure of these documents and truncate - or delete - data that is not really relevant for WMArchive monitoring. I fear though that we will only be able to get to this mid May, as we are fully focused on WMAgent containerization these weeks.

@nikodemas
Copy link
Member

@amaltaro ok, thanks for the information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

7 participants