Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gathering resource stats took longer than 5 seconds #171

Closed
kvaps opened this issue Aug 17, 2020 · 12 comments
Closed

Gathering resource stats took longer than 5 seconds #171

kvaps opened this issue Aug 17, 2020 · 12 comments

Comments

@kvaps
Copy link

kvaps commented Aug 17, 2020

Hi after upgrade to v1.8.0 my log is full of these messages:

21:36:47.026 [Main] INFO  LINSTOR/Controller - SYSTEM - Controller initialized
21:36:55.017 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Gathering resource stats took longer than 5 seconds [Report number 5F3AF821-00000-000000]
21:36:55.018 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Gathering resource stats took longer than 5 seconds [Report number 5F3AF821-00000-000001]
21:37:00.045 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Gathering storage pool stats took longer than 5 seconds [Report number 5F3AF821-00000-000003]
21:37:00.046 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Gathering storage pool stats took longer than 5 seconds [Report number 5F3AF821-00000-000002]
21:37:00.224 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-image-28' of node 'm1c9' added for retry.
21:37:01.185 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-6832249a-ffc4-4680-b701-d7fa44ffb519' of node 'm1c7' added for retry.
21:37:01.445 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-103-disk-0' of node 'm1c9' added for retry.
21:37:02.587 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-104-disk-0' of node 'm1c9' added for retry.
21:37:03.807 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-105-disk-0' of node 'm1c9' added for retry.
21:37:04.957 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-113-disk-0' of node 'm1c9' added for retry.
21:37:05.065 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Gathering error reports took longer than 5 seconds [Report number 5F3AF821-00000-000005]
21:37:05.065 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Gathering error reports took longer than 5 seconds [Report number 5F3AF821-00000-000004]
21:37:05.077 [TaskScheduleService] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-6832249a-ffc4-4680-b701-d7fa44ffb519' of node 'm1c12' added for retry.
21:37:06.068 [TaskScheduleService] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'pvc-944d33f5-ad07-4f72-9841-9d0ba3b2b2d6' of node 'm1c6' added for retry.
21:37:06.182 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-114-disk-0' of node 'm1c9' added for retry.
21:37:07.334 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-115-disk-0' of node 'm1c9' added for retry.
21:37:08.610 [MainWorkerPool-1] WARN  LINSTOR/Controller - SYSTEM - RetryTask: Failed resource 'one-vm-120-disk-0' of node 'm1c9' added for retry.
21:37:10.118 [grizzly-http-server-1] ERROR LINSTOR/Controller - SYSTEM - Gathering resource stats took longer than 5 seconds [Report number 5F3AF821-00000-000006]
21:37:10.119 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Gathering resource stats took longer than 5 seconds [Report number 5F3AF821-00000-000007]

5F3AF821-00000-000000.log
5F3AF821-00000-000001.log
5F3AF821-00000-000003.log
5F3AF821-00000-000002.log
5F3AF821-00000-000004.log
5F3AF821-00000-000005.log
5F3AF821-00000-000006.log
5F3AF821-00000-000007.log

@kvaps
Copy link
Author

kvaps commented Aug 17, 2020

Probably related with #172. Messages are gone when I switch down the metrics scraping.

@rp-
Copy link
Contributor

rp- commented Aug 18, 2020

Right now I set a fixed 5 seconds limit for gather requests.
I'm not sure how much sense it makes in your setup, to really gather resource/volume data from all your nodes.

You can use /metrics?resource=false to skip resource based stats

@kvaps
Copy link
Author

kvaps commented Aug 18, 2020

Same problem here. Also scraping fell off by the timeout (vmagent is set to 10 seconds by default):

state=down, endpoint=http://10.112.0.198:3370/metrics?resource=false, labels={instance="10.112.0.198:3370", job="kubernetes-service-endpoints", kubernetes_endpoint_address_target_kind="Pod", kubernetes_endpoint_address_target_name="linstor-controller-0", kubernetes_endpoint_hostname="linstor-controller-0", kubernetes_endpoint_node_name="pve1", kubernetes_endpoint_port_name="restapi", kubernetes_endpoint_port_protocol="TCP", kubernetes_endpoint_ready="true", kubernetes_endpoints_name="linstor-controller", kubernetes_namespace="linstor", kubernetes_pod_container_name="linstor-controller", kubernetes_pod_container_port_name="restapi", kubernetes_pod_container_port_number="3370", kubernetes_pod_container_port_protocol="TCP", kubernetes_pod_controller_kind="StatefulSet", kubernetes_pod_controller_name="linstor-controller", kubernetes_pod_host_ip="10.28.36.159", kubernetes_pod_ip="10.112.0.198", kubernetes_pod_label_app="linstor-controller", kubernetes_pod_label_controller_revision_hash="linstor-controller-784d498d56", kubernetes_pod_label_statefulset_kubernetes_io_pod_name="linstor-controller-0", kubernetes_pod_labelpresent_app="true", kubernetes_pod_labelpresent_controller_revision_hash="true", kubernetes_pod_labelpresent_statefulset_kubernetes_io_pod_name="true", kubernetes_pod_name="linstor-controller-0", kubernetes_pod_node_name="pve1", kubernetes_pod_phase="Running", kubernetes_pod_ready="true", kubernetes_pod_uid="192a40ea-62b4-4958-854d-6e2f99d8387a", kubernetes_service_annotation_prometheus_io_path="/metrics?resource=false", kubernetes_service_annotation_prometheus_io_port="3370", kubernetes_service_annotation_prometheus_io_scrape="true", kubernetes_service_annotation_qbec_io_component="linstor", kubernetes_service_annotation_qbec_io_last_applied="H4sIAAAAAAAA/2yQvU4DQQyEex7D9UGIUiCthHgIJBpE4WwMWbFnG9tJc7p3Rz4UEn7Ku5n5dsYToLYnMm/CUOC4hgHeG++gwCPZsVWCAUYK3GEglAmQWQKjCXt+qslIsaeD3zRZKcYeCqxGCmvVH4xcDlbp/hW7J+mXXSygwGZzd/tH82qoBAXCDpn82FLN/1VGFSbOYG/sIQbzAB231JdGJyOq9laXphfWM4j42Ex4/EJ54BsliHGks/+6CodJ75TR1Fyx0s+3XakutxALh/I8nSBGHqgtty1Lc+g8fMvu/VJazy8DOHWqyc1Tq/7fZJ7nq08AAAD//wEAAP//lLaa5bkBAAA=", kubernetes_service_annotationpresent_prometheus_io_path="true", kubernetes_service_annotationpresent_prometheus_io_port="true", kubernetes_service_annotationpresent_prometheus_io_scrape="true", kubernetes_service_annotationpresent_qbec_io_component="true", kubernetes_service_annotationpresent_qbec_io_last_applied="true", kubernetes_service_cluster_ip="10.96.137.74", kubernetes_service_labelpresent_qbec_io_application="true", kubernetes_service_labelpresent_qbec_io_environment="true", kubernetes_service_name="linstor-controller", kubernetes_service_type="ClusterIP", qbec_io_application="linstor", qbec_io_environment="stage"}, last_scrape=17.183s ago, scrape_duration=10.010s, error="error when scraping \"http://10.112.0.198:3370/metrics?resource=false\" with timeout 10s: timeout"
state=down, endpoint=http://10.112.0.198:3370/metrics?resource=false, labels={instance="10.112.0.198:3370", job="kubernetes-service-endpoints", kubernetes_endpoint_address_target_kind="Pod", kubernetes_endpoint_address_target_name="linstor-controller-0", kubernetes_endpoint_hostname="linstor-controller-0", kubernetes_endpoint_node_name="pve1", kubernetes_endpoint_port_name="ssl", kubernetes_endpoint_port_protocol="TCP", kubernetes_endpoint_ready="true", kubernetes_endpoints_name="linstor-controller", kubernetes_namespace="linstor", kubernetes_pod_container_name="linstor-controller", kubernetes_pod_container_port_name="ssl", kubernetes_pod_container_port_number="3371", kubernetes_pod_container_port_protocol="TCP", kubernetes_pod_controller_kind="StatefulSet", kubernetes_pod_controller_name="linstor-controller", kubernetes_pod_host_ip="10.28.36.159", kubernetes_pod_ip="10.112.0.198", kubernetes_pod_label_app="linstor-controller", kubernetes_pod_label_controller_revision_hash="linstor-controller-784d498d56", kubernetes_pod_label_statefulset_kubernetes_io_pod_name="linstor-controller-0", kubernetes_pod_labelpresent_app="true", kubernetes_pod_labelpresent_controller_revision_hash="true", kubernetes_pod_labelpresent_statefulset_kubernetes_io_pod_name="true", kubernetes_pod_name="linstor-controller-0", kubernetes_pod_node_name="pve1", kubernetes_pod_phase="Running", kubernetes_pod_ready="true", kubernetes_pod_uid="192a40ea-62b4-4958-854d-6e2f99d8387a", kubernetes_service_annotation_prometheus_io_path="/metrics?resource=false", kubernetes_service_annotation_prometheus_io_port="3370", kubernetes_service_annotation_prometheus_io_scrape="true", kubernetes_service_annotation_qbec_io_component="linstor", kubernetes_service_annotation_qbec_io_last_applied="H4sIAAAAAAAA/2yQvU4DQQyEex7D9UGIUiCthHgIJBpE4WwMWbFnG9tJc7p3Rz4UEn7Ku5n5dsYToLYnMm/CUOC4hgHeG++gwCPZsVWCAUYK3GEglAmQWQKjCXt+qslIsaeD3zRZKcYeCqxGCmvVH4xcDlbp/hW7J+mXXSygwGZzd/tH82qoBAXCDpn82FLN/1VGFSbOYG/sIQbzAB231JdGJyOq9laXphfWM4j42Ex4/EJ54BsliHGks/+6CodJ75TR1Fyx0s+3XakutxALh/I8nSBGHqgtty1Lc+g8fMvu/VJazy8DOHWqyc1Tq/7fZJ7nq08AAAD//wEAAP//lLaa5bkBAAA=", kubernetes_service_annotationpresent_prometheus_io_path="true", kubernetes_service_annotationpresent_prometheus_io_port="true", kubernetes_service_annotationpresent_prometheus_io_scrape="true", kubernetes_service_annotationpresent_qbec_io_component="true", kubernetes_service_annotationpresent_qbec_io_last_applied="true", kubernetes_service_cluster_ip="10.96.137.74", kubernetes_service_labelpresent_qbec_io_application="true", kubernetes_service_labelpresent_qbec_io_environment="true", kubernetes_service_name="linstor-controller", kubernetes_service_type="ClusterIP", qbec_io_application="linstor", qbec_io_environment="stage"}, last_scrape=10.866s ago, scrape_duration=10.028s, error="error when scraping \"http://10.112.0.198:3370/metrics?resource=false\" with timeout 10s: timeout"

I think it's better to implement some non-blocking caching mechanism for the metrics, multiple agents might gather metrics in same time, it is unacceptable that it affects linstor so hard.

@rp-
Copy link
Contributor

rp- commented Aug 19, 2020

next linstor version will report cached values, that will tremendously speedup /metrics fetching.

If you want you can try these 2 jars, controller only upgrade:

controller-1.8.0.jar.gz
server-1.8.0.jar.gz

@kvaps
Copy link
Author

kvaps commented Aug 19, 2020

@rp-,

# curl localhost:3370/metrics
[{"ret_code":-4611686018427386906,"message":"Exception thrown.","error_report_ids":["5F3D95EA-00000-000027"]}]

# curl localhost:3370/health 
Services not running: NetComService

5F3D95EA-00000-000027.log
controller.log
ErrorReport-5F3D95EA-00000.zip

kvaps added a commit to kvaps/kube-linstor that referenced this issue Aug 19, 2020
@rp-
Copy link
Contributor

rp- commented Aug 20, 2020

Here is an updated version with nullpointer checks, thanks for testing.
the server.jar from above is also needed.
controller-1.8.0.jar.gz

@kvaps
Copy link
Author

kvaps commented Aug 20, 2020

@rp-, thank you, now metrics collection takes about 2 seconds, but health still not working properly:

# curl localhost:3370/health 
Services not running: NetComService

Please also check java.io.IOException: Broken pipe exceptions in the controller.log

ErrorReport-5F3E345B-00000-000000.log
ErrorReport-5F3E345B-00000-000001.log
ErrorReport-5F3E345B-00000-000002.log
ErrorReport-5F3E345B-00000-000003.log
ErrorReport-5F3E345B-00000-000004.log

just to be sure:

# sha256sum /usr/share/linstor-server/lib/controller-1.8.0.jar
65d0b39e75c40c57a5e0bc53498795f2d1e14a9977693298b24b04837b3f350c  /usr/share/linstor-server/lib/controller-1.8.0.jar

kvaps added a commit to kvaps/kube-linstor that referenced this issue Aug 20, 2020
@rp-
Copy link
Contributor

rp- commented Aug 20, 2020

are all nodes connected? and does it maybe recover after a while?

btw. try also to add ?error_reports=false to your /metrics query, we don't have a caching for that yet

@kvaps
Copy link
Author

kvaps commented Aug 20, 2020

are all nodes connected?

one node is offline

does it maybe recover after a while?

I don't think so, my health check killing the pod earlier in 5 minutes after startup :)

btw. try also to add ?error_reports=false to your /metrics query, we don't have a caching for that yet

will do

@kvaps
Copy link
Author

kvaps commented Aug 20, 2020

with ?error_reports=false metrics collecting tooks about 0.03 seconds, without it about 6-10 seconds and reports:

09:06:30.607 [grizzly-http-server-0] ERROR LINSTOR/Controller - SYSTEM - Gathering error reports took longer than 5 seconds [Report number 5F3E3C9F-00000-000001]

Any way I still having these errors in my controller.log if I running many requests to /metrics without ?error_reports=false in parallel and interrupting them:

SEVERE: An I/O error has occurred while writing a response message entity to the container output stream.

health is working now

@kvaps
Copy link
Author

kvaps commented Aug 20, 2020

Hi.
After a while the most of the nodes in cluster become to OFFLINE:

**╭───────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses               ┊ State   ┊
╞═══════════════════════════════════════════════════════╡
┊ m1c4  ┊ SATELLITE ┊ 10.28.36.164:3367 (SSL) ┊ OFFLINE ┊
┊ m1c5  ┊ SATELLITE ┊ 10.28.36.165:3367 (SSL) ┊ OFFLINE ┊
┊ m1c6  ┊ SATELLITE ┊ 10.28.36.166:3367 (SSL) ┊ Online  ┊
┊ m1c7  ┊ SATELLITE ┊ 10.28.36.167:3367 (SSL) ┊ Online  ┊
┊ m1c8  ┊ SATELLITE ┊ 10.28.36.168:3367 (SSL) ┊ OFFLINE ┊
┊ m1c9  ┊ SATELLITE ┊ 10.28.36.169:3367 (SSL) ┊ Online  ┊
┊ m1c10 ┊ SATELLITE ┊ 10.28.36.170:3367 (SSL) ┊ OFFLINE ┊
┊ m1c12 ┊ SATELLITE ┊ 10.28.36.172:3367 (SSL) ┊ OFFLINE ┊
┊ pve1  ┊ SATELLITE ┊ 10.28.36.159:3367 (SSL) ┊ Online  ┊
┊ pve2  ┊ SATELLITE ┊ 10.28.36.160:3367 (SSL) ┊ OFFLINE ┊
┊ pve3  ┊ SATELLITE ┊ 10.28.36.161:3367 (SSL) ┊ OFFLINE ┊
╰───────────────────────────────────────────────────────╯

controller.log
ErrorReport-5F3E3C9F-00000.log.zip

After controller restart nodes become to normal state:

╭───────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses               ┊ State   ┊
╞═══════════════════════════════════════════════════════╡
┊ m1c4  ┊ SATELLITE ┊ 10.28.36.164:3367 (SSL) ┊ Online  ┊
┊ m1c5  ┊ SATELLITE ┊ 10.28.36.165:3367 (SSL) ┊ Online  ┊
┊ m1c6  ┊ SATELLITE ┊ 10.28.36.166:3367 (SSL) ┊ Online  ┊
┊ m1c7  ┊ SATELLITE ┊ 10.28.36.167:3367 (SSL) ┊ Online  ┊
┊ m1c8  ┊ SATELLITE ┊ 10.28.36.168:3367 (SSL) ┊ Online  ┊
┊ m1c9  ┊ SATELLITE ┊ 10.28.36.169:3367 (SSL) ┊ Online  ┊
┊ m1c10 ┊ SATELLITE ┊ 10.28.36.170:3367 (SSL) ┊ Online  ┊
┊ m1c12 ┊ SATELLITE ┊ 10.28.36.172:3367 (SSL) ┊ Online  ┊
┊ pve1  ┊ SATELLITE ┊ 10.28.36.159:3367 (SSL) ┊ Online  ┊
┊ pve2  ┊ SATELLITE ┊ 10.28.36.160:3367 (SSL) ┊ Online  ┊
┊ pve3  ┊ SATELLITE ┊ 10.28.36.161:3367 (SSL) ┊ OFFLINE ┊
╰───────────────────────────────────────────────────────╯

@rp-
Copy link
Contributor

rp- commented Aug 20, 2020

does this cluster just have this 11 nodes?
how many error-reports are there, or lets say how long does a linstor err list take?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants