Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle exceptions from prometheus collectors in discovery node #3044

Merged
merged 5 commits into from
May 5, 2022

Conversation

joaquincasares
Copy link
Contributor

@joaquincasares joaquincasares commented May 5, 2022

Description

Our prometheus endpoints are currently seeing this:

{"error":["Something caused the server to crash."],"success":false}

Our logs are showing this:

{"levelno": 40, "level": "ERROR", "msg": "Non Audius-derived exception\nTraceback (most recent call last):\n  File \"/usr/lib/python3.9/site-packages/flask/app.py\", line 1838, in full_dispatch_request\n    rv = self.dispatch_request()\n  File \"/usr/lib/python3.9/site-packages/flask/app.py\", line 1824, in dispatch_request\n    return self.view_functions[rule.endpoint](**req.view_args)\n  File \"/audius-discovery-provider/src/queries/prometheus_metrics_exporter.py\", line 43, in prometheus_metrics_exporter\n    PrometheusMetric.populate_collectors()\n  File \"/audius-discovery-provider/src/utils/prometheus_metric.py\", line 77, in populate_collectors\n    collector()\n  File \"/audius-discovery-provider/src/queries/get_celery_tasks.py\", line 30, in celery_tasks_prometheus_exporter\n    metric.save_time({\"task_name\": task[\"name\"]}, start_time=task[\"time_start\"])\nKeyError: 'name'", "timestamp": "2022-05-05 17:23:27,629"}

Let's wrap our collectors with try/except blocks to minimize future collector issues.

Tests

Check that our prometheus endpoint continues to expose metrics:

Check that our discovery provider metrics are still being displayed:

Ensure our staging nodes are being scraped without failures:

How will this change be monitored? Are there sufficient logs?

@SidSethi
Copy link
Contributor

SidSethi commented May 5, 2022

@joaquincasares mind linking to the relevant dashboards in PR description? in place of Monitor Grafana for incoming data.

@joaquincasares
Copy link
Contributor Author

Now seeing where the issue is with the new log lines:

{"levelno": 40, "level": "ERROR", "msg": "Processing failed for task: {'task_id': '552ef680-2cbb-4c5b-9e1b-e7ecc39a09e9', 'task_name': 'update_metrics', 'started_at': 1651784400.00275}\nTraceback (most recent call last):\n  File \"/audius-discovery-provider/src/queries/get_celery_tasks.py\", line 34, in celery_tasks_prometheus_exporter\n    metric.save_time({\"task_name\": task[\"name\"]}, start_time=task[\"time_start\"])\nKeyError: 'name'", "timestamp": "2022-05-05 21:21:37,133"}
{"levelno": 40, "level": "ERROR", "msg": "Processing failed for task: {'task_id': '85b587a6-d232-4a23-80fc-212c762f45d2', 'task_name': 'monitoring_queue', 'started_at': 1651783267.5805767}\nTraceback (most recent call last):\n  File \"/audius-discovery-provider/src/queries/get_celery_tasks.py\", line 34, in celery_tasks_prometheus_exporter\n    metric.save_time({\"task_name\": task[\"name\"]}, start_time=task[\"time_start\"])\nKeyError: 'name'", "timestamp": "2022-05-05 21:21:37,134"}
{"levelno": 40, "level": "ERROR", "msg": "Processing failed for task: {'task_id': 'cac769ac-b530-4afd-9f38-81612d51b0cf', 'task_name': 'index_eth', 'started_at': 1651784448.7092664}\nTraceback (most recent call last):\n  File \"/audius-discovery-provider/src/queries/get_celery_tasks.py\", line 34, in celery_tasks_prometheus_exporter\n    metric.save_time({\"task_name\": task[\"name\"]}, start_time=task[\"time_start\"])\nKeyError: 'name'", "timestamp": "2022-05-05 21:21:37,134"}
{"levelno": 40, "level": "ERROR", "msg": "Processing failed for task: {'task_id': 'c2061b21-9ba3-4b33-9919-d203bc917ce4', 'task_name': 'vacuum_db', 'started_at': 1651783458.180594}\nTraceback (most recent call last):\n  File \"/audius-discovery-provider/src/queries/get_celery_tasks.py\", line 34, in celery_tasks_prometheus_exporter\n    metric.save_time({\"task_name\": task[\"name\"]}, start_time=task[\"time_start\"])\nKeyError: 'name'", "timestamp": "2022-05-05 21:21:37,134"}
{"levelno": 40, "level": "ERROR", "msg": "Processing failed for task: {'task_id': '3e9432c2-0400-4b30-b6f2-0ba79b978fdc', 'task_name': 'index_solana_plays', 'started_at': 1651784448.7501755}\nTraceback (most recent call last):\n  File \"/audius-discovery-provider/src/queries/get_celery_tasks.py\", line 34, in celery_tasks_prometheus_exporter\n    metric.save_time({\"task_name\": task[\"name\"]}, start_time=task[\"time_start\"])\nKeyError: 'name'", "timestamp": "2022-05-05 21:21:37,134"}
{"levelno": 40, "level": "ERROR", "msg": "Processing failed for task: {'task_id': '654dd11c-e71f-4bc2-b46c-a773b6dab5f1', 'task_name': 'index_trending', 'started_at': 1651784421.4895377}\nTraceback (most recent call last):\n  File \"/audius-discovery-provider/src/queries/get_celery_tasks.py\", line 34, in celery_tasks_prometheus_exporter\n    metric.save_time({\"task_name\": task[\"name\"]}, start_time=task[\"time_start\"])\nKeyError: 'name'", "timestamp": "2022-05-05 21:21:37,134"}
{"levelno": 20, "level": "INFO", "msg": "handle flask request", "timestamp": "2022-05-05 21:21:37,141", "method": "GET", "path": "/prometheus_metrics", "status": 200, "duration": 18, "ip": "207.193.120.71", "host": "discoveryprovider2.staging.audius.co", "params": ""}

@joaquincasares
Copy link
Contributor Author

New metrics are now being ingested:

# HELP audius_dn_celery_running_tasks Multiprocess metric
# TYPE audius_dn_celery_running_tasks gauge
audius_dn_celery_running_tasks{pid="ser",task_name="update_metrics"} 1828.4609746932983
audius_dn_celery_running_tasks{pid="ser",task_name="monitoring_queue"} 2960.883231639862
audius_dn_celery_running_tasks{pid="ser",task_name="index_eth"} 1779.754591703415
audius_dn_celery_running_tasks{pid="ser",task_name="vacuum_db"} 2770.283311843872
audius_dn_celery_running_tasks{pid="ser",task_name="index_solana_plays"} 1779.713775396347
audius_dn_celery_running_tasks{pid="ser",task_name="index_trending"} 1806.9744653701782

Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! thanks for linking to stage Discprov and testing there. nice to see the fixes working!

Copy link
Contributor

@jonaylor89 jonaylor89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joaquincasares joaquincasares merged commit 0dec65b into master May 5, 2022
@joaquincasares joaquincasares deleted the jc-prom-metrics branch May 5, 2022 21:49
joaquincasares added a commit that referenced this pull request May 5, 2022
* wrapping collectors with try blocks

* remove unused e

* items not enumerate

* extract task_name, not name

* started_at, not time_start
@SidSethi SidSethi changed the title Handle exceptions from collectors Handle exceptions from prometheus collectors in discovery node May 5, 2022
sliptype pushed a commit that referenced this pull request Sep 10, 2023
[e804aa1] [C-2248, C-2373] Use playlistUpdates, remove legacyNotifications (#3094) Dylan Jeffers
[824933e] [C-2366] Improve web notification selection performance (#3103) Dylan Jeffers
[4b8edef] [PLAT-696] Add trending-playlists/underground notifications (#3089) Dylan Jeffers
[1f9cf3e] [C-2275] Fix android drawer offsets (#3095) Dylan Jeffers
[fc14c82] [PAY-1063][PAY-1085][PAY-1086] Update UI for inaccessible gated tracks from favorites and history pages (#3100) Saliou Diallo
[b0441f5] [C-2365] Update play buttons on web and mobile to show resume when track is current (#3101) Kyle Shanks
[453910f] [C-2378] Add upload v2 feature flag (#3099) Sebastian Klingler
[962a6df] [C-2337] Remove reachability mobile web (#3090) Raymond Jacobson
[4ad5cd2] Fix visible collectibles for upload popup (#3093) Saliou Diallo
[c143078] Fix feature flag bug (#3092) Saliou Diallo
[44435b5] Fix upload prompt modal learn more url (#3091) Saliou Diallo
[c9024ad] Use chat.messagesStatus instead of selector (#3087) Reed
[38d43c4] [C-2369] Fix issue where notification poll can break app on signout (#3088) Dylan Jeffers
[90122d9] [PAY-923] DMs: Add desktop entrypoints (#3083) Marcus Pasell
[00f27e8] [PAY-907] Mobile chat reactions (#3020) Reed
[4678b89] DMs: Fix broken typecheck on main (#3086) Marcus Pasell
[756ade4] [PAY-1000][PAY-1084][PAY-1096][PAY-1097][PAY-1098] - More gated content fixes (#3085) Saliou Diallo
[820aa9d] Fix upload and repost probers tests and lint (#3076) Sebastian Klingler
[345607e] [C-2320] Fix profile socials alignment (#3079) Dylan Jeffers
[569199c] Fix prod build timeout (#3084) Sebastian Klingler
[12f6c22] Remove ports for local dev (#3082) Theo Ilie
[1940618] Fix broken Main build due to typeerror (#3080) Marcus Pasell
[eb8d47e] [PAY-1082] DMs: Dedupe sent messages (#3066) Marcus Pasell
[50a11c3] Update SDK to 2.0.3-beta.0 (#3078) Marcus Pasell
[c420fbb] Clean up NPM package lock (#3077) Marcus Pasell
[35d1124] [C-2327] Add playlist updates slice (#3063) Dylan Jeffers
[59862ad] [C-2344] Update the web playbar scrubber to respect the playback speed of podcasts (#3075) Kyle Shanks
[ffeb0d3] [C-2349] Default download on wifi only to false (#3074) Andrew Mendelsohn
[cafae41] [C-2325] Fix playlist table date-added column (#3073) Dylan Jeffers
[384a510] [PAY-927] DMs: Empty messages state (#3068) Marcus Pasell
[1132f83] Update @jup-ag/core to 2.0.0-beta.9 (#3072) Marcus Pasell
[49c0ebf] [PAY-1072] Change "Download App" icon on Settings Page (#3067) Marcus Pasell
[928dcaf] [PAY-1056] - More gated content updates and fixes (#3070) Saliou Diallo
[1e1f769] [C-2345] Move PlaybackRate drawer to common drawers map (#3071) Kyle Shanks
[f5d1251] Fix web-dist CI steps (#3069) Sebastian Klingler
[5f89800] Fix heavy rotation playlist on client (#3056) sabrina-kiam
[c0191e2] [C-2316] Add remote config for all oauth verification (#3052) Raymond Jacobson
[40f5627] [PAY-1074][PAY-1075][PAY-1076][PAY-1080] - Update availability settings states + more QA fixes (#3059) Saliou Diallo
[5be60ac] [C-2339] Update podcast control updates to also work for audiobooks (#3065) Kyle Shanks
[163ebf5] [C-2297] Add fallback flag to podcast feature (#3064) Sebastian Klingler
[f206391] [PAY-904] - Add gated content upload prompt (#3057) Saliou Diallo
[1afc4e5] [C-1344] Move probers to monorepo and make tests pass (#3061) Sebastian Klingler
[e198279] Remove random line (#3062) Saliou Diallo
[24a001b] Add playback position logic for mobile (#3051) Kyle Shanks
[d210124] [PAY-1070] Update TabSlider/SegmentedControl slider size on resize (#3044) Marcus Pasell
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants