Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uptime: not updated after a crash #180

Closed
Krysztophe opened this issue Apr 9, 2018 · 9 comments
Closed

uptime: not updated after a crash #180

Krysztophe opened this issue Apr 9, 2018 · 9 comments

Comments

@Krysztophe
Copy link
Collaborator

pg_conf_load_time() et pg_postmaster_start_time are not updated when a crash occurs and the postmaster restarts all its children. The uptime service does not raise an alert.

Idea: check pg_stat_activity.backend_start for some vital process like checkpointer? (10+)
I have no idea to track unexpected restarts before 10 though.

@Krysztophe
Copy link
Collaborator Author

Plan:

  • (10+) use the backend_start from the checkpointer from pg_stat_activity

Before v10:

  • Searching for the start time of the checkpointer process? It would work only if check_pga is running on the PG server; risk of mixing processes from different instances

  • Searching in the logs is IMHO out of scope of this service

@Krysztophe
Copy link
Collaborator Author

Krysztophe commented May 1, 2018

See PR #182 for 10+

For 9.1 to 9.6:

I'm wondering if pg_stat_get_db_stat_reset_time and pg_stat_get_bgwriter_stat_reset_time() (9.1+) would be helpful.

Rule : If the oldest not-NULL of these dates is after the pg_postmaster_start_time(), take it as uptime. Possibly take a given database as a reference (you never reset stats manually on template1)?

@ioguix
Copy link
Member

ioguix commented May 2, 2018

Hi,

You can not rely on *_reset_time functions as users can use them whenever they want.

Moreover, terminating all backends to reset the shared_buffers is not a real restart on its own. What you seems to seek for is a way to detect a backend crashed and when it did (and I have no idea how to do it right now).

@Krysztophe
Copy link
Collaborator Author

You can not rely on *_reset_time functions as users can use them whenever they want.

Right, that is only because I cannot rely on backends restart time before PG10 (remotely at least). If you reset all stats time for all databases (even template1?), you usually know it.

terminating all backends to reset the shared_buffers is not a real restart on its own

From user's point of view, it is : connections dropped, transactions canceled....

Such a thing is usually worth an investigation. And I know no automated way to detect it with check_pga. Such a restart is not obvious on weekly charts.

@ioguix
Copy link
Member

ioguix commented May 2, 2018

From user's point of view, it is : connections dropped, transactions canceled....
Such a thing is usually worth an investigation. And I know no automated way to detect it with check_pga.

If it worth investigating (and it does), investigating imply you can read the logs which are packaed with WARNING/ERROR messages in such situation :)

But I agree an alert from the supervision might be useful...if possible.

Such a restart is not obvious on weekly charts.

I suppose the cache hit miss ratio should drop after the shared buffers reset.

@Krysztophe
Copy link
Collaborator Author

I suppose the cache hit miss ratio should drop after the shared buffers reset.

Not so obvious if you do not really search for it. Especially on a weekly OPM graph.

@ioguix
Copy link
Member

ioguix commented May 2, 2018

indeed. However, if you set alert on cache hit/miss ratio, you should catch one with a very very low ratio.

I agree this is not the best and straight solution for this issue, but I have no other idea right now :/

@ioguix
Copy link
Member

ioguix commented May 2, 2018

Note that even for 10+, your solution is an non-direct side effet as well :/ A much better one, but not direct anyway...

@Krysztophe
Copy link
Collaborator Author

#182 merged (thanks ioguix). I do not see a way to detect the crash and restart before v10, so I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants