New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 #76106
Comments
Hey Jon, thanks for the ping. I noticed this too at about 17:15UTC, when the alarms about failing channel bump jobs started happening: I went to check out the log on the server which runs these updates, and saw:
so I just tested connecting to the hydra.nixos.org server ( it is down and has been for a bit. I have since escalated to @edolstra and @rbvermaa, who have keys to this particular part of the castle. |
I don't see any related incidents: https://www.hetzner-status.de/en.html |
@grahamc thanks for the quick response :) Also, I sent you an email :) to graham@grahamc.com, please take a look |
Tried to reach the database server (chef), but couldn't. Performing a hard reset now. |
Machine did not come up with hard reset, nor with rescue system. Engaged Hetzner support. |
I updated the "hydra is down "error page at https://hydra.nixos.org/ in NixOS/infra@72e03ec to link the monitoring infra and tickets with the label "infrastructure". The infra team is discussing options in case Hetzner support doesn't reply soon. |
We have heard back from support about the database machine ( |
We brought |
We're preparing to take a backup directly off the disk, and then we will replace the failed disk. |
The first, block-wise backup is on its way:
transferring at roughly 100MiB/s, with an ETA of about 1h30m. The receiving end is a ZFS pool:
|
transferring at roughly 350MiB/s, with an ETA of about 30m. The receiving end is a ZFS pool:
|
The
vs.
Apparently the block device compresses very well:
I'm also replicating this
On to a pg_dumpall based backup. We'll first block external access to postgres via iptables to prevent any services from causing IO and stressing the disk further:
then start postgresql and:
|
I took a snapshot and will be sending this to my secondary backup machine as soon as the first transfer finishes (likely won't finish before morning, as it is going much further away than Germany to Amsterdam, and also not going to a fancypants Packet datacenter.) We've now requested Hetzner replace the failed disk now. We should be back online once that RAID array recovers. No word yet on the new server we ordered. |
|
|
We have a clean array!
|
Thank you @grahamc for your hard work. I (as well as many others) appreciate it :) |
At the time of writing, trying to connect to https://hydra.nixos.org/ give the following error:
cc @edolstra @grahamc @FRidh (not sure who else is on the infrastructure team)
The text was updated successfully, but these errors were encountered: