🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 #76106

jonringer · 2019-12-20T17:04:26Z

At the time of writing, trying to connect to https://hydra.nixos.org/ give the following error:

DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=hydra;host=10.254.1.2;user=hydra;','',...) failed: could not connect to server: Connection timed out
Is the server running on host "10.254.1.2" and accepting
TCP/IP connections on port 5432? at /nix/store/bf666x0vqgdaxfal1ywvi9any2vx9q7x-hydra-perl-deps/lib/perl5/site_perl/5.30.0/DBIx/Class/Storage/DBI.pm line 1517. at /nix/store/8271gj7cpcbz4mf6s0kc6905fnfh1bng-hydra-0.1.20191108.4779757/libexec/hydra/lib/Hydra/Helper/CatalystUtils.pm line 418

cc @edolstra @grahamc @FRidh (not sure who else is on the infrastructure team)

The text was updated successfully, but these errors were encountered:

grahamc · 2019-12-20T17:53:51Z

Hey Jon, thanks for the ping. I noticed this too at about 17:15UTC, when the alarms about failing channel bump jobs started happening:

https://status.nixos.org/grafana/d/fBW4tL1Wz/scheduled-task-state-channels-website?orgId=1&from=1576855800000&to=1576864200000

I went to check out the log on the server which runs these updates, and saw:

Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: path=$(curl -LH 'Accept: application/json' https://hydra.nixos.org/job/nix-pills/master/html-split/latest | jq -re '.buildoutputs.out.path') && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: nix-store -r $path && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: ln -sfn $path/share/doc/nix-pills nixos/nix-pills-raw
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: [10.1K blob data]
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: make: *** [Makefile:50: nixos/nix-pills-raw] Error 1

so I just tested connecting to the hydra.nixos.org server (ceres) which went well. I then started to suspect a database server (chef) problem. I tried to SSH, and can't connect. I took a look at the metrics for chef, and ... well:

https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&var-instance=chef:9100&from=1576857600000&to=1576861200000

it is down and has been for a bit.

I have since escalated to @edolstra and @rbvermaa, who have keys to this particular part of the castle.

grahamc · 2019-12-20T18:07:04Z

I don't see any related incidents: https://www.hetzner-status.de/en.html

jonringer · 2019-12-20T18:10:24Z

@grahamc thanks for the quick response :)

Also, I sent you an email :) to graham@grahamc.com, please take a look

rbvermaa · 2019-12-20T18:29:52Z

Tried to reach the database server (chef), but couldn't. Performing a hard reset now.

rbvermaa · 2019-12-20T18:38:43Z

Machine did not come up with hard reset, nor with rescue system. Engaged Hetzner support.

grahamc · 2019-12-20T19:04:44Z

This is incredible timing. The infra team was discussing last week about getting a replacement machine for the database, because:

We had put in the request for the new machine yesterday :).

grahamc · 2019-12-20T19:47:13Z

I updated the "hydra is down "error page at https://hydra.nixos.org/ in NixOS/infra@72e03ec to link the monitoring infra and tickets with the label "infrastructure". The infra team is discussing options in case Hetzner support doesn't reply soon.

grahamc · 2019-12-20T20:04:27Z

We have heard back from support about the database machine (chef), they say an SSD has failed. The machine has been booted in to a rescue console to find out if the disk was part of the root filesystem or the /data array.

grahamc · 2019-12-20T20:15:51Z

We brought chef up successfully and the /data mirror is indeed degraded. We're considering next steps.

grahamc · 2019-12-20T20:34:18Z

We're preparing to take a backup directly off the disk, and then we will replace the failed disk.

grahamc · 2019-12-20T21:20:12Z

The first, block-wise backup is on its way:

pv /dev/md127 | gzip | mbuffer -q | ssh «remote» 'mbuffer -q | gzip -d > /hydradbrecv/storage/dd'

transferring at roughly 100MiB/s, with an ETA of about 1h30m.

The receiving end is a ZFS pool:

# zpool iostat -v 1
                                     capacity     operations     bandwidth
pool                               alloc   free   read  write   read  write
---------------------------------  -----  -----  -----  -----  -----  -----
hydradbrecv                        4.77G   883G      0    238    118  12.8M
  mirror                           2.39G   442G      0    119     59  6.41M
    pci-0000:01:00.0-scsi-0:0:1:0      -      -      0     59     29  3.21M
    pci-0000:01:00.0-scsi-0:0:2:0      -      -      0     60     29  3.21M
  mirror                           2.38G   442G      0    118     59  6.40M
    pci-0000:01:00.0-scsi-0:0:3:0      -      -      0     59     29  3.20M
    pci-0000:01:00.0-scsi-0:0:4:0      -      -      0     59     29  3.20M

grahamc · 2019-12-20T21:24:15Z

gzip was a bad choice, so I switched to lz4, which isn't taking 100% of a core:

pv /dev/md127 | lz4 | mbuffer -q | ssh «remote» 'mbuffer -q | lz4 -d > /hydradbrecv/storage/dd'

transferring at roughly 350MiB/s, with an ETA of about 30m.

The receiving end is a ZFS pool:

# zpool iostat -v 1
                                     capacity     operations     bandwidth
pool                               alloc   free   read  write   read  write
---------------------------------  -----  -----  -----  -----  -----  -----
hydradbrecv                        2.76G   885G      0  2.17K      0   118M
  mirror                           1.38G   443G      0  1.12K      0  59.5M
    pci-0000:01:00.0-scsi-0:0:1:0      -      -      0    572      0  29.8M
    pci-0000:01:00.0-scsi-0:0:2:0      -      -      0    579      0  29.8M
  mirror                           1.38G   443G      0  1.04K      0  58.1M
    pci-0000:01:00.0-scsi-0:0:3:0      -      -      0    531      0  29.1M
    pci-0000:01:00.0-scsi-0:0:4:0      -      -      0    535      0  29.1M
---------------------------------  -----  -----  -----  -----  -----  -----

grahamc · 2019-12-20T21:35:21Z

If you want to watch in real time: https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&refresh=30s&var-instance=chef:9100&from=1576876200000&to=now&fullscreen&panelId=13

grahamc · 2019-12-20T22:27:37Z

The dd copy arrived intact:

backup# md5sum ./dd
5db3000dc3486a2e701e976bef0c988f  ./dd

vs.

chef# pv /dev/md127 | md5sum -
 447GiB 0:18:26 [ 413MiB/s] [=================================>] 100%
5db3000dc3486a2e701e976bef0c988f  -

Apparently the block device compresses very well:

backup# du --apparent-size -h ./dd
448G	./dd

backup# du -h ./dd
99G	./dd

I'm also replicating this dd backup elsewhere as well:

# ssh «remote» 'zfs send -c hydradbrecv/storage@2019-12-20T21:51UTC--dd-only | mbuffer' \
    | mbuffer | zfs recv -s rpool/hydradbstorage

On to a pg_dumpall based backup. We'll first block external access to postgres via iptables to prevent any services from causing IO and stressing the disk further:

ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP
iptables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
iptables -I nixos-fw -p tcp --dport 5432 -j DROP

then start postgresql and:

pg_dumpall | lz4 | mbuffer -q | ssh «remote» 'mbuffer -q | lz4 -d > /hydradbrecv/storage/pg_dumpall.sql'

grahamc · 2019-12-20T22:55:08Z

pg_dumpall finished:

# du --apparent-size -h ./pg_dumpall.sql
109G	./pg_dumpall.sql

# du -h ./pg_dumpall.sql
48G	./pg_dumpall.sql

I took a snapshot and will be sending this to my secondary backup machine as soon as the first transfer finishes (likely won't finish before morning, as it is going much further away than Germany to Amsterdam, and also not going to a fancypants Packet datacenter.)

We've now requested Hetzner replace the failed disk now. We should be back online once that RAID array recovers. No word yet on the new server we ordered.

grahamc · 2019-12-20T23:25:22Z

It looks like Hetzner took the machine down to replace the disk at 2019-12-21 00:20GMT:

grahamc · 2019-12-20T23:34:40Z

[>....................]  recovery =  0.5% (2366080/468719424) finish=39.4min speed=197173K/sec

[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 00:39:38 2019
             State : clean, degraded, recovering
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 12% complete

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 691

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      spare rebuilding   /dev/sdd1

grahamc · 2019-12-21T00:07:27Z

           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 01:05:43 2019
             State : clean, degraded, recovering
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 79% complete

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 1026

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      spare rebuilding   /dev/sdd1

grahamc · 2019-12-21T00:15:03Z

We have a clean array!

[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 01:13:57 2019
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 1126

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      active sync   /dev/sdd1

grahamc · 2019-12-21T00:47:51Z

We're back!

I'm closing this issue. We will be getting a new machine for the database within the next week, to address the next impending-doom problem:

[root@chef:~]# df -ha /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      439G  426G   14G  97% /data

jonringer · 2019-12-21T00:55:31Z

Thank you @grahamc for your hard work. I (as well as many others) appreciate it :)

jonringer added the 0.kind: bug label Dec 20, 2019

grahamc changed the title ~~Hydra seems to be down~~ Hydra's database server is down Dec 20, 2019

grahamc added 1.severity: blocker 1.severity: channel blocker Blocks a channel and removed 0.kind: bug labels Dec 20, 2019

grahamc pinned this issue Dec 20, 2019

grahamc changed the title ~~Hydra's database server is down~~ New commits are not being built by Hydra, and channels updates are stopped due to a down database server. Dec 20, 2019

grahamc changed the title ~~New commits are not being built by Hydra, and channels updates are stopped due to a down database server.~~ 🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 Dec 20, 2019

grahamc added the infrastructure label Dec 20, 2019

grahamc closed this as completed Dec 21, 2019

worldofpeace unpinned this issue Dec 21, 2019

grahamc mentioned this issue Jan 6, 2020

🏗️ ⚠️ Hydra database maintenance will stop builds on 2020-01-07. #77152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 #76106

🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 #76106

jonringer commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019 •

edited

jonringer commented Dec 20, 2019

rbvermaa commented Dec 20, 2019

rbvermaa commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019 •

edited

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019 •

edited

grahamc commented Dec 21, 2019

grahamc commented Dec 21, 2019

grahamc commented Dec 21, 2019

jonringer commented Dec 21, 2019

🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 #76106

🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 #76106

Comments

jonringer commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019 • edited

jonringer commented Dec 20, 2019

rbvermaa commented Dec 20, 2019

rbvermaa commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019 • edited

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019

grahamc commented Dec 20, 2019 • edited

grahamc commented Dec 21, 2019

grahamc commented Dec 21, 2019

grahamc commented Dec 21, 2019

jonringer commented Dec 21, 2019

grahamc commented Dec 20, 2019 •

edited

grahamc commented Dec 20, 2019 •

edited

grahamc commented Dec 20, 2019 •

edited