Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 #76106

Closed
jonringer opened this issue Dec 20, 2019 · 21 comments

Comments

@jonringer
Copy link
Contributor

At the time of writing, trying to connect to https://hydra.nixos.org/ give the following error:

DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=hydra;host=10.254.1.2;user=hydra;','',...) failed: could not connect to server: Connection timed out
Is the server running on host "10.254.1.2" and accepting
TCP/IP connections on port 5432? at /nix/store/bf666x0vqgdaxfal1ywvi9any2vx9q7x-hydra-perl-deps/lib/perl5/site_perl/5.30.0/DBIx/Class/Storage/DBI.pm line 1517. at /nix/store/8271gj7cpcbz4mf6s0kc6905fnfh1bng-hydra-0.1.20191108.4779757/libexec/hydra/lib/Hydra/Helper/CatalystUtils.pm line 418

cc @edolstra @grahamc @FRidh (not sure who else is on the infrastructure team)

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

Hey Jon, thanks for the ping. I noticed this too at about 17:15UTC, when the alarms about failing channel bump jobs started happening:

https://status.nixos.org/grafana/d/fBW4tL1Wz/scheduled-task-state-channels-website?orgId=1&from=1576855800000&to=1576864200000

image

I went to check out the log on the server which runs these updates, and saw:

Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: path=$(curl -LH 'Accept: application/json' https://hydra.nixos.org/job/nix-pills/master/html-split/latest | jq -re '.buildoutputs.out.path') && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: nix-store -r $path && \
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: ln -sfn $path/share/doc/nix-pills nixos/nix-pills-raw
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 20 17:00:13 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: [10.1K blob data]
Dec 20 17:02:22 webserver swqdnxb946996msj71w5i0yh9aj4480k-unit-script-update-homepage-start[13584]: make: *** [Makefile:50: nixos/nix-pills-raw] Error 1

so I just tested connecting to the hydra.nixos.org server (ceres) which went well. I then started to suspect a database server (chef) problem. I tried to SSH, and can't connect. I took a look at the metrics for chef, and ... well:

https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&var-instance=chef:9100&from=1576857600000&to=1576861200000

image

it is down and has been for a bit.

I have since escalated to @edolstra and @rbvermaa, who have keys to this particular part of the castle.

@grahamc grahamc changed the title Hydra seems to be down Hydra's database server is down Dec 20, 2019
@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

I don't see any related incidents: https://www.hetzner-status.de/en.html

@jonringer
Copy link
Contributor Author

@grahamc thanks for the quick response :)

Also, I sent you an email :) to graham@grahamc.com, please take a look

@rbvermaa
Copy link
Member

Tried to reach the database server (chef), but couldn't. Performing a hard reset now.

@rbvermaa
Copy link
Member

Machine did not come up with hard reset, nor with rescue system. Engaged Hetzner support.

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

This is incredible timing. The infra team was discussing last week about getting a replacement machine for the database, because:

https://status.nixos.org/prometheus/graph?g0.range_input=8w&g0.end_input=2019-12-20%2019%3A02&g0.expr=node_filesystem_avail_bytes%7Bmountpoint%3D%22%2Fdata%22%2Cinstance%3D%22chef%3A9100%22%7D&g0.tab=0

We had put in the request for the new machine yesterday :).

@grahamc grahamc pinned this issue Dec 20, 2019
@grahamc grahamc changed the title Hydra's database server is down New commits are not being built by Hydra, and channels updates are stopped due to a down database server. Dec 20, 2019
@grahamc grahamc changed the title New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 New commits are not being built by Hydra, and channels updates are stopped due to a down database server. 🚨 Dec 20, 2019
@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

I updated the "hydra is down "error page at https://hydra.nixos.org/ in NixOS/infra@72e03ec to link the monitoring infra and tickets with the label "infrastructure". The infra team is discussing options in case Hetzner support doesn't reply soon.

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

We have heard back from support about the database machine (chef), they say an SSD has failed. The machine has been booted in to a rescue console to find out if the disk was part of the root filesystem or the /data array.

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

We brought chef up successfully and the /data mirror is indeed degraded. We're considering next steps.

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

We're preparing to take a backup directly off the disk, and then we will replace the failed disk.

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

The first, block-wise backup is on its way:

pv /dev/md127 | gzip | mbuffer -q | ssh «remote» 'mbuffer -q | gzip -d > /hydradbrecv/storage/dd'

transferring at roughly 100MiB/s, with an ETA of about 1h30m.

The receiving end is a ZFS pool:

# zpool iostat -v 1
                                     capacity     operations     bandwidth
pool                               alloc   free   read  write   read  write
---------------------------------  -----  -----  -----  -----  -----  -----
hydradbrecv                        4.77G   883G      0    238    118  12.8M
  mirror                           2.39G   442G      0    119     59  6.41M
    pci-0000:01:00.0-scsi-0:0:1:0      -      -      0     59     29  3.21M
    pci-0000:01:00.0-scsi-0:0:2:0      -      -      0     60     29  3.21M
  mirror                           2.38G   442G      0    118     59  6.40M
    pci-0000:01:00.0-scsi-0:0:3:0      -      -      0     59     29  3.20M
    pci-0000:01:00.0-scsi-0:0:4:0      -      -      0     59     29  3.20M

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

gzip was a bad choice, so I switched to lz4, which isn't taking 100% of a core:

pv /dev/md127 | lz4 | mbuffer -q | ssh «remote» 'mbuffer -q | lz4 -d > /hydradbrecv/storage/dd'

transferring at roughly 350MiB/s, with an ETA of about 30m.

The receiving end is a ZFS pool:

# zpool iostat -v 1
                                     capacity     operations     bandwidth
pool                               alloc   free   read  write   read  write
---------------------------------  -----  -----  -----  -----  -----  -----
hydradbrecv                        2.76G   885G      0  2.17K      0   118M
  mirror                           1.38G   443G      0  1.12K      0  59.5M
    pci-0000:01:00.0-scsi-0:0:1:0      -      -      0    572      0  29.8M
    pci-0000:01:00.0-scsi-0:0:2:0      -      -      0    579      0  29.8M
  mirror                           1.38G   443G      0  1.04K      0  58.1M
    pci-0000:01:00.0-scsi-0:0:3:0      -      -      0    531      0  29.1M
    pci-0000:01:00.0-scsi-0:0:4:0      -      -      0    535      0  29.1M
---------------------------------  -----  -----  -----  -----  -----  -----

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

The dd copy arrived intact:

backup# md5sum ./dd
5db3000dc3486a2e701e976bef0c988f  ./dd

vs.

chef# pv /dev/md127 | md5sum -
 447GiB 0:18:26 [ 413MiB/s] [=================================>] 100%
5db3000dc3486a2e701e976bef0c988f  -

Apparently the block device compresses very well:

backup# du --apparent-size -h ./dd
448G	./dd

backup# du -h ./dd
99G	./dd

I'm also replicating this dd backup elsewhere as well:

# ssh «remote» 'zfs send -c hydradbrecv/storage@2019-12-20T21:51UTC--dd-only | mbuffer' \
    | mbuffer | zfs recv -s rpool/hydradbstorage

On to a pg_dumpall based backup. We'll first block external access to postgres via iptables to prevent any services from causing IO and stressing the disk further:

ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
ip6tables -I nixos-fw -p tcp --dport 5432 -j DROP
iptables -I nixos-fw -p tcp --dport 5432 -j DROP -i wg0
iptables -I nixos-fw -p tcp --dport 5432 -j DROP

then start postgresql and:

pg_dumpall | lz4 | mbuffer -q | ssh «remote» 'mbuffer -q | lz4 -d > /hydradbrecv/storage/pg_dumpall.sql'

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

pg_dumpall finished:

# du --apparent-size -h ./pg_dumpall.sql
109G	./pg_dumpall.sql

# du -h ./pg_dumpall.sql
48G	./pg_dumpall.sql

I took a snapshot and will be sending this to my secondary backup machine as soon as the first transfer finishes (likely won't finish before morning, as it is going much further away than Germany to Amsterdam, and also not going to a fancypants Packet datacenter.)


We've now requested Hetzner replace the failed disk now. We should be back online once that RAID array recovers. No word yet on the new server we ordered.

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

It looks like Hetzner took the machine down to replace the disk at 2019-12-21 00:20GMT:
https://status.nixos.org/grafana/d/5LANB9pZk/per-instance-metrics?orgId=1&var-instance=chef:9100&from=1576882800000&to=1576884600000

@grahamc
Copy link
Member

grahamc commented Dec 20, 2019

[>....................]  recovery =  0.5% (2366080/468719424) finish=39.4min speed=197173K/sec
[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 00:39:38 2019
             State : clean, degraded, recovering
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 12% complete

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 691

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      spare rebuilding   /dev/sdd1

@grahamc
Copy link
Member

grahamc commented Dec 21, 2019

           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 01:05:43 2019
             State : clean, degraded, recovering
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 79% complete

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 1026

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      spare rebuilding   /dev/sdd1

@grahamc
Copy link
Member

grahamc commented Dec 21, 2019

We have a clean array!

[root@chef:~]# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Mon Mar 19 14:49:55 2018
        Raid Level : raid1
        Array Size : 468719424 (447.01 GiB 479.97 GB)
     Used Dev Size : 468719424 (447.01 GiB 479.97 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Dec 21 01:13:57 2019
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : chef:data  (local to host chef)
              UUID : c3e8f78f:39d17bae:d1522247:19c60b91
            Events : 1126

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       2       8       49        1      active sync   /dev/sdd1

@grahamc
Copy link
Member

grahamc commented Dec 21, 2019

We're back!

https://status.nixos.org/grafana/d/MJw9PcAiz/hydra-jobs?orgId=1&from=1576888500000&to=1576889100000

I'm closing this issue. We will be getting a new machine for the database within the next week, to address the next impending-doom problem:

[root@chef:~]# df -ha /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      439G  426G   14G  97% /data

@jonringer
Copy link
Contributor Author

Thank you @grahamc for your hard work. I (as well as many others) appreciate it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants