Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecoverable memory corruption if db concurrency > 6 #542

Closed
ixmatus opened this issue Mar 8, 2018 · 7 comments
Closed

Unrecoverable memory corruption if db concurrency > 6 #542

ixmatus opened this issue Mar 8, 2018 · 7 comments

Comments

@ixmatus
Copy link

ixmatus commented Mar 8, 2018

We're having a fairly serious issue with our Hydra deployment in that, occasionally (we haven't tracked down what tends to cause this), a double free or other memory-related crash will happen that subsequent restarts of the hydra-queue-runner can't recover from (i.e. they crash immediately upon start):

Mar 08 12:16:17 hydra hydra-queue-runner[39630]: warning: 14 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: *** Error in `hydra-queue-runner': double free or corruption (fasttop): 0x00007efd78035e50 ***
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: ======= Backtrace: =========
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/g1g31ah55xdia1jdqabv1imf6mcw0nb1-glibc-2.25-49/lib/libc.so.6(+0x711e6)[0x7efdee7cd1e6]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/g1g31ah55xdia1jdqabv1imf6mcw0nb1-glibc-2.25-49/lib/libc.so.6(+0x775c6)[0x7efdee7d35c6]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/g1g31ah55xdia1jdqabv1imf6mcw0nb1-glibc-2.25-49/lib/libc.so.6(+0x77dce)[0x7efdee7d3dce]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: hydra-queue-runner(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x46)[0x42aab6]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store13queryPathInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvNS_3refINS_13ValidPathInfoEEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x138)[0x7efdefc171a8]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x1394ee)[0x7efdefbc84ee]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x138579)[0x7efdefbc7579]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11callSuccessINS_3refINS_13ValidPathInfoEEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEEOS5_+0x55)[0x7efdefc1ddd5]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c548)[0x7efdefc1b548]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c6cf)[0x7efdefc1b6cf]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix10sync2asyncISt10shared_ptrINS_13ValidPathInfoEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEERKS4_IFS5_vEE+0x4b)[0x7efdefb9eedb]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11RemoteStore21queryPathInfoUncachedERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvSt10shared_ptrINS_13ValidPathInfoEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x4b)[0x7efdefbf5b1b]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store13queryPathInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvNS_3refINS_13ValidPathInfoEEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x5d3)[0x7efdefc17643]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x1394ee)[0x7efdefbc84ee]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x138579)[0x7efdefbc7579]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11callSuccessINS_3refINS_13ValidPathInfoEEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEEOS5_+0x55)[0x7efdefc1ddd5]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c548)[0x7efdefc1b548]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c6cf)[0x7efdefc1b6cf]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix10sync2asyncISt10shared_ptrINS_13ValidPathInfoEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEERKS4_IFS5_vEE+0x4b)[0x7efdefb9eedb]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11RemoteStore21queryPathInfoUncachedERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvSt10shared_ptrINS_13ValidPathInfoEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x4b)[0x7efdefbf5b1b]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store13queryPathInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvNS_3refINS_13ValidPathInfoEEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x5d3)[0x7efdefc17643]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x1394ee)[0x7efdefbc84ee]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x138579)[0x7efdefbc7579]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11callSuccessINS_3refINS_13ValidPathInfoEEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEEOS5_+0x55)[0x7efdefc1ddd5]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c548)[0x7efdefc1b548]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c6cf)[0x7efdefc1b6cf]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix10sync2asyncISt10shared_ptrINS_13ValidPathInfoEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEERKS4_IFS5_vEE+0x4b)[0x7efdefb9eedb]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11RemoteStore21queryPathInfoUncachedERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvSt10shared_ptrINS_13ValidPathInfoEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x4b)[0x7efdefbf5b1b]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store13queryPathInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvNS_3refINS_13ValidPathInfoEEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x5d3)[0x7efdefc17643]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x1394ee)[0x7efdefbc84ee]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x138579)[0x7efdefbc7579]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11callSuccessINS_3refINS_13ValidPathInfoEEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEEOS5_+0x55)[0x7efdefc1ddd5]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c548)[0x7efdefc1b548]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c6cf)[0x7efdefc1b6cf]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix10sync2asyncISt10shared_ptrINS_13ValidPathInfoEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEERKS4_IFS5_vEE+0x4b)[0x7efdefb9eedb]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11RemoteStore21queryPathInfoUncachedERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvSt10shared_ptrINS_13ValidPathInfoEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x4b)[0x7efdefbf5b1b]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store13queryPathInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvNS_3refINS_13ValidPathInfoEEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x5d3)[0x7efdefc17643]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x1394ee)[0x7efdefbc84ee]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x138579)[0x7efdefbc7579]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11callSuccessINS_3refINS_13ValidPathInfoEEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEEOS5_+0x55)[0x7efdefc1ddd5]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c548)[0x7efdefc1b548]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x18c6cf)[0x7efdefc1b6cf]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix10sync2asyncISt10shared_ptrINS_13ValidPathInfoEEEEvRKSt8functionIFvT_EERKS4_IFvNSt15__exception_ptr13exception_ptrEEERKS4_IFS5_vEE+0x4b)[0x7efdefb9eedb]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix11RemoteStore21queryPathInfoUncachedERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvSt10shared_ptrINS_13ValidPathInfoEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x4b)[0x7efdefbf5b1b]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store13queryPathInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvNS_3refINS_13ValidPathInfoEEEEES9_IFvNSt15__exception_ptr13exception_ptrEEE+0x5d3)[0x7efdefc17643]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(+0x1394ee)[0x7efdefbc84ee]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store16computeFSClosureERKSt3setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4lessIS7_ESaIS7_EERSB_bbb+0x1ca)[0x7efdefbc807a]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/fipq97y7kcbhgvgarzi7387dird5qn79-nix-1.12pre5788_e3013543/lib/libnixstore.so(_ZN3nix5Store16computeFSClosureERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERSt3setIS6_St4lessIS6_ESaIS6_EEbbb+0xce)[0x7efdefbc923e]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: hydra-queue-runner[0x471e6f]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: hydra-queue-runner[0x453e14]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: hydra-queue-runner[0x45766c]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: hydra-queue-runner[0x44f2b7]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/rf22nb44kggi8r4fbw5xa0bh79fimyj7-gcc-6.4.0-lib/lib/libstdc++.so.6(+0xb76df)[0x7efdef0db6df]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/g1g31ah55xdia1jdqabv1imf6mcw0nb1-glibc-2.25-49/lib/libpthread.so.0(+0x7234)[0x7efdf007e234]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: /nix/store/g1g31ah55xdia1jdqabv1imf6mcw0nb1-glibc-2.25-49/lib/libc.so.6(clone+0x3f)[0x7efdee84475f]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: ======= Memory map: ========
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 00400000-0048b000 r-xp 00000000 fe:05 2217217184                         /nix/store/i9hj8a3lv8lyn90kv1dhmpmqvgpvcs8w-hydra-2017-11-21/bin/hydra-queue-runner
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 0068b000-0068d000 r--p 0008b000 fe:05 2217217184                         /nix/store/i9hj8a3lv8lyn90kv1dhmpmqvgpvcs8w-hydra-2017-11-21/bin/hydra-queue-runner
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 0068d000-0068e000 rw-p 0008d000 fe:05 2217217184                         /nix/store/i9hj8a3lv8lyn90kv1dhmpmqvgpvcs8w-hydra-2017-11-21/bin/hydra-queue-runner
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 010a8000-010fe000 rw-p 00000000 00:00 0                                  [heap]
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd4c000000-7efd4c021000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd4c021000-7efd50000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd54000000-7efd54021000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd54021000-7efd58000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd58000000-7efd58021000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd58021000-7efd5c000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd5c000000-7efd5c038000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd5c038000-7efd60000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd60000000-7efd60021000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd60021000-7efd64000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd64000000-7efd64021000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd64021000-7efd68000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd68000000-7efd6803f000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd6803f000-7efd6c000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd6c000000-7efd6c065000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd6c065000-7efd70000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd70000000-7efd7005f000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd7005f000-7efd74000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd74000000-7efd74021000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd74021000-7efd78000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd78000000-7efd7804c000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd7804c000-7efd7c000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd7c000000-7efd7c053000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd7c053000-7efd80000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd80000000-7efd8004f000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd8004f000-7efd84000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd84000000-7efd84057000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd84057000-7efd88000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd88000000-7efd88072000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd88072000-7efd8c000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd8c000000-7efd8c068000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd8c068000-7efd90000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd90000000-7efd90067000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd90067000-7efd94000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd94000000-7efd9404d000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9404d000-7efd98000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd987f9000-7efd987fa000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd987fa000-7efd98ffa000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd98ffa000-7efd98ffb000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd98ffb000-7efd997fb000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd997fb000-7efd997fc000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd997fc000-7efd99ffc000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd99ffc000-7efd99ffd000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd99ffd000-7efd9a7fd000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9a7fd000-7efd9a7fe000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9a7fe000-7efd9affe000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9affe000-7efd9afff000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9afff000-7efd9b7ff000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9b7ff000-7efd9b800000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9b800000-7efd9c000000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9c000000-7efd9c06b000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efd9c06b000-7efda0000000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efda07f9000-7efda07fa000 ---p 00000000 00:00 0
Mar 08 12:16:17 hydra hydra-queue-runner[39630]: 7efda07fa000-7efda0ffa000 rw-p 00000000 00:00 0
Mar 08 12:16:17 hydra systemd[1]: hydra-queue-runner.service: Main process exited, code=killed, status=6/ABRT
Mar 08 12:16:17 hydra systemd[1]: hydra-queue-runner.service: Unit entered failed state.
Mar 08 12:16:17 hydra systemd[1]: hydra-queue-runner.service: Failed with result 'signal'.
Mar 08 12:16:17 hydra systemd[1]: hydra-queue-runner.service: Service hold-off time over, scheduling restart.
Mar 08 12:16:17 hydra systemd[1]: Stopped hydra-queue-runner.service.
Mar 08 12:16:17 hydra systemd[1]: Started hydra-queue-runner.service.
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 7 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 8 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 9 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 10 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 11 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 12 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 13 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 14 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 15 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 15 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 16 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 17 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra hydra-queue-runner[39808]: warning: 18 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:18 hydra systemd[1]: hydra-queue-runner.service: Main process exited, code=killed, status=11/SEGV
Mar 08 12:16:18 hydra systemd[1]: hydra-queue-runner.service: Unit entered failed state.
Mar 08 12:16:18 hydra systemd[1]: hydra-queue-runner.service: Failed with result 'signal'.
Mar 08 12:16:19 hydra systemd[1]: hydra-queue-runner.service: Service hold-off time over, scheduling restart.
Mar 08 12:16:19 hydra systemd[1]: Stopped hydra-queue-runner.service.
Mar 08 12:16:19 hydra systemd[1]: Started hydra-queue-runner.service.
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 7 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 8 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 9 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 10 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 11 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 12 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 13 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 14 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 15 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 16 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 17 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra hydra-queue-runner[40014]: warning: 18 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:19 hydra systemd[1]: hydra-queue-runner.service: Main process exited, code=killed, status=6/ABRT
Mar 08 12:16:20 hydra systemd[1]: hydra-queue-runner.service: Unit entered failed state.
Mar 08 12:16:20 hydra systemd[1]: hydra-queue-runner.service: Failed with result 'signal'.
Mar 08 12:16:20 hydra systemd[1]: hydra-queue-runner.service: Service hold-off time over, scheduling restart.
Mar 08 12:16:20 hydra systemd[1]: Stopped hydra-queue-runner.service.
Mar 08 12:16:20 hydra systemd[1]: Started hydra-queue-runner.service.
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 7 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 8 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 10 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 11 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 12 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 13 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 14 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 14 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 15 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 16 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 17 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:20 hydra hydra-queue-runner[40200]: warning: 18 concurrent database updates; PostgreSQL may be stalled
Mar 08 12:16:21 hydra systemd[1]: hydra-queue-runner.service: Main process exited, code=killed, status=6/ABRT
Mar 08 12:16:21 hydra systemd[1]: hydra-queue-runner.service: Unit entered failed state.
Mar 08 12:16:21 hydra systemd[1]: hydra-queue-runner.service: Failed with result 'signal'.
Mar 08 12:16:21 hydra systemd[1]: hydra-queue-runner.service: Service hold-off time over, scheduling restart.
Mar 08 12:16:21 hydra systemd[1]: Stopped hydra-queue-runner.service.
Mar 08 12:16:21 hydra systemd[1]: hydra-queue-runner.service: Start request repeated too quickly.
Mar 08 12:16:21 hydra systemd[1]: Failed to start hydra-queue-runner.service.
Mar 08 12:16:21 hydra systemd[1]: hydra-queue-runner.service: Unit entered failed state.
Mar 08 12:16:21 hydra systemd[1]: hydra-queue-runner.service: Failed with result 'signal'.

The only way we've been able to recover from this error is by reducing max_db_concurrency to 6, letting Hydra chew through builds for a while, then increasing the max_db_concurrency.

We can't leave the max_db_concurrency at 6 because Hydra can't otherwise keep up with build volume at all and ends up taking many hours to schedule new builds being added by the evaluator.

Has anyone else encountered this before? Is it a postgresql configuration issue?

@edolstra could you weigh in? I've asked in IRC and no one else has any insight into these issues.

@ixmatus
Copy link
Author

ixmatus commented Mar 8, 2018

Help would be appreciated because this issue is murdering us right now.

@ixmatus
Copy link
Author

ixmatus commented Mar 8, 2018

CC: @domenkozar

@domenkozar
Copy link
Member

I haven't seen this one before but I'd recommend doing a coredump that can be inspected.

@edolstra
Copy link
Member

edolstra commented Mar 9, 2018

This should be fixed by NixOS/nix@24b7398.

@ixmatus
Copy link
Author

ixmatus commented Mar 9, 2018

@edolstra thank you for the quick response! I will apply that patch to our configuration and test it today.

I'm providing more context about our configuration for posterity.

We are pinned to the following version of nixpkgs:

{
  "url": "https://github.com/NixOS/nixpkgs.git",
  "rev": "74286ec9e76be7cd00c4247b9acb430c4bd9f1ce",
  "date": "2018-01-15T12:35:29-05:00",
  "sha256": "13ydgpzl5nix4gc358iy9zjd5nrrpbpwpxmfhis4aai2zmkja3ak",
  "fetchSubmodules": true
}

This, is the version of Hydra we are running: 2017-11-21 (using nix-1.12pre5788_e3013543)

We have a few patches applied to the Hydra derivation, one for building pull requests off of our Enterprise Github instance, one fixing a hydra-notify script error, and a patch trimming github pr store paths.

Our Hydra configuration is:

build-fallback = true
max_db_connections = 128
max_output_size = 4294967296
        
<githubstatus>
  jobs = <redacted
  github = <redacted>
  inputs = src
  authorization = <redacted>
 </githubstatus>

Our Postgresql configuration (some of which is cribbed from https://github.com/NixOS/nixos-org-configurations/blob/master/delft/chef.nix) is:

log_min_duration_statement = 5000
log_duration = off
log_statement = 'none'
max_connections = 250
work_mem = 16MB
shared_buffers = 4GB

# Checkpoint every 256 MB.
min_wal_size = 128MB
max_wal_size = 256MB

# We can risk losing some transactions.
synchronous_commit = off
effective_cache_size = 16GB

# We're on SSDs, random access isn't as expensive
random_page_cost = 1
effective_io_concurrency = 5

The Nix configuration has maxJobs = 10 and buildCores = 4.

The host is a bare metal machine and it has 251GB of high-speed RAM, two 500GB SSDs, and one 500GB SSD on NVMe.

The hosts CPU hardware is:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Stepping:            4
CPU MHz:             2799.829
CPU max MHz:         2800.0000
CPU min MHz:         1200.0000
BogoMIPS:            5599.96
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            25600K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts

We have four build slaves (one of which is the Hydra host itself) that Hydra is configured to use as well, here is the machines configuration:

    hydra-queue-runner@hydra x86_64-linux /etc/hydra-queue-runner/hydra-queue-runner_rsa 10 1 local
    hydra-queue-runner@hydra-slave03 x86_64-linux /etc/hydra-queue-runner/hydra-queue-runner_rsa 8 10 kvm,big-parallel,nixos-test
    hydra-queue-runner@hydra-slave04 x86_64-linux /etc/hydra-queue-runner/hydra-queue-runner_rsa 8 10 kvm,big-parallel,nixos-test
    nixbuilder@nix-osx-builder x86_64-darwin /etc/hydra-queue-runner/hydra-queue-runner_rsa 4 10

@ixmatus
Copy link
Author

ixmatus commented Mar 9, 2018

I tried to debug the hydra-queue-runner crashes by running it with valgrind; unfortunately (and happily, since we have a way to stabilize Hydra now) that eliminated the crashes entirely.

I will test this patch by not running hydra-queue-runner in valgrind and see if a crash occurs. Crashes weren't easy to produce but they were common enough under load during the workday that we'd see at least two or three in a day.

I will report back to this issue ticket if we see hydra-queue-runner performing under heavy load stably.

@ixmatus
Copy link
Author

ixmatus commented Mar 9, 2018

I can confirm that hydra-queue-runner has been running all day with a patched nixUnstable using the patch @edolstra created that closed this issue and has seen no memory corruption-related crashes.

Additionally, another issue we were observing but didn't think was related and have not seen at all today, was hydra-queue-runner would wedge without any feedback as to what happened.

More time will tell but I think @edolstra's fix was the ticket.

edolstra added a commit to NixOS/nix that referenced this issue Apr 11, 2018
It was holding on to a Value* (i.e. a std::shared_ptr<ValidPathInfo>*)
outside of the pathInfoCache lock, so the std::shared_ptr could be
destroyed between the release of the lock and the decrement of the
std::shared_ptr refcount. This can happen if more than
'path-info-cache-size' paths are added in the meantime, *or* if
clearPathInfoCache() is called. The hydra-queue-runner queue monitor
thread periodically calls the later, so is likely to trigger a crash.

Fixes NixOS/hydra#542.

(cherry picked from commit 24b7398)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants