Failure to actually run the builds in the queue #433

FPtje · 2016-12-16T17:07:25Z

Problem description

Our private hydra build server (with @basvandijk) has had its database cleaned and currently holds one project with one jobset containing one (fresh) job. This job is put inside the Queue (as seen in Status > Queue), but it is never actually built.

When looking at the job's Summary page, the Status shows Scheduled to be built, part of evaluation 1. The build steps tab is empty. The build dependencies tab, though, shows a whole bunch of dependencies.

The Status > Latest steps page is empty, as is the Status > Latest builds. The Status > Running builds says there are no running builds. The machines in Machine status are shown as Idle.

Background

The issue has started since hydra was updated a month or so ago. In a debugging session (together with @basvandijk), the version was upgraded to revision de55303197d997c4fc5, where the issue still occurs. Jobs just wouldn't build. Only when actually running nix-store --realize <drv path> does hydra mark the job as finished.

Somewhat related to this issue is #431, in which @basvandijk describes some error with a Raspberry Pi 3 build server. He thought the Raspberry Pi stuff might cause Hydra to act this weird, but the one job currently active has nothing to do with the Raspberry Pi, which should exclude it as a cause.

Debug information

With all lvlChat print messages changed to lvlInfo (because we're too lazy to figure out how to set verbose mode), the following is printed in the journal when the configuration is switched (and hydra is started):

hydra-queue-runner[9884]: adding new machine ‘root@rpibuild3’
hydra-queue-runner[9884]: checking the queue for builds > 0...
hydra-queue-runner[9884]: loading build 1 (lumi:staging:test-stalling-net)
nix-daemon[1629]: accepted connection from pid 9884, user hydra-queue-runner (trusted)
hydra-queue-runner[9884]: status: {"status":"up","time":1481904748,"uptime":0,"pid":9884,"nrQueuedBuilds":0,"nrUnfinishedSteps":49,"nrRunnableSteps":0,"nrActiveSteps":0,"nrStepsBuilding":0,"nrStepsCopyingTo":0,"nrStepsCopyingFrom":0,"nrStepsWaiting":0,"bytesSent":0,"bytesReceived":0,"nrBuildsRead":0,"buildReadTimeMs":0,"buildReadTimeAvgMs":0,"nrBuildsDone":0,"nrStepsStarted":0,"nrStepsDone":0,"nrRetries":0,"maxNrRetries":0,"nrQueueWakeups":0,"nrDispatcherWakeups":3,"dispatchTimeMs":0,"dispatchTimeAvgMs":0,"nrDbConnections":2,"nrActiveDbUpdates":0,"memoryTokensInUse":0,"machines":{"root@rpibuild3":{"enabled":true,"currentJobs":0,"idleSince":0,"nrStepsDone":0,"disabledUntil":0,"lastFailure":0,"consecutiveFailures":0}},"jobsets":{"lumi:staging":{"shareUsed":0,"seconds":0}},"machineTypes":{},"store":{"narInfoRead":0,"narInfoReadAverted":0,"narInfoMissing":0,"narInfoWrite":0,"narInfoCacheSize":0,"narRead":0,"narReadBytes":0,"narReadCompressedBytes":0,"narWrite":0,"narWriteAverted":0,"narWriteBytes":0,"narWriteCompressedBytes":0,"narWriteCompressionTimeMs":0,"narCompressionSavings":0,"narCompressionSpeed":0}}
lumi-hydra-setup-start[9877]: updating existing user `xxxxxxx'
systemd[1]: Started lumi-hydra-setup.service.
nixos[9650]: finished switching to system configuration /nix/store/5b8d70z8x8kqkbdp55hn747kz2yp7paz-nixos-system-hera-nixos-16.09pre-git_lumi-60950208ee3b2250f46839f8f69027b3234f17a7
hydra-server[9882]: [warn] Unicode::Encoding plugin is auto-applied, please remove this from your appclass and make sure to define "encoding" config
hydra-queue-runner[9884]: added build 1 (top-level step /nix/store/9kf5k2gi719ysbqx7bnz6fnsvkhkawnw-<our-derivation>.drv, 2869 new steps)
hydra-queue-runner[9884]: got 77 new runnable steps from 1 new builds
hydra-queue-runner[9884]: step ‘/nix/store/05dqd3kryhlsg0pw8c3xxmwnwvc8d7fc-bash43-029.drv’ is now runnable

... (similar message for the other 76 new runnable steps)

hydra-queue-runner[9884]: Finished getting queued builds
hydra-server[9882]: DEPRECATION WARNING: The Regex dispatch type is deprecated.
hydra-server[9882]:   It is recommended that you convert Regex and LocalRegex
hydra-server[9882]:   methods to Chained methods. at /nix/store/0npj59p4gdzssrpf6p03j1pjgfi5xi3w-hydra-perl-deps/lib/perl5/site_perl/5.22.2/Catalyst/DispatchType/Regex.pm line 210.
hydra-server[9882]: 2016/12/16-17:12:29 Starman::Server (type Net::Server::PreFork) starting! pid(9882)
hydra-server[9882]: Resolved [*]:3000 to [0.0.0.0]:3000, IPv4
hydra-server[9882]: Binding to TCP port 3000 on host 0.0.0.0 with IPv4
hydra-server[9882]: Setting gid to "122 122 122"
hydra-server[9882]: Starman: Accepting connections at http://*:3000/
hydra-evaluator[9879]: starting evaluation of jobset ‘lumi:staging’
hydra-evaluator[9879]: considering jobset lumi:staging (last checked 1s ago)
nix-daemon[1629]: accepted connection from pid 9925, user hydra
hydra-evaluator[9879]:   jobset is unchanged, skipping
hydra-evaluator[9879]: evaluation of jobset ‘lumi:staging’ succeeded

Subsequently, this is what hydra tells us every now and then afterwards:

hydra-evaluator[9879]: starting evaluation of jobset ‘lumi:staging’
hydra-evaluator[9879]: considering jobset lumi:staging (last checked 1s ago)
nix-daemon[1629]: accepted connection from pid 9955, user hydra
hydra-evaluator[9879]:   jobset is unchanged, skipping
hydra-evaluator[9879]: evaluation of jobset ‘lumi:staging’ succeeded

Dec 16 17:50:12 hera hydra-queue-runner[11345]: status: {"status":"up","time":1481907012,"uptime":300,"pid":11345,"nrQueuedBuilds":1,"nrUnfinishedSteps":2869,"nrRunnableSteps":77,"nrActiveSteps":0,"nrStepsBuilding":0,"nrStepsCopyingTo":0,"nrStepsCopyingFrom":0,"nrStepsWaiting":0,"bytesSent":0,"bytesReceived":0,"nrBuildsRead":1,"buildReadTimeMs":622,"buildReadTimeAvgMs":622,"nrBuildsDone":0,"nrStepsStarted":0,"nrStepsDone":0,"nrRetries":0,"maxNrRetries":0,"nrQueueWakeups":0,"nrDispatcherWakeups":74,"dispatchTimeMs":0,"dispatchTimeAvgMs":0,"nrDbConnections":2,"nrActiveDbUpdates":0,"memoryTokensInUse":0,"machines":{"root@rpibuild3":{"enabled":true,"currentJobs":0,"idleSince":0,"nrStepsDone":0,"disabledUntil":0,"lastFailure":0,"consecutiveFailures":0}},"jobsets":{"lumi:staging":{"shareUsed":0,"seconds":0}},"machineTypes":{"x86_64-linux:local":{"runnable":77,"running":0,"waitTime":23100,"lastActive":0}},"store":{"narInfoRead":0,"narInfoReadAverted":0,"narInfoMissing":0,"narInfoWrite":0,"narInfoCacheSize":0,"narRead":0,"narReadBytes":0,"narReadCompressedBytes":0,"narWrite":0,"narWriteAverted":0,"narWriteBytes":0,"narWriteCompressedBytes":0,"narWriteCompressionTimeMs":0,"narCompressionSavings":0,"narCompressionSpeed":0}}

With some debug messages I figured out that hydra-queue-runner blocks on this line after mentioning that those 77 steps are "now runnable". From that point it never unblocks until the process is manually restarted.

So somehow the steps are marked as "runnable", but nothing is actually built. Restarting hydra doesn't change anything and just reproduces the exact same output as posted above.

Further debugging

I'm actually kind of lost on ideas as to what could cause this. I'll update the issue with more info as I continue, but where should I even look?

In my current setup it's easy to put debug prints in arbitrary places in the code of hydra, if necessary.

The text was updated successfully, but these errors were encountered:

FPtje · 2016-12-16T20:09:19Z

Update:

I took a look at these two lines and added the following debug prints:

printMsg(lvlError, format("Can this happen!?"));
if (!build->finishedInDB) { // FIXME: can this happen?
     printMsg(lvlError, format("YES, THIS CAN HAPPEN"));
     (*builds_)[build->id] = build;
}

I'm glad to be of service and answer the question in the comment, but I'm not yet sure what it means.

This is the contents of the builds table in in PostgreSQL, queried after the above message was observed:

-[ RECORD 1 ]--+-------------------------------------------------------------------------
id             | 1
finished       | 0
timestamp      | 1481897112
project        | lumi
jobset         | staging
job            | <job name>
nixname        | <our job's nix name>
description    | 
drvpath        | /nix/store/9kf5k2gi719ysbqx7bnz6fnsvkhkaqnw-<our-derivation>.drv
system         | x86_64-linux
license        | 
homepage       | 
maintainers    | 
maxsilent      | 7200
timeout        | 36000
ischannel      | 0
iscurrent      | 1
nixexprinput   | lumi
nixexprpath    | release.nix
priority       | 100
globalpriority | 0
starttime      | 
stoptime       | 
iscachedbuild  | 
buildstatus    | 
size           | 
closuresize    | 
releasename    | 
keep           | 0

domenkozar · 2016-12-16T22:45:38Z

@edolstra

phile314-fh · 2016-12-30T10:30:19Z

I see the exact same problem on our hydra instance. Hydra runs on and builds for a normal intel 64bit machine, no cross-compilation is involved. We are using Hydra version de55303 .

cyraxjoe · 2017-03-17T02:00:16Z

I'm in a similar situation and currently I suspect that the problem is the nix-daemon that I'm using from the nix in the system (1.11.4) trying to evaluate the builds, @FPtje which version of nix are you using in the system?

FPtje · 2017-03-17T11:52:42Z

1.11.7

cyraxjoe · 2017-03-30T07:08:44Z

@FPtje the problem that I had, has a much simpler explanation. The hydra-queue-runner wasn't able to find the ssh executable and the jobs were never evaluated. The problem was introduced because I was migrating the service to systemd and the PATH is not inherited by default into the environment. This is in an Ubuntu 16.04 server.

basvandijk · 2017-08-09T10:04:17Z

Due to a tip from @expipiplus1 I managed to fix our long standing issue with hydra not dequeueing certain jobs.

Note that I have always configured hydra as follows:

{ services.hydra.buildMachinesFiles = mkIf (config.nix.buildMachines == []) []; }

If you don't explicitly set services.hydra.buildMachinesFiles the hydra NixOS module will default to /etc/nix/machines which doesn't exist if you don specify any nix.buildMachines. (Also see #432).

In our case we don't set nix.buildMachines meaning that services.hydra.buildMachinesFiles = []; I discovered today that in this case hydra will specify localhost as a default build machine. However by default no supported system features like "kvm" or "nixos-test" are specified.

I suspect that the jobs that fail to dequeue have dependencies with certain requiredSystemFeatures which are not supported by the default localhost build machine.

The solution is to explicitly specify localhost as a build machine with the required system features:

{
  nix.buildMachines = [
    { hostName = "localhost";
      system = "x86_64-linux";
      supportedFeatures = ["kvm" "nixos-test" "big-parallel" "benchmark"];
      maxJobs = 8;
    }
  ];
}

We should probably document this behaviour in the description of the services.hydra.buildMachinesFiles option so other users don't have to go to the trouble we had to go through.

basvandijk mentioned this issue Dec 17, 2016

Set nix.package to the same version that hydra is using #434

Open

dtzWill mentioned this issue Jul 21, 2017

Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

Closed

dtzWill mentioned this issue Aug 8, 2017

When a build step can't be queued, report that in hydra-queue-runner #502

Closed

kquick mentioned this issue Apr 23, 2019

Add output from hydra-queue-runner for unrunnable steps (no machine). #651

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to actually run the builds in the queue #433

Failure to actually run the builds in the queue #433

FPtje commented Dec 16, 2016

FPtje commented Dec 16, 2016 •

edited

domenkozar commented Dec 16, 2016

phile314-fh commented Dec 30, 2016 •

edited

cyraxjoe commented Mar 17, 2017

FPtje commented Mar 17, 2017

cyraxjoe commented Mar 30, 2017

basvandijk commented Aug 9, 2017

Failure to actually run the builds in the queue #433

Failure to actually run the builds in the queue #433

Comments

FPtje commented Dec 16, 2016

Problem description

Background

Debug information

Further debugging

FPtje commented Dec 16, 2016 • edited

domenkozar commented Dec 16, 2016

phile314-fh commented Dec 30, 2016 • edited

cyraxjoe commented Mar 17, 2017

FPtje commented Mar 17, 2017

cyraxjoe commented Mar 30, 2017

basvandijk commented Aug 9, 2017

FPtje commented Dec 16, 2016 •

edited

phile314-fh commented Dec 30, 2016 •

edited