Skip to content

Buildbot administration

Josh Matthews edited this page Aug 13, 2018 · 106 revisions

Salt

We use SaltStack to configure our build master and the slaves. The sources are at: https://github.com/servo/saltfs/ . See the in-tree docs and the SaltStack Administration page for more information.

Services hidden behind nginx

Logs can be found for services like intermittent-tracker and intermittent-failure-tracker in /var/log/upstart. These services can be controlled via initctl restart [servicename], while nginx is controlled with service nginx restart.

Homu (previously Bors)

Homu is the service that watches our PRs for approvals and shepherds them into the buildbot queue. Its sources are at: https://github.com/barosl/homu

It runs.

When updating Homu, it's safer to move the Homu directory to homu_old then allow Homu to recreate it from scratch. This mimics the environment in which the updates were tested.

The queue of jobs can be seen at: http://build.servo.org/homu/queue/all

Buildbot Builders

Their status can be viewed at: http://build.servo.org/buildslaves

Buildbot log files can be found on the individual machines at: /home/servo/buildbot/(master|slave)/twistd.log

If you need access, create a PR against https://github.com/servo/saltfs/, including your account in the common/map.jinja file and SSH pubkey in the common/ssh folder. To access the machines, log in as root on Linux or OSX; there are no individual accounts on slaves. If you need to test something (e.g., a reftest failure) make sure to su - servo to simulate the space.

IP addresses

  • servo-mac1: 208.52.161.130
  • servo-mac2: 208.52.161.128
  • servo-mac3 63.135.170.19
  • servo-master1: 52.37.76.55
  • servo-linux1: 52.88.241.130
  • servo-linux2: 52.11.58.66
  • servo-linux3: 52.34.208.74
  • servo-linux-cross1: 52.36.147.44
  • servo-linux-cross2: 52.37.172.87

Decommissioned hosts:

  • servo-master0: 96.126.125.232

The OSX builders are MacStadium quad core mac minis with 8GB of RAM. The Linux machines are EC2 machines (with varying specs?).

Setting up a new build slave

Follow the instructions for adding a Salt minion.

Finally, add it as appropriate to the buildbot master.cfg file and then run salt servo-master1 state.apply to restart buildbot. Check the /home/servo/buildbot logs to ensure it started property and can see the new machine.

Adding another build flavor

  • bors/bors.cfg needs to know about the build flavor (its string name from master.cfg below)homu???
  • buildbot/master/master.cfg should add another type of build, hooked up in all the same ways as e.g., "mac1"
  • salt servo-master1 state.apply

Making changes

See the in-tree docs for information about how to make changes, starting with a fresh clone of the saltfs repo, through building, testing, and deploying. Some additional deploy-specific notes (e.g. clean restarts) are listed here.

If a task requires catching a buildslave between builds, trigger a graceful shutdown. Click the slave name in this list, then log into the Buildbot web UI with the username and password from the secrets doc. In the lower right corner there should be a "graceful shutdown" button. Once you click it, the buildslave will stop accepting new builds and shut down after all running builds are finished.

If there are Buildbot configuration changes, the Buildbot master must be manually restarted. See https://github.com/servo/saltfs/issues/304 for more information about handling this automatically in the future.

⚠️ Wait until Buildbot is not currently running a job before restarting!

Afterwards, always verify that the services are working by tailing the homu and buildbot logs:

# less +F /var/log/upstart/homu.log
# less +F /home/servo/buildbot/master/twistd.log

Seekrits

Secret passwords and account information are stored at the following, secured location: https://docs.google.com/document/d/1bJfq47eGfipX0R-S6rwe8InVNM7TspwlBZiPD7pRvDo/edit

The Secrets document also contains the URL to sign into the Servo AWS account. Any AWS account signin URL can be constructed in the form <account ID number>.signin.aws.amazon.com/console . The account ID number can be found in the user's fully qualified IAM ID, which looks like ID: arn:aws:iam::<account ID number>:user/<username>

Dealing with troubles

If homu is not picking up changes in state for PRs properly, you should first click the Synchronize button in homu for the repo. If that does not work, homu may be restarted by running this on the buildmaster:

# service homu restart

If a build is aborted due to a lost client machine, first go to the waterfall view (http://build.servo.org/waterfall), then click on the build that was aborted. Finally, click on the Rebuild button.

If you need to update the buildbot master with new configuration information or reset the builder info, the most graceful way is to run this on the buildmaster:

# su - servo
# rm /home/servo/buildbot/master/*.pyc
# buildbot restart --clean --nodaemon /home/servo/buildbot/master &

This will wait until the current builds are finished and then restart the buildbot master, hopefully not losing any in-progress information or disconnecting clients unexpectedly.

WITH EXTREME CAUTION, if the buildbot master appears to be in a very bad state, you can restart it with:

# service buildbot-master restart

Note that if buildbot has crashed for some reason, the PID file it leaves behind must be removed or else it will refuse to restart, thinking there is another buildbot-master process running. Run this before restarting:

# rm /home/servo/buildbot/master/twistd.pid

http://trac.buildbot.net/wiki/RunningBuildbotOnWindows knows as much as we do about setting up Buildbot on Windows hosts.

Handling low disk space

There is a scheduled job to delete logs over 5 days old each night on each builder, so hopefully this doesn't happen. If disk space does get low, run the following to delete logs over 5 days old from the builders that have massive logs:

# find /home/servo/buildbot/master/linux-rel -mtime +5 -exec rm -rf {} \;
# find /home/servo/buildbot/master/mac-rel-wpt -mtime +5 -exec rm -rf {} \;
# find /home/servo/buildbot/master/mac-rel-css -mtime +5 -exec rm -rf {} \;

Dealing with hosting hiccups/DDOSes / "stuck" homu

Go to the servo/servo page as an administrator and look at the webhooks on settings. For both of the ones tied into build.servo.org, check for any messages that could not be delivered and go ahead and deliver them.

Alternatively, in homu, click on Synchronize. There will be some PRs marked as pending that have no builds going in buildbot. Close or otherwise cancel those PRs and once there is nothing pending, homu will move on to the next one.

Dealing with "Actual commit differs from expected commit"

The git index state has been corrupted due to a forced homu queue entry. Log in to servo-master1, then run:

macOS:

root@servo-master1:~# salt '[builders]' cmd.run 'find /Users/servo/buildbot/slave -path "*/.git/index.lock"'
root@servo-master1:~# salt '[builders]' cmd.run 'find /Users/servo/buildbot/slave -path "*/.git/index.lock" -delete'

linux:

root@servo-master1:~# salt '[builders]' cmd.run 'find /home/servo/buildbot/slave -path "*/.git/index.lock"'
root@servo-master1:~# salt '[builders]' cmd.run 'find /home/servo/buildbot/slave -path "*/.git/index.lock" -delete'

where [builders] is a pattern that will match the affected builders (eg. servo-mac*, or servo-linux1)

Dealing with excessively long buildbot jobs

In the event that buildbot reports job times that wildly exceed historical job lengths, and the jobs themselves report elapsed times that do not agree with buildbot's times, kill the buildbot process on the affected builder and re-run the command to restart buildbot.

Clone this wiki locally