Buildbot administration
We use SaltStack to configure our build master and the slaves. The sources are at: https://github.com/servo/saltfs/ . See the in-tree docs and the SaltStack Administration page for more information.
Logs can be found for services like intermittent-tracker and intermittent-failure-tracker in /var/log/upstart. These services can be controlled via initctl restart [servicename]
, while nginx is controlled with service nginx restart
.
Homu is the service that watches our PRs for approvals and shepherds them into the buildbot queue. Its sources are at: https://github.com/barosl/homu
It runs.
When updating Homu, it's safer to move the Homu directory to homu_old then allow Homu to recreate it from scratch. This mimics the environment in which the updates were tested.
The queue of jobs can be seen at: http://build.servo.org/homu/queue/all
Their status can be viewed at: http://build.servo.org/buildslaves
Buildbot log files can be found on the individual machines at: /home/servo/buildbot/(master|slave)/twistd.log
If you need access, create a PR against https://github.com/servo/saltfs/, including your account in the common/map.jinja file and SSH pubkey in the common/ssh folder. To access the machines, log in as root
on Linux or OSX; there are no individual accounts on slaves. If you need to test something (e.g., a reftest failure) make sure to su - servo
to simulate the space.
- servo-mac1: 208.52.161.130
- servo-mac2: 208.52.161.128
- servo-mac3 63.135.170.19
- servo-master1: 52.37.76.55
- servo-linux1: 52.88.241.130
- servo-linux2: 52.11.58.66
- servo-linux3: 52.34.208.74
- servo-linux-cross1: 52.36.147.44
- servo-linux-cross2: 52.37.172.87
Decommissioned hosts:
- servo-master0: 96.126.125.232
The OSX builders are MacStadium quad core mac minis with 8GB of RAM. The Linux machines are EC2 machines (with varying specs?).
Follow the instructions for adding a Salt minion.
Finally, add it as appropriate to the buildbot master.cfg file and then run salt servo-master1 state.apply
to restart buildbot. Check the /home/servo/buildbot logs to ensure it started property and can see the new machine.
-
bors/bors.cfg needs to know about the build flavor (its string name from master.cfg below)homu??? - buildbot/master/master.cfg should add another type of build, hooked up in all the same ways as e.g., "mac1"
- salt servo-master1 state.apply
See the in-tree docs for information about how to make changes, starting with a fresh clone of the saltfs repo, through building, testing, and deploying. Some additional deploy-specific notes (e.g. clean restarts) are listed here.
If a task requires catching a buildslave between builds, trigger a graceful shutdown. Click the slave name in this list, then log into the Buildbot web UI with the username and password from the secrets doc. In the lower right corner there should be a "graceful shutdown" button. Once you click it, the buildslave will stop accepting new builds and shut down after all running builds are finished.
If there are Buildbot configuration changes, the Buildbot master must be manually restarted. See https://github.com/servo/saltfs/issues/304 for more information about handling this automatically in the future.
Afterwards, always verify that the services are working by tailing the homu and buildbot logs:
# less +F /var/log/upstart/homu.log
# less +F /home/servo/buildbot/master/twistd.log
Secret passwords and account information are stored at the following, secured location: https://docs.google.com/document/d/1bJfq47eGfipX0R-S6rwe8InVNM7TspwlBZiPD7pRvDo/edit
The Secrets document also contains the URL to sign into the Servo AWS account.
Any AWS account signin URL can be constructed in the form
<account ID number>.signin.aws.amazon.com/console
. The account ID number can be found
in the user's fully qualified IAM ID, which looks like
ID: arn:aws:iam::<account ID number>:user/<username>
If homu is not picking up changes in state for PRs properly, you should first click the Synchronize button in homu for the repo. If that does not work, homu may be restarted by running this on the buildmaster:
# service homu restart
If a build is aborted due to a lost client machine, first go to the waterfall view (http://build.servo.org/waterfall), then click on the build that was aborted. Finally, click on the Rebuild button.
If you need to update the buildbot master with new configuration information or reset the builder info, the most graceful way is to run this on the buildmaster:
# su - servo
# rm /home/servo/buildbot/master/*.pyc
# buildbot restart --clean --nodaemon /home/servo/buildbot/master &
This will wait until the current builds are finished and then restart the buildbot master, hopefully not losing any in-progress information or disconnecting clients unexpectedly.
WITH EXTREME CAUTION, if the buildbot master appears to be in a very bad state, you can restart it with:
# service buildbot-master restart
Note that if buildbot has crashed for some reason, the PID file it leaves behind must be removed or else it will refuse to restart, thinking there is another buildbot-master process running. Run this before restarting:
# rm /home/servo/buildbot/master/twistd.pid
http://trac.buildbot.net/wiki/RunningBuildbotOnWindows knows as much as we do about setting up Buildbot on Windows hosts.
There is a scheduled job to delete logs over 5 days old each night on each builder, so hopefully this doesn't happen. If disk space does get low, run the following to delete logs over 5 days old from the builders that have massive logs:
# find /home/servo/buildbot/master/linux-rel -mtime +5 -exec rm -rf {} \;
# find /home/servo/buildbot/master/mac-rel-wpt -mtime +5 -exec rm -rf {} \;
# find /home/servo/buildbot/master/mac-rel-css -mtime +5 -exec rm -rf {} \;
Go to the servo/servo page as an administrator and look at the webhooks on settings. For both of the ones tied into build.servo.org, check for any messages that could not be delivered and go ahead and deliver them.
Alternatively, in homu, click on Synchronize. There will be some PRs marked as pending that have no builds going in buildbot. Close or otherwise cancel those PRs and once there is nothing pending, homu will move on to the next one.
The git index state has been corrupted due to a forced homu queue entry. Log in to servo-master1
, then run:
macOS:
root@servo-master1:~# salt '[builders]' cmd.run 'find /Users/servo/buildbot/slave -path "*/.git/index.lock"'
root@servo-master1:~# salt '[builders]' cmd.run 'find /Users/servo/buildbot/slave -path "*/.git/index.lock" -delete'
linux:
root@servo-master1:~# salt '[builders]' cmd.run 'find /home/servo/buildbot/slave -path "*/.git/index.lock"'
root@servo-master1:~# salt '[builders]' cmd.run 'find /home/servo/buildbot/slave -path "*/.git/index.lock" -delete'
where [builders]
is a pattern that will match the affected builders (eg. servo-mac*
, or servo-linux1
)
In the event that buildbot reports job times that wildly exceed historical job lengths, and the jobs themselves report elapsed times that do not agree with buildbot's times, kill the buildbot process on the affected builder and re-run the command to restart buildbot.