freedomofpress · mig5 · Mar 13, 2024 · Mar 1, 2024 · Mar 4, 2024 · Mar 4, 2024
diff --git a/INSTALL.md b/INSTALL.md
@@ -3,7 +3,9 @@
 This document explains how to install the CI for Securedrop Workstation.
 
 It involves a combination of dom0 and VM configuration on a Qubes installation, as well as steps in
-Github/Tailscale.
+Github.
+
+The information assumes you'll be running this on a virtual machine such as VMware.
 
 # Qubes install and initial provisioning
 
@@ -26,133 +28,114 @@ hardware) by running `qvm-start sys-usb`.
 5. Update dom0 and install `make`, referring again to
 [the next section of SDW docs](https://workstation.securedrop.org/en/stable/admin/install.html#apply-dom0-updates-estimated-wait-time-15-30-minutes)
 
+In our case, we also install `open-vm-tools` and run `sudo systemctl enable vmtoolsd`,
+as our scripts use vmtoolsd to issue commands to the dom0 from the VMware API.
+
 ```
-sudo qubes-dom0-update make
+sudo qubes-dom0-update make open-vm-tools
 ```
 
 6. Run any updates you see in the Qubes menu and then reboot.
 
-7. In dom0, create the sd-ssh StandaloneVM:
+7. In dom0, create the sd-dev StandaloneVM. If it's Qubes 4.2, you can use the fedora-38-xfce template.
 
 ```
-sudo qvm-create --standalone --template fedora-37 --label red sd-ssh
-qvm-volume resize sd-ssh:root 50G
-qvm-volume resize sd-ssh:private 20G
-qvm-tags sd-ssh add sd-client
+sudo qvm-create --standalone --template fedora-38 --label red sd-dev
+qvm-volume resize sd-dev:root 50G
+qvm-volume resize sd-dev:private 20G
 ```
 
 Also ensure that you check the box to 'Start qube automatically on boot' in the Qubes settings.
 
-# Install dependencies on sd-ssh VM
+# Install dependencies on sd-dev VM
 
-1. Open a terminal in the sd-ssh VM and perform the following steps to install the core dependencies:
+1. Open a terminal in the sd-dev VM and perform the following steps to install the core dependencies:
 
 ```
-sudo dnf install openssh-server rpm-build dnf-plugins-core python3-pip python3-flask python3-paramiko python3-scp
-sudo pip3 install python-dotenv github-webhook
-sudo systemctl enable sshd
-sudo systemctl start sshd
+sudo dnf install rpm-build dnf-plugins-core
 ```
 
-2. Install Tailscale:
+2. Setup docker:
 
 ```
-curl -fsSL https://tailscale.com/install.sh | sh
-sudo tailscale up --advertise-tags=tag:servers,tag:sd-ci-servers
+sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
+sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+sudo usermod -a -G docker user
+sudo systemctl enable docker
 ```
 
-Complete the approval of the device in Tailscale as an admin, by copying
-the link that is returned in the last step.
+Set up the sd-dev machine to automatically start at boot.
 
-Sign in with your GitHub account and approve your VM as a device on the
-`freedomofpress.org.github` tailnet, with a name describing the
-hardware like `sd-ssh-t14`; it will show up on [the machines
-list](https://login.tailscale.com/admin/machines). When authorizing
-Tailscale in Github OAuth consent, be sure to choose the "Multi-user"
-`freedomofpress` tailnet, if your Github account is a member of
-multiple organizations.
+# Snapshot the VM
 
-3. Setup the firewall:
+At this point, if you're using VMware, you'll want to shut down and snapshot the VM, as it's now
+in a good state and could be cloned to make more of them!
 
-```
-sudo -i
-
-iptables -I INPUT 3 -m tcp -p tcp --dport 22 -i tailscale0 -j ACCEPT
-ip6tables -I INPUT 3 -m tcp -p tcp --dport 22 -i tailscale0 -j ACCEPT
-iptables -I INPUT 3 -m tcp -p tcp --dport 5000 -i tailscale0 -j ACCEPT
-ip6tables -I INPUT 3 -m tcp -p tcp --dport 5000 -i tailscale0 -j ACCEPT
-iptables-save > /etc/qubes/iptables.rules
-ip6tables-save > /etc/qubes/ip6tables.rules
-```
+# Configure the scripts on GitHub
 
-4. Setup docker:
+1. Generate a PAT in Github with full `repo:` access and ensure that that PAT is written to 
+   `sd-dev/.sdci-ghp.txt` on the machine that will execute the run.py on the host machine.
+   This will be used by `status.py`, so that the script can post git commit statuses back to Github.
 
-```
-sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
-sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
-sudo usermod -a -G docker user
-systemctl enable docker
-```
+2. Configure the webhook in your repository for the 'push' event, with the same secret you put in
+   the systemd file.
 
-# Install the CI scripts from this repository
+The Payload URL of the webhook should be `https://ws-ci-runner.securedrop.org/hook/postreceive` and
+the Content type should be `application/json`. Ensure you keep `Enable SSL verification` turned on.
 
-You're nearly done! Now you need to install the actual CI scripts, systemd unit files, and other
-config from this very repo into your dom0 and sd-ssh.
+# Test
 
-1. Start by cloning this repo into your sd-ssh VM.
+Test the CI flow with `./run.py --version 4.1 --commit [some commit hash]`
 
-2. In `sd-ssh`, run the following script as 'user' (not as root/sudo)
 
-```
-./install/sd-ssh
-```
+# Options for `run.py`
 
-This will pull up the `.flaskenv` file. Edit it to fill in `SDCI_REPO_WEBHOOK_SECRET` and adjust the
-`FLASK_RUN_HOST` to the IP of your sd-ssh machine's Tailscale IP so that the service listens only on
-that interface.
+There are a few options for `run.py` which is the main entry point that the webhook service calls.
 
-3. Copy files from `sd-ssh` to `dom0` (do this any time you pull an
-   update to the git repository, from the home directory):
+## `--version [4.1|4.2]`
 
-```
-qvm-run --pass-io sd-ssh 'tar -c -C /home/user securedrop-workstation-ci' | tar xvf -
-```
+Set the version number of Qubes you are going to be running on, for example, 4.1 or 4.2.
 
-4. In `dom0`, run as 'user' (not as root/sudo)
+This helps the script find a VM with that version in its name, to use for the CI run.
 
-```
-./install/dom0
-```
+## `--commit [sha]`
 
-# Configure the scripts on GitHub
+If you pass a commit hash, this will be understood that you want to run CI tests.
 
-1. Generate a PAT in Github with full `repo:` access and ensure that that PAT is written to
-   `/home/user/sdci-ghp.txt`. This will be used by `upload-report`, so that the script can post git
-   commit statuses back to Github.
+## `--snapshot [id]`
 
-2. Configure the webhook in your repository for the 'push' event, with the same secret you put in
-   the systemd file in step 7.
+If you pass this option, the VM will be reverted to this snapshot if it exists, before being
+powered up.
 
-The Payload URL of the webhook should be `https://ws-ci-runner.securedrop.org/hook/postreceive` and
-the Content type should be `application/json`. Ensure you keep `Enable SSL verification` turned on.
+If you do not pass this option, a snapshot ID will be read from the config file for this
+VM, and the VM will be restored to that snapshot instead. (There is never a scenario whereby
+the VM is *not* restored from snapshot first, as that is our way of guaranteeing a 'clean
+start')
+
+## `--update`
+
+If you pass this flag, the system will boot the Qubes VM and run dom0, template and StandaloneVM
+updates via salt in the standard Qubes way.
+
+If you also passed `--commit`, it will be undertood that you want to run CI tests immediately
+after having applied the updates. In this case, it will reboot the VM after applying updates
+but before running the CI test suite. This flow is useful for running 'nightly' tests.
 
-# Generate SSH upload key
+## `--save`
 
-Generate an SSH key on sd-ssh with `ssh-keygen -t ed25519 -f
-~/.ssh/id_ed25519_sdci_upload`, and ensure that this key is in the
-`/home/wscirunner/.ssh/authorized_keys` on the tailscale proxy droplet.
-This ensures that the `upload-report` script can successfully scp up the
-log file to the proxy droplet. Run `ssh -i ~/.ssh/id_ed25519_sdci_upload
-wscirunner@ws-ci-runner.securedrop.org` once, on the sd-ssh VM, to
-accept the host key signature for the first time.
+If you pass this flag, the system will save a new snapshot of the VM and store the new snapshot
+ID in the config file. This option is meant to mainly be used in conjunction with `--update`,
+e.g as an automatic routine patching procedure.
 
-(TODO: cover setting up an SSH config file.)
 
-## Reboot and test
+# Options for `nightlies.py`
 
-Do a full reboot of the Qubes system.
+The `nightlies.py` script is designed to run via cron or similar schedule. It takes `--branch` as
+an argument.
 
-Then, double-check that sd-ssh has started automatically at boot and that it has started the Flask
-webhook service (look for a process `/usr/bin/python3 -m flask run`).
+It will clone the repo, check out that branch, detect the appropriate Qubes version from that
+branch, detect the latest commit, then run `run.py` with the flag `--update` and the `--commit`
+hash.
 
-After that, you can push a commit to the repository and test the webhook and CI process works.
+This is designed to apply software updates in Qubes, stop/start the guest and then proceed with
+CI.
diff --git a/README.md b/README.md
@@ -2,7 +2,8 @@
 
 ## About
 
-This collection of scripts is for running the securedrop-workstation CI on a Qubes device.
+This collection of scripts is for running the securedrop-workstation CI on a hypervisor that is
+running Qubes virtual machines.
 
 ## Installation instructions
 
@@ -12,67 +13,39 @@ Please see the [INSTALL.md](INSTALL.md).
 
 ![Architecture diagram](SD_Qubes_CI.png)
 
-1. The webhook in Github delivers the payload to the proxy droplet.
+1. The webhook in Github delivers the payload to a remote server via HTTPS.
 
-2. The proxy droplet proxies the payload through to the sd-ssh VM running on a Qubes device, via the
-   Tailscale tunnel.
+2. The server passes that payload to a Flask service that parses the payload. This service
+   then posts a commit status to Github saying the build is 'queued'.
 
-3. The sd-ssh VM sends a commit status to Github notifying that the build has been received and is
-   pending.
+3. The Flask service executes the `run.py` script which makes calls to a hypervisor (currently
+   VMware) to find a Qubes VM with a matching version, restore it from snapshot and boot it.
 
-4. The sd-ssh VM clones the repository and checks out the commit learned from the payload, at
-   `/var/lib/sdci-ci-runner/securedrop-workstation_{SHA}`.
+4. The script adds various files to the dom0 and the sd-dev StandaloneVM on that Qubes VM.
 
-5. The sd-ssh VM sends an RPC call to the dom0 to run the `runner.py` (wrapped in flock to avoid
-   concurrent builds).
+5. The script then instructs dom0 to run a command on the sd-dev StandaloneVM to clone the
+   SDW CI repository and then issue an RPC call to the dom0 to run the `dom0/runner.py`
+   script.
 
-6. The runner.py reports a commit status back to Github (via sd-ssh) that the build has started (or,
-   if there is a build already running, that it is queued).
+6. The runner.py reports a commit status back to Github (via sd-dev) that the build has started.
 
-7. The runner.py tarballs up the codebase from the sd-ssh VM, and proceeds with the
+7. The runner.py tarballs up the codebase from the sd-dev VM, and proceeds with the
    `make clone; make dev; make test` sequence, logging all output to a log file.
 
-8. The runner.py then leverages the securedrop-workstation's `sdw-admin.py --uninstall --force` to
-   tear everything down, along with cleaning up some remaining cruft.
+   The runner.py will detect if any of the commands succeed or fail. If a step fails, the
+   whole procedure is halted. In either case, a commit status is sent back to Github indicating
+   whether it was a success or a failure.
 
-The runner.py will detect if any of the commands succeed or fail but it should not abort on failure
-(so that the teardown still completes).
+8. At the end of the process, the server copies the log file from dom0 and stores it in the
+   same place that the commit status links to, for viewing later.
 
-9. At the end of the process, the dom0 will copy its log file to the sd-ssh VM and then calls
-   `upload-report` on the sd-ssh VM with the status of the build.
+9. The Qubes VM is then powered off.
 
-10. That `upload-report` script will upload the log to the ws-ci-runner proxy for viewing in a
-   browser at https://ws-ci-runner.securedrop.org, and will also post a commit status to Github,
-   with the `target_url` pointing to the HTTPS URL of that log file on the ws-ci-runner, and with
-   the status of the build. At this point, the commit shows either a green tick or a red cross and
-   has a link to the log file.
+## Parallelization
 
-## Queuing and canceling builds
+The server is able to iterate until it finds a Qubes VM that is powered off. If it's off, it
+assumes it is available for use.
 
-The webhook can handle multiple commits delivered to it. The jobs get issued to the dom0 with a
-maximum `flock` wait of 86400s (24h).
-
-If another job is already running, it means the lock is held, so the other jobs wait for the lock to
-be released before starting.
-
-Once the lock releases, one of the pending jobs will claim the lock and start running.
-
-While a job is waiting, the commit in Github has a status of 'pending' with the message 'The build
-is queued'.
-
-When a build starts, the commit status changes to a description of 'The build is running'. The
-commit status state is technically still 'pending' because Github makes no distinction between
-'queued' and 'running', except in the description field of the commit status.
-
-If you need to cancel a build that is queued, run `cancel.py --sha xxxxxxxx` on the sd-ssh VM. This
-will:
-
-- remove the codebase that was checked out to this commit on sd-ssh
-- kill the pending process on the dom0 (by way of the qubes.SDCICanceler RPC script)
-- update the git commit status at Github to say that this build was canceled by an administrator.
-  The commit status will now be of state 'error' with a red cross.
-
-## Automatic updates of the sd-ssh VM
-
-The installation adds a systemd timer and script to perform daily updates of the sd-ssh VM to keep
-it up to date.
+If all Qubes VMs with the matching version are powered on, it's assumed that they are all busy
+running CI runners already. In this case, it sleeps for up to 1 hour and keeps occasionally
+retrying throughout.