Skip to content

Commit

Permalink
Merge pull request #620 from rgooch/doc-imports
Browse files Browse the repository at this point in the history
Add Birther document.
  • Loading branch information
rgooch committed Jun 25, 2019
2 parents b9110ea + a93e5f8 commit 02559a5
Show file tree
Hide file tree
Showing 3 changed files with 165 additions and 3 deletions.
6 changes: 3 additions & 3 deletions design-docs/Dominator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,7 +373,7 @@ The *planned* image is only pushed if the *active* image is the same as the *req
Fast, Secure re-Imaging
-----------------------

This section posits a [***Birther***](https://docs.google.com/document/d/1y7rPTuG145fdPhqaCdLu_03D1dC1UTRzyEq61ygIQeg/pub) system which is designed to leverage the **Dominator** to deploy images. This system is not implemented, so this section is currently a guide to how it would work.
This section posits a [**Birther**](../MachineBirthing/README.md) system which is designed to leverage the **Dominator** to deploy images. This system is not implemented, so this section is currently a guide to how it would work.

When a machine is re-provisioned for a different purpose, it may be wise to *re-image* it (wipe the file-system and re-install). This is typically done by sending a machine back to the **Birther** which already takes care of creating file-systems and installing the OS image. This is an expensive operation as it requires fetching the full OS image across the network.

Expand Down Expand Up @@ -410,7 +410,7 @@ As stated earlier, a single **Dominator** system and a single **Image Server** s

- SSD storage with 500 MB/s write throughput

In this environment, it would be possible to perform a complete system upgrade (such as when [**birthing**](https://docs.google.com/document/d/1y7rPTuG145fdPhqaCdLu_03D1dC1UTRzyEq61ygIQeg/pub) a machine) for a single machine in 2 seconds. When birthing many machines, the limiting factor is downloading the system image, as this is the largest component of network traffic. Thus, 3,600 machine per hour can be birthed *without any peer-to-peer enhancements*, with the limiting factor being bandwidth out of the **Image Server**.
In this environment, it would be possible to perform a complete system upgrade (such as when [**birthing**](../MachineBirthing/README.md) a machine) for a single machine in 2 seconds. When birthing many machines, the limiting factor is downloading the system image, as this is the largest component of network traffic. Thus, 3,600 machine per hour can be birthed *without any peer-to-peer enhancements*, with the limiting factor being bandwidth out of the **Image Server**.

A typical “large” system image upgrade changes less than 10% of the files on the system, which would require less than 100 MB of network traffic to each **sub**, which can be transferred in 0.1 seconds at maximum network speed. For such a change, 10 machines per second could be upgraded, which would be 1,000 seconds (under 17 minutes) for an upgrade of all 10,000 machines in the cluster. Again, this is *without any peer-to-peer enhancements*.

Expand Down Expand Up @@ -441,7 +441,7 @@ Polling speed is optimised since each **sub** stores a generation count of the f
Birthing Machines
=================

The **Dominator** system may be used to optimise the birthing of machines. The [***Birther***](https://docs.google.com/document/d/1y7rPTuG145fdPhqaCdLu_03D1dC1UTRzyEq61ygIQeg/pub) system would install a minimal payload on a machine (**subd** and an appropriate certificate authority file), start up **subd**, add the machine to the **MDB** and wait for the **Dominator** to install the system image, which will complete the birthing process. [***Birthing***](https://docs.google.com/document/d/1y7rPTuG145fdPhqaCdLu_03D1dC1UTRzyEq61ygIQeg/pub) machines is the subject of a separate document.
The **Dominator** system may be used to optimise the birthing of machines. The [**Birther**](../MachineBirthing/README.md) system would install a minimal payload on a machine (**subd** and an appropriate certificate authority file), start up **subd**, add the machine to the **MDB** and wait for the **Dominator** to install the system image, which will complete the birthing process. [**Birthing**](../MachineBirthing/README.md) machines is the subject of a separate document.

Auditing, Compliance Enforcement and Intrusion Detection
========================================================
Expand Down
161 changes: 161 additions & 0 deletions design-docs/MachineBirthing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
Machine Birthing
================
Richard Gooch
-------------

Background
==========

Growing machine capacity in a datacentre environment is often done by rolling in multiple racks of machines, wiring them in and powering them up. Once powered up, it is common for operations staff to use scripts and other automation tools to *birth* (install and configure) the machines. These automation tools typically build on top of other tools which were designed to birth a single machine (i.e. boot from an installation CD/ISO image). The layers of tools can make the birthing process less reliable and efficient, and leave the machine in a state where it is ready for further configuration rather than being ready for actual useful work. These tools often neglect other aspects of the machine life-cycle, such as automated repairs.

This document describes the design of a fully automated, robust, reliable and efficient architecture for (re)birthing machines at large scale. The design target is that 100 racks of machines can be turned on and within an hour all the new machines are available for real work, *without any further human intervention* nor any preparatory software configuration. The software system that will implement this architecture is called the **Birther**.

The **Birther** system depends on the [**Dominator**](../Dominator/README.md) system, which is likely to be the limiting factor in how quickly machines can be made available for real work. A more focussed design target for the **Birther** system is that it can **sub**ject more machines per second to **Domination** than the **Dominator** can complete a full system update on per second.

High-level Design
=================

The system is comprised of the following components:

- a **M**achine **D**ata**B**ase (**MDB**) which lists all the machines in the fleet and their properties

- a **Birther** machine which responds to PXE boot requests from machines

- a **Boot** **Server** (containing a DHCP and TFTP server), which is used to install a tiny **Bootstrap Image**

- a **Bootstrap Image** which configures the machine, enters it into the **MDB** and enables the machine for **Domination**

- a [**Dominator**](../Dominator/README.md) system which is used to install fully configured, workload-ready images

The following diagram shows how these components are connected:
![BirtherSystem Components image](../pictures/BirtherSystemComponents.svg)

The MDB
-------

The **MDB** is the sole source of truth which defines the intended state of the fleet. It lists all the known machines in the fleet and records the name, IP address, MAC address, *required* system image, repair state and so on.

The Birther
-----------

The **Birther** listens for PXE boot requests from any machine, and consults the **MDB** to determine what kind of response to send. In all cases, a response is sent. The following **MDB** states are defined:

- *unknown*: the system is not yet known in the **MDB**

- *birth*: a temporary private IP address is assigned and the PXE response instructs the machine to load and boot the **Bootstrap Image**

- *healthy*: the system is known in the **MDB** and is considered healthy. The PXE response instructs the machine to boot from local media. This is an optimisation that dramatically decreases system reboot time, as the machine does not have to wait for the PXE boot timeout before booting from local media

- *rebirth*: the system is known in the **MDB** and is in need of a software repair (a **rebirth**). The permanent IP address is assigned and the PXE response instructs the machine to load and boot the **Bootstrap Image**

- *clean*: the system is known in the **MDB** and needs to be cleaned (old data removed). The permanent IP address is assigned and the PXE response instructs the machine to load and boot the **Fast Bootstrap Image**

The **Birther** stores PXE boot request and response statistics in the **MDB** so that persistently failing machines can be detected.

The Boot Server
---------------

The **Boot Server** contains a DHCP and TFTP server and serves requests for the **Bootstrap Image**. It will respond to requests on the private IP network used for temporary addresses as well as requests on the main IP network for permanent addresses. The Hypervisor in SmallStack (part of the [**Dominator**](../Dominator/README.md) ecosystem) contains a **Boot Server** which is integrated with the ecosystem (including image building and distribution). Consult your favourite search engine for generic implementations.

The Bootstrap Image
-------------------

The **Bootstrap Image** contains:

- a generic kernel

- a small compressed file-system which contains:

- a configuration tool, which is run as the *init* process

- a copy of **subd** from the **Dominator** system and a Certificate Authority file

The configuration tool performs initial setup and then hands the machine over to the **Dominator**.

###

### The Fast Bootstrap Image

This is the same as the **Bootstrap Image** except that a burn-in test is not performed.

The Miracle of Birth
====================

Consider the first power on of a machine. The following sequence will ensue:

- the machine will broadcast a PXE boot request

- the **Birther** system will consult the **MDB** and see that the machine is *unknown*

- the **Birther** will assign a temporary private IP address and create a new machine entry with state *birth* in the **MDB** recording the MAC address and the assigned IP address. It will then send a PXE response to instruct the machine to load and boot the **Bootstrap Image**

- the machine will boot the kernel in the image

- the kernel will probe the machine hardware and then start the configuration tool, which will:

- start a watchdog process that talks to a hardware watchdog device

- run a burn-in stress and performance test

- probe the network (using a LLDP query tool or similar) to determine its physical position in the rack and will use this information to compute its hostname and permanent IP address

- scan the machine hardware and compute a preferred image based on burn-in test results, storage capacity, memory and number of CPUs. Examples of the image types that may be selected are:

- compute node

- storage node

- debug image (if the burn-in test failed)

- generate random encryption keys for the storage media and store them in NVRAM (discarding any old keys stored there, effectively wiping the media of any old data)

- partition storage devices

- create file-systems

- set up a boot loader

- mount and populate the new root file-system with system configuration data (/etc/fstab, hostname, network configuration, etc.)

- copy **subd** and the Certificate Authority file to the root file-system

- issue a request to the **MDB** to update its entry with the hostname, IP address, system image and set its state to *healthy*

- if the **MDB** change is successful it will change the network configuration to the permanent IP address, change to the new root directory and transfer control to **subd**. At this point the machine is fully **sub**jugated

- the **Dominator** will see the new **sub** appear in the **MDB** and will install the system image. The **Dominator** will see that the **sub** is essentially empty and will direct the **sub** to fetch files at maximum speed

- the **sub** will see that the kernel is being updated (since there are no kernel files currently on the system) and will reboot once the update is complete

- the **Birther** will see a PXE boot request from the machine, will see that the machine is listed in the **MDB** and is *healthy* and instructs the machine to boot from local media

- the **sub** will boot its image. Assuming the image is appropriately configured, it is now ready to perform work

Repairing (rebirthing) Machines
-------------------------------

If a machine is found to be persistently failing (e.g. stuck in a reboot loop), a separate automated system may decide that a **rebirthing** is required. If so, that system will set the state of the machine in the **MDB** to *rebirth* and on the next reboot the **Birther** will send a PXE boot response to boot the **Bootstrap Image**. The flow is almost the same as above for **birthing** machines, with the following exceptions:

- the permanent IP address is used in the PXE response

The means of detecting unhealthy machines and determining how sick they are and the steps required to heal them is the topic of another paper about **Machine Lifecycle Management**. The **Birther** and the **Dominator** are foundational components in a larger system.

Cleaning Machines
-----------------

**Cleaning** a machine is almost identical to **rebirthing**, except that the burn-in test is not performed. This is useful if a machine is re-assigned to a different owner so that any potentially sensitive data are removed before the machine is available to the new owner. The burn-in test is not needed (the machine is *healthy*), so it is best to avoid that step (which can take many minutes or even hours, depending on how exhaustive the test is). A fast re-assignment facilitates building responsive Metal as a Service system, if so desired.

In the simplest case, data can be “cleaned” by re-making the file-systems. This limits the potential for data exfiltration to more advanced attackers. If the secure encryption features of the storage media are used, throwing away the old encryption keys is a fast and effective method to effectively erase the storage media.

Calculating Performance Targets
===============================

One of the limitations on birthing machines is how quickly they can fetch the **Bootstrap Image** from the **Boot** server. Considering the following environment:

- 1 GB/s (10 Gb/s) network

- 10 MB **Bootstrap Image**

- 1 GB system image

the **Boot** server should be able to service 100 fetches per second. This is much faster than the **Dominator** can perform full system updates on (its limit is 1 machine per second, assuming it does not have any peer-to-peer enhancements). Clearly, optimising the **Birther** system would be premature, and will probably never be needed.
1 change: 1 addition & 0 deletions design-docs/pictures/BirtherSystemComponents.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 02559a5

Please sign in to comment.