# Spectre is here to stay An analysis of side-channels and speculative execution

Ross Mcilroy Google Jaroslav Sevcik Google Tobias Tebbi Google

rcmilroy@google.com

jarin@google.com

tebbi@google.com

Ben L. Titzer Google Toon Verwaest Google

titzer@google.com

verwaest@google.com

February 15, 2019

#### Abstract

The recent discovery of the Spectre and Meltdown attacks represents a watershed moment not just for the field of Computer Security, but also of Programming Languages. This paper explores speculative side-channel attacks and their implications for programming languages. These attacks leak information through micro-architectural side-channels which we show are not mere bugs, but in fact lie at the foundation of optimization. We identify three open problems, (1) finding side-channels, (2) understanding speculative vulnerabilities, and (3) mitigating them. For (1) we introduce a mathematical meta-model that clarifies the source of side-channels in simulations and CPUs. For (2) we introduce an architectural model with speculative semantics to study recently-discovered vulnerabilities. For (3) we explore and evaluate software mitigations and prove one correct for this model. Our analysis is informed by extensive offensive research and defensive implementation work for V8, the production JavaScript virtual machine in Chrome. Straightforward extensions to model real hardware suggest these vulnerabilities present formidable challenges for effective, efficient mitigation. As a result of our work, we now believe that speculative vulnerabilities on today's hardware defeat all language-enforced confidentiality with no known comprehensive software mitigations, as we have discovered that untrusted code can construct a universal read gadget to read all memory in the same address space through side-channels. In the face of this reality, we have shifted the security model of the Chrome web browser and V8 to process isolation.

# 1 Introduction

Computer systems aspire to enforce three key security properties: confidentiality, integrity, and availability [3]. Of these, confidentiality is the property that guarantees private information can only be accessed by authorized parties. Confidentiality is the basis for many security mechanisms, including passwords, access tokens, TANs, cookies, and capabilities. It is enforced through a number of mechanisms, including encryption, spatial, temporal, or virtual separation, and mediation through access checks. When ensuring confidentiality, it is not enough for implementations to be functionally correct. In addition to computing the correct result, private data must not leak to unauthorized parties by any means. Such leakage can happen through unforeseen information channels, called side-channels. Any measurable property of a computer implementation has the potential to be a side-channel.

Side-channels are an important area of study for computer systems and processes that deal with high-value secrets like crypto systems. These attacks are not merely theoretical, as practical attacks have been demonstrated as far back as 1996 [19]. In high-security contexts, algorithms, software, and hardware are carefully designed to eliminate as many side-channels as possible. This seems to be a never-ending battle, as attacks have been demonstrated using an amazing variety of measurements, including electromagnetic emissions [1], energy consumption [21], power lines [11], microphones [10], high resolution cameras [8], IR photon detectors [6], and even the ambient light sensor of smartphones [32]. Timing attacks use time

itself as the measurable property, inferring confidential information by differential analysis of the varying time required to execute operations on or related to secret information. Timing was in fact one of the first side-channels used to attack crypto systems [19].

Timing side-channels are abundant because reducing execution time is an important goal of software and hardware engineering, and modern computer systems employ many optimizations which depend on the dynamic values of computations and thus might leak information about those dynamic values. Confidentiality is at risk when code of different parties share resources such as a processor core, processor, processor package, memory bus, DRAM, cache, or disk. Side-channels can exist both in time-sharing scenarios, where state persists between context switches, or in concurrent scenarios where processes directly contend for resources.

A classic example is processor caches which store the address and value of recently accessed memory locations. Subsequent memory addresses are then faster for cached addresses, revealing information about which addresses were or were not recently accessed. [5] was the first to use cache timing as a side-channel to attack crypto systems. In principle, any processor unit with hidden state has the potential to store sensitive information and to leak this information to subsequently or concurrently executing code via a timing channel. Outside of the CPU itself, DRAM can also be a side-channel [27], since it contains active and inactive rows. [9] gives a more complete account of known attacks that use microarchitectural components.

Security acknowledges that trust has levels. Kernels are part of the trusted computing base, offering a platform for semi-trusted and trusted applications as isolated processes. For the vast majority of installed applications, such as word processors or video games, we simply trust that they don't to steal our personal information and credentials, or attempt to impersonate us to third parties. These attacks are however extremely serious for crypto systems and production settings when servers are virtualized on the same hardware, or the extreme case where multiple, competing, untrusted clients run on shared cloud computing platforms [31]. In the case of a Web browser, where webpages run untrusted code and applications run semi-trusted code on devices close to us, trust is lower, and these attacks become relevant. For the Web, untrusted JavaScript and WebAssembly code is sandboxed at multiple granularities to protect both the user's system and other webpages' private content. As such, side-channels are now emerging as a real threat for the web platform.

## 1.1 Security in Programming Languages

Computer systems are massively complex, requiring oversimplification and reasoning via analogy. In this vein we can for the moment view them as a tower of abstractions that ultimate rests upon physics and builds upwards, from electrons to circuits, from circuits to micro-architectures, from micro-architectures to micro-ops, from micro-ops to ISAs, from ISAs to system software and libraries, and then programming languages, themselves an abstraction comprised of a some combination of a static compiler, runtime system, and virtual machine. When we consider with emulation and virtualization via hypervisors, the tower of hardware and software can include an unbounded number of emulation layers. Abstractions go further up from applications, of course, beyond a single computer to networked computer systems, clusters, databases, clouds, the internet, the Web.

In this view, programming languages, as most people use them, are square in the middle of this tower of abstractions, though they are relevant to upper layers that are programmable. Being the abstraction in which algorithms are expressed means the security properties of a programming language are therefore paramount to the security of algorithms and systems implemented with it. Security at this layer is important, since many decades of experience have shown us that secure hardware is not enough and securing software has many remaining challenges.

Although it is rarely explicitly recognized by name, most programming languages enforce various degrees of both integrity and confidentiality for programmer-created abstractions, preventing bugs and unpredictable results and increasing our confidence in software correctness. Basic programming language mechanisms such as proper scoping, namespace separation, storage separation, dynamic bounds checks, and both static and dynamic type checks are all attempts to prevent one part of a program or library from accessing another's data inappropriately. Of course, some languages are more explicitly security-conscious than others. For example some languages express confidentiality explicitly [28]. An early example is Jif [25] which extends Java with a type system that expresses principals and access rights and

extends the Java type system to ensure that data cannot be read by principals that were not granted access rights.

In mainstream modern languages, strong type systems are designed to guarantee that a program cannot exhibit certain dangerous behaviors which might jeopardize integrity, confidentiality, and availability. A strong type system allows reasoning about programs manipulating sensitive data. Memory safety is among the crucial properties a strong type system must enforce. In principle, memory-safe languages can guarantee strong isolation similar to process isolation as offered by common hardware and operating systems. There even have been serious efforts [14, 26] to build operating systems where processes share the same address space and are only isolated by a memory-safe programming language. In this case, the burden of ensuring confidentiality rests fully on the type system, compiler, and runtime system rather than on hardware/OS process isolation. A language implementation may enforce security properties through a combination of static and dynamic checks.

Language security is a powerful tool, but what if a programming language is run on insecure hardware? What if the CPU, even if verified to be architecturally correct [29], allows the program to bypass language security measures through side channels, which are outside architectural models? Our proofs of type safety and of type system soundness, which help us enforce confidentiality through integrity of abstractions, taint freedom, and other things, would suddenly be called into question. In short, that would be bad.

## 1.2 A new and pervasive threat

To date, timing attacks and side-channels have only been used to observe *legitimate* computations that happened in the ordinary course of a program's execution. Thus the risk for information leakage could be determined by considering the algorithm and the data it processes. In turn, this was mostly seen as a risk for encryption algorithms and other programs doing heavy computations on sensitive data. This offered the possibility to selectively harden sensitive algorithms.

Information leaks from speculative execution represent a different level of threat that we must now begin to understand and reason about. Spectre [13, 20] allows for information to be leaked from computations that should have never happened according to the architectural semantics of the processor. Spectre attacks use targeted manipulations of the shared microarchitectural state to trigger ordinarily impossible computations, either in untrusted code on which an implementation imposes safety checks, or by injecting dangerous speculative behavior into trusted code. In either case, Spectre allows for low-level read access to all of the addressable memory. This puts arbitrary in-memory data at risk, even data "at rest" that is not currently involved in computations and was previously considered safe from side-channel attacks.

This paper is an attempt to distil and clarify that threat. As a result of our work on Spectre, we now know that information leaks may affect all processors that perform speculation, regardless of instruction set architecture, manufacturer, clock speed, virtualization, or timer resolution. Since the initial disclosure of three classes of speculative vulnerabilities, all major vendors have reported affected products, including Intel, ARM, AMD, MIPS, IBM, and Oracle. This class of flaws are deeper—at the microarchitectural level of the processor—and more widely distributed—in essentially every high performance processor—than perhaps any security flaw in history, affecting billions of CPUs in production across all device classes. While speculation is often informally equated with branch prediction, the concept of speculation is broader, since processors speculate in other ways not related to branch prediction, as we show in section 3.4.4. Vulnerabilities from speculative execution are not processor bugs but are more properly considered fundamental design flaws, since they do not arise from errata. Troublingly, these fundamental design flaws were overlooked by top minds for decades. Our paper shows these leaks are not only design flaws, but are in fact foundational, at the very base of theoretical computation. On the opposite end of the spectrum, we detail our practical attempts at software mitigation in a production virtual machine entrusted with the security of the Web.

<sup>&</sup>lt;sup>1</sup>See acknowledgments section

## 1.3 Contributions

- A formal model of microarchitectural side-channels and optimizations
- An abstract amplification technique to make even the smallest timing difference detectable with any resolution clock
- A formal model that explains several recently-discovered classes of speculative vulnerabilities
- A succinct description of the ultimate leak from speculation: the universal read gadget
- Construction of the universal read gadget using multiple vulnerabilities
- An analysis of exploitable source language features
- A formal analysis of mitigations for the variant 1 vulnerability
- Discussion of mitigations for other variants

# 2 Understanding microarchitectural side-channels

It is often convenient to think of a CPU as an interpreter which executes instructions, maintaining a program counter, registers, and memory along the way, a collection of information which together comprise what is known as the *architectural* state of a program. Yet in reality, all CPUs are simulators; they are implemented as complex circuits organized into register files, functional units, tables, forwarders, caches, and many other things, each with their own internal administrative state, and only simulate the action of an interpreter. A CPU's internal state, often called the *microarchitectural* state, abbreviated here  $\mu$ -state, is hidden by the abstraction that the CPU provides to programs. However, as we'll see in this section, this abstraction is more permeable than it would at first appear, once we consider the ability to measure execution time.

For the moment, let's step back from concrete computers and introduce a mathematical meta-model to make reasoning about state-transition systems both abstract and rigorous. The meta-model allows us to reason about any kind of computational system, not just CPUs, Turing machines, or equivalent calculi. We use a simulation relation that relates architectural states to  $\mu$ -architectural states, allowing many correct  $\mu$ -states to map to a single architectural state. It is exactly in this differential mapping where side-channels can occur. This model is deterministic and requires only a single event counter, or timer, to extract information from the side-channel. This shows that these vulnerabilities are *not* the result of complex nondeterministic machines or specific CPU bugs, but are in fact, fundamental to simulation in general.

Architecture We define an architecture  $\overline{\alpha}: P \times \Sigma \to \Sigma$  as a computable function  $\overline{\alpha}_{\rho}(\sigma) = \sigma'$  where  $\rho \in P$  represents a program, and  $\sigma \in \Sigma$  represents the state of the program. We denote an architecture for programs P with states  $\Sigma$  by  $\Lambda\langle P, \Sigma \rangle$ . We make no restrictions on the language of programs P or the language of states  $\Sigma$  as long as they are countable sets. In the next section it will be convenient to mimic a real CPU by modeling memory cells, but for now we allow P and  $\Sigma$  to encode any kind of computation, e.g. with syntactic terms like  $\lambda x.x$  and reduction rules. Further, we assume nothing about how  $\overline{\alpha}$  is most succinctly described, whether it be mathematically with a small-step semantics, operationally with a state transition diagram, or procedurally as a set of instructions for yet another architecture.

 $\mu$ -Architecture We define the execution semantics of a  $\mu$ -architecture for  $\overline{\alpha}: \Lambda\langle P, \Sigma\rangle$  as an architecture  $\overline{m}: \Lambda\langle P, \Delta\rangle$  where we choose two sets of observable states  $\Omega_{\Sigma} \subseteq \Sigma$  and  $\Omega_{\Delta} \subseteq \Delta$  such that there exists a simulation relation function  $R: \Omega_{\Delta} \to \Omega_{\Sigma}$  with:

•  $\forall \sigma \in \Omega_{\Sigma}, \exists \mu \in \Omega_{\Delta}$  such that  $R(\mu) = \sigma$  meaning that all observable architectural states have at least one corresponding observable  $\mu$ -architectural state (i.e. the relation is *surjective*).

The relation R may map many  $\mu_i$  to a single  $\sigma$ , allowing  $\overline{m}$  the freedom to add additional details like a memory cache or branch predictor state. It is exactly this extra state that can exhibit side-channels.

We can now precisely define *correctness* for an  $\overline{\alpha}$  and its implementation  $\overline{m}$  as a condition on execution paths between *observable related* states.

•  $\forall n_1 \in \mathbb{N}, \sigma, \sigma' \in \Omega_{\Sigma}$  such that  $\overline{\alpha}_{\rho}^{n_1}(\sigma) = \sigma', \exists n_2 \in \mathbb{N}, \mu, \mu' \in \Omega_{\Delta}$  such that  $R(\mu) = \sigma, \overline{m}_{\rho}^{n_2}(\mu) = \mu',$  and  $R(\mu') = \sigma'$  meaning that for any architectural execution path between two observable states, there is a  $\mu$ -architectural execution path between correspondingly related observable  $\mu$ -states.

Since architectures are deterministic state transition systems, we can make use of the existential quantifier for the correctness property<sup>2</sup>.

## 2.1 Writing information into $\mu$ -state

The additional state of a  $\mu$ -architecture might be used by a crafty program to encode information which is hidden from the architectural state. To see this, consider an architecture  $\overline{\alpha}: \Lambda\langle P, \Sigma\rangle$  and implementation  $\overline{m}: \Lambda\langle P, \Delta\rangle$ . Consider a program  $w \in P$  that receives as input either 0 or 1 and writes this bit into the  $\mu$ -architectural state and then "forgets" it. More precisely:

- $\forall b \in \{0,1\}$  let  $\Sigma_w(b) \subset \Omega_\Sigma$  be the architectural states encoding b as input to w with  $\Sigma_w(0) \cap \Sigma_w(1) = \emptyset$
- $\exists \phi \in \Omega_{\Sigma}, \forall \sigma \in \Sigma_w(b), \exists n \text{ such that } \overline{\alpha_w}^n(\sigma) = \phi$  meaning that all inputs representing 0 or 1 converge to the same observable state  $\phi$
- $\exists V \in \Omega_{\Delta} \to \{0,1\}$  such that
  - $\forall b \in \{0, 1\}, \sigma \in \Sigma_w(b), \mu \in \Omega_\Delta \text{ such that } (\sigma, \mu) \in R, \text{ let } k \in \mathbb{N} \text{ be the smallest number such that } \overline{m}_w{}^k(\mu) = \mu_\phi \text{ and } (\phi, \mu_\phi) \in R, \text{ then } V(\mu_\phi) = b$  meaning that the shortest<sup>3</sup>  $\mu$ -architectural paths that lead observable states related to  $\phi$  preserve the input bit b and can be inspected by the function V.

We encode the input bit b by starting with a  $\sigma$  chosen from the appropriate set. By design, when w executes, it converges on a common state  $\phi$ , thus "forgetting" b at the architectural level. However, w organizes the output  $\mu$ -states into two disjoint sets, and with the function V, b can be recovered. Thus the program w constitutes a write mechanism for a  $\mu$ -architectural side-channel.

## 2.2 Reading information from $\mu$ -state

A correct implementation of an architecture is a complete abstraction: even with a write mechanism to store information in  $\mu$ -architectural state, the information is, by design, inaccessible at the architectural level. However, many architectures offer ways to measure some aspect of the  $\mu$ -state. This can be a direct measurement (e.g. is a given memory address in cache) or some approximation of the execution history (e.g. a timer or an event counter). These mechanisms may violate the architectural abstraction and allow a crafty program to create a read mechanism to complete the construction of a side-channel.

**Events.** For an architecture  $\overline{\alpha}: \Lambda\langle P, \Sigma\rangle$ , we define an *event* as a predicate function  $P_E: P \times \Sigma \to \{0,1\}$  that determines if a special situation E has occurred at a given state. It is straightforward for a program to count its own events, maintaining a counter encoded in  $\Sigma$ . Such architectural event counters pose no threat to the integrity of the architectural abstraction. However, very often a  $\mu$ -architecture counts  $\mu$ -architectural events (i.e. with a predicate over  $\Delta$ ) and offers an API function  $f_E: () \to \mathbb{N}$  to

<sup>&</sup>lt;sup>2</sup> In essence, for each observable starting state, we only require there is  $some \ \mu$ -architectural state from which computation can begin, and determinism takes over. Universally quantifying the path length accomplishes induction for full correctness. This choice allows  $\mu$ -architectures that can even skip observable states under some circumstances, as long as they can be configured to reach any desired observable state reachable by the architectural execution.

<sup>&</sup>lt;sup>3</sup>Restricting to shortest paths thus only requires the bit be viewable for a non-zero time window immediately after having reached  $\phi$  the first time. This handles the general case of infinite loops involving  $\phi$  and machines where the  $\mu$ -architecturally-encoded bit can be flipped or lost.

return the counter value. This does indeed break the abstraction, allowing programs to construct the read mechanism to complete the side-channel.

Clocks. It's easy to see that clocks are simply special cases of event counters. The basic building block of all clocks is the *step counter* (i.e.  $P_E(\rho,\mu)=1$ ) which advances for every step of the  $\mu$ -architecture execution. Of course a clock may advance at a different rate than the step counter and potentially drift. To model this, we define a  $clock\ C: \mathbb{N} \to \mathbb{N}$  as a linear, monotonically increasing function from the step counter:

- C(0) = 0All clocks start at 0 initially.
- $C(n+1) \ge C(n)$ Each reading is greater than or equal to the last (monotonically increasing).
- $\exists r \in \mathbb{R} > 0$  and  $B \in \mathbb{N}$  such that  $\forall n \ |C(n) rn| \leq B$ Every clock has a nonzero average rate r and a drift bound B such that every reading is within B of the linear function rn.

Thus a clock measures the passage of time as an ever-increasing number, never runs backwards, and always runs at more or less the same rate, no matter how long we use it.

## 2.3 A timing channel that exploits optimizations

CPUs and simulators take shortcuts to improve the execution time of programs, making use of mechanisms such as caches and branch predictors. A common theme among most such optimizations is that they depend only on  $\mu$ -state: a single architectural state can result in different execution time, depending on whether the  $\mu$ -state is "fast" or "slow". This differential is the basis of a *timing channel*, as a program intentionally manipulates  $\mu$ -state to trigger either fast or slow executions.

A timing channel, like all information channels, is comprised of separate read and write mechanisms. It is easy to construct a program whose execution time varies by any given amount by simply having the program do completely different things depending on the input. However, we are interested in timing channels at the  $\mu$ -architectural level, where programs that exhibit no visible architectural divergence depending on their input, nevertheless exploit  $\mu$ -architectural differences for timing channel construction. Such timing channels are not visible without a  $\mu$ -architectural model.

These can be described as follows:

- Write mechanism. We will use the term *optimization trigger* to denote a program fragment trig that intentionally introduces a difference in the μ-state that will lead to different execution time for *another program*, in the future. As in 2.2, trig takes as input a bit b and organizes μ-states such that a function V can recover the bit.
- Read mechanism. We will use the term optimization opportunity to denote a program fragment opt whose execution time varies depending on its input  $\mu$ -state. By executing opt, information is transferred from the  $\mu$ -state domain into the timing domain. To complete the mechanism, i.e. to finish the implementation of V, we read the information from the timing domain back into the architectural domain by simply reading the clock and deciding if execution was fast or slow.

## 2.4 Amplification Lemma

As we have now seen, the last step of a timing channel's read mechanism is to read the clock. However, a clock may in fact have lower resolution than variation in  $\mu$ -steps due to optimizations. For example, an L1 cache hit on today's fastest CPUs may take 3 cycles, while a miss to L2 may take 12 or more cycles. For a processor running at 3Ghz, this difference of 9 cycles is therefore 3 nanoseconds, much smaller than the 1000-nanosecond-resolution clocks typically offered by most programming language APIs. Is the information therefore gone, inaccessible to programs?

The answer is, unfortunately, no. The information remains as *fractional bits* stored in the timing domain. These fractional bits can be made visible with a lower resolution clock using a variety

of amplification techniques, e.g. by simply batching multiple optimization triggers, or repeating the trigger+optimization combination as many times as needed.

**Lemma.** Suppose we have an architecture  $\overline{a}$  and implementation  $\overline{m}$ .

- let trig represent an optimization trigger and opt represent an optimization opportunity.
- let r and B be the clock rate and drift bound of the clock available in  $\overline{m}$ .
- let  $\delta$  be the timing difference, in  $\mu$ -steps, between the fastest execution of opt for input 0 and the slowest execution for input 1.
- let  $rB = ceiling(max(B, B/r)/\delta)$ .
- let  $\rho_{rB} = Compile$ (".

  if (input == 0) {
   for (i = 0; i < rB; i++) {
   trig( $\Sigma_{trig}(0)$ )
   opt()
   }
  } else {
   for (i = 0; i < rB; i++) {
   trig( $\Sigma_{trig}(1)$ )
   opt()
   }
  }
  if (timer() > threshold) return 1; return 0;

The above program executes the same number of architectural steps, no matter its input. However, it amplifies the  $\mu$ -architectural timing differences due to trig and opt to be detectable with any clock, if we know the clock rate and its drift bound. Since the timing domain is cumulative, we can repeatedly write fractional bits into the timing domain, and then when we are sure their sum is more than the clock resolution, we can extract a whole bit with a single clock reading.

Generality. We argue that nearly all optimizations that could be performed by a  $\mu$ -architecture are observable using this amplification technique or a similar one. The argument ultimately rests on the ability to repeat an optimization arbitrarily many times. Intuitively this is always possible since by design, optimizations are intended to improve the asymptotic runtime of a program (i.e. make it N% faster), no matter how long it runs. An optimization that shuts off after some number of repetitions achieves only a constant speedup (i.e. a maximum of K seconds faster), which approaches 0% the longer a program runs. We can also argue directly from the technical details of optimizations. Since optimizations are based on local  $\mu$ -state observations, including local history, the extra state needed to maintain a loop can always be separated from the state that triggers the optimization, forcing the optimization to repeatedly occur.

Impossibility of complete mitigation with timers. Based on the generality argument, we argue that mitigating timing channels by manipulating timers is impossible, nonsensical, and in any case ultimately self-defeating. For example, a common thought is that perhaps the  $\mu$ -architecture can track all time that has been saved due to optimizations and somehow charge the program back. To see why this does not work, first, we require the timer to track the  $\mu$ -architectural steps as a clock with bounded drift, so the  $\mu$ -architecture cannot lie forever about the clock. Instead, it must regularly charge the program back, e.g. by waiting in order to stay with the drift bound. Such a scheme is nonsensical because it simply wastes  $\mu$ -steps<sup>4</sup>. We argue that elaborate charge-back systems like this are equivalent to not doing optimizations at all; they approach a constant time implementation with resolution of the drift bound. As such, the asymptotic performance benefit from optimizations is again 0. Indeed, it may be case that constant-time implementations may be the *only way* to avoid leaks via timing side-channels. Timer coarsening as a mitigation at best lowers the effective bandwidth of a timing channel and simply increases the complexity of the read and write mechanisms without making them impossible.

<sup>&</sup>lt;sup>4</sup>Also note that waiting might consume less power in real hardware, risking another side-channel.

```
\frac{\rho[pc] = \mathsf{const}\, r_d\, k}{\rho, \langle pc, R, M \rangle \longrightarrow \langle pc + 1, R[r_d] \leftarrow k, M \rangle} \quad \text{Arch-Const}
(registers)
                    r ::= \mathsf{r0} \mid \ldots \mid \mathsf{r15}
(operations)
                                                                                                    \frac{\rho[pc] = \mathsf{load} \; r_d \; [r_a + k]}{\rho, \langle pc, R, M \rangle \longrightarrow \langle pc + 1, R[r_d] \leftarrow M[R[r_a] + k], M \rangle} \quad \mathsf{Arch\text{-}Load}
                  op ::= eq \mid ge \mid
                                   add | sub |
                                  mul | div |
                                                                                                      \frac{\rho[pc] = \mathsf{store}\left[r_a + k\right] r_v}{\rho, \langle pc, R, M \rangle \longrightarrow \langle pc + 1, R, M[R[r_a] + k] \leftarrow R[r_v] \rangle}
                                  shl | shr |
                                  and or
                                  xor
(instructions)
                                                                                            \frac{\rho[pc] = \text{binop } op \ r_d \ r_a \ r_b}{\rho, \langle pc, R, M \rangle \longrightarrow \langle pc + 1, R[r_d] \leftarrow op(R[r_a], R[r_b]), M \rangle} \quad \text{Arch-Binop}
                     i ::= \operatorname{const} r_d \# \mathbb{Z} \mid
                                  load r_d [r_a + \#\mathbb{Z}]
                                  store [r_a + \#\mathbb{Z}] r_v
                                                                                                                                 \frac{\rho[pc] = \operatorname{branch} r_c \ d \quad R[r_c] \neq 0}{\rho, \langle pc, R, M \rangle \longrightarrow \langle pc + d, R, M \rangle} \quad \text{Arch-BranchTaken}
                                   binop op \ r_d \ r_a \ r_b
                                  branch r_c \# \mathbb{Z}
                                  jump r_d
                                                                                                                        \frac{\rho[pc] = \mathsf{branch}\, r_c\; d \quad R[r_c] = 0}{\rho, \langle pc, R, M \rangle \longrightarrow \langle pc + 1, R, M \rangle} \quad \text{Arch-BranchNotTaken}
                                  timer r_d
(program \equiv P)
                    \rho ::= i
                                                                                                                                                      \frac{\rho[pc] = \mathsf{jump} \ r_d}{\rho, \langle pc, R, M \rangle \longrightarrow \langle R[r_d], R, M \rangle} \quad \text{Arch-Jump}
(states \equiv \Sigma)
                   \sigma ::= \langle pc : \mathbb{Z}, R : \vec{\mathbb{Z}}, M : \vec{\mathbb{Z}} \rangle
```

Figure 1: Architectural model

# 3 An Architecture to study Spectre

In the last section, we proved that any  $\mu$ -architectural optimization is ultimately observable at the architectural level through timing, showing a general amplification technique to construct programs that repeatedly trigger the same optimization in order to amplify its effect. In that exercise we used a metamodel to show that information leaks affect all models of computation. In this section we introduce a series of semantic models to study the problem of information leaks due to speculative execution like that in today's CPUs. Note that we don't model multi-core systems, threads, pipelines, or memory barriers, as these are not necessary to demonstrate speculative vulnerabilities.

In Figure 1, we introduce the language of programs and states and the execution semantics of programs. We will use this architecture for the remainder of this section.

- A program  $\rho$  consists of a vector  $\vec{i}$  of instructions indexed with an integer program counter.
- The architectural states  $\sigma$  consist of a triple  $\langle pc, R, M \rangle$  of program counter  $pc \in \mathbb{Z}$ , a vector  $R \in \mathbb{Z}$  of registers, and vector  $M \in \mathbb{Z}$  representing memory.
- A **branch** has an input condition register  $r_c$  and a relative pc offset.
- A jump is an unconditional indirect jump to the pc contained in the register  $r_d$ .

## 3.1 Microarchitecture

Our architecture is simple yet complete enough to illustrate several broad of classes vulnerabilities that arise from speculative execution. We will now begin enhancing this architecture with models of  $\mu$ -architectural state in order to study how speculative vulnerabilities arise.

#### 3.1.1 Modeling memory caches

We extend the architectural state of the program  $\Sigma$  by adding  $\mu$ -state  $C : \mathbb{Z}^{C_{max}}$  a list of cached memory addresses of maximum fixed capacity  $C_{max} > 0$  which holds the addresses currently stored in the cache. We add a cache-update function  $LRU_n : (\mathbb{Z}^n, \mathbb{Z}) \to \mathbb{Z}^n$  which models adding a new memory address to

the cache and evicting the least-recently-used<sup>5</sup> address if necessary to limit the entries to n. Evaluation rules are as before, with the following modifications to the rules for **load** and **store**:

$$\frac{\rho[pc] = \mathbf{load} \ r_d \ [r_a + k] \quad R[r_a] + k \notin C \quad C' = LRU_{C_{max}}(C, R[r_a] + k)}{\rho, \langle pc, R, M, C \rangle \longrightarrow \langle pc, R, M, C' \rangle} \quad \text{Uncached-Load}$$
 
$$\frac{\rho[pc] = \mathbf{store} \ [r_a + k] \ r_v \quad R[r_a] + k \notin C \quad C' = LRU_{C_{max}}(C, R[r_a] + k)}{\rho, \langle pc, R, M, C \rangle \longrightarrow \langle pc, R, M, C' \rangle} \quad \text{Uncached-Store}$$
 
$$\frac{\rho[pc] = \mathbf{load} \ r_d \ [r_a + k] \quad R[r_a] + k \in C \quad C' = LRU_{C_{max}}(C, R[r_a] + k)}{\rho, \langle pc, R, M, C \rangle \longrightarrow \langle pc + 1, R[r_d] \leftarrow M[R[r_a]], M, C' \rangle} \quad \text{Cached-Load}$$
 
$$\frac{\rho[pc] = \mathbf{store} \ [r_a + k] \ r_v \quad R[r_a] + k \in C \quad C' = LRU_{C_{max}}(C, R[r_a] + k)}{\rho, \langle pc, R, M, C \rangle \longrightarrow \langle pc + 1, R, M[R[r_a]] \leftarrow R[r_v], C' \rangle} \quad \text{Cached-Store}$$

That is, we require that loads and stores first load their addresses into the cache, penalizing uncached accesses with an extra  $\mu$ -step.

#### 3.1.2 Branch Predictor State

Branch prediction is a far more complex process than adding a cache, since the CPU begins to execute instructions speculatively before a branch outcome is known, and discard the architectural effects if the speculation is wrong. However, the  $\mu$ -state to feed branch prediction is relatively simple. We add  $\mu$ -state  $B:(\mathbb{Z},\mathbb{B})^{B_{max}}$  a list of program counter location/prediction pairs of maximum fixed capacity  $B_{max}>0$ . We add a prediction-update function  $BP_n:((\mathbb{Z},\mathbb{B})^n,\mathbb{B})\to(\mathbb{Z},\mathbb{B})^n$  that updates the prediction for a branch given its outcome. The prediction state is updated after a branch outcome is known. It will be used later in section 3.1.4.

$$\frac{\rho[pc] = \mathsf{branch} \, r_c \, d \quad R[r_c] \neq 0 \quad B' = BP_{B_{max}}(B, \mathsf{true})}{\rho, \langle pc, R, M, B \rangle \longrightarrow \langle pc + d, R, M, B' \rangle} \quad \text{Record-BranchTaken}$$
 
$$\frac{\rho[pc] = \mathsf{branch} \, r_c \, d \quad R[r_c] = 0 \quad B' = BP_{B_{max}}(B, \mathsf{false})}{\rho, \langle pc, R, M, B \rangle \longrightarrow \langle pc + 1, R, M, B' \rangle} \quad \text{Record-BranchNotTaken}$$

#### 3.1.3 Indirect Branch Predictor State

We model indirect branch prediction by additional  $\mu$ -state  $J:(\mathbb{Z},\mathbb{Z})^{J_{max}}$  a list of program counter location pairs of maximum fixed capacity  $J_{max}>0$ . While similar to the branch predictor state, this jump table, which is often referred to as the *branch target buffer* in computer architecture literature, stores a target address for each indirect jump. As with the branch prediction state, we add a function  $JP_n:((\mathbb{Z},\mathbb{Z})^n,\mathbb{Z})\to(\mathbb{Z},\mathbb{Z})^n$  which updates the jump table with the target address for a given jump after it is executed.

$$\frac{\rho[pc] = \mathsf{jump} \ r_d \quad J' = JP_{J_{max}}(J, R[r_d])}{\rho, \langle pc, R, M, J \rangle \longrightarrow \langle R[r_d], R, M, J' \rangle} \quad \text{Record-Jump}$$

### 3.1.4 Modeling Control Speculation

To model control speculation, we further augment the state with a reorder buffer which is a sequence of instructions that are waiting to be evaluated and will later be committed to the architectural state. The reorder buffer models two aspects of speculative execution:

 $<sup>^{5}</sup>$ The exact cache replacement algorithm is mostly orthogonal to this work. We use LRU here because it is simple to model.

- Instructions are evaluated out-of-order, after their input dependencies are executed.
- Branch outcome can be predicted before the condition is available, starting speculation.

The semantic state  $\langle \sigma, spc, E, B \rangle$  is composed of the architectural state  $\sigma$ , speculative program counter spc that points to the next instruction to issue, reorder buffer E and branch predictor state B. We model the reorder buffer E as a sequence of triples of the form  $\langle pc, v_{\perp}, b \rangle$ , where pc is the address of the instruction in the reorder buffer,  $v_{\perp}$  is the result value of the instruction if it was evaluated out-of-order, and  $b \in \mathbb{B}$  is the branch prediction for branch instructions, ignored for non-branches.

The operational semantics for speculative execution is split into three groups of rules. The *issue* rules (Figure 2), which for instructions that do not speculate, insert an entry into the reorder buffer for that instruction. For the **branch** instruction, the issue rule ignores the condition and instead uses the branch predictor to determine the new program counter, recording the branch prediction in the reorder buffer entry (rule S-Branch-Issue). The *execute* rules (Figure 3) evaluate instructions with available dependencies. We only have execute rules for instructions that produce values, the available values in registers and memory are computed by auxiliary reorder buffer lookup predicates in Figure 5. The *commit* rules (Figure 4) update the architectural state with the result of the oldest instruction in the reorder buffer. Since our architectural semantics is determistic, we just lift the architectural semantics rules to the microarchitectural rules for all the instructions that cannot mis-speculate (S-NS-Commit). When committing the **branch** instruction, we flush the reorder buffer if the speculated condition turned out to be wrong (rule S-Predict-Fail).

$$\frac{\operatorname{nospec}(\rho[spc])}{\rho, \langle \sigma, spc, E, B \rangle \xrightarrow{\operatorname{spec}} \langle \sigma, spc + 1, E + [\langle spc, \bot, \bot \rangle], B \rangle} \quad \text{S-Seq-Issue}$$

$$\frac{\rho[spc] = \operatorname{branch} r_c \, d}{b = \operatorname{predict}(B, spc)}$$

$$\frac{spc' = \begin{cases} spc + d & \text{if} \quad b = \operatorname{true} \\ spc + 1 & \text{otherwise} \end{cases}}{\rho, \langle \sigma, spc, E, B \rangle \xrightarrow{\operatorname{spec}} \langle \sigma, spc', E + [\langle spc, \bot, b \rangle], B' \rangle} \quad \text{S-Branch-Issue}$$

where non-speculating instruction predicate is defined by

$$\operatorname{nospec}(i) = \left\{ \begin{array}{ll} \mathtt{true} & \text{if} \quad i = \mathsf{const} \: r_d \: k \\ & \text{or} \quad i = \mathsf{load} \: r_d \: [r_a + k] \\ & \text{or} \quad i = \mathsf{store} \: [r_a + k] \: r_v \\ & \text{or} \quad i = \mathsf{binop} \: op \: r_d \: r_a \: r_b \\ & \texttt{false} \quad \text{otherwise} \end{array} \right.$$

Figure 2: Speculative semantics - issue instructions.

It is useful to observe that the branch speculative semantics coincides with the architectural semantics until the first misspeculation in the reorder buffer. To make the notions of state agreement and misspeculation precise, we will say that program  $\rho$ 's architectural state  $\sigma = \langle pc, R, M \rangle$  agrees with its microarchitectural state  $\langle \langle pc', R', M' \rangle, spc, \{e_i\}_{i=0}^n, B \rangle$  at depth  $m \leq n$ , if

- for all registers r, either lookup $R(\rho, R', \{e_i\}_{i=0}^m, r)$  is bottom or it equals to R[r],
- for all memory locations l, value lookup $M(\rho, R', M', \{e_i\}_{i=0}^m, r)$  equals to either bottom or M[l],
- The program counter pc is equal to the program counter of  $e_{m+1}$  if m < n or pc = spc.

For program  $\rho$ , we define the state after applying reorder buffer entries  $\{e_i\}_{i=0}^m$ , written apply $(\rho, \sigma, \{e_i\}_{i=0}^m)$  to be the unique state that is reached after m transitions, i.e.,  $\rho, \sigma \longrightarrow_m \text{apply}(\rho, \sigma, \{e_i\}_{i=0}^m)$ . Note that we do not have to decode the buffer because our architectural semantics is deterministic. For a non-deterministic architecture, we would have to replay the instructions contained in the reorder buffer.

Figure 3: Speculative semantics - execute instructions.

$$\frac{\operatorname{nospec}(\rho[pc]) \quad \rho, \sigma \longrightarrow \sigma'}{\rho, \langle \sigma, spc, \langle pc, v_{\perp}, b \rangle + E, B \rangle \xrightarrow{\operatorname{spec}} \langle \sigma', spc, E, B \rangle} \quad \text{S-NS-Commit}$$

$$\rho[pc] = \operatorname{branch} r_c d$$

$$\langle b, pc' \rangle = \begin{cases} \langle \operatorname{true}, pc + d \rangle & \text{if} \quad R[r_c] \neq 0 \\ \langle \operatorname{false}, pc + 1 \rangle & \text{if} \quad R[r_c] = 0 \end{cases}$$

$$B' = BP_{B_{max}}(B, b)$$

$$\rho, \langle \langle pc, R, M \rangle, spc, [\langle pc, \bot, b \rangle] + E, B \rangle \xrightarrow{\operatorname{spec}} \langle \langle pc', R, M \rangle, spc, E, B' \rangle \quad \text{S-Predict-Success}$$

$$\rho[pc] = \operatorname{branch} r_c d$$

$$\langle b, pc' \rangle = \begin{cases} \langle \operatorname{false}, pc + d \rangle & \text{if} \quad R[r_c] \neq 0 \\ \langle \operatorname{true}, pc + 1 \rangle & \text{if} \quad R[r_c] = 0 \end{cases}$$

$$B' = BP_{B_{max}}(B, \operatorname{invert}(b))$$

$$\rho, \langle \langle pc, R, M \rangle, spc, [\langle pc, \bot, b \rangle] + E, B \rangle \xrightarrow{\operatorname{spec}} \langle \langle pc', R, M \rangle, spc, [], B' \rangle \quad \text{S-Predict-Fail}$$

Figure 4: Speculative semantics - commit instructions.

We will say that program  $\rho$ 's microarchitectural state mis-speculates at reorder buffer depth m if m-th entry in the reorder buffer is a branch with prediction b and evaluating the condition by applying the first m instructions from the reorder buffer on the architectural state disagrees with b. More precisely, we say  $\langle \langle pc', R', M' \rangle$ , spc,  $\{\langle pc_i, v_i, b_i \rangle\}_{i=0}^n$ ,  $B \rangle$  mis-speculates in reorder buffer depth  $m \leq n$  if  $\rho[pc_m] = \mathbf{branch} \, r_c \, d$  and for apply  $(\rho, \langle pc', R', M' \rangle, \{\langle pc_i, v_i, b_i \rangle\}_{i=0}^n) = \langle pc'_m, R'_m, M'_m \rangle$ , we have either  $R'_m[r_c] = 0$  and  $b_m = \mathtt{true}$  or  $R'_m[r_c] \neq 0$  and  $b_m = \mathtt{false}$ .

**Theorem 1** Given a program  $\rho$ 's microarchitectural state  $\mu = \langle \sigma, spc, \{e_i\}_{i=0}^n, B \rangle$  reachable from a state with empty reorder buffer, either the state apply $(\rho, \sigma, \{e_i\}_{i=0}^m)$  agrees with  $\mu$  at all depths  $m \leq n$  or  $\mu$  mis-speculates at depth k and apply $(\rho, \sigma, \{e_i\}_{i=0}^m)$  agrees with  $\mu$  at depths  $m \leq k$ .

#### 3.1.5 Timer

To implement a counter that counts  $\mu$ -architectural steps, we add  $\mu$ -state  $T \in \mathbb{N}$  and extend the semantics with a meta-rule that increments the count for every step of evaluation. Reading the timer is then accomplished with the straightforward rule.

$$\frac{\rho, \langle pc, R, M \rangle \xrightarrow{\alpha} \langle pc', R', M' \rangle}{\rho, \langle pc, R, M, T \rangle \longrightarrow \langle pc', R', M', T' \rangle} \quad \text{Tick-Rule}$$

$$\frac{\rho[pc] = \mathbf{timer} \ r_d \quad T' = T + 1}{\rho, \langle pc, R, M, T \rangle \longrightarrow \langle pc + 1, R[r_d] \leftarrow T, M, T' \rangle} \quad \text{Timer-Read}$$

$$\frac{\rho[spc] = \mathbf{load} \ r_d \ [r_a + k]}{\mathbf{lookupR}(\rho, R, E + [spc, v^{\perp}, \bot], r_d) = v^{\perp}} \quad \text{S-LookupR-Load}$$
 
$$\frac{\rho[spc] = \mathbf{binop} \ op \ r_d \ r_a \ r_b}{\mathbf{lookupR}(\rho, R, E + [spc, v^{\perp}, \bot], r_d) = v^{\perp}} \quad \text{S-LookupR-Op}$$
 
$$\frac{\rho[spc] \neq \mathbf{load} - \rho[spc] \neq \mathbf{binop} - \dots - \rho[spc] \neq \mathbf{binop} - \dots - \rho[spc] + \rho[spc] - \rho[spc]}{\mathbf{lookupR}(\rho, R, E, r_v) = v^{\perp}} \quad \text{S-LookupR-Other}$$
 
$$\frac{\rho[spc] = \mathbf{store} \ [r_a + k] \ r_v}{\mathbf{lookupR}(\rho, R, E, r_a) = v_a} \quad \text{S-LookupR-Base}$$
 
$$\frac{\rho[spc] = \mathbf{store} \ [r_a + k] \ r_v}{\mathbf{lookupM}(\rho, R, M, E + [spc, \bot, \bot], v_a + k) = v} \quad \text{S-LookupM-Store-Value}$$
 
$$\frac{\rho[spc] = \mathbf{store} \ [r_a + k] \ r_v}{\mathbf{lookupM}(\rho, R, M, E + [spc, \bot, \bot], v_a) = \bot} \quad \text{S-LookupM-Store-Unknown}$$
 
$$\frac{\rho[spc] = \mathbf{store} \ [r_a + k] \ r_v}{\mathbf{lookupM}(\rho, R, M, E, r_a) = v^{\perp}} \quad \text{S-LookupM-Store-NoAlias}$$
 
$$\frac{\rho[spc] = \mathbf{store} \ [r_a + k] \ r_v}{\mathbf{lookupM}(\rho, R, M, E, v_a) = v^{\perp}} \quad \text{S-LookupM-Store-NoAlias}$$
 
$$\frac{\rho[spc] \neq \mathbf{store} \ \_}{\mathbf{lookupM}(\rho, R, M, E, v_a) = v^{\perp}} \quad \text{S-LookupM-Store-NoAlias}$$
 
$$\frac{\rho[spc] \neq \mathbf{store} \ \_}{\mathbf{lookupM}(\rho, R, M, E, v_a) = v^{\perp}} \quad \text{S-LookupM-Other}$$
 
$$\frac{\rho[spc] \neq \mathbf{store} \ \_}{\mathbf{lookupM}(\rho, R, M, E, v_a) = v^{\perp}} \quad \text{S-LookupM-Other}$$

Figure 5: Auxiliary predicates

## 3.2 Composing $\mu$ -architectural extensions

As we've seen in the previous sections, a number of extensions are possible to model  $\mu$ -architectural mechanisms by adding state and additional evaluation rules. To create a complete model of a  $\mu$ -architecture, we can *compose* these extensions into a larger model that contains state for caches, branch prediction, indirect branch prediction, out-of-order execution, and a timer. As most of these extensions have orthogonal state, this is generally straightforward (though perhaps tedious).

#### 3.3 Cache-based side-channels

Real CPU caches copy data in physical memory, include virtual or physical tags, and include  $\mu$ -state for tracking dirtiness and implementing coherency protocols across multiple cache levels spread over many cores and processors. Our model is thus very simplified, since these fine details are not needed for our

#### 3.3.1 Encoding information in cachedness

Caches are typically measured by their capacity to store program data, yet we recognize they have a meta-information capacity,  $C_{info}$  that represents the information they store about program history. This capacity depends on the exact details of the  $\mu$ -state, such as the number of levels, replacement policy, inclusiveness, etc. In our model,  $C_{info} = C_{max}log |M|^6$  where  $C_{max}$  and |M| are respectively the cache capacity and the maximum memory address.

We need an encoding scheme to store information in the cachedness of memory. The fact that all memory accesses may alter the state of the cache means it must be robust against:

- incidental accesses that occur between storing information in the cache and its retrieval
- accesses by the implementation of the encoding scheme itself, particularly retrieval
- noise caused by interrupt handlers and concurrent processes sharing the same cache

The decoding problem restricts the amount of meta-information we can store, so in practice  $C_{info}$  is not achievable. In practice, an attack may only require leaking a single bit or byte at a time. Two schemes are:

- **Direct-mapped bits**. For a small number  $b < C_{max}$  of bits, we choose addresses  $a_0 
  ldots a_{b-1}$ . We store a 1 for bit i by accessing  $a_i$ , bringing it into the cache. We store a 0 for bit i by evicting  $a_i$  from the cache<sup>7</sup>. A program can read bit i by measuring the access time to  $a_i$  using the **timer** instruction, determining 1 if fast, 0 if slow.
- Indexed values. For a small number  $b < log C_{max}$  of bits, we use a larger number of addresses  $a_0 \ldots a_{2^b-1}$ . To store the value  $v = B_{b-1} \ldots B_1 B_0$ , we access address  $a_v$ . To read a value from the cache, we probe lines  $a_0 \ldots a_{2^b-1}$ . The fast address  $a_v$  indicates the value v.

Both schemes have advantages and disadvantages. Direct-mapped is space-efficient but requires b accesses to write b bits. It is less robust to information loss, since each stored 1 bit is at risk of being flipped to 0 by eviction through incidental accesses. Indexed-value has the advantage that one access can write b bits, but it requires  $2^b$  accesses to read b bits. It is also more robust to errors, since the one cached memory address which carries information is less likely to be evicted by accident. In our work on real hardware, we typically used the indexed-value scheme with  $4 \le b \le 8$ , both because it was easier to leak information from speculation this way, and it was more robust to errors and noise.

## 3.4 Vulnerabilities by the Numbers

#### 3.4.1 Variant 1: Speculative Safety Check Bypass

Programs often contain branches that implement safety checks to prevent unsafe runtime behavior such as accessing outside the bounds of allocated memory or accessing an object of an incorrect type. Depending on the programming language semantics, an out-of-bounds access might result in a language-level exception being thrown or a sentinel value like undefined or 0 being returned. Most programs are well-behaved, so safety checks normally pass. When executing such programs, CPU branch predictors quickly learn to predict these branches. The uncommon failed safety check will result in the processor's normal recovery mechanism for incorrectly predicted branches rolling back architectural state to before the mispredicted branch and instead executing the proper architectural path. However, as we have seen in our model, CPUs do not generally rollback the  $\mu$ -state changes, so changes to caches are not undone. Therein lies our first vulnerability.

An attacker can construct a program that trains the CPU's branch predictor to assume that safety checks normally pass, and then intentionally triggers a misprediction that results in speculatively executing code where safety conditions normally established by branches do not hold. To date, we've all

<sup>&</sup>lt;sup>6</sup>Note that despite first intuition, we cannot actually store a bit per address, since only  $C_{max}$  bits can simultaneously be

 $<sup>^{7}</sup>$ We can evict a line by filling the cache with other addresses that alias it, called an  $eviction\ set$ .



Figure 6: Illustration of a variant 1 attack

assumed that this was innocent because architectural state rollback would not allow an attacker to make use of any information accessed in misspeculated executions. However, a careful attacker can exfiltrate information from speculative execution through  $\mu$ -state.

The example in Figure 6 shows a program that attacks bounds checks. The routine vulnerable accepts an integer index argument. Compiled code loads the array length and then checks the index against this length. If the index is in-bounds, it loads the array element and then uses that value as an index into a second array, timing. If the index is out-of-bounds, the code simply returns 0. Clearly, no execution of this program should access outside the bounds of the array, as the load is guarded by a bounds check. However, the attacker trains the branch predictor to assume not-taken (Figure 6d, with  $\mu$ -state shown in light gray). After training, the attacker crafts a special out-of-bounds access to cause the CPU to mis-speculate and load the out-of-bounds memory location (Figure 6e). Even though the CPU rolls-back the speculative architectural state and executes @oob, the value at secret has already been encoded into  $\mu$ -state as the cachedness of the address timing+secret. The attacker decodes the secret data using the techniques in Section 3.3.

Another vulnerability is exposed by indirect jumps, which are typically used by programs to implement type-dispatched behavior, numerically-indexed switch statements and threaded bytecode interpreters. Often the construction of objects in memory enforces implicit safety properties. For example, a typical implementation of classes stores a header word before each object in memory which points to metadata for type A, such as a virtual dispatch table or vtable, if and only if the object is of type A in the source program. An indirect jump through the virtual dispatch table allows the target of the indirect jump to assume, without checking, that receiver objects are of the proper type.

The example shown in Figure 7 shows that indirect branch speculation violates this assumption. The routine vulnerable accepts a single argument which is the address of an object. It assumes that the first memory cell of the object contains a pointer to code that implements a virtual function virtualFunc. If the routine is repeatedly called with objects of type A, the indirect branch predictor will predict that the indirect call always jumps to @A.virtualFunc (Figure 7d). When the routine is then called with an object of type B, the CPU mispredicts and speculatively jumps to @A.virtualFunc, but with an argument of type B (Figure 7e). Objects of type A contain a pointer in the second cell, while objects of

```
class A { int* a; };
function A.virtualFunc() {
  return timing[*(this.a)]
                                                                                                                                 0: load r1 [r0 + #0]
1: jump r1
                                                                                                                                                                                            load obj.virtualFunc (vulnerable) indirect jump
        class B { int b; };
       function B.virtualFunc() {
  return this.b
                                                                                                                               @A.virtualFunc:
2: load r2 [r0 + #1]
3: load r1 [r2]
4: load r0 [r1 + #timing]
5: jump r15
       function vulnerable(obj) {
  return obj.virtualFunc();
                                                                                                                                                                                            load *(this.a)
load [timing +
return
                                                                                                                               @B.virtualFunc:
6: load r0 [r0 + #1]
7: jump r15
       vulnerable(new A(&val)); // Train
int addr = <craft secret address>;
vulnerable(new B(addr); // Attack
                                                                                                                                                                   (b) Vulnerable machine code
                                       (a) Vulnerable code
                \begin{array}{c|c} 0 & 0 \\ 1 & 1 \end{array}

\begin{array}{c|cccc}
 & r0 & r1 & r2 \\
\hline
 & 0 & 0
\end{array}

                                                                                                                                                                                       r0 r1 r2
4 2 0
                                                                                                                                                   0
                                                                                                                                                                                                                  2
                               timing
                 2 2
                3 3
4 2

\begin{array}{c|cccc}
 & r0 & r1 & r2 \\
\hline
 & 2 & 0
\end{array}

                                                                                                                                                                                                             ... 2
                                                                                                                                                   0
                5 6
                             A.a
                6 3
                             integer pointed to by A.z
                                                                                                                                                                                                                                                                @1 -> @2
                7 6
                            B.virtualFunc
                                                                                                                                                                                                                                                           C timing + 3
                            B.b (address of secret)
                                                                                                                                                     (d) Training the predictor
                        (c) Memory layout
                                                                                                                                                                                                                                                             r0 r1 r2

\begin{array}{c|cccc}
 & r1 & r2 \\
\hline
 & 6 & 0
\end{array}

\begin{array}{c|cccc}
 & r0 & r1 & r2 \\
\hline
 & 7 & 0 & 0
\end{array}

\begin{array}{c|cccc}
 & r0 & r1 & r2 \\
\hline
 & 7 & 0 & 0
\end{array}

                                                                                                                               7 0 0
                                                                                                                                                         0
                                                                                                                                                                                                                         1
                                                                                                                                                                                                                                                                                         7
                                                                                                                                \begin{array}{c|cccc} \mathbf{r0} & \mathbf{r1} & \mathbf{r2} \\ \hline \mathbf{2} & \mathbf{2} & \mathbf{9} & \dots \end{array} 
                                                                                                                                                                                              r0 r1 r2
7 6 0
                                                                                                                                                                                                                                                             r0 r1 r2
1 6 0
                                                                                                                                                         рс
5
                                                               \begin{array}{c|cccc} \mathbf{r}0 & \mathbf{r}1 & \mathbf{r}2 \\ \hline \mathbf{7} & \mathbf{?} & \mathbf{0} & \cdots \end{array}
                                                                                                                                                                                                                         1
                                                                                                                                                                                                                                                                                        7
7 0 0 ...
                          0
                                                                                         1
            @1 -> @2
                                                                          @1 -> @2
                                                                                                                                       @1 -> @2
                                                                                                                                                                                                  P @1 -> ?
                                                                                                                                                                                                                                                                  P @1 -> @6
                                                                    C
                                                                                                                                                                                                  C timing + 2
```

Figure 7: Illustration of a indirect branch variant 1 attack

(e) Performing the attack

type B contain an integer in the second cell. Thus when QA.virtualFunc is executed speculatively on an object of type B, the CPU will confuse an integer field as a pointer, speculatively loading from this attacker-crafted pointer. The attacker ex-filtrates the value again through  $\mu$ -state of the cache.

#### 3.4.2 Variant 2: Speculative Target Misreconstruction

In variant 1 we've shown that indirect jump prediction can be exploited to bypass the implicit type checks that are part of a typical language's virtual dispatch mechanism. As it turns out, the branch target buffer on most CPUs are approximate in order to save space. For example, Intel 64-bit CPUs only store the low-order 32 bits of the from address (the address of the indirect jump) and the low-order 32 bits of the relative target address (the predicted address). Upon lookup, the predictor ignores the upper 32 bits of the from address and reuses a prediction for an aliased from address. This allows an attacker to train a target indirect branch to speculatively jump to any address within a 4GiB range without ever executing the victim branch.

This is particularly bad, because the attacker can create speculative indirect jumps to anywhere, i.e., control flow that cannot possibly exist in the original code, such as jumping into the middle of arbitrary machine code that simply happens to be a leaking gadget. That means an attacker may not even need to craft an instruction sequence, but find an extant instruction sequence in the victim's code, similar to return-oriented programming. This can even work across processes. [20] found that the branch target buffer on Intel chips is shared across hyperthreads on the same core, allowing one process to inject predictions into another. As a mitigation for this attack, a subsequent microcode update from Intel disabled this sharing [16].

Appendix A.1 provides an illustrative example of a variant 2 attack.

#### 3.4.3 Variant 3: Speculative Hardware Permission Check Bypass

In addition to programmatic safety checks to prevent unsafe runtime behavior, the hardware itself provides certain guarantees via implicit permission checks. User programs should not be able to access unmapped virtual memory addresses, write to read-only memory [18], or read from kernel memory. Such attempts should result in a *faults*. Some CPUs seem to check for a fault too late, effectively speculating through a hardware permission check. This depends on the specific details of a CPU's trap mechanism of course; e.g. faulting at retirement is too late if the processor has already accessed the memory and supplied its value to dependent instructions, which leaked the value into  $\mu$ -state. Lipp et al. [22], describe a variant 3 attack that enables leakage of data in kernel memory to a userspace process.

These hardware checks are sometimes used by the CPU as a general mechanism to interrupt the program and transfer control to the kernel. It has been discovered that Intel CPUs use it to implement an optimization called Lazy FPU state restore [23]. Upon context switch between two applications, microprocessors can choose to lazily restore floating-pointer registers of the arriving process. If the program reads from an unrestored register, it receives a stale value (from a previous process) and continues executing speculatively, faulting at retirement. Again, this is too late, since the process might leak the stale value in  $\mu$ -state.

## 3.4.4 Variant 4: Speculative Aliasing Confusion

Since memory is often the bottleneck in many programs, modern CPUs utilize not only caching but dynamic alias analysis known as memory disambiguation. When executing a store, the CPU uses a predictor to determine which, if any, subsequent loads will depend on the store. If the prediction is no-alias, the CPU may speculatively execute a later load before the store. If the prediction turns out to be incorrect and the store address and load address are in fact aliases, this will be detected when instructions are being retired in program order, and the load will be aborted and re-executed. This, too, represents a vulnerability, since loads that are speculatively executed out of order observe stale values from memory.

Bypassing stores is only one way a memory disambiguator can speculatively accelerate loads [33]. As long as violations are detected and repaired before retirement, other aggressive forwarding strategies could be implemented. If the memory disambiguator learns that a load typically aliases a store, it could speculatively forward the value even if the source address for the load is not yet known. Similarly the disambiguator could learn that two consecutive loads typically load from the same address, and inject the result from the first load into the second. Like in variant 2, predictions made by the memory disambiguator are subject to implementation details.

Appendix A.2 provides an illustrative example of a variant 4 attack.

## 3.5 The Universal Read Gadget

By design, misspeculated execution has no architectural side-effects. That means that no speculative execution vulnerability can alter the architectural state; they are limited to read-only access. However, as we have seen, the four variants we have outlined bypass normal software safety checks and the assumption of language type safety, allowing even a well-typed program to read inaccessible memory. The most general form of out-of-bounds read, a routine that can read all of addressable memory, we term the *universal read gadget*.

Our work on Spectre has led us to the disheartening conclusion:

• Pervasive availability of the Universal read gadget. For most programming languages L with a timer, speculative vulnerabilities on most of today's CPUs allow the construction of a well-typed procedure

```
read(address: int, bit: int) \rightarrow bit
```

that uses a  $\mu$ -architectural side-channel to read the bit of the contents of the process memory at the given address.

The universal read gadget is not necessarily a straightforward construction. It requires detailed knowledge of the  $\mu$ -architectural characteristics of the CPU and knowledge of the language implementation, whether that be a static compiler or a virtual machine. Additionally, the gadget might have particularly unusual performance and concurrency characteristics:

- if the timer is low resolution, the gadget requires amplification
- the gadget may require training  $\mu$ -architectural predictors in a complex warmup phase
- the gadget may fail probabilistically due to noise from interrupts, frequency scaling, or predictor algorithms with hidden state, and thus requires repeated attempts

What characteristics of a programming language make it exploitable on today's modern hardware? As we have seen, access to a timer, no matter the resolution, leaks  $\mu$ -architectural information. We point out several language features whose typical implementations may be vulnerable to Spectre. In these we found that a key to constructing the universal read gadget was speculative pointer crafting, whereby an attacker exploits speculation to trick the implementation into interpreting attacker-controlled input as a machine-level pointer, feeding this pointer into a (normally innocent, but speculatively dangerous) load to achieve the universal read gadget.

Source-language features we have determined to be vulnerable include:

- 1. **Indexed data structures with dynamic bounds checks.** As seen in our example variant 1 vulnerability, bounds checks inserted into code by either static or dynamic compiler can be trained to be mispredicted, causing speculative out-of-bounds accesses.
- 2. **Differential data structure shapes.** Data structures of different types have different layouts in memory. Code manipulating objects of different shapes, e.g. objects of two different classes, can be trained to execute with improper object layouts, causing type confusion leading to pointer crafting [12].
- 3. **Dynamically shaped structures.** In languages that have dynamically typed objects where the underlying storage may change shape, mispredicted shape checks may allow an attacker to read out of the bounds of an object's shape into nearby objects, which can contain attacker-controlled values.
- 4. Variadic arguments to functions. Like other dynamic data structures, both variant 1 and variant 4 vulnerabilities apply to the implementation of variadic functions, where an argument count check can be trained to allow accessing out-of-bounds of the real arguments, or a stale argument can be accessed, both leading to pointer crafting.
- 5. **Interpreter dispatch loop.** The interpreter dispatch loop of a virtual machine typically consists of one or more indirect branches. The loop can be trained to speculatively transfer control to either the wrong bytecode (variant 1) or attacker-controlled code (variant 2), leading to pointer crafting.
- 6. **Indirect control transfers.** The variant 2 example can be exploited with indirect function calls in languages with first class functions or with virtual dispatch in object-oriented languages. Improper speculative control transfer leads to type confusion and pointer crafting.
- 7. Call stack. The CPU can be trained into mispredicting return addresses due to another internal predictor known as the *return address stack* [24], leading to pointer crafting.
- 8. Switch statements. Switch statements that are compiled to jump tables can be used to train the indirect branch predictor, either directly or through aliasing, to transfer control to code with improper type assumptions, leading to pointer crafting.

In our offensive work for JavaScript and WebAssembly implementations, detailed in the next section, we developed proof-of-concept universal read gadgets using many of the above mechanisms. As our examples show, our semantic model is able to capture the fundamental vulnerabilities that these gadgets rely upon, though architectural details determine exploitability. Items near the top of the list generally

require less knowledge than those at the bottom of the list. In particular, we found variant 1 to be quite simple to exploit. For managed languages, variant 3 is only different from variant 1 in that superuser memory can be accessed. Variant 2 is only easily exploitable if one has direct control over the virtual memory addresses of code. Variant 4 can be difficult to exploit reliably due to the black box nature of the memory disambiguator state. We focused exclusively on in-process attacks and not cross-process attacks.

# 4 Mitigations

## 4.1 Disabling Speculation

The most obvious mitigation against Spectre attacks is to explicitly disable speculation. Unfortunately most CPUs don't provide such a mechanism, and even when they do, not to user-level programs. On Intel CPUs the manufacturer recommendation is to use an LFENCE instruction to prevent speculation across this barrier as a protection for variant 1 [15].

Unfortunately this approach has a number of limitations. First, the instruction was not designed as a speculation barrier, but as a memory load barrier. It prevents future speculative memory loads, and so mitigates side-channels that ex-filtrate data via the cache as all the examples in Section 3.4. However the microarchitectural implementation details are proprietary and impossible to audit. It is not known what effects it really has. Indeed, this could be a problem if CPUs continue to speculatively execute other instructions, which still influence other  $\mu$ -state and also consume execution time. Second, this mitigation is local, and careful thought is required to insert LFENCE before every vulnerable operation. Variant 1 can be mitigated by inserting LFENCE before safety-check branches, but mitigating variant 4 may require insertion of an LFENCE before every vulnerable memory load operation, which is drastic. Because the mitigation is local, it cannot directly mitigate against variant 2, because an attacker with the power to speculatively jump anywhere can simply skip any inserted LFENCE. Third, this approach imposes an impractical performance overhead on execution. Requiring the CPU to stall its out-of-order execution pipeline before every branch (variant 1) or even every load operation (variant 4) reduces program performance by orders of magnitude as seen in Section 4.5.

## 4.2 Timer Mitigation

Individual optimizations based on  $\mu$ -state, such as a cache hit versus miss, give rise to extremely small differences in execution time which require a high resolution timer to detect. We might consider adjusting the precision of timers or removing them altogether as an attempt cripple the program's ability to read timing side-channels.

Unfortunately we now know that this mitigation is not comprehensive. Previous research [30] has shown three problems with timer mitigations: (1) certain techniques to reduce timer resolution are vulnerable to resolution recovery, (2) timers are more pervasive than previously thought and (3) a high resolution timer can be constructed from concurrent shared memory. The Amplification Lemma from Section 2.4 is the final nail in this coffin, as it shows (4) gadgets themselves can be amplified to increase the timing differences to arbitrary levels.

## 4.3 Branchless Masking

As we've now seen in detail, the variant 1 vulnerability means that any safety check inserted as a branch into the code has the potential to be bypassed by the inherent nature of branch prediction. This is captured in our semantic model in that the reorder buffer does not encode control dependencies between a branch and instructions following that branch, allowing them to execute out of order with respect to each other. Our model therefore informed the design of our mitigations, since it forced us to assume that branches can no longer be trusted as security mechanisms in speculation. This realization led us to designing speculative safety checks that do not rely on branches.

```
@mitigated
load r1 [#array - 1]
binop[ge] r2 r0 r1
branch r2 @oob
                                                                                                                            @program_start:
  const rp 1
                                                                                                                                                                 ; Initialize poison register
                                         ; load array length
                                            index >= array length
(vuln) bounds check
                                                                                                                             branch rc @target
                                                                                                                                                                 ; (vuln) branch
                                                                                                                             const rt 1
binop[xor] rt rt rc
binop[and] rp rp rt
                                                                                                                                                                ; (mitig); (mitig) rt is inverted rc; (mitig) set rp to 0 if rc
const r3 #1
binop[xcr] r2 r2 r3
binop[mul] r0 r0 r2
load r0 [r0 + #array]
load r0 [r0 + #timing]
jump rip
@oob:
const r0 #0
jump rip
  const r3 #1
                                            (mitig)
                                             (mitig) invert condition
                                                                                                                             binop[and] rp rp rc ; (mitig) set r0 to 0 if rc == 0
                                                                                                                             binop[mul] rt rp r0
                                                                                                                                                                ; (mitig) mask load addr with rp
```

Figure 8: Branchless array index masking mitigation

Figure 9: Pervasive conditional masking

#### 4.3.1 Array index masking

The intuition behind array index masking is to automatically insert additional arithmetic computations to force attacker-supplied out-of-bounds indices to be within bounds, even in misspeculated execution. In the case of array accesses, the safety check already computes the expression index >= array.length. In our model, this condition is computed via a binop ge instruction that produces either 0 or 1 in a register, which is then used as input to a branch instruction. Following the branch instruction, on the success (in-bounds) path, we simply insert a multiplication of the index by the inverse of the condition, so that if the condition is true (i.e., the index is out-of-bounds), the attacker-chosen index is clamped to 0. Figure 8 shows the machine code that would be generated for Figure 6a with this mitigation enabled.

The extra calculation is always a no-op in the architected path, so it strictly adds overhead, not only in code size, but in execution time. There is also a risk to being a no-op on the architected path, as an aggressively optimizing  $\mu$ -architecture might even compile it away, reintroducing the original vulnerability in the pursuit of performance. In fact, in the course of our practical mitigation work in V8, an engineer introduced an optimization in the backend of V8's optimizing compiler that unwittingly removed Spectre mitigations inserted by the frontend.

#### 4.3.2 Pervasive conditional masking

Array index accesses are just one of many types of safety checks inserted by language implementations. As we've seen, type checks, argument count checks, and others can constitute potential vulnerabilities. The logical extension of the array index masking technique is to use the conditions computed for any of these branches to compute a *poison* value that is used as an additional input to all loads that are control-dependent on the safety-check.

To implement general conditional masking, we reserve one otherwise unused register, denoted by  $r_p$ , for the poison value. The register is initially set to 1, a value it will have whenever the processor is correctly speculating. We will instrument every branch target to update the poison register so that a misspeculation can still execute, but the poison register will be set to 0, so that the poison value can be used to destroy any sensitive information in misspeculation. In effect, we compute a redundant condition in dataflow in addition to control flow. In the true branch, we will mask the poison register with the condition register so that the poison register becomes 0 if the condition was 0 and in the false branch we set the poison register is set to 0 if the condition is 1. After instrumenting the branches to add to the running speculative poison, we will instrument all loads to consume the poison by masking every load's address with the poison register. An example of the generated code is shown in Figure 9.

**Proof sketch.** To show correctness of this instrumentation, assuming that loads from address 0 are not secret, we will argue that if an execution in the speculative semantics (Section 3.1.4) loads memory location l, then there is an execution in the architecture that reads the same location. If the speculative execution reaches a load of memory location l and the load occurs in the reorder buffer before any misspeculation, then Theorem 1 guarantees the load is also reachable in the architecture model. If the load occurs after a mis-speculated branch in the reorder buffer, the instructions sequence following the branch must have updated the poison register to 0 and the load's address must have been multiplied with the poison, thus loading from address 0.

## 4.3.3 Pervasive indirect branch masking

To guard type safety across indirect branches, such as bytecode handler dispatch and virtual function dispatch, we need to extend masking even further. A poison is computed by comparing the expected target of the indirect branch with the reached target. To do so, we reserve a register  $r_b$  to used for the target address of all indirect branches. All potential targets of indirect branches are instrumented to perform a branchless comparison of the current pc with  $r_b$ , and use the result of this comparison to update the poison register  $r_p$ .

Note that this mitigation strategy does not prevent mispredicted indirect branch speculation, but instead clears the poison register in the case of misspeculation, and relies on the masking of loads to prevent the speculative execution from accessing memory that could expose secret data. It also only prevents variant 1 indirect branch attacks. If an attacker can exploit a CPUs approximate prediction state, as in a variant 2 attack, they can train an indirect branch to alias with a location that isn't a indirect branch target, and therefore isn't instrumented with the indirect target poision computation and doesn't reset the poison register.

## 4.4 Mitigating Other Mispredictions

All previous mitigations rely on specially-crafted code to avoid either accessing or leaking information in misspeculation. Variant 2 unfortunately shows that some predictions allow attackers to take paths that aren't architecturally possible. Those paths are not under our control and hence cannot be made to include mitigations. The only solution is to avoid such predictors altogether. This has led to the suggested use of retpoline [17], a construct that avoids using indirect branches altogether to block predictions from the branch target buffer. Unfortunately, this technique actually makes use of a second predictor, the return stack buffer, which happens to take priority on Intel CPUs. Retpolines do not work on other CPU models or architectures.

Variant 4 shows that loads can sometimes observe completely wrong values in speculation. This can lead to pointer crafting or using wrong addresses for indirect jumps. There is no real known way to mitigate this in software. Hardware manufacturers have provided mitigations in the form of a new off-by-default mitigation by Intel [15] and AMD [2], and two new barriers SSBB and PSSBB from ARM [4] that block loads from bypassing stores to the same virtual and physical address respectively.

### 4.5 Implementation in a Production JavaScript VM

As part of our offensive work, we developed proofs of concept in C++, JavaScript, and WebAssembly for all the reported vulnerabilities. We were able to leak over 1KB/s from variant 1 gadgets in C++ using **rdtsc** with 99.99% accuracy and over 10B/s from JavaScript using a low resolution timer. We demonstrated a potential 2.5KB/s variant 4 vunerability, but with low reliability, starting at 0.01% but amplifiable up to 20% through various techniques. We found that using shared memory to construct a timer worked well enough in JavaScript to measure individual cache hits and misses and exploit any of the known leaks.

As part of our defensive work, we implemented a number of the described mitigations in the V8 JavaScript virtual machine and evaluated their performance penalties. As we've noted, none of these mitigations provide comprehensive protection against Spectre, and so the mitigation space is a frustrating performance / protection trade-off.

We augmented every branch with an LFENCE instruction, which provides variant 1 protection<sup>8</sup>. The overhead was considerable, with a 2.8x slowdown on the Octane JavaScript benchmark (some line items up to 5x). We also experimented with the *retpoline* code sequence, inserting it into every generated indirect jump to protect against variant 2. Due to the dynamic nature of JavaScript, V8 emits a significant number of indirect jumps and so *retpoline* also has a high performance penalty of 1.52x slowdown on Octane. Disabling the emission of jump tables for WebAssembly has highly variable costs of up to 5x, but retpoline for all indirect calls costs only 0-2%.

In order to provide more performant mitigations, we took a tactical approach to target specific vulnerabilities. Implementing array masking incurs a 10% slowdown on Octane, while the more pervasive

<sup>&</sup>lt;sup>8</sup>Although not variant 4, which would require an LFENCE before every memory load

conditional poisoning using a reserved poison register (which protects against variant 1 type confusion) incurs a 19% slowdown on Octane. In addition we implemented pervasive indirect branch masking on the V8 interpreter's dynamic dispatch and on indirect function calls. This incurred negligible overhead on code optimized by V8's optimizing compiler, however with the optimizer disabled it incurs a 12% slowdown on V8's interpreter when running Octane. For WebAssembly, we implemented unconditional memory masking (by padding the memory to a power-of-2 size and always loading a mask), which incurs a 10-20% slowdown. For variant 4, we implemented a mitigation to zero the unused memory of the heap prior to allocation, which cost about 1% when done concurrently and 4% for scavenging.

Variant 4 defeats everything we could think of. We explored more mitigations for variant 4 but the threat proved to be more pervasive and dangerous than we anticipated. For example, stack slots used by the register allocator in the optimizing compiler could be subject to type confusion, leading to pointer crafting. Mitigating type confusion for stack slots alone would have required a complete redesign of the backend of the optimizing compiler, perhaps man years of work, without a guarantee of completeness. We recognized quickly that a compiler backend overhaul, a complete audit of the entire runtime system, and application of (not yet designed) mitigations in the C++ compiler for the VM's code itself were intractable for essentially any-sized codebase. For this reason we do not believe that variant 4 can be effectively mitigated in software, due not just to manpower, but a lack of architectural options, since reasoning about variant 4 requires the confounding assumption that in speculation, writes to memory may not be visible to subsequent reads at all.

A subset of the implemented mitigations shipped in successive Chrome releases from 64 to 67 and formed the initial primary defensive strategy. Timer mitigations were also shipped early, since they were an easy way to slow attackers. Accordingly, SharedArrayBuffer which represents concurrent shared memory and can be used to construct a timer, was disabled. In recognition of the fact that software mitigation is still an open problem for virtual machines, and that mitigations would also need to be applied to all of the millions of lines of C++ code in the browser, Chrome's defensive strategy shifted entirely to site isolation [7], which sandboxes code from different origins in different processes, thus relying on hardware-enforced protection.

## 5 Conclusion

Spectre defeats an important layer of software security. The community has assumed for decades that programming language security enforced with static and dynamic checks could guarantee confidentiality between computations in the same address space. Our work has discovered there are numerous vulnerabilities in today's languages that when run on today's CPUs allow construction of the universal read gadget, which completely destroys language-enforced confidentiality. This paper has attempted to shed light on side-channels that have been hiding in simulators and CPUs, a fact that has received little attention in the past and is only now coming to the fore. Hardware/OS process isolation is needed now more than ever, as we predict that language mechanisms to enforce confidentiality will be constantly threatened by side-channels. From our first attempts at modeling recently-disclosed vulnerabilities to our work on software mitigations, it has become painfully obvious to us that we are facing three massive open problems:

- 1. Finding  $\mu$ -architectural side channels requires enumerating and modeling relevant  $\mu$ -state, a difficult task for processors that are closed source and full of valuable and carefully-guarded intellectual property.
- 2. Understanding vulnerabilities requires us to model how programs can manipulate and observe  $\mu$ state, which also requires us to understand complex  $\mu$ -state in black-box processors.
- 3. Mitigating vulnerabilities is perhaps the most challenging of all, since efficient software mitigations needed for extant hardware seem to be in their infancy, and hardware mitigation for future designs is a completely open design problem.

Computer systems have become massively complex in pursuit of the seemingly number-one goal of performance. We've been extraordinarily successful at making them faster and more powerful, but also more complicated, facilitated by our many ways of creating abstractions. The tower of abstractions

has allowed us to gain confidence in our designs through separate reasoning and verification, separating hardware from software, and introducing security boundaries. But we see again that our abstractions leak, side-channels exist *outside* of our models, and now, down deep in the hardware where we were not supposed to see, there are vulnerabilities in the very chips we deployed the world over. Our models, our *mental* models, are wrong; we have been trading security for performance and complexity all along and didn't know it. It is now a painful irony that today, defense requires even more complexity with software mitigations, most of which we know to be incomplete. And complexity makes these three open problems all that much harder. Spectre is perhaps, too appropriately named, as it seems destined to haunt us for a long time.

# Acknowledgements

Our work on Spectre was a close collaboration among dozens of engineers across several companies. We would especially like to acknowledge Jann Horn, Matt Linton, Chandler Carruth, Paul Turner, Michael Starzinger, Brad Nelson, Chris Palmer, Justin Schuh, and Charlie Reis at Google, Luke Wagner at Mozilla, Filip Pizlo, Michael Saboff and Robin Morrisset at Apple, John Hazen and Louis Lafreniere at Microsoft, Jason Brandt at Intel, Alastair Reid and Rodolph Perfetta at ARM.

## References

- [1] Dakshi Agrawal, Bruce Archambeault, Josyula R Rao, and Pankaj Rohatgi. The em side-channel (s). In *International Workshop on Cryptographic Hardware and Embedded Systems*, pages 29–45. Springer, 2002.
- [2] AMD. AMD Speculative Bypass Store Disable. Technical report, 2018. Accessed: 2018-05-21.
- [3] Jason Andress. The Basics of Information Security: Understanding the Fundamentals of InfoSec in Theory and Practice. Syngress Publishing, 2nd edition, 2014.
- [4] ARM. Cache Speculation Side-channels. Technical report, 2018. Accessed: 2018-07-11.
- [5] Daniel J. Bernstein. Cache-timing attacks on aes. Technical report, 2005.
- [6] E. Carmon, J. P. Seifert, and A. Wool. Photonic side channel attacks against rsa. In 2017 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), pages 74–78, May 2017.
- [7] Chromium. Site isolation. https://www.chromium.org/Home/chromium-security/site-isolation, 2018. Accessed: 2018-07-10.
- [8] J. Ferrigno and M. Hlavac. When aes blinks: introducing optical side channel. *IET Information Security*, 2(3):94–98, September 2008.
- [9] Qian Ge, Yuval Yarom, David Cock, and Gernot Heiser. A survey of microarchitectural timing attacks and countermeasures on contemporary hardware. *Journal of Cryptographic Engineering*, 8(1):1–27, 2018.
- [10] Daniel Genkin, Adi Shamir, and Eran Tromer. Rsa key extraction via low-bandwidth acoustic cryptanalysis. In Juan A. Garay and Rosario Gennaro, editors, *Advances in Cryptology CRYPTO* 2014, pages 444–461, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
- [11] Mordechai Guri, Boris Zadov, Dima Bykhovsky, and Yuval Elovici. Powerhammer: Exfiltrating data from air-gapped computers through power lines. *CoRR*, abs/1804.04014, 2018.
- [12] Noam Hadad and Jonathan Afek. Overcoming (some) spectre browser mitigations. https://alephsecurity.com/2018/06/26/spectre-browser-query-cache/, 2018. Accessed: 2018-07-25.

- [13] Jann Horn. Reading privileged memory with a side-channel. https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html, January 2018. Accessed: 2018-06-03.
- [14] Galen C Hunt and James R Larus. Singularity: rethinking the software stack. ACM SIGOPS Operating Systems Review, 41(2):37–49, 2007.
- [15] Intel. Intel Analysis of Speculative Execution Side Channels. Technical Report 336983-001, 2018.
- [16] Intel. Intel microcode revision guidance. https://newsroom.intel.com/wp-content/uploads/sites/11/2018/04/2018. Accessed: 2018-07-10.
- [17] Intel. Retpoline: A Branch Target Injection Mitigation. Technical Report 337131-003, 2018. Accessed: 2018-07-11.
- [18] V. Kiriansky and C. Waldspurger. Speculative Buffer Overflows: Attacks and Defenses. *ArXiv* e-prints, July 2018.
- [19] Paul Kocher. Timing attacks on implementations of diffie-hellman, rsa, dss and other systems. In Advances in Cryptology CRYPTO '96, LNCS 1109, pages 104–113. Springer, 1996.
- [20] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. Spectre attacks: Exploiting speculative execution. *ArXiv e-prints*, January 2018.
- [21] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Differential power analysis. In *Advances in Cryptology CRYPTO '99, LNCS 1666*, pages 388–397. Springer, 1999.
- [22] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. Meltdown: Reading kernel memory from user space. In 27th USENIX Security Symposium (USENIX Security 18), 2018.
- [23] MITRE. CVE-2018-3665. "https://nvd.nist.gov/vuln/detail/CVE-2018-3665", 2018. Accessed: 2018-07-11.
- [24] E. Mohammadian Koruyeh, K. Khasawneh, C. Song, and N. Abu-Ghazaleh. Spectre Returns! Speculation Attacks using the Return Stack Buffer. *ArXiv e-prints*, July 2018.
- [25] Andrew C Myers, Lantian Zheng, Steve Zdancewic, Stephen Chong, and Nathaniel Nystrom. Jif: Java information flow. http://www.cs.cornell.edu/jif, 2001.
- [26] Edmund B Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, and Galen Hunt. Helios: Heterogeneous multiprocessing with satellite kernels. In *Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP '09)*. ACM, October 2009.
- [27] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. DRAMA: exploiting DRAM addressing for cross-cpu attacks. In 25th USENIX Security Symposium, USENIX Security 16, Austin, TX, USA, August 10-12, 2016., pages 565–581, 2016.
- [28] Andrei Sabelfeld and Andrew C Myers. Language-based information-flow security. *IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS*, 21(1):1, 2003.
- [29] Jun Sawada and Warren A. Hunt. Verification of fm9801: An out-of-order microprocessor model with speculative execution, exceptions, and program-modifying capability. Formal Methods in System Design, 20(2):187–222, Mar 2002.
- [30] Michael Schwarz, Clémentine Maurice, Daniel Gruss, and Stefan Mangard. Fantastic timers and where to find them: High-resolution microarchitectural attacks in javascript. In Aggelos Kiayias, editor, Financial Cryptography and Data Security, pages 247–267, Cham, 2017. Springer International Publishing.

- [31] A. A. Shaikh. Attacks on cloud computing and its countermeasures. In 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), pages 748–752, Oct 2016.
- [32] Raphael Spreitzer. Pin skimming: Exploiting the ambient-light sensor in mobile devices. In *Proceedings of the 4th ACM Workshop on Security and Privacy in Smartphones & Mobile Devices*, pages 51–62. ACM, 2014.
- [33] Henry Wong. Store-to-load forwarding and memory disambiguation in x86 processors. http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/, 2018. Accessed: 2018-07-25

# A Appendix

## A.1 Variant 2: Illustrative Example

The example shown in Figure 10 illustrates the variant 2 vulnerability where target misreconstruction is used to leak secret data. The train routine calls a virtual function via an indirect jump to train the branch predictor. However the branch predictor only stores the lower 2 bits of the source address and encodes the target as an offset (Figure 10d). As such, the prediction is ambiguous and predicts an indirect jump from @5->@6, as well as the intended @1->@2. An attacker takes advantage of this ambiguity by positioning the attack routine on the target side of this misprediction, and calling vulnerable. The CPU speculative executes the attack routine instead of the intended B.call, which as before speculatively loads an attacker-crafted pointer and encodes that value into the cache

## A.2 Variant 4: Illustrative Example

The example in Figure 11 represents an example of a variant 4, speculative aliasing confusion attack. A memory location (e.g., a stack slot) is reused to pass argument to two different functions which each expect argument to be of a different type. The attacker first repeatedly stores an integer into argument and calls dummyInt. Since this routine doesn't load the value in argument, the memory disambiguator will predict that stores to argument have no-alias with subsequent stores (Figure 11d). The attacker then overwrites argument with a pointer to an object and calls loadObject. The memory disambiguator mispredicts that subsequent loads depend on this store, and therefore speculatively executes loadObject before the store has completed (Figure 11e). As a result, it incorrectly treats an attacker-controlled integer as an object pointer, which, as before, can be used to load secret memory state that can be exfiltrated from speculation via a timing array, encoding it into the  $\mu$ -state of the cache.



Figure 10: Illustration of a variant 2 attack



Figure 11: Illustration of a variant 4 attack