# IEEE Floating Point Arithmetic


Reference: [Overton](https://cs.nyu.edu/~overton/book/) Chapter 4.

In this lecture, we introduce the
[IEEE Standard for Floating-Point Arithmetic](https://en.wikipedia.org/wiki/IEEE_754).
There are many  possible ways of representing real numbers on a computer, as well as 
the precise behaviour of operations such as addition, multiplication, etc.
Before the 1980s each processor had potentially a different representation for 
real numbers, as well as different behaviour for operations.  
IEEE introduced in 1985 was a means to standardise this across
processors so that algorithms would produce consistent and reliable results.

This chapter may seem very low level for a mathematics course but there are
two important reasons to understand the behaviour of floating point numbers:
1. Floating point arithmetic is very precisely defined, and can even be used
in rigorous computations as we shall see in the problem sheets.
2. Failure to understand floating point arithmetic can cause many errors
in practice, with the extreme example being the [explosion of the Ariane 5 rocket](https://youtu.be/N6PWATvLQCY?t=86).


Before we begin, we load two external packages. SetRounding.jl allows us 
to set the rounding mode of floating point arithmetic. ColorBitstring.jl
  implements a function `printbits`
which print the bits of floating point numbers in colour. 
Each colour corresponds to a different part of the representation:
the <span style="color:red">sign bit</span>, the <span style="color:green">exponent bits</span>, 
and the <span style="color:blue">significand bits</span> which we shall learn about shortly.

In [1]:
using SetRounding, ColorBitstring
printbits(1.0)

[31m0[0m[32m01111111111[0m[34m0000000000000000000000000000000000000000000000000000[0m

In this chapter we discuss the following:

1. Binary representation of real numbers: Any real number can be represented by an infinite sequence of bits. 
2. Floating point numbers: we discuss how real numbers are stored on a computer with a finite number of bits.
3. Arithmetic: we discuss how arithmetic operations in floating point are exact up to rounding, and how the
rounding mode can be set. This allows for precise error bounds in computations.
4. High-precision floating point numbers: we discuss how the precision of floating point arithmetic can be increased arbitrary
using `BigFloat`.


## 1.  Binary representation of real numbers

Before discussing how to represent real numbers with a finite number
of bits, we consider what one could do with an infinite number of bits,
that is, binary expansions.

We use the notation:
$$
(b_0.b_1b_2b_3\ldots)_2 = b_0 + {b_1 \over 2} + {b_2 \over 2^2} + {b_3 \over 2^3} + \cdots
$$
where $b_k$ are either 0 or 1. Every number $0 \leq x < 1$ has a binary representation in this form.
First we show some examples of verifying a numbers binary representation:

**Example**
Consider the number `1/3`.  In decimal we all know:
$$
1/3 = 0.3333\ldots =  \sum_{k=1}^\infty {3 \over 10^k}
$$
We will see that in binary
$$
1/3 = (0.010101\ldots)_2 = \sum_{k=1}^\infty {1 \over 2^{2k}}
$$
Both results can be proven using the geometric series:
$$
\sum_{k=0}^\infty z^k = {1 \over 1 - z}
$$
provided $|z| < 1$. That is, with $z = 1/10$ we see the decimal case:
$$
3\sum_{k=1}^\infty {1 \over 10^k} = {3 \over 1 - {1 \over 10}} - 3 = {1 \over 3}
$$
A similar argument works for binary with $z = {1 \over 4}$:
$$
\sum_{k=1}^\infty {1 \over 4^k} = {1 \over 1 - 1/4} - 1 = {1 \over 3}
$$

The extension to numbers outside the range $0 \leq x < 1$ is clear, e.g.
$$
(-101.010101\ldots)_2 = -(5 + 1/3)
$$
But for floating point numbers we handle the integer part differently so
we will not use this.




## 2. Floating point numbers

Floating point numbers are how a computer typically approximates
a real number. 

**Definition** The _floating point numbers_ are
$$
F_{S,Q,P} := F^{\rm normal}_{S,Q,P} \cup F^{\rm sub-normal}_{S,Q,P} \cup F^{\rm special}.
$$
The _normal floating point numbers_
$F^{\rm normal}_{S,Q,P} \subset {\mathbb R}$ are defined by
$$
F^{\rm normal}_{S,Q,P} = \{\pm 2^{q-S} \times (1.b_1b_2b_3\ldots b_P)_2 : 1 \leq q < 2^Q-1 \}.
$$
The _sub-normal numbers_ $F^{\rm sub-normal}_{S,Q,P} \subset {\mathbb R}$ are defined as
$$
F^{\rm sub-normal}_{S,Q,P} = \{\pm 2^{1-S} \times (0.b_1b_2b_3\ldots b_P)_2\}.
$$
The _special numbers_ $F^{\rm special} \not\subset {\mathbb R}$ are defined later.

Note this set of real numbers has no nice algebraic structure: it is not closed under addition, subtraction, etc.
and the normal floating point numbers does not include $0$. We will therefore need to define approximate versions of algebraic operations later.

Floating point numbers are stored in $1 + Q + P$ total number of bits, in the format
$$
sq_1\ldots q_Q b_1 \ldots b_P
$$
The first bit ($s$) is the <span style="color:red">sign bit</span>: 0 means positive and 1 means
negative. The bits $q_1\ldots q_Q$ are the <span style="color:green">exponent bits</span>:
they are the binary digits of the unsigned integer $q$: 
$$
q = (q_1\ldots q_Q)_2.
$$
Finally, the bits $b_1\ldots b_P$ are the <span style="color:blue">significand bits</span>.
If $1 \leq q < 2^Q-1$ then the bits represent the normal number
$$
x = \pm 2^{q-S} \times (1.b_1b_2b_3\ldots b_P)_2.
$$
If $q = 0$ (i.e. all bits are 0) then the bits represent the sub-normal number
$$
x = \pm 2^{1-S} \times (0.b_1b_2b_3\ldots b_P)_2.
$$
If $q = 2^Q-1$  (i.e. all bits are 1) then the bits represent a special number, discussed
later.


### IEEE floating point numbers

**Definition** IEEE has 3 standard floating point numbers: 16-bit (half precision), 32-bit (single precision) and
64-bit (double precision) defined by:
$$
\begin{align*}
F_{16} &:= F_{15,5,10} \\
F_{32} &:= F_{127,8,23} \\
F_{64} &:= F_{1023,11,52}
\end{align*}
$$

In Julia these correspond to 3 different floating point types:

1.  `Float64` is a type representing $F_{64}$.
We can create a `Float64` by including a 
decimal point when writing the number: 
`1.0` is a `Float64`. `Float64` is the default format for 
scientific computing (on the _Floating Point Unit_, FPU).  
2. `Float32` is a type representing $F_{32}$.  We can create a `Float32` by including a 
`f0` when writing the number: 
`1f0` is a `Float32`. is generally the default format for graphics (on the _Graphics Processing Unit_, GPU), 
as the difference between 32 bits and 64 bits is indistinguishable to the eye in visualisation,
and more data can be fit into a GPU's limitted memory.
3.  `Float16` is a type representing $F_{16}$.
It is important in machine learning where one wants to maximise the amount of data
and high accuracy is not necessarily helpful. 


**Example** How is the number $1/3$ stored in `Float32`?
Recall that
$$
1/3 = (0.010101\ldots)_2 = 2^{-2} (1.0101\ldots)_2 = 2^{125-127} (1.0101\ldots)_2
$$
Since
$$
125 = (1111101)_2
$$
For the significand we round the last bit to the nearest (this is explained in detail in
the section on rounding), so we have
$$
1.010101010101010101010101\ldots \approx 1.01010101010101010101011 
$$
Thus the `Float32` bits for $1/3$ are:

In [2]:
printbits(1f0/3)

[31m0[0m[32m01111101[0m[34m01010101010101010101011[0m

The smallest positive normal number is $q = 1$ and $b_k$ all zero:
$2^{1-S}$.
For a given floating point type, it can be found using `floatmin`:

In [3]:
mn = floatmin(Float64)

2.2250738585072014e-308

In [4]:
printbits(mn)

[31m0[0m[32m00000000001[0m[34m0000000000000000000000000000000000000000000000000000[0m

For sub-normal numbers, the simplest example is zero, which has $q=0$ and all significand bits zero:

In [5]:
printbits(0.0)

[31m0[0m[32m00000000000[0m[34m0000000000000000000000000000000000000000000000000000[0m

Unlike integers, we also have a negative zero:

In [6]:
printbits(-0.0)

[31m1[0m[32m00000000000[0m[34m0000000000000000000000000000000000000000000000000000[0m

This is treated as identical to `0.0`, except for with regards to special numbers.


If we divide the smallest normal number by two, we get a subnormal number:

In [7]:
printbits(mn/2)

[31m0[0m[32m00000000000[0m[34m1000000000000000000000000000000000000000000000000000[0m

Can you explain the bits?


### Special numbers

The special numbers extend the real line by adding $\pm \infty$ but also a notion of "not-a-number".
**Definition**
Denote ${\rm NaN} := \{\}$ and define
$$
F^{\rm special} := \{\infty, -\infty, {\rm NaN}\}
$$

Whenever the bits of $q$ of a floating point number are all 1 then they represent an element of $F^{\rm special}$.
If all $b_k=0$, then the number represents either $\pm\infty$, called `Inf` and `-Inf`:

In [8]:
printbits(Inf)
printbits(-Inf)

[31m0[0m[32m11111111111[0m[34m0000000000000000000000000000000000000000000000000000[0m[31m1[0m[32m11111111111[0m[34m0000000000000000000000000000000000000000000000000000[0m

All other special floating point numbers represent `NaN`: `NaN` is stored with $q=(11111111111)_2$ and at least one of the $b_k =1$:

In [9]:
printbits(NaN)

[31m0[0m[32m11111111111[0m[34m1000000000000000000000000000000000000000000000000000[0m

These are needed for undefined algebraic operations such as:

In [10]:
0/0

NaN

**Example** What happens if we change some other $b_k$ to be nonzero?
We can create bits as a string and see:

In [11]:
i=parse(UInt64, "1111111111110000000000000000000010000001000000000010000000000000"; base=2)
reinterpret(Float64,i)

NaN

Thus, there are more than one `NaN`s on a computer.  Can you figure out how many there are?


## 3. Arithmetic


Arithmetic operations are done exactly _up to rounding_.
There are three basic rounding strategies: round up/down/towards zero/nearest.
Mathematically we introduce the function ${\rm round}$:

**Definition** ${\rm round}^{\rm nearest}_{S,Q,P} : \mathbb R \rightarrow F_{S,Q,P}$ denotes 
the function that rounds a real number to the nearest floating point number. In case of a tie, the
it returns the floating point number whose least significant bit is equal to zero.
We use the notation ${\rm round} := {\rm round}^{\rm nearest}$ when $S,Q,P$ are implied by context.



In Julia, the rounding mode is specified by tags `RoundUp`, `RoundDown`, `RoundToZero`, and
`RoundNearest`. (There are also more exotic rounding strategies `RoundNearestTiesAway` and
`RoundNearestTiesUp` that we won't use.)
 Note that these rounding modes are part
of the FPU instruction set so will be (roughly) equally fast as the default, `RoundNearest`.

Let's try rounding a `Float64` to a `Float32`.

In [12]:
printbits(1/3)  # 64 bits
printbits(Float32(1/3))  # round to nearest 32-bit

[31m0[0m[32m01111111101[0m[34m0101010101010101010101010101010101010101010101010101[0m[31m0[0m[32m01111101[0m[34m01010101010101010101011[0m

The default rounding mode can be changed:

In [13]:
printbits(Float32(1/3,RoundDown) )

[31m0[0m[32m01111101[0m[34m01010101010101010101010[0m

Or alternatively we can change the rounding mode for a chunk of code
using `setrounding`. The following changes `/` to round down:

In [14]:
setrounding(Float32, RoundDown) do
    1f0/3
end

0.33333334f0

In IEEE arithmetic, the arithmetic operations `+`, `-`, `*`, `/` are defined by the property that they are exact up to 
rounding.  Mathematically we denote these operations as follows:
$$
x\oplus y &:= {\rm round}(x+y) \\
x\ominus y &:= {\rm round}(x - y) \\
x\otimes y &:= {\rm round}(x * y) \\
x\odiv y &:= {\rm round}(x / y)
$$
Note also that  `^` and `sqrt` are similarly exact up to rounding.

**WARNING** These operations are not associative! E.g. $(x \oplus y) \oplus z$ is not necessarily equal to $x \oplus (y \oplus z)$. 
Commutativity is preserved at least.


**Example** `1.1+0.1` gives a different result than `1.2`:

In [15]:
x=1.1
y=0.1
x + y - 1.2 # Not Zero?!?

2.220446049250313e-16

This is because ${\rm round}(1.1) \neq 1+1/10$, but rather:
$$
{\rm round}(1.1) = 1 + 2^{-4}+2^{-5} + 2^{-8}+2^{-9}+\cdots + 2^{-48}+2^{-49} + 2^{-51}= {2476979795053773 \over 2251799813685248} = 1.1 +2^{-51} - 2^{-52} - 2^{-53} - 2^{-56} - 2^{-57} - \cdots
$$


We can bound basic arithmetic operations in terms of:

**Definition** _Machine epsilon_ is denoted $\epsilon_{\rm machine} := 2^{-P}$. 

This corresponds to the _relative error_ introduced
by changing the last bit of a normal floating point number. 

**Proposition**
$$
\begin{align*}
|x \oplus y - (x + y)| \leq |x+y| \epsilon_{\rm machine}
$$


### Arithmetic and special numbers

Arithmetic works differently on `Inf` and `NaN`. In particular we have:
$$
\begin{align*}
x \odiv 0 &:= \begin{cases}
    (\sign x) \infty & x \neq 0 \\
    {\rm NaN} & {\rm otherwise}\\
\end{\align*}
$$

In [16]:
Inf*0      # NaN
Inf+5      # Inf
(-1)*Inf   # -Inf
1/Inf      # 0
1/(-Inf)   # -0
Inf-Inf    # NaN
Inf == Inf   # true
Inf == -Inf  # false

NaN*0      # NaN
NaN+5      # NaN
1/NaN      # NaN
NaN == NaN    # false
NaN != NaN    #true

true

### Special functions

Other special functions like `cos`, `sin`, `exp`, etc. are _not_ part of the IEEE standard.
Instead, they are implemented by composing the basic arithmetic operations, which accumulate
errors. Fortunately they are all designed to have relative accuracy, that is, there exists
reasonably small $c > 0$ such that, for `s = sin(x)` (that is, the Julia implementation of $sin(x)$) satisfies
$$
|s - \sin x| \leq |sin(x)| c\epsilon_{\rm machine}
$$
Note these special functions are written in (advanced) Julia code, for example, 
[sin](https://github.com/JuliaLang/julia/blob/d08b05df6f01cf4ec6e4c28ad94cedda76cc62e8/base/special/trig.jl#L76).


**WARNING** This is an extremely misleading statement for large `x`. Consider
the following demonstration of the statement:

In [17]:
ε = eps() # machine epsilon, 2^(-52)
x = 2*10.0^100
abs(sin(x) - sin(big(x)))  ≤  abs(sin(big(x))) * ε

true

But if we instead compute `10^100` using `BigFloat` we get a completely different
answer that even has the wrong sign!

In [18]:
x̃ = 2*big(10.0)^100
sin(x), sin(x̃)

(-0.703969872087777, 0.6911910845037462219623751594978914260403966392716944990360937340001300242965408)

This is because we commit an error on the order of roughly $2 * 10^100 * \epsilon_{\machine} \approx 4.44 * 10^(84)$
when we round $2*10^100$ to the nearest float. 

## 4. High-precision floating point numbers (*advanced*)