# numbers_


Reference: [Overton](https://cs.nyu.edu/~overton/book/)

In this lecture, we introduce the
[IEEE Standard for Floating-Point Arithmetic](https://en.wikipedia.org/wiki/IEEE_754).
There are many  possible ways of representing real numbers on a computer, as well as 
the precise behaviour of operations such as addition, multiplication, etc.
Before the 1980s each processor had potentially a different representation for 
real numbers, as well as different behaviour for operations.  
IEEE introduced in 1985 was a means to standardise this across
processors so that algorithms would produce consistent and reliable results.

This chapter may seem very low level for a mathematics course but there are
two important reasons to understand the behaviour of floating point numbers:
1. Floating point arithmetic is very precisely defined, and can even be used
in rigorous computations as we shall see in the problem sheets.
2. Failure to understand floating point arithmetic can cause many errors
in practice, with the extreme example being the [explosion of the Ariane 5 rocket](https://youtu.be/N6PWATvLQCY?t=86).


Before we begin, we load two external packages. SetRounding.jl allows us 
to set the rounding mode of floating point arithmetic. ColorBitstring.jl
  implements a function `printbits`
which print the bits of floating point numbers in colour. 
Each colour corresponds to a different part of the representation:
the <span style="color:red">sign bit</span>, the <span style="color:green">exponent bits</span>, 
and the <span style="color:blue">significand bits</span> which we shall learn about shortly.

In [1]:
using SetRounding, ColorBitstring

┌ Info: Precompiling ColorBitstring [ce91de38-7578-4b98-aebb-6a5df05791d6]
└ @ Base loading.jl:1317


In this chapter we discuss the following:

1. Binary representation: Any real number can be represented in binary, that is,
by an infinite sequence of 0s and 1s (bits). Here we review  binary representation.
2. Integers:  There are multiple ways of representing integers on a computer. Here we discuss the 
the different types of integers and their representation as bits, and how arithmetic operations behave 
like modular arithmetic.
2. Floating point numbers: we discuss how real numbers are stored on a computer with a finite number of bits.
3. Arithmetic: we discuss how arithmetic operations in floating point are exact up to rounding, and how the
rounding mode can be set. This allows for precise error bounds in computations.
4. High-precision floating point numbers: we discuss how the precision of floating point arithmetic can be increased arbitrary
using `BigFloat`.



## 1.  Binary representation

Any integer can be presented in binary format, that is, a sequence of $0$s and $1$s.
**Definition**
For $B_0,\ldots,B_p \in \{0,1\}$ denote a non-negative integer in _binary format_ by:
$$
(B_p\ldots B_1B_0)_2 := 2^pB_p + \cdots + 2B_1 + B_0
$$
For $b_1,b_2,\ldots \in \{0,1\}$, Denote a non-negative real number in _binary format_ by:
$$
(B_\ldots B_0.b_1Qb_2b_3\ldots)_2 = (B_Q\ldotsB_0)_2 + {b_1 \over 2} + {b_2 \over 2^2} + {b_3 \over 2^3} + \cdots
$$



First we show some examples of verifying a numbers binary representation:

**Example**
A simple integer example is
$$
5 = 2^2 + 2^0 = (101)_2
$$

**Example**
Consider the number `1/3`.  In decimal recall that:
$$
1/3 = 0.3333\ldots =  \sum_{k=1}^\infty {3 \over 10^k}
$$
We will see that in binary
$$
1/3 = (0.010101\ldots)_2 = \sum_{k=1}^\infty {1 \over 2^{2k}}
$$
Both results can be proven using the geometric series:
$$
\sum_{k=0}^\infty z^k = {1 \over 1 - z}
$$
provided $|z| < 1$. That is, with $z = 1/10$ we see the decimal case:
$$
3\sum_{k=1}^\infty {1 \over 10^k} = {3 \over 1 - {1 \over 10}} - 3 = {1 \over 3}
$$
A similar argument works for binary with $z = {1 \over 4}$:
$$
\sum_{k=1}^\infty {1 \over 4^k} = {1 \over 1 - 1/4} - 1 = {1 \over 3}
$$



## 2. Integers


On a computer one typically represents integers by a finite number of $p$ bits,
with $2^p$ possible combinations of 0s and 1s. For _unsigned integers_, that is,
non-negative integers these bits are just the first $p$ binary digits, that is,
$(B_p\ldots B_1B_0)_2$.
 
Integers follows [modular arithmetic](https://en.wikipedia.org/wiki/Modular_arithmetic),
that is, $p$-bit numbers on a computer represent elements of ${\mathbb Z}_{2^p}$, 
the ring of integers modulo `2^p`, and arithmetic follows accordingly.

**Example (addition of 8-bit unsigned integers)** Consider the addition of
two 8-bit numbers:
$$
255 + 1 = (11111111)_2 + (00000001)_2 = (100000000)_2 = 256
$$
The result is impossible to store in just 8-bits! It is way too slow
for a computer to increase the number of bits, or to throw an error (checks are slow).
So instead it treats the integers as elements of ${\mathbb Z}_{256}$:
$$
(255 + 1) \mod 256 = (00000000)_2 \mod 256 = 0 \mod 256
$$
We can see this in Julia:

In [2]:
x = UInt8(255)
y = UInt8(1)
printbits(x); println(" + "); printbits(y); println(" = ")
printbits(x + y)

[34m11111111[0m + 
[34m00000001[0m = 
[34m00000000[0m

**Example (multiplication of 8-bit unsigned integers)** 
Multiplication works similarly: for example,
$$
254 * 2 \mod 256 = 252 = (11111100)_2
$$
and we have

In [3]:
x = UInt8(254)
y = UInt8(2)
printbits(x); println(" * "); printbits(y); println(" = ")
printbits(x * y)

[34m11111110[0m * 
[34m00000010[0m = 
[34m11111100[0m

Signed integers use the [2's complemement](https://epubs.siam.org/doi/abs/10.1137/1.9780898718072.ch3)
convention. The convention is if the first bit is 1 then the number is negative: the number $2^p - y$
is interpreted as $-y$.
Thus for $p = 8$ we are interpreting
$2^7 \mod 2^8$ through $(2^8-1) \mod 2^8$ as negative numbers. 

**Example (converting bits to signed integers)** 
What 8-bit integer has the bits `01001001`? Adding the corresponding decimal places we get:

In [4]:
2^0 + 2^3 + 2^6

73

What 8-bit (signed) integer has the bits `11001001`? Because the first bit is `1` we know it's a negative 
number, hence we need to sum the bits but then subtract `2^p`:

In [5]:
2^0 + 2^3 + 2^6 + 2^7 - 2^8

-55

We can check the results using `printbits`:

In [6]:
printlnbits(Int8(73))
printbits(-Int8(55))

[31m0[0m[34m1001001[0m
[31m1[0m[34m1001001[0m

Arithmetic works precisely
the same for signed and unsigned integers.

**Example (addition of 8-bit integers)**
Consider `(-1) + 1` in 8-bit arithmetic. The number $-1$ has the same bits as
$2^8 - 1 = 255$. Thus this is equivalent to the previous question and we get the correct
result of $0$. In other words:
$$
(-1 \mod 2^p) + (1 \mod 2^p) = (2^p-1 \mod 2^p) + (1 \mod 2^p) = 2^p \mod 2^p = 0 \mod 2^p
$$


**Example (multiplication of 8-bit integers)**
Consider `(-2) * 2`. $-2$ has the same bits as $2^{256} - 2 = 254$ and $-4$ has the
same bits as $2^{256}-4 = 252$, and hence from the previous example we get the correct result of $-4$.
In other words:
$$
(-2 \mod 2^p) * (2 \mod 2^p) = (2^p-2 \mod 2^p) * (2 \mod 2^p) = (2^{p+1}-4 \mod 2^p) = -4 \mod 2^p
$$





We can find the largest and smallest instances of a type using `typemax` and `typemin`:

In [7]:
printlnbits(typemax(Int8)) # 2^7-1 = 127
printbits(typemin(Int8)) # -2^7 = -128

[31m0[0m[34m1111111[0m
[31m1[0m[34m0000000[0m

As explained, due to modular arithmetic, when we add `1` to the largest 8-bit integer we get the smallest:

In [8]:
typemax(Int8) + Int8(1) # returns typemin(Int8)

-128

In addition to `+`, `-`, and `*` we have integer division `÷`, which rounds down:

In [9]:
5 ÷ 2 # equivalent to div(5,2)

2

Standard division `/` (or `\` for division on the right) creates a floating point number, which will be discussed shortly:

In [10]:
5 / 2 # alternatively 2 \ 5

2.5

**Remark (advanced)** We can also create rational numbers using `//`:

In [11]:
(1//2) + (3//4)

5//4

Rational arithmetic often leads to overflow so it
is often best to combine `big` with rationals:

In [12]:
102324//132413023 + 23434545//4243061 + 23434545//42430534435
big(102324)//132413023 + 23434545//4243061 + 23434545//42430534435

LoadError: OverflowError: 3103473113053299 * 8486106887 overflowed for type Int64

See the [Introduction to Julia](Julia.ipynb) for details on `big`,
which creates a `BigInt`, which allows for an arbitrary number of bits.

## 2. Floating point numbers

Floating point numbers are a subset of real numbers that are representable using
a fixed number of bits.

**Definition** The _floating point numbers_ are
$$
F_{S,Q,P} := F^{\rm normal}_{S,Q,P} \cup F^{\rm sub-normal}_{S,Q,P} \cup F^{\rm special}.
$$
The _normal floating point numbers_
$F^{\rm normal}_{S,Q,P} \subset {\mathbb R}$ are defined by
$$
F^{\rm normal}_{S,Q,P} = \{\pm 2^{q-S} \times (1.b_1b_2b_3\ldots b_P)_2 : 1 \leq q < 2^Q-1 \}.
$$
The _sub-normal numbers_ $F^{\rm sub-normal}_{S,Q,P} \subset {\mathbb R}$ are defined as
$$
F^{\rm sub-normal}_{S,Q,P} = \{\pm 2^{1-S} \times (0.b_1b_2b_3\ldots b_P)_2\}.
$$
The _special numbers_ $F^{\rm special} \not\subset {\mathbb R}$ are defined later.

Note this set of real numbers has no nice algebraic structure: it is not closed under addition, subtraction, etc.
We will therefore need to define approximate versions of algebraic operations later.

Floating point numbers are stored in $1 + Q + P$ total number of bits, in the format
$$
sq_1\ldots q_Q b_1 \ldots b_P
$$
The first bit ($s$) is the <span style="color:red">sign bit</span>: 0 means positive and 1 means
negative. The bits $q_1\ldots q_Q$ are the <span style="color:green">exponent bits</span>:
they are the binary digits of the unsigned integer $q$: 
$$
q = (q_1\ldots q_Q)_2.
$$
Finally, the bits $b_1\ldots b_P$ are the <span style="color:blue">significand bits</span>.
If $1 \leq q < 2^Q-1$ then the bits represent the normal number
$$
x = \pm 2^{q-S} \times (1.b_1b_2b_3\ldots b_P)_2.
$$
If $q = 0$ (i.e. all bits are 0) then the bits represent the sub-normal number
$$
x = \pm 2^{1-S} \times (0.b_1b_2b_3\ldots b_P)_2.
$$
If $q = 2^Q-1$  (i.e. all bits are 1) then the bits represent a special number, discussed
later.


### IEEE floating point numbers

**Definition** IEEE has 3 standard floating point formats: 16-bit (half precision), 32-bit (single precision) and
64-bit (double precision) defined by:
$$
\begin{align*}
F_{16} &:= F_{15,5,10} \\
F_{32} &:= F_{127,8,23} \\
F_{64} &:= F_{1023,11,52}
\end{align*}
$$

In Julia these correspond to 3 different floating point types:

1.  `Float64` is a type representing $F_{64}$.
We can create a `Float64` by including a 
decimal point when writing the number: 
`1.0` is a `Float64`. `Float64` is the default format for 
scientific computing (on the _Floating Point Unit_, FPU).  
2. `Float32` is a type representing $F_{32}$.  We can create a `Float32` by including a 
`f0` when writing the number: 
`1f0` is a `Float32`. `Float32` is generally the default format for graphics (on the _Graphics Processing Unit_, GPU), 
as the difference between 32 bits and 64 bits is indistinguishable to the eye in visualisation,
and more data can be fit into a GPU's limitted memory.
3.  `Float16` is a type representing $F_{16}$.
It is important in machine learning where one wants to maximise the amount of data
and high accuracy is not necessarily helpful. 


**Example** How is the number $1/3$ stored in `Float32`?
Recall that
$$
1/3 = (0.010101\ldots)_2 = 2^{-2} (1.0101\ldots)_2 = 2^{125-127} (1.0101\ldots)_2
$$
Since
$$
125 = (1111101)_2
$$
For the significand we round the last bit to the nearest element of $F_{32}$, (this is explained in detail in
the section on rounding), so we have
$$
1.010101010101010101010101\ldots \approx 1.01010101010101010101011 \in F_{32} 
$$
Thus the `Float32` bits for $1/3$ are:

In [13]:
printbits(1f0/3)

[31m0[0m[32m01111101[0m[34m01010101010101010101011[0m

For sub-normal numbers, the simplest example is zero, which has $q=0$ and all significand bits zero:

In [14]:
printbits(0.0)

[31m0[0m[32m00000000000[0m[34m0000000000000000000000000000000000000000000000000000[0m

Unlike integers, we also have a negative zero:

In [15]:
printbits(-0.0)

[31m1[0m[32m00000000000[0m[34m0000000000000000000000000000000000000000000000000000[0m

This is treated as identical to `0.0` (except for degenerate operations as explained in special numbers).


If we divide the smallest normal number by two, we get a subnormal number:

In [16]:
mn = floatmin(Float32) # smallest normal Float32
printbits(mn/2)

[31m0[0m[32m00000000[0m[34m10000000000000000000000[0m

Can you explain the bits?

### Special normal numbers

When dealing with normal numbers there are some important constants that we will use
to bound errors.

**Definition** _Machine epsilon_ is denoted
$$
\epsilon_{{\rm m},P} := 2^{-P}.
$$
When $P$ is implied by context we use the notation $\epsilon_{\rm m}$.
The _smallest positive normal number_ is $q = 1$ and $b_k$ all zero:
$$
\min |F_{S,P,Q}^{\rm normal}| = 2^{1-S}.
$$ 
The _largest (positive) normal number_ is 
$$
\max F_{S,P,Q}^{\rm normal} = 2^{2^Q-2-S} (1.11\ldots1)_2 = 2^{2^Q-2-S} (2-\epsilon_{\rm m})
$$


We confirm the simple bit representations:

In [17]:
S,P,Q = 127,23,8 # Float32
εₘ = 2.0^(-P)
printlnbits(Float32(2.0^(1-S))) # smallest positive Float32
printlnbits(Float32(2.0^(2^Q-2-S) * (2-εₘ))) # largest Float32

[31m0[0m[32m00000001[0m[34m00000000000000000000000[0m
[31m0[0m[32m11111110[0m[34m11111111111111111111111[0m


For a given floating point type, we can find these constants using the following functions:

In [18]:
eps(Float32),floatmin(Float32),floatmax(Float32)

(1.1920929f-7, 1.1754944f-38, 3.4028235f38)

### Special numbers

The special numbers extend the real line by adding $\pm \infty$ but also a notion of "not-a-number".

**Definition**
Let ${\rm NaN}$ represent "not a number" and define
$$
F^{\rm special} := \{\infty, -\infty, {\rm NaN}\}
$$

Whenever the bits of $q$ of a floating point number are all 1 then they represent an element of $F^{\rm special}$.
If all $b_k=0$, then the number represents either $\pm\infty$, called `Inf` and `-Inf` for 64-bit floating point numbers (or `Inf16`, `Inf32`
for 16-bit and 32-bit, respectively):

In [19]:
printlnbits(Inf16)
printbits(-Inf16)

[31m0[0m[32m11111[0m[34m0000000000[0m
[31m1[0m[32m11111[0m[34m0000000000[0m

All other special floating point numbers represent ${\rm NaN}$. One particular representation of ${\rm NaN}$ 
is denoted by `NaN` for 64-bit floating point numbers (or `NaN16`, `NaN32` for 16-bit and 32-bit, respectively):

In [20]:
printbits(NaN16)

[31m0[0m[32m11111[0m[34m1000000000[0m

These are needed for undefined algebraic operations such as:

In [21]:
0/0

NaN

**Example** What happens if we change some other $b_k$ to be nonzero?
We can create bits as a string and see:

In [22]:
i = parse(UInt16, "0111110000010001"; base=2)
reinterpret(Float16, i)

NaN16

Thus, there are more than one `NaN`s on a computer.  Can you figure out how many there are?


## 3. Arithmetic


Arithmetic operations are done exactly _up to rounding_.
There are three basic rounding strategies: round up/down/nearest.
Mathematically we introduce the function ${\rm round}$:

**Definition** ${\rm round}^{\rm up}_{S,Q,P} : \mathbb R \rightarrow F_{S,Q,P}$ denotes 
the function that rounds a real number up to the nearest floating point number that is greater or equal.
${\rm round}^{\rm down}_{S,Q,P} : \mathbb R \rightarrow F_{S,Q,P}$ denotes 
the function that rounds a real number down to the nearest floating point number that is greater or equal.
${\rm round}^{\rm nearest}_{S,Q,P} : \mathbb R \rightarrow F_{S,Q,P}$ denotes 
the function that rounds a real number to the nearest floating point number. In case of a tie, the
it returns the floating point number whose least significant bit is equal to zero.
We use the notation ${\rm round}$ when $S,Q,P$ and the rounding mode are implied by context,
with ${\rm round}^{\rm nearest}$ being the default rounding mode.



In Julia, the rounding mode is specified by tags `RoundUp`, `RoundDown`, and
`RoundNearest`. (There are also more exotic rounding strategies `RoundToZero`, `RoundNearestTiesAway` and
`RoundNearestTiesUp` that we won't use.)
 Note that these rounding modes are part
of the FPU instruction set so will be (roughly) equally fast as the default, `RoundNearest`.

Let's try rounding a `Float64` to a `Float32`.

In [23]:
printbits(1/3)  # 64 bits
printbits(Float32(1/3))  # round to nearest 32-bit

[31m0[0m[32m01111111101[0m[34m0101010101010101010101010101010101010101010101010101[0m[31m0[0m[32m01111101[0m[34m01010101010101010101011[0m

The default rounding mode can be changed:

In [24]:
printbits(Float32(1/3,RoundDown) )

[31m0[0m[32m01111101[0m[34m01010101010101010101010[0m

Or alternatively we can change the rounding mode for a chunk of code
using `setrounding`. The following changes `/` to round down:

In [25]:
setrounding(Float32, RoundDown) do
    1f0/3
end

0.33333334f0

In IEEE arithmetic, the arithmetic operations `+`, `-`, `*`, `/` are defined by the property that they are exact up to 
rounding.  Mathematically we denote these operations as follows:
$$
x\oplus y &:= {\rm round}(x+y) \\
x\ominus y &:= {\rm round}(x - y) \\
x\otimes y &:= {\rm round}(x * y) \\
x\odiv y &:= {\rm round}(x / y)
$$
Note also that  `^` and `sqrt` are similarly exact up to rounding.

**WARNING** These operations are not associative! E.g. $(x \oplus y) \oplus z$ is not necessarily equal to $x \oplus (y \oplus z)$. 
Commutativity is preserved at least.


**Example** `1.1+0.1` gives a different result than `1.2`:

In [26]:
x=1.1
y=0.1
x + y - 1.2 # Not Zero?!?

2.220446049250313e-16

This is because ${\rm round}(1.1) \neq 1+1/10$, but rather:
$$
{\rm round}(1.1) = 1 + 2^{-4}+2^{-5} + 2^{-8}+2^{-9}+\cdots + 2^{-48}+2^{-49} + 2^{-51}= {2476979795053773 \over 2251799813685248} = 1.1 +2^{-51} - 2^{-52} - 2^{-53} - 2^{-56} - 2^{-57} - \cdots
$$


We can bound basic arithmetic operations in terms of:

**Definition** The _normalized range_ ${\cal N}_{S,Q,P} \subset {\mathbb R}$
is the subset of real numbers that lies
between the smallest and largest normal floating point number:
$$
{\cal N}_{S,Q,P} := \{x : \min |F_{S,Q,P}| \leq |x| \leq \max F_{S,Q,P} \}
$$
When $S,Q,P$ are implied by context we use the notation ${\cal N}$.

We can use machine epsilon to determine bounds on rounding:

**Proposition** For any rounding mode,
$$
\begin{align*}
x \in {\cal N} &\Rightarrow |{\rm round}(x) - x|  \leq |x| \epsilon_{\rm m} \\
x + y \in {\cal N} &\Rightarrow |x \oplus y - (x + y)| \leq |x+y| \epsilon_{\rm m} \\
x - y \in {\cal N} &\Rightarrow |x \ominus y - (x - y)| \leq |x-y| \epsilon_{\rm m} \\
x y \in {\cal N} &\Rightarrow |x \otimes y - x y| \leq |xy| \epsilon_{\rm m} \\
x/y \in {|cal N} &\Rightarrow |x \odiv y - (x + y)| \leq |x/y| \epsilon_{\rm m}.
\end{align*}
$$


### Arithmetic and special numbers

Arithmetic works differently on `Inf` and `NaN`. In particular we have:
$$
\begin{align*}
x \odiv 0 &:= \begin{cases}
    (\sign x) \infty & x \neq 0 \\
    {\rm NaN} & {\rm otherwise}\\
\end{\align*}
$$

In [27]:
Inf*0      # NaN
Inf+5      # Inf
(-1)*Inf   # -Inf
1/Inf      # 0
1/(-Inf)   # -0
Inf-Inf    # NaN
Inf == Inf   # true
Inf == -Inf  # false

NaN*0      # NaN
NaN+5      # NaN
1/NaN      # NaN
NaN == NaN    # false
NaN != NaN    #true

true

### Special functions

Other special functions like `cos`, `sin`, `exp`, etc. are _not_ part of the IEEE standard.
Instead, they are implemented by composing the basic arithmetic operations, which accumulate
errors. Fortunately they are all designed to have relative accuracy, that is, there exists
reasonably small $c > 0$ such that, for `s = sin(x)` (that is, the Julia implementation of $sin(x)$) satisfies
$$
|s - \sin x| \leq |sin(x)| c\epsilon_{\rm m}
$$
Note these special functions are written in (advanced) Julia code, for example, 
[sin](https://github.com/JuliaLang/julia/blob/d08b05df6f01cf4ec6e4c28ad94cedda76cc62e8/base/special/trig.jl#L76).


**WARNING** This is an extremely misleading statement for large `x`. Consider
the following demonstration of the statement:

In [28]:
ε = eps() # machine epsilon, 2^(-52)
x = 2*10.0^100
abs(sin(x) - sin(big(x)))  ≤  abs(sin(big(x))) * ε

true

But if we instead compute `10^100` using `BigFloat` we get a completely different
answer that even has the wrong sign!

In [29]:
x̃ = 2*big(10.0)^100
sin(x), sin(x̃)

(-0.703969872087777, 0.6911910845037462219623751594978914260403966392716944990360937340001300242965408)

This is because we commit an error on the order of roughly $2 * 10^100 * \epsilon_{\machine} \approx 4.44 * 10^(84)$
when we round $2*10^100$ to the nearest float. 

## 4. High-precision floating point numbers (advanced)