# IEEE Floating Point Arithmetic


Reference: [Overton](https://cs.nyu.edu/~overton/book/) Chapter 4.

In this lecture, we introduce the
[IEEE Standard for Floating-Point Arithmetic](https://en.wikipedia.org/wiki/IEEE_754).
There are many  possible ways of representing real numbers on a computer, as well as 
the precise behaviour of operations such as addition, multiplication, etc.
Before the 1980s each processor had potentially a different representation for 
real numbers, as well as different behaviour for operations.  
IEEE introduced in 1985 was a means to standardise this across
processors so that algorithms would produce consistent and reliable results.


Before we begin, we load two external packages. SetRounding.jl allows us 
to set the rounding mode of floating point arithmetic. ColorBitstring.jl
  implements a function `printbits`
which print the bits of floating point numbers in colour. 
Each colour corresponds to a different part of the representation:
the <span style="color:red">sign bit</span>, the <span style="color:green">exponent bits</span>, 
and the <span style="color:blue">significand bits</span> which we shall learn about shortly.

In [1]:
using SetRounding, ColorBitstring
printbits(1.0)

[31m0[0m[32m01111111111[0m[34m0000000000000000000000000000000000000000000000000000[0m

## Real numbers in binary

Before discussing how to represent real numbers with a finite number
of bits, we consider what one could do with an infinite number of bits,
that is, binary expansions.

We use the notation
$$
(1.b_1b_2b_3\ldots)_2 = 
$$


**Example**
Consider the number `1/3`.  In decimal we all know:
$$
1/3 = 0.3333\ldots =  \sum_{k=1}^\infty {3 \over 10^k}
$$
We will see that in binary
$$
1/3 = (0.010101\ldots)_2 = \sum_{k=1}^\infty {1 \over 2^{2k}}
$$
Both results can be proven using the geometric series:
$$
\sum_{k=0}^\infty z^k = {1 \over 1 - z}
$$
provided $|z| < 1$. That is, with $z = 1/10$ we see the decimal case:
$$
3\sum_{k=1}^\infty {1 \over 10^k} = {3 \over 1 - {1 \over 10}} - 3 = {1 \over 3}
$$
A similar argument works for binary with $z = {1 \over 4}$:
$$
\sum_{k=1}^\infty {1 \over 4^k} = {1 \over 1 - 1/4} - 1 = {1 \over 3}
$$




## Floating point number examples

`Float64` is a type representing real numbers using 64 bits, 
that is also known as double precision.
We can create a `Float64` by including a 
decimal point when writing the number: 
`1.0` is a `Float64` while `1` is an `Int`.
We use `printbits` to see what the bits of a `Float64` 
for a few numbers are.

First, let's check an integer of type `Float64:

In [2]:
printbits(1.0)

[31m0[0m[32m01111111111[0m[34m0000000000000000000000000000000000000000000000000000[0m

The format is very different from what we saw before with `Int`:

In [3]:
bitstring(1)

"0000000000000000000000000000000000000000000000000000000000000001"

Another example is `1.3`. This is representable with only two base-10 digits, 
it requires an infinite number of base-2 digits in `Float64`, which are 
cut off:

In [4]:
printbits(1.3)

[31m0[0m[32m01111111111[0m[34m0100110011001100110011001100110011001100110011001101[0m

`Float32` is another type representing real numbers using 32 bits, that is also known 
as single precision.  `Float64` is now the default format for scientific computing (on the _Floating Point Unit_, FPU).  
`Float32` is generally the default format for graphics (on the _Graphics Processing Unit_, GPU), 
as the difference between 32 bits and 64 bits is indistinguishable to the eye in visualisation.  
`Float16` is an important type in machine learning where one wants to maximise the amount of data
and high accuracy is not necessarily helpful. 

Here we see that the format for `Float32` looks consistent with `Float64`:

In [5]:
printbits(Float32(1.3))

[31m0[0m[32m01111111[0m[34m01001100110011001100110[0m

## Standard IEEE numbers


The bits of floats are stored in the format
$$
sq_1\ldots q_Q b_1 \ldots b_P
$$
which (provided $q_k$ are not all zero and not all one) 
represents the real number
$$
x=\pm 2^{q-S} \times (1.b_1b_2b_3\ldots b_P)_2
$$
where $\sign x = isone(s) ? 1 : -1$, 
$S$ and $P$ are fixed constants, 
$q = (q_1q_2\ldots q_Q)_2$ is an unsigned  integer satisfying
$$
0 < q < 2^Q-1.
$$


In the case of `Float32`, $S = 127$, $Q = 8$, and $P = 23$,
for a total of $1 + 8 + 23 = 32$ bits.

In the case of `Float64`, $S = 1023$, $Q = 11$, and $P = 52$,
for a total of $1 + 11 + 52 = 64$ bits.

**Example** How is the number $1/3$ stored in `Float32`?
Recall that
$$
1/3 = (0.010101\ldots)_2 = 2^{-2} (1.0101\ldots)_2 = 2^{125-127} (1.0101\ldots)_2
$$
Since
$$
125 = (1111101)_2
$$
For the significand we round the last bit to the nearest, so we have
$$
1.010101010101010101010101\ldots \approx 1.01010101010101010101011 
$$
Thus the `Float32` bits for $1/3$ are:

In [6]:
printbits(1f0/3)

[31m0[0m[32m01111101[0m[34m01010101010101010101011[0m

The smallest normal number is $q = 1$ and $b_k$ all zero.  
For a given floating point type, it can be found using `floatmin`:

In [7]:
mn = floatmin(Float64)

2.2250738585072014e-308

In [8]:
printbits(mn)

[31m0[0m[32m00000000001[0m[34m0000000000000000000000000000000000000000000000000000[0m

# Subnormal numbers

Whenever $q = 0$, this is called a subnormal number, so does not follow the same interpretation of the bits.  Instead, if $q = 0$ the number is represented as
$$
    x = \pm 2^{1-S}*(0.b_0b_1b_2\ldots b_P)_2
$$
That is, the significand is now we have $0 + b_0/2 + \cdots$ instead of $1 + b_0/2 + \cdots$.

The simplest example is zero, which has $q=0$ and all significand bits zero:

In [9]:
printbits(0.0)

[31m0[0m[32m00000000000[0m[34m0000000000000000000000000000000000000000000000000000[0m

Unlike integers, we also have a negative zero:

In [10]:
printbits(-0.0)

[31m1[0m[32m00000000000[0m[34m0000000000000000000000000000000000000000000000000000[0m

If we divide the smallest normal number by two, we get a subnormal number:

In [11]:
printbits(mn/2)

[31m0[0m[32m00000000000[0m[34m1000000000000000000000000000000000000000000000000000[0m

Can you explain the bits?


# Special numbers

Whenever the bits of $q$ are all 1, that is, for `Float64` $q=2^{11}-1=2047=(11111111111)_2$, the number is treated differently.  If all $b_k=0$, then the number is interpreted as either $\pm\infty$, called `Inf`:

In [12]:
printbits(Inf)
printbits(-Inf)

[31m0[0m[32m11111111111[0m[34m0000000000000000000000000000000000000000000000000000[0m[31m1[0m[32m11111111111[0m[34m0000000000000000000000000000000000000000000000000000[0m

Another special number is `NaN`, which represents _not a number_. For example, `0/0` is not defined, so returns `NaN`:

In [13]:
0/0

NaN

`NaN` is stored with $q=(11111111111)_2$ and at least one of the $b_k =1$:

In [14]:
printbits(NaN)

[31m0[0m[32m11111111111[0m[34m1000000000000000000000000000000000000000000000000000[0m

**Example** What happens if we change some other $b_k$ to be nonzero?
We can create bits as a string and see:

In [15]:
i=parse(UInt64, "1111111111110000000000000000000010000001000000000010000000000000"; base=2)
reinterpret(Float64,i)

NaN

Thus, there are more than one `NaN`s on a computer.  Can you figure out how many there are?

Arithmetic works differently on `Inf` and `NaN`:

In [16]:
Inf*0      # NaN
Inf+5      # Inf
(-1)*Inf   # -Inf
1/Inf      # 0
1/(-Inf)   # -0
Inf-Inf    # NaN
Inf == Inf   # true
Inf == -Inf  # false

NaN*0      # NaN
NaN+5
1/NaN
NaN == NaN    # false
NaN != NaN    #true

true

## Rounding

There are three basic rounding strategies: round up/down/towards zero/nearest.
These are specified by tags `RoundUp`, `RoundDown`, `RoundToZero`, and
`RoundNearest`. (There are also more exotic rounding strategies `RoundNearestTiesAway` and
`RoundNearestTiesUp` that we won't use.)
 Note that these rounding modes are part
of the FPU instruction set so will be (roughly) equally fast as the default, `RoundNearest`.

Let's try rounding a `Float64` to a `Float32`.

In [17]:
printbits(1/3)  # 64 bits
printbits(Float32(1/3))  # round to nearest 32-bit

[31m0[0m[32m01111111101[0m[34m0101010101010101010101010101010101010101010101010101[0m[31m0[0m[32m01111101[0m[34m01010101010101010101011[0m

The default rounding mode can be changed:

In [18]:
printbits(Float32(1/3,RoundDown) )

[31m0[0m[32m01111101[0m[34m01010101010101010101010[0m

Or alternatively we can change the rounding mode for a chunk of code
using `setrounding`. The following changes `/` to round down:

In [19]:
setrounding(Float32, RoundDown) do
    1f0/3
end

0.33333334f0

## Arithmetic

A real number can have an infinite number of digits to represent exactly.  Define the operation that takes a real number to its `Float64` representation as `round`.

The Arithmetic operations '+', '-', '*', '/' are defined by the property that they are exact up to rounding.  That is, if `x` and `y` are `Float64`, we have
$$
x\oplus y={\rm round}(x+y)
$$
where in this formula $\oplus$ denotes the floating point definition of `+` and + denotes the mathematical definition of +.  

This has some bizarre effects.  For example, 1.1+0.1 gives a different result than 1.2:

In [20]:
x=1.1
y=0.1
x + y - 1.2 # Not Zero?!?

2.220446049250313e-16

This is because ${\rm round}(1.1)\neq1+1/10$, but rather:

$$
{\rm round}(1.1) = 1 + 2^{-4}+2^{-5} + 2^{-8}+2^{-9}+\cdots + 2^{-48}+2^{-49} + 2^{-51}= {2476979795053773 \over 2251799813685248} = 1.1 +2^{-51} - 2^{-52} - 2^{-53} - 2^{-56} - 2^{-57} - \cdots
$$