# Floating point arithmetic

## Limitations of Digital Representations

In order to complete computations in finite space and bounded time, we replace the real numbers with a surrogate finite set $\mathbb{F}$, the floating point numbers. (The "floating point" term originally differentiated it from "fixed point", which was an early alternative system based on absolute errors rather than relative errors.) Most scientific computing now conforms to the IEEE 754 double precision standard.  

In double precision there are 64 binary bits used to represent the members of $\mathbb{F}$.

In [5]:
one = bits(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

## Floating Point Numbers

These bits define three integers $s$, $e$, and $m$ used in the representation

$$ x = (-1)^s \cdot \left( 1 + 2^{-52}m \right) \cdot 2^e.$$

Here $s\in\{0,1\}$ requires one bit, $e\in\{-1022,\ldots,1023\}$ requires 11 bits, and $m\in\{0,1,\ldots,2^{52}-1\}$ requires 52 bits. We can dissect a double precision number to see these parts.

In [7]:
function ieee(x::Float64)
    b = bits(x);
    s = (b[1:1],parse(Int,b[1:1]));
    e = (b[2:12],parse(Int,b[2:12],2)-1023);
    f = (b[13:64],parse(Int,b[13:64],2));
    return s,e,f
end

ieee(1.0)

(("0",0),("01111111111",0),("0000000000000000000000000000000000000000000000000000",0))

The next-greater element of $\mathbb{F}$ is $1+\epsilon_M$, for machine epsilon $\epsilon_M=2^{-52}$. 

In [8]:
@show eps()
ieee(1.0+eps())

eps() = 2.220446049250313e-16


(("0",0),("01111111111",0),("0000000000000000000000000000000000000000000000000001",1))

There are $2^{52}$ elements of $\mathbb{F}$ equally spaced throughout $[1,2)$. After these, the exponent increases to 1 and the value of $m$ resets to zero. Thus there are also $2^{52}$ elements equally spaced throughout $[2,4)$, as well as $[4,8)$, $[1/2,1)$, and in general, $[2^{e},2^{e+1})$. Let $\text{fl}(x)$ be the element of $\mathbb{F}$ nearest to any real number $x$. Consequently, if we momentarily ignore the bounds on the exponent $e$,

$$ \frac{\bigl|x-\text{fl}(x)\bigr|}{\bigl|x\bigr|} \le \frac{1}{2} \epsilon_M.$$

In practice, the largest finite value in $\mathbb{F}$ is

In [4]:
@show R = (2.0^1023)*(1 + (2^52-1)/2^52);
ieee(R)

R = 2.0 ^ 1023 * (1 + (2 ^ 52 - 1) / 2 ^ 52) = 1.7976931348623157e308


(("0",0),("11111111110",1023),("1111111111111111111111111111111111111111111111111111",4503599627370495))

Results that should be larger than this become the special value Inf; this situation is called _overflow_.

In [5]:
nextfloat(R)

Inf

In [6]:
2.0^(-1022)

2.2250738585072014e-308

The analogous situation near zero is called _underflow_. Ignoring some special "denormalized" numbers, the smallest positive element of $\mathbb{F}$ is  

In [7]:
@show r = 2.0^-1022;
ieee(r)

r = 2.0 ^ -1022 = 2.2250738585072014e-308


(("0",0),("00000000001",-1022),("0000000000000000000000000000000000000000000000000000",0))

Note that this minimum value is far smaller than $\epsilon_M$, which is the number spacing relative to 1. 