# 2.1: The Essentials

This section talks all about the essentials of roundoff errors.

## Computer Representation of Numbers

> **FACT:** Any number $x \in \mathbb{R}$ can be represented in binary by:
> $$
     x = \pm (1.d_1d_2d_3\cdots) \cdot 2^e
  $$
> 
> where $e$ is an integer exponent (DON'T think like the `exp` function) and $d_1 = 0$ or $1$ and $(1.d_1d_2d_3\cdots)$ is called the mantissa.
>
> The value of the mantissa can be converted to denary (base-10) by:
> $$
    1 + \frac{d_1}{2} + \frac{d_2}{2^2} + \frac{d_3}{2^3} + \cdots
  $$
  

> #### Example 1: Binary Conversion
> 
> a)
> $$
    \begin{aligned}
        x &= -(1.101000\cdots) \cdot 2^1\\
        &= -(1 + \frac{1}{2}+\frac{1}{8 \cdot 2\\
        &= -3.25
    \end{aligned}
  $$
>  
> b)
> $$
    \begin{aligned}
        x &= -(1.00110100\cdots) \cdot 2^6\\
        &= -(1 + \frac{1}{8}+\frac{1}{16}+\frac{1}{64}) \cdot 64\\
        &= 77
    \end{aligned}
  $$
  


## Floating Point Representation

In order to store a number on a computer using finite precision (aka a finite number of digits), we convert it to a **floating point representation**, or the approximation of the number stored in memory. We can represent this with $f \ell (x) \approx x$. This means that $f\ell(x) = \text{ sign}(x) \cdot (1.\tilde{d_1}\tilde{d_2}\tilde{d_3}\cdots\tilde{d_t}) \cdot 2^e$, where $L \le e \le U$. 

The relative rounding error in the approximation for the floating point representation $f\ell(x)\approx x$ may be quantified by $\frac{\left|f\ell(x) - x\right|}{|x|}$

## The IEEE Standard

The IEEE standard for floating point numbers guarantees that $\frac{|f\ell(x) - x|}{|x|} \le \frac{1}{2} \cdot 2^{-t} = 2^{-t+1} = \eta = \epsilon_{mach}$. This is the upper bound of the rounding error, which is called either the rounding unit or machine epsilon or unit roundoff.

> **IEEE STANDARD:** Double Precision
>
> In a 64 bit block of computer memory, 1 bit is used for the sign of the number, 11 bits are used for the exponent, and the rest of the 52 bits are for the mantissa.

so $\eta = 2^{-53} \approx 1.1 \cdot 10^{-16}$. This is the best that we can expect for the rounding error on average.

## Problems That Can Arise with Floating Point Numbers

* undefined errors ($\frac{0}{0}, 0^0$)
* overflow/underflow (#'s that are too big/too small in memory)
    * In the IEEE standard for double precision, $e < -1023$, $e > 1024$
* cancellation error: large relative error that results when computing $z=x-y$ where $x\approx y$
    * relative error gets big because you divide by $|z| = |x-y|$ which goes in the denominator (so the denomiator is very small)
    * Place where this can happen: approximating derivatives

## A Script Analyzing the Behavior of Roundoff Error

Let's try to approximate a function $f(t)$ at a bunch of $t$ intervals on $[0,1]$ in double and single precision, then comparing the 2 approximations.

In [None]:
%%%%%%%%%%%% SETTING UP OUR VARIABLES %%%%%%%%%%%%%%%%
t = linspace(0,1,500); % Domain discretization
fd = exp(-t) .* cos(10*t); % function approximated in double precision (DEFAULT)
fs = single(fd); % function approximation in single precision

%%%%%%%% COMPUTE THE RELATIVE ERROR %%%%%%%%%%%%%%%
err = (fd - fs) ./ fd; % No abs because we want to observe the sign change, so this is the signed absolute error

%%%%%%%%%% PLOTTING THE ERROR %%%%%%%%%%%%%%%%%
figure
plot(t, err, '*-')
xlabel('t')
ylabel('error')