# Floating point operations and Round-Off Error



A machine is fundamentally unable to represent some numbers completely accurately (such as $\sqrt{2}$ for example). Computers work with floating point representations of numbers which can introduce several subtle issues that must be handled carefully. 

## Floating point operations 

$(1011)_2 = 11 = 1 * 2^0 + 0 * 2^1 + 1 * 2^2 + 1 * 2^3$
To convert $-27.25$ to binary, we reserve one bit to indicate the sign of the number, represent 27 as a floating point $(11011)_2$. 

Question: Given a positive number $x$, find its binary expression through an algorithm. 

$$x = 2c_0 + a_0$$
$$c_0 = 2c_1 + a_1$$
$$c_1 = 2c_2 + a_2$$
$$\vdots$$
$$c_{n-1} = 2(0) + a_n$$

where each $a_i \in \{0,1\}$. Then, $x$ can be represented as $(a_n a_{n-1} ... a_1 a_0)_2$ A more effective way to write this algorithm would be as below. 

In [None]:
x = 2c_0 + a_0
for i=1 ,..., n:
    c_{i-1} = 2c_i + a_i
    if c_i = 0, stop
end

Now consider a $\beta$-based representation where $\beta \geq 2$, $x = (a_n a_{n-1} ... a_1 a_0 b_1, b_2 \dots b_m)_{\beta}$ This would mean the number $x$ would simply be represented as:

$$ x = a_n \beta^n + a_{n-1} \beta^{n - 1} + \dots + a_{1} \beta + a_0 + b_1 \beta^{-1} + b_2 \beta^{-2} + \dots + b_m \beta^{-m} $$

Consider the polynomail $p_3(x) = 4x^3 + 0.2x^2 - 2.1x + 5$ In order to evaluate this polynomial at a specific point, we have to perform three addition operations and six multiplication operations, so nine total floating point operations. Now suppose the polynomial is written equivalently as $((4x + 0.2)x - 2.1)x + 5$. We now have to perform three additions and three multiplications, for a total of six floating point operations. The nested formulation of the polynomial clearly takes less operations to evaluate at a specific point, which $p_n(x)$ with larger $n$, can be very significant.


$$p_n(x) = a_n x^{n} + a_{n-1} x^{n-1} + \dots a_1x + a_0 = ((a_n x + a_{n-1})x + \dots + a_1)x + a_0$$


In the first representation, we have to perform $n + (n-1) + \dots + 1$ multiplications which is $n(n+1)/2$, on the order of $O(n^2)$. On the second representation, we only need to perform $n$ multiplications, so the operation is $O(n)$.

Important: No computer can express all real numbers in an exact way. A computer represents numbers in the following way. $\sigma$ represents the sign, the various $a_i$ are the digits, and the $f$ is the exponent.

$$x = \sigma ( \bold{.} a_1 a_2 \dots a_f a_{f+1})_\beta \hspace{2px} \beta^f$$

For example, $x = 0.1347864213 \times 10^7$. Assume the computer has 9 digits. Suppose due our specific hardware constraints, our architecture can only represent up to 9 digits after the decimal point. A natural choice would be to truncate the three at the very end of the sequence. Another option is simply to change the second to last character and round it up based on the value of the last digit. Most computers perform the round-off operation. The main idea is, these operations that are undertaken to accomodate the hardware, introduce some error into the representation. 


## Error

$x_T$ is the true value, and $x_A$ is the approximate value. There are two ways to think about error; absolute versus relative. Relative is useful when wanting to understand the error in the context of the scale. 

$$E_{\text{abs}}(x_a) = | x_T - x_A |$$
$$E_{\text{rel}}(x_a) = \frac{| x_T - x_A |}{| x_T |}$$

Given $x_T = x_A + \nu$, $y_T = y_A + \epsilon$ where $|\nu|, |\epsilon| << 1$

Case a: 

$$x_T y_T - x_A y_A$$
$$= (x_a + \nu)(y_A + \epsilon) - x_A y_A$$
$$x_A \epsilon + y_A \nu + \epsilon \nu = x_A \epsilon + y_A \nu$$

Based on this, we can conclude $E_{\text{abs}}(x_A y_A)$ is $O(\nu + \epsilon)$. 


Similarly, 
$$E_{\text{rel}}(x_a y_a) = |\frac{(x_T y_T) - (x_A y_A)}{| x_T y_T |}|$$
$$ = |\frac{(x_T y_T) - (x_T - \nu)(y_T - \epsilon)}{| x_T y_T |}|$$
$$ = |\frac{\nu}{x_T} + \frac{\epsilon}{y_T} - \frac{\epsilon \nu}{x_T y_T}| \leq |E_{rel}(x_A) + E_{rel}(y_A)|$$