# Floating point operations, Round-Off Error, Arithmetic


## Floating Point Representations: How does a Computer Represent Numbers? 

Computers do not store most numbers exactly, such as $pi$ or $sqrt(2)$, but rather as floating point binary numbers, which introduces roundoff or truncation error. First let's look at how a computer actually represents floating point numbers. Computers store numbers in binary, meaning for a decimal x, the number will be represented as:

$$(a_n \times 2^n) + (a_{n-1} \times 2^{n-1}) + \dots (a_1 \times 2) + a_0$$

Where each $a_i$ is either $0$ or $1$. Consider the problem of algorithmically converting a decimal number to binary. A potential algorithm to accomplish this may be:

$$x = 2c_0 + a_0$$
$$c_0 = 2c_1 + a_1$$
$$c_1 = 2c_2 + a_2$$
$$\vdots$$
$$c_{n-1} = 2(0) + a_n$$

where each $a_i \in \{0,1\}$. Then, $x$ can be represented as $(a_n a_{n-1} ... a_1 a_0)_2$ A more effective way to write this algorithm would be as below. 

$$\text{DecimalToBinary}(x):$$
$$
\begin{aligned}
& \hspace{10px} x = 2c_0 + a_0 \\
&\hspace{10px} \textbf{for } i = 1, \dots, n: \\
&\hspace{10px} \quad c_{i-1} = 2c_i + a_i \\
&\hspace{10px} \quad \textbf{if } c_i = 0 \textbf{ then stop} \\
& \hspace{10px} \textbf{return} \hspace{5px} (a_n a_{n-1} ... a_1 a_0)
\end{aligned}
$$


While computers use binary, we can create this kind of representation for any base $\beta$ (so a binary number is where $\beta = 0$). For example, the base-$3$ number $(210.12)_3$ would be the following in decimal: 

$$x = (2 \times 3^2) + (1 \times 3^1) + (0 \times 3^0) + (1 \times 3^{-1}) + (2 \times 3^{-2}) = (21.555)_{10}$$


To be more specific, computers represent numbers in the following general form (the exact standard is not relevant to the study of numerical processes, but an example would be the IEEE representation), where $\sigma$ represents the sign. 

$$x = \sigma ( . a_1 a_2 \dots a_f a_{f+1})_\beta \hspace{2px} \beta^f$$


Consider $x = 0.1342567823 \times 10^7$. This number has 10 digits after the decimal, but suppose our specific computer architecture only can handle up to 9 digits in a representation. Computer architectures generally handle this issue via two methods: either by truncating the extra digits or by rounding it. So in this case, our approximate representation would be $x = 0.134256782 \times 10^7$. This introduces a slight mismatch between the true value of what we want to represent, and the approximated value that is limited by the computer's hardware. A major part of Numerical Analysis is studying how these errors are formed, propogate, and scale, as well as how they can be handled. 

## Absolute and Relative Errors

Let $x_A = x_T + \epsilon$ be our approximated value and let $x_T$ be our true value. The following are two different definitions of error associated with a given approximation. The absolute error definiton is intuitive and simple, but fails to capture the scale of the variables. For example, the absolute error in the measurement of a countries GDP may be on the order of 1-million, but this error is far more insigificant than the absolute error would suggest. Thus, the relative error is a popular choice for many applications. 

$$ E_{abs}(x_A) = |x_T - x_A| $$ 
$$ E_{rel}(x_A) = \frac{|x_T - x_A|}{|x_T|} $$

Let us analyze what these errors look like for products, and how they scale. 

1. Absolute Representation

$$E_{abs} = x_Ty_T - x_Ay_A = (x_A + \epsilon)(y_A + \delta) - x_A y_A$$
$$ = x_Ay_A + x_A \epsilon + y_A \delta + \epsilon \delta - x_A y_A = x_A \epsilon + y_A \delta$$
$$E_{abs}(x_Ay_A) = O(|x_A| \epsilon + |y_A| \delta)$$


2. Relative Representation 

$$ E_{rel}(x_A y_A) = \frac{|(x_Ty_T) - (x_Ay_T)|}{|x_T y_T|}$$
$$ E_{rel}(x_A y_A) = \frac{|(x_Ty_T) - (x_T - \epsilon)(y_T - \delta)|}{|x_T y_T|} = \frac{x_T y_T - (x_T y_T - \epsilon x_T - \delta y_T + \epsilon \delta)}{|x_T y_T|}$$
$$= \frac{|\epsilon x_T + \delta y_T + \epsilon \delta|}{|x_T y_T|} = \frac{\epsilon}{x_T} + \frac{\delta}{y_T} - \frac{\epsilon \delta}{x_T y_T}$$
$$ = E_{rel}(x_A y_A) \leq |E_{rel}{x_A}| + |E_{rel}(y_A)|$$

Since the relative error is additive, it is a more efffective measure of accuracy as absolute error scales with magnitudes of $x_A$ and $y_A$, while relative error ignores scale and gives us a clean, additive bound. 

## Polynomials: How many operations are needed under different representations? 

Consider the polynomail $p_3(x) = 4x^3 + 0.2x^2 - 2.1x + 5$ In order to evaluate this polynomial at a specific point, we have to perform three addition operations and six multiplication operations, so nine total floating point operations. Now suppose the polynomial is written equivalently as $((4x + 0.2)x - 2.1)x + 5$. We now have to perform three additions and three multiplications, for a total of six floating point operations. The nested formulation of the polynomial clearly takes less operations to evaluate at a specific point, which $p_n(x)$ with larger $n$, can be very significant.


$$p_n(x) = a_n x^{n} + a_{n-1} x^{n-1} + \dots a_1x + a_0 = ((a_n x + a_{n-1})x + \dots + a_1)x + a_0$$


In the first representation, we have to perform $n + (n-1) + \dots + 1$ multiplications which is $n(n+1)/2$, on the order of $O(n^2)$. On the second representation, we only need to perform $n$ multiplications, so the operation is $O(n)$.