> Dionysios Rigatos <br />
> dionysir@stud.ntnu.no <br />

# Theory - Numerics exercise 1

#### Using floating point numbers in Python (click the arrow to the left to expand)

Computers represent numbers, in base 2, as bits. They have a finite amount of bits available, and to keep the number representations equal across programming languages and platforms, standards are made for how a computer is to store numbers internally. To represent decimal numbers, Python uses double precision floating point numbers (doubles). Doubles consist of 8 bytes, i.e. 64 bits $b_0,b_1,...,b_{63}$ that each can take the values 0 or 1. They are used to represent a decimal number $a$ on the form
$$\mathrm{fl}(a) = (-1)^{b_{0}}*2^{e-1023}*1.b_{1}b_{2}b_{3}....b_{52}.$$

Here, the exponent is given as $e = b_{53}b_{54}b_{55}b_{56}b_{57}b_{58}b_{59}b_{60}b_{61}b_{62}b_{63},$ a base 2 integer. Note that $e = 00000000000$ and $e=1111111111$ are interpreted as $0$ and $\infty$ respectively, so e takes values between 1 and 2046. The mantissa $1.b_{1}b_{2}b_{3}....b_{52}$ is a real number in base 2 with values between 1 and 2:
$$ 1.b_{1}b_{2}b_{3}....b_{52} = 1 + \sum_{j = 1}^{52}b_{j}2^{-j}. $$
Since this is a finite sum, we have a finite amount of numbers at our disposal. This means we cannot represent all of the infinitely many real numbers *exactly* as doubles. We can, however, guarantee a certain *precision*.

An example of a number that can be exactly represented on this format is 1.25. It is represented as 

$$b_{1} \quad b_{53}b_{54}b_{55}b_{56}b_{57}b_{58}b_{59}b_{60}b_{61}b_{62}b_{63} \quad b_{1}b_{2}b_{3}...b_{52}
\quad$$
$$= \quad 0 \quad 0 1 1 1 1 1 1 1 1 1 1 \quad 010000000000000000000000000000000000000000000000000.$$

An example of a number that cannot be exactly represented on this format is 9.4. It is represented as
$$b_{1} \quad b_{53}b_{54}b_{55}b_{56}b_{57}b_{58}b_{59}b_{60}b_{61}b_{62}b_{63} \quad b_{1}b_{2}b_{3}...b_{52}
\quad$$
$$= \quad 0 \quad 1 0 0 0 0 0 0 0 0 1 0 \quad 0010110011001100110011001100110011001100110011001100.$$

If you check closely, you will find that $fl(9.4) = 9.4 + 2^{-49} - 0.4 * 2^{-48} = 9.4 + (1-0.8) * 2^{-49} = 9.4 + 0.2 * 2^{-49}$, i.e. there is an error of $ 0.2 * 2^{-49}$.


In general, one can show that for a floating point number the relative truncation error and the absolute truncation error are given by, respectively,
$$\frac{|\mathrm{fl}(a) - a|}{|a|} \leq \epsilon_{mach} \qquad \mathrm{ and } \qquad |\mathrm{fl}(a) - a| \leq |a|\epsilon_{mach}$$
where the *machine epsilon* $\epsilon_{mach}$,  is the smallest unit of precision in the mantissa. For doubles, machine epsilon is $2^{-52} = 2.22*10^{-16}$ since the last decimal spot in the mantissa is $b_{51}$, corresponding to the value $2^{-51}£. For other floating point standard, other precisions apply. If you are interested, you can read more about this [here](https://en.wikipedia.org/wiki/Floating-point_arithmetic).

#### a) How many bits of computer memory are used when storing a double?

64 bits.

#### b) How are these bits distributed among sign, exponent and significand?

1 sign bit

11 exponent bits

52 mantissa bits

#### c) What are, in absolute value, the largest and smallest numbers representable as doubles?

The smallest absolute number is 0.

The largest possible value is infinity, but that is not a number.
The largest possible finite value is: 
1. The unbiased largest possible exponent is 2^(2046-1023) = 2^1023.
2. The largest possible mantissa is all bits 1, and the mantissa is in the range [1, 2), so the largest possible mantissa is 2-2^-52.
3. So, the largest possible number is `M * 2^e` = `(2 - 2^-52) * 2^1023` = `2 * (2^0 - 2^-53) * 2^1023` = `(2^0 - 2^-53) * 2^1024` = `2^1024 - 2^971`.

#### d) Convert the below numbers to doubles, fl(a), and comment for each of them whether it is an exact representation or not. 
You do not need to find more than the first eight bits of the significand. You are allowed to use an online "double converter" if you like.
- 0.25
- 4.5
- 0.1

* 0.25 -> 0.01 -> 1.0 * 2^-2 -> 0 01111111101 00000000...0 (Exact)
* 4.5 -> 100.1 -> 1.001 * 2^2 -> 0 10000000001 00100000...0 (Exact)
* 0.1 -> 0.00011001... -> 1.10011... * 2^-4 -> 0 01111111011 1001100...1100...1100 (Not exact, repeating sequence)

#### e) Find the absolute round-off error of the numbers below when represented as a double.

- 3.1415
- 6.022140857*10^23
- 0.8*10^(-10)

Rounding Errors when converting to double are at worst 1/2 * eMach, where eMach is 2^-52 for double precision. 
The absolute rounding error for a, in this case, will be 1/2 * eMach * |a|.

In [42]:
EMACH = 2**-52
EMACH

2.220446049250313e-16

In [43]:
def absolute_rounding_error(a):
    return 1/2 * EMACH * abs(a)

In [44]:
print(f"The absolute rounding error of {3.1415} is {absolute_rounding_error(3.1415)}")
print(f"The absolute rounding error of {6.022140857e23} is {absolute_rounding_error(6.022140857e23)}")
print(f"The absolute rounding error of {0.8e-10} is {absolute_rounding_error(0.8e-10)}")

The absolute rounding error of 3.1415 is 3.4877656318599295e-16
The absolute rounding error of 6.022140857e+23 is 66859194.369772725
The absolute rounding error of 8e-11 is 8.881784197001252e-27


Mathematically, the equation would be expressed as |fl(a) - a| = 1/2 * eMach * |a|.
Here, eMach/2 is 2.2204 * 10^-16 / 2 = 1.1102 * 10^-16.
 
* 3.1415 * 1.1102 * 10^-16 = 3.4877 * 10^-16
* 6.022140857 * 10^23 * 1.1102 * 10^-16 = 6.688 * 10^7
* 0.8 * 10^-10 * 1.1102 * 10^-16 = 8.8818 * 10^-27

We can also calculate the absolute truncation error, which omits the 1/2 from the formula.

In [45]:
def absolute_truncation_error(a):
    return EMACH * abs(a)

In [46]:
print(f"The absolute truncation error of {3.1415} is {absolute_truncation_error(3.1415)}")
print(f"The absolute truncation error of {6.022140857e23} is {absolute_truncation_error(6.022140857e23)}")
print(f"The absolute truncation error of {0.8e-10} is {absolute_truncation_error(0.8e-10)}")

The absolute truncation error of 3.1415 is 6.975531263719859e-16
The absolute truncation error of 6.022140857e+23 is 133718388.73954545
The absolute truncation error of 8e-11 is 1.7763568394002504e-26


Mathematically, the equation would be expressed as |fl(a) - a| = eMach * |a|.
Here, eMach is 2.2204 * 10^-16.
 
* 3.1415 * 2.2204 * 10^-16 = 6.975 * 10^-16
* 6.022140857 * 10^23 * 2.2204 * 10^-16 = 1.33719 * 10^8
* 0.8 * 10^-10 * 2.2204 * 10^-16 = 1.776 * 10^-26