# Floatin-point Numbers

## Definition 1.1.1

The set $\mathbb{F}$ of **floating-point numbers** consists of zero and all numbers of the form

$$\pm (1 + f) \times 2^n$$

where $n$ is an integer called the **exponent**, and $1 + f$ is the **significand**, in which

$$f = \sum_{i = 1}^d b_i 2^{-i}, \space b_i \in \{0, 1\}$$

for a fixed integer $d$ called the binary **precision**.

### Note

This definition can be understand through borrowing the concept of scientific notation we use in decimal system like

$$1.32 \times 10^{-5} = 0.0000132$$

I will use an example of decimal $3.3$ to show how to write any decimal in the form mentioned in the definition.

**Step 1: Turn the whole numbers part into binary**

In this case, the whole numbers part is $3$, so we need to write $3$ in binary form, which is

$$3_10 = 11_2$$

**Step 2: Turn the decimal part into binary**

Consider a decimal $x$ in decimal system, its binary representation is

$$x = b_1 \times 2^{-1} + b_2 \times 2^{-2} + b_3 \times 2^{-3} + \dots$$

where $b_i$ is each digit of this binary decimal and $b_i \in \{0, 1\}$.

Now we want to extract each $b_i$ from $x$. To achieve this, we can multiply $x$ by $2$ for many times

$$2x = b_1 + b_2 \times 2^{-1} + b_3 \times 2^{-2} + \dots$$

Observed that $b_1$ is extracted from $x$. Then we just need to take out $b_1$ and do it again on $2x$.

Here is the exact steps in this case

$$
0.3 \times 2 = 0.6 \longrightarrow 0 \\
0.6 \times 2 = 1.2 \longrightarrow 1 \\
0.2 \times 2 = 0.4 \longrightarrow 0 \\
0.4 \times 2 = 0.8 \longrightarrow 0 \\
0.8 \times 2 = 1.6 \longrightarrow 1
$$

Here we notice that the decimal returned to $0.6$, so sometimes a finite decimal can't be convert to a finite decimal in binary. In this case, the decimal part is

$$0.3_{10} = 0.\overline{0100}$$

**Step 3: Add decimal and whole number part together**

In this case, the result is

$$11.\overline{0100}_2$$

**Step 4: Convert into scientific notation**

$$11.\overline{0100}_2 \approx 1.0100_2 \times 2^1$$

The significand ($1 + f$) is $1 + 0.0100_2$. The exponent ($n$) is $1$.

In [1]:
@show p = 22 / 7

p = 22 / 7 = 3.142857142857143


3.142857142857143

In [2]:
@show float(π)

float(π) = 3.141592653589793


3.141592653589793

In [None]:
acc = abs(π - p)
println("Absoluate accuracy =  $acc")
println("Relative accuracy = $(acc / π)")
println("Number of accurate digits = $(-log10(acc / π))")

Absoluate accuracy =  0.0012644892673496777
Relative accuracy = 0.0004024994347707008
Number of accurate digits = 3.3952347251747166


In [6]:
x = 1.0
y = 2.0
@show bitstring(x)
@show bitstring(y)
@show sign(x), exponent(x), significand(x)
@show sign(y), exponent(y), significand(y)

bitstring(x) = "0011111111110000000000000000000000000000000000000000000000000000"
bitstring(y) = "0100000000000000000000000000000000000000000000000000000000000000"
(sign(x), exponent(x), significand(x)) = (1.0, 0, 1.0)
(sign(y), exponent(y), significand(y)) = (1.0, 1, 1.0)


(1.0, 1, 1.0)