# 2 - Floating Point Numbers and Arithmetic

## 1 - Scientific Notation

In scientific notation, numbers are represented by a **mantissa** and a power of 10. For example, $2024$ can be represented by $2.024 * 10^3$, where $2.024$ is the mantissa.

## 2 - Binary Representation of Real Numbers

For any positive integer $n$, its representation in base 2 is denoted by

$$ (a_k a_{k-1} \dots a_0)_2 $$

provided that $a_k 2^k + a_{k-1} 2^{k-1} + \dots + a_0 = n$.

A positive real number $x$ has a finite binary representation (fixed-point) denoted by:

$$ (a_k a_{k-1} \dots a_0 . b_1 \dots b_m)_2 $$

if

$$ x = \sum^k_{i=0} a_i 2^i + \sum^m_{j=1} b_j 2^{-j} $$

**Theorem 1**:

Any real number $x \in [1,2)$ has a binary representation of infinite or infinitely many digits of the form

$$ x = (1.b_1b_2\dots b_n \dots)_2 $$

where $b_i$ is either $0$ or $1$.

**Proof**:

For $x \in [1,2)$, we can see that $0 \leq x-1 < 1$. This implies that $0 \leq 2(x-1) < 2$. Let $b_1 = \lfloor 2(x-1) \rfloor$ be the integer part of the number $2(x-1)$. We obtain $b_1 \in {0,1}$. By the definition of the integer part, we have

$$ 0 \leq 2(x-1) - b_1 < 1 $$

and hence

$$ 0 \leq 2(2(x-1) - b_1) = 2^2 (x-1-\frac{b_1}{2}) < 2 $$

Now define $b_2 = \lfloor 2^2 (x-1-\frac{b_1}{2}) \rfloor$, then we have a similar condition $b_2 \in {0,1}$ and

$$ 0 \leq 2^2 (x-1-\frac{b_1}{2}) - b_2 < 1 $$

By induction, we can define

$$ b_{n+1} = \lfloor 2^{n+1} (x - 1 - \sum^n_{i=1} b_i 2^{-i}) \rfloor $$

where $b_{n+1} \in {0,1}$ and obtain a similar estimate

$$ 0 \leq 2^{n+1} (x - 1 - \sum^n_{i=1} b_i 2^{-i}) - b_{n+1} < 1 $$

By the comparison test and the geometric series, the series $1 + \sum^\infty_{i=0} b_i 2^{-i}$ is convergent to a finite number, and particularly, in this case, it converges to $x$. Thus,

$$ x = 1 + \sum^\infty_{i=1} b_i 2^{-i} = (1.b_1 b_2 \dots)_2 $$

**Theorem 2**:

For any positive real number $y$, there exist an integer exponent $n$ and binary digits $b_1, b_2, \dots \in {0,1}$ such that

$$ y = 2^n(1+\sum^\infty_{i=1} b_i 2^{-i}) $$

**Proof**:

Let $y$ be the given positive real number. We can find an integer $n$ such that $y \in [2^n, 2^{n+1})$. Set $x=\frac{y}{2^n}$ then $x \in [1,2)$. From Theorem 1, there exist $b_1, b_2, \dots \in {0,1}$ such that

$$ x = 1 + \sum^\infty_{i=1} b_i 2^{-i} $$

Now we have

$$ y = 2^nx = 2^n(1 + \sum^\infty_{i=1} b_i 2^{-i}) $$

## 3 - Floating Point Representation

A floating point number with $t$ digits (prescision) in base 2 of a real non-zero number $x$ has the form

$$ x = \pm 2^p \times (1.b_1 b_2 \dots b_{t-1})_2 = \pm 2^p (1 + \sum^{t-1}_{i=1} b_i 2^{-i}) $$

where $p \in \mathbb{Z}$ and $b_i \in {0,1}$. The part after the decimal is called the mantissa and $p$ is the exponent.

Given a value of $t$ and integers $L, U$ define

$$ \mathcal{F} = \{x \in \mathbb{R} : x=0 \text{ or } x \text{ of the form (1) with } L \leq p \leq U \} $$

This is a space of floating point numbers with $t$ digits in a fixed range of exponents from $L$ to $U$. The set $\mathcal{F}$ is finite, which allows us to represent any number by a designed-size block of memory in a computer. There are two standards that are implemented in hardware:
- Single-precision (float): 32 bits, $t = 23$, $L=-126$, $U=127$
- Double-precision (double): 64 bits, $t=53$, $L=-1022$, $U=1023$  

## 4 - IEEE Double Precision Floating Point Numbers



## 7 - Error Analysis

- We often cannot exactly evaluate functions
- Problems often do not have analytic solutions
- We need to understand the inherent error (round off or truncation) in storage of numbers by computers
- We need to quantify how close our approximate solution is to the actual solution

## 8 - Types of Error



## 9 - Floating Point Arithmetic

Let $*$ be a binary operation ($+,-,\times,\div$) and $\circledast$ denote its floating point analogue. Recall $\mathcal{F}$ is the space of floating point numbers (64 bit double precision). For any $x,y \in \mathcal{F}$, we define

$$ x \circledast y = \text{fl}(x*y) $$

