In [3]:
# Packages required
import numpy as np

* Every number in a computer is stored using finite number of digits.
* Numbers are stored with reasonable approximations.
* Loss of digits lead to roundoff errors.
* What are some possible scenarios and how to deal with them?


#### Note: A standard decimal representation is $AeB \equiv A \times 10^B$, that is in standard form is written as $1.2625e1$ meaning $1.2625 \times 10^1.$

---

# Effects of roundoff error.
### The order of addititon of numbers in a computer matters.

In [1]:
x = 1+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16
y = 1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1e-16+1
print("x = ", x)
print("y = ", y)

x =  1.0
y =  1.000000000000001


* The computer stores about 16 base 10 digits
* We get 15 digits after the first nonzero digit of  anumber
* In the firs sum the 16th digit is lost evrytime the addition is performed
* In the second sum when $1e-16$ is added 10 times, it gets a chance to form $1e-15$
* Hence, different results!

---

### Subtracting two nearly equal numbers can lead to large error

Consider the function:
$$f(x)=\frac{e^{x}-e^{-x}}{x}$$

What is
$$\lim_\limits{x \rightarrow 0}f(x)?$$

If we numerically evaluate the limit, for varying the values of $x$ we get,

In [4]:
def f1(x):
    return (np.exp(x)-np.exp(-x))/x

print("--------------------------------------------")
print("   x                 (exp(x)-exp(-x))/x")
print("--------------------------------------------")
for i in range(5, 21):
    x = 10**(-1*i)
    print("  ", x, "            ", f1(x))

--------------------------------------------
   x                 (exp(x)-exp(-x))/x
--------------------------------------------
   1e-05              2.0000000000242046
   1e-06              1.999999999946489
   1e-07              1.9999999989472883
   1e-08              1.999999987845058
   1e-09              2.0000000544584395
   1e-10              2.000000165480742
   1e-11              2.000000165480742
   1e-12              2.0000667788622195
   1e-13              1.999511667349907
   1e-14              1.9984014443252818
   1e-15              2.1094237467877974
   1e-16              1.1102230246251565
   1e-17              0.0
   1e-18              0.0
   1e-19              0.0
   1e-20              0.0


We run into trouble!
* The smaller value of $x$ we choose, the better the result should be.
* However, when $x < 1e-16$ the value of $e^x - e^{-x}$ is stored as zero causing a huge error.


A better way to avoid subtraction of nearly equal expressions is to utilize the Macluarin series expansion.

$$ e^x = \sum\limits_{n=0}^{\infty} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots $$

So that we end up with,
$$ \frac{e^{x}-e^{-x}}{x} = 2 + \frac{2x^2}{3!} + \frac{2x^4}{5!} + \frac{2x^6}{7!} + \cdots $$

For a small enough value of $x$ we can accurately approximate
$$f(x) \approx 2 + \frac{x^2}{3}$$

In [5]:
def f2(x):
    return 2+x*x/3

print("--------------------------------------------")
print("   x                     2+x^2/3")
print("--------------------------------------------")
for i in range(5,21):
    x = 10**(-1*i)
    print("  ", x, "            ", f2(x))

--------------------------------------------
   x                     2+x^2/3
--------------------------------------------
   1e-05              2.0000000000333333
   1e-06              2.0000000000003335
   1e-07              2.0000000000000036
   1e-08              2.0
   1e-09              2.0
   1e-10              2.0
   1e-11              2.0
   1e-12              2.0
   1e-13              2.0
   1e-14              2.0
   1e-15              2.0
   1e-16              2.0
   1e-17              2.0
   1e-18              2.0
   1e-19              2.0
   1e-20              2.0


### Numerical errors can cause divergent inifinite sums to converge

$$ \sum\limits_{n=1}^{\infty} \frac{1}{n} \;\;\; \text{is a divergent series} \;\;\; \text{(Why?)} $$

However, as the terms get too small, they are approximated as zero. Hence we end up with a finite value for a non-converging series.

#### **How do we estimate the error for a convergent series?**

Say $ \sum\limits_{n=1}^{\infty} \frac{1}{n^2}$

---

# Binary Numbers

We work with decimal system or base-10, where we use the digits 0-9 to represent any number. Under the hood computers work with a binary system or base-2 that uses only the digits 0 and 1.

$$(12.625)_{10} = 1\times 10^1+2\times 10^0+6\times 10^{-1}+2\times 10^{-2}+5\times 10^{-3}$$

$$(11.001)_2 = 1 \times 2^1 + 1 \times 2^0 + 0 \times 2^{-1} + 0 \times 2^{-2} + 1 \times 2^{-3} \equiv (3.125)_{10}$$

### **Standard Binary Form:**
The standard representation of a binary number can be considered analogous to exponential formatting in base 10. The idea here is that every binary number, except $0$, can be represented as
$$ x = 1.b_1b_2b_3\cdots \times 2^{exponent}$$

---

# ALGORITHM: Decimal to Binary
Given a base $10$ decimal $d$.
1. Find the biggest power $p$ such that $2^p\leq d$ but $2^{p+1}>d$, and save the index number $p$;
2. Set  $d \leftarrow d-2^p $;
3. If $d$ is $0$ or very small(compared to the starting $d$) or the process has been repeated "enough times", then stop. Otherwise repeat steps 1 and 2.
4. Obtain the binary equivalent by placing 1's at the places corresponding to the saved indices.
---

# Exercise:
Convert the base $10$ number $d = 11.5625$ to base $2$.


---

# 64-bit floating-point number
The IEEE standard divides up the $64$ bits as follows:
* 1 bit sign: $0$ for positive, $1$ for negative;
* $11$ bit exponent: the base $2$ representation of (standard binary form exponent + 1023);
* 52 bit mantissa: the first 52 digits after decimal point from standard binary form.

## |--- 1 sign ---|------ 11 bit exponent ------|------------------------- 52 bit fraction ---------------------------|


# Example:
$(11.5625)_{10} \equiv (1011.1001)_2$, that is $1.0111001 \times 2^3$ in standard form. It will be stored in a $64$ bit system as:

* $sign\; bit = 0$
[since the number is positive]
* $exponent = 10000000010$
[binary representation of $(3+1023)$]
* $mantissa = 0111001000000000000000000000000000000000000000000000$

So the $64$-bit representation is
$$0\;|\;10000000010\;|\;0111001000000000000000000000000000000000000000000000\;$$

# Note:

* The biggest positive representable number is $1.11 \cdots 1 \times 2^{1023} \approx 10^{308}.$
* The smallest positive representable number is $1.00 \cdots 0 \times 2^{-1022} \approx 10^{-308}.$
* Te relative spacing between two consecutive rpresentable numbers is $2^{-52} \approx 2.22 \times 10^{-16}.$
This quantity is referred to as **machine epsilon**, and we denote $\epsilon_{mach} = 2^{-52}.$
* Given a number $d = 1.b_1 b_2 \cdots b_{52} \times 2^{exponent}$, the smallest increment we can add to it is in the 52nd digit of the mantissa, i.e. $2^{-52} \times 2^{exponent}.$

# Exercises:

1. Convert the binary number $1101101.1011$ to decimal.
2. Convert the decimal number $-66.125$ to binary format. WHat is the 64-bit floating point representation?
3. Show that the decimal number $0.1$ canot be represented exactly as a finite binary number.
Use this fact to explain the output of the following statement in Python.
$$\verb+0.1*3 == 0.3+$$