<a href="https://colab.research.google.com/github/SimonT2003/MAT421/blob/main/HW_A_9_1%2C9_2%2C9_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Representation of Numbers in Python

## Section 9.1 - Base-N and Binary

The **decimal** system is a way of representing numbers that you are familiar with from elementary school. In the decimal system, a number is represented by a list of digits from 0 to 9, where each digit represents the coefficient for a power of 10.

EXAMPLE: Show the decimal expansion for 147.3.

> 136.3= 1⋅10^2 + 3⋅10^1 + 6⋅10^0 + 3⋅ 10^−1

Since each digit is associated with a power of 10, the decimal system is also known as **base10** because it is based on 10 digits (0 to 9). However, there is nothing special about base10 numbers except perhaps that you are more accustomed to using them.

 For example, in base3 we have the digits 0, 1, and 2 and the number

 121(base 3)=1⋅32+2⋅31+1⋅30=9+6+1=16(base 10)

A very important representation of numbers for computers is base2 or **binary** numbers. In binary, the only available digits are 0 and 1, and each digit is the coefficient of a power of 2. Digits in a binary number are also known as a bit. Note that binary numbers are still numbers, and so addition and multiplication are defined on them exactly as you learned in grade school.

**TRY IT!** Convert the number 11 (base10) into binary.

In [1]:
bin(11) #the code bin(11) returns the binary expression of 11

'0b1011'

**TRY IT!** Convert 37 (base10) and 17 (base10) to binary. Add and multiply the resulting numbers in binary. Verify that the result is correct in base10.

In [9]:
# ADDING
bin(37) #here we convert 37 to a binary number
bin(17) #here we convert 17 to a binary number
x = 0b100101; #x is initiallized as a binary of 37
y = 0b10001; #y is initiallized as a binary of 17
i = x + y; #i is x + y, or the binary numbers added
print(f"{i:b}") ##printing the addition result

110110


In [10]:
# MULTIPLYING
x = 0b100101; #x is initiallized as a binary of 37
y = 0b10001; #y is initiallized as a binary of 17
j = x * y; #j is x * y, or the binary numbers multiplied
print(f"{j:b}") ##printing the product result

1001110101


## Section 9.2 - Floating Point Numbers

The number of bits is usually fixed for any given computer. Using binary representation gives us an insufficient range and precision of numbers to do relevant engineering calculations. To achieve the range of values needed with the same number of bits, we use **floating point** numbers or **float** for short.



In [6]:
import sys
sys.float_info

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

A float can then be represented as:

> $n=(-1)^{s}2^{e-1023}(1+f).$ (for 64-bit) where $s$ is **sign indicator**, which says whether a number is positive or negative and $f$ is the **fraction**, which is the coefficient of the exponent.

**TRY IT!** What is the number 1 10000000010 1000000000000000000000000000000000000000000000000000 (IEEE754) in base10?**

The exponent in decimal is $1⋅2^{10}+1⋅2^{1}-1023=3.$ The fraction is $1⋅\frac{1}{2^{1}}+0⋅\frac{1}{2^{2}}+...=0.5$. Therefore $n=(-1)^{1}⋅2^{3}⋅(1+0.5)=-12.0$ (base10).

We call the distance from one number to the next the **gap**. Because the fraction is multiplied by $2^{e-1023}$, the gap grows as the number represented grows. The gap at a given number can be computed using the function spacing in numpy.

In [3]:
import numpy as np

**TRY IT!** Use the spacing function to determine the gap at 1e9. Verify that adding a number to 1e9 that is less than half the gap at 1e9 results in the same number.

In [13]:
np.spacing(1e9)

1.1920928955078125e-07

In [14]:
1e9 == (1e9 + np.spacing(1e9)/3)

True

There are special cases for the value of a floating point number when e = 0 (i.e., e = 00000000000 (base2)) and when e = 2047 (i.e., e = 11111111111 (base2)), which are reserved.

When the exponent is 0, the leading 1 in the fraction takes the value 0 instead. The result is a subnormal number, which is computed by $n=(-1)^{s}2^{-1022}(0+f)$ (note: it is -1022 instead of -1023). When the exponent is 2047 and f is nonzero, then the result is “Not a Number”, which means that the number is undefined. When the exponent is 2047, then f = 0 and s = 0, and the result is positive infinity. When the exponent is 2047, then f = 0, and s = 1, and the result is minus infinity.

**TRY IT!** Compute the base10 value for 0 11111111110 1111111111111111111111111111111111111111111111111111 (IEEE754), the largest defined number for 64 bits, and for 0 00000000001 000000000000000000000000000000000000000000000000000 (IEEE754), the smallest. Note that the exponent is, respectively, e = 2046 and e = 1 to comply with the previously stated rules. Verify that Python agrees with these calculations using sys.float_info.max and sys.float_info.min.

In [4]:
largest = (2**(2046-1023))*((1 + sum(0.5**np.arange(1, 53))));
largest

1.7976931348623157e+308

In [7]:
sys.float_info.max

1.7976931348623157e+308

In [8]:
smallest = (2**(1-1023))*(1+0)
smallest

2.2250738585072014e-308

In [9]:
sys.float_info.min

2.2250738585072014e-308

Numbers that are larger than the largest representable floating point number result in **overflow**, and Python handles this case by assigning the result to inf. Numbers that are smaller than the smallest subnormal number result in **underflow**, and Python handles this case by assigning the result to 0.

**TRY IT!** Show that adding the maximum 64 bits float number with 2 results in the same number. The Python float does not have sufficient precision to store the + 2 for sys.float_info.max, therefore, the operations is essentially equivalent to add zero. Also show that adding the maximum 64 bits float number with itself results in overflow and that Python assigns this overflow number to inf.

In [10]:
sys.float_info.max + 2 == sys.float_info.max

True

In [11]:
sys.float_info.max + sys.float_info.max

inf

TRY IT! The smallest subnormal number in 64-bit number has s = 0, e = 00000000000, and f = 0000000000000000000000000000000000000000000000000001. Using the special rules for subnormal numbers, this results in the subnormal number $(-1)^{0}2^{1-1023}2^{-52}=2^{-1074}.$ Show that $2^{-1075}$ underflows t0 0.0 and that the result cannot be distinguished from 0.0. Show that $2^{-1074}$ does not.



In [12]:
2**(-1075)

0.0

In [13]:
2**(-1075) == 0

True

In [14]:
2**(-1074)

5e-324

## Section 9.3 - Round-off Errors

In the previous section, we talked about how the floating point numbers are represented in computers as base 2 fractions. This has a side effect that the floating point numbers can not be stored with perfect precision, instead the numbers are approximated by finite number of bytes. Therefore, the difference between an approximation of a number used in computation and its correct (true) value is called **round-off error**. It is one of the common errors usually in the numerical calculations.

### Representation Error

The most common form round-off error is the representation error in the floating point numbers. A simple example will be to represent π
. We know that π
 is an infinite number, but when we use it, we usually only use a finite digits. For example, if you only use 3.14159265, there will be an error between this approximation and the true infinite number. Another example will be 1/3, the true value will be 0.333333333…, no matter how many decimal digits we choose, there is an round-off error as well.

### Round-off Error by Floating-Point Arithmetic

From the above example, the error between 4.845 and 4.8 should be 0.055. But if you calculate it in Python, you will see the 4.9 - 4.845 is not equal to 0.055.

In [15]:
4.9 - 4.845 == 0.055

False

Why does this happen? If we have a look of 4.9 - 4.845, we can see that, we actually get 0.055000000000000604 instead. This is because the floating point can not be represented by the exact number, it is just approximation, and when it is used in arithmetic, it is causing a small error.

In [16]:
4.9 - 4.845

0.055000000000000604

In [17]:
4.8 - 4.845

-0.04499999999999993

Another example shows below that 0.1 + 0.2 + 0.3 is not equal 0.6, which has the same cause.

In [18]:
round(0.1 + 0.2 + 0.3, 5)  == round(0.6, 5)

True

Though the numbers cannot be made closer to their intended exact values, the round function can be useful for post-rounding so that results with inexact values become comparable to one another:

In [19]:
round(0.1 + 0.2 + 0.3, 5)  == round(0.6, 5)

True

### Accumulation of Round-Off Error

The following is an example, that we have the number 1 add and subtract 1/3, which gives us the same number 1. But what if we adding 1/3 for many times and subtract the same number of times 1/3, do we still get the same number 1? No, you can see the example below, the more times you doing this, the more errors you are accumulating.

In [20]:
# If we only do once
1 + 1/3 - 1/3

1.0

In [21]:
def add_and_subtract(iterations):
    result = 1

    for i in range(iterations):
        result += 1/3

    for i in range(iterations):
        result -= 1/3
    return result

In [22]:
# If we do this 100 times
add_and_subtract(100)

1.0000000000000002

In [23]:
# If we do this 1000 times
add_and_subtract(1000)

1.0000000000000064

In [24]:
# If we do this 10000 times
add_and_subtract(10000)

1.0000000000001166

## End of Module A