# 1 Rounding Error

## 1.1 Floating Point Formats
Real numbers, being infinite, require an approximate representation when stored in a computer using a finite number of bits. While integers can typically be stored in 32 bits, calculations involving real numbers often yield quantities that cannot be precisely represented, regardless of the number of allocated bits. Therefore, rounding is necessary in floating-point calculations to fit the results back into their finite representation, leading to rounding errors.

The most commonly used representation for real numbers is the floating-point representation. In this representation, a real number is expressed with a base (usually assumed to be even) and a precision value. For example, if the base is 10 and the precision is 3, the number 0.1 would be represented as $1.00 \times 10^{-1}$.

However, there are limitations to the precision of floating-point representations. For instance, if the base is 2 and the precision is 24, the decimal number 0.1 cannot be accurately represented. This is because the binary representation of 0.1 is a repeating fraction in binary, causing a loss of precision.

In general, a floating-point number will be represented as $\pm d.dd \cdots d \times \beta^e$, where $d.dd \cdots d$ is called the significand and has $p$ digits.

More precisely, $\pm d_0.d_1d_2 \cdots d_{p-1} \times \beta^e$ represents the number.

$\pm \left( d_0 + d_1 \beta^{-1} + \ldots + d_{p-1} \beta^{-(p-1)} \right) \beta^e$

$0 < d_k < \beta$

A "floating-point number" refers to a real number that can be represented as such. Two additional parameters associated with floating-point representations are the largest allowable exponent $(e_{\text{max}})$ and the smallest allowable exponent $(e_{\text{min}})$. With $\beta^p$ possible significands and $e_{\text{max}} - e_{\text{min}} + 1$ possible exponents, a floating-point number can be encoded in approximately $\log_2 (e_{\text{max}} - e_{\text{min}} + 1) + \log_2 (\beta^p) + 1$ bits, including the sign bit. 

The possible exponents in a floating-point representation depend on the range between the smallest allowable exponent $(e_{\text{min}})$ and the largest allowable exponent $(e_{\text{max}})$. To include both endpoints, we add 1 to the difference, resulting in $e_{\text{max}} - e_{\text{min}} + 1$ possible exponents.

For example, a range of integers from 1 to 5 demonstrates the need for adding 1. Without adding 1, the difference is 4, resulting in 4 possible integers (1, 2, 3, 4), excluding 5. By adding 1, we obtain 5, ensuring that all numbers from 1 to 5, including the endpoints, are considered. Thus, the inclusive range is 1, 2, 3, 4, 5.

In the context of exponents, an inclusive range includes both the smallest (\(e_{\text{min}}\)) and largest (\(e_{\text{max}}\)) allowable exponents. It means that when calculating the range by subtracting \(e_{\text{min}}\) from \(e_{\text{max}}\), we account for both endpoints.

For instance, if \(e_{\text{min}}\) is -5 and \(e_{\text{max}}\) is 5, the range would be 5 - (-5) + 1 = 11. This indicates 11 possible exponents: -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, and 5. All these exponents are part of the inclusive range and can be used to represent various values in the floating-point format.



Real numbers may not be exactly representable as floating-point numbers due to two main reasons.

1. Representation Error: The most common reason is that some real numbers have finite decimal representations but infinite repeating representations in binary. For example, the decimal number 0.1 cannot be precisely represented in binary, causing it to fall between two floating-point numbers without being exactly representable by either.

2. Out of Range: A less common reason is when a real number exceeds the range supported by the floating-point format. If a number's absolute value is too large or too small relative to the maximum and minimum representable values, it cannot be accurately represented within the floating-point format.



Floating-point representations are not always unique. For instance, both 0.01 x 10^1 and 1.00 x 10^-1 represent 0.1. A representation is considered normalized if the leading digit is nonzero. In the given example, 1.00 x 10^-1 is normalized, but 0.01 x 10^1 is not.

When certain conditions are met (e.g., β = 2, p = 3, e_min = -1, and e_max = 2), there are 16 normalized floating-point numbers. Requiring normalization ensures a unique representation. However, it also makes it impossible to represent zero.

To represent zero in a natural way, we can use 1.0 x β^(e_min-1). This preserves the numerical ordering of nonnegative real numbers in their floating-point representations.

When storing the exponent in a k-bit field, only 2^(k-1) values are available for exponents since one must be reserved to represent zero.



Note that the "x" in a floating-point number is a notation symbol and should not be confused with the multiplication operation. The context usually makes the meaning of the "x" symbol clear. For instance, the expression (2.5 x 10^-3) x (4.0 x 10^2) involves a single floating-point multiplication operation.

## 1.2 Relative Error And Ulps

Floating-point representations are subject to rounding errors, which can be measured using relative error and units in the last place (ulps). In the context of a floating-point format with β = 10 and p = 3, the relative error corresponds to 1/2 ulp and can vary by a factor of O. This factor is known as the "wobble."

The relative error between a computed floating-point number and the real number it approximates is expressed as a factor times ε (machine epsilon). For example, if the relative error is ε/2, the error in units in the last place (ulps) can be as high as 1/2 ulp.

Comparing ulps and relative error, the ulp error can be larger even though the relative error remains the same. The base (β) of the floating-point number determines the extent of wobbling. A fixed relative error in ulps can wobble by a factor of β.

Ulps provide a natural way to measure rounding error. Rounding to the nearest floating-point number corresponds to 1/2 ulp error. However, when analyzing rounding errors caused by various formulas, relative error is a better measure. Relative error accounts for the wobble factor and provides tighter error estimates, particularly when β is small.

Note: Out-of-range numbers and their impact on error estimates will be discussed separately.

Ulps and ε (relative error) can be used interchangeably when only the order of magnitude of rounding error matters, as they differ by at most a factor of β. For example, if a floating-point number has an error of n ulps, it means there are approximately logβ(n) contaminated digits. Similarly, if the relative error in a computation is ε, the number of contaminated digits can be estimated as logβ(1/ε).

## 1.3 guard digits

Apologies for the confusion. Here's the corrected version of the text block in Markdown format:

"One method of computing the difference between two floating-point numbers is to compute the difference exactly, then round it to the nearest floating-point number. This approach becomes expensive when the operands differ greatly in size.

Assuming P = 3, 2.15 x 10^12 – 1.25 x 10^-5 would be calculated as:

x = 2.15 x 10^12
y = 0.0000000000000000125 x 10^12

x - y = 2.1499999999999999875 x 10^12, which rounds to 2.15 x 10^12. 

Rather than using all these digits, floating-point hardware normally operates on a fixed number of digits. Let's suppose the number of digits kept is p, and when the smaller operand is shifted right, digits are simply discarded (as opposed to rounding).

Then, 2.15 x 10^12 – 1.25 x 10^-5 becomes:

x = 2.15 x 10^12
y = 0.00 x 10^12

x - y = 2.15 x 10^12.

The answer is exactly the same as if the difference had been computed exactly, and then rounded.

Let's take another example: 10.1 – 9.93.

This becomes:

x = 1.01 x 10^1
y = 0.99 x 10^1

x - y = 0.02 x 10^1.

The correct answer is 0.17, so the computed difference is off by 0.02 x 10^1, which can be simplified to 0.02, and is wrong in every digit. How bad can the error be?

**Theorem 1**

Using a floating-point format with parameters &epsilon; and p, and computing differences using p digits, the relative error of the result can be as large as &beta; – 1.

**Proof**
A relative error of &epsilon; – 1 in the expression x – y occurs when x = 1.00...0 (p digits) and y = 0.00...0 (p digits), where p = &beta; – 1. Here, y has p digits (all equal to 0). The exact difference is x – y = 1.00...0 (p digits). When computing the answer using only p digits, however, the rightmost digit of y gets shifted off, so the computed difference is 1.00...0 (p+1 digits). Thus, the error is p+1 - p – (&beta; - p+1) = &epsilon; - p(&epsilon; - 1), and the relative error is (&epsilon; - p(&epsilon; - 1))/(&epsilon; - 1) = &gamma; - 1.

When &beta; = 2, the absolute error can be as large as the result, and when &beta; = 10, it can be nine times larger. To put it another way, when &beta; = 2, log<sub>2</sub>(&epsilon;) shows that the number of contaminated digits is log<sub>2</sub>(1/&epsilon;) = log<sub>2</sub>(&epsilon; - 1) = p. That is, all of the p digits in the result are wrong!

Suppose one extra digit is added to guard against this situation (a guard digit). That is, the smaller number is truncated to p + 1 digits, then the result of the subtraction is rounded to p digits. With a guard digit, the previous example becomes:

x = 1.010 x 10^2
y = 0.993 x 10^2
x – y = 0.017 x 10^2, and the answer is exact.

With a single guard digit, the relative error of the result may be greater than &epsilon;, as shown in the example:
x = 1.101 x 10^2
y = 0.085 x 10^2
x – y = 1.015 x 10^2
This rounds to 10^2, compared to the correct answer of 101.41, for a relative error of 0.006, which is greater than &epsilon;.

In general, the relative error of the result can be only slightly larger than &gamma;. More precisely, we have:

**Theorem 2**