# Floating-Point Basics

## Objective

Understand the fundamentals of IEEE-754 floating-point arithmetic:
- Representation and precision limits
- Machine epsilon and unit roundoff
- Rounding modes and error bounds
- Units in Last Place (ULP) analysis

## IEEE-754 Representation

A floating-point number in IEEE-754 double precision (binary64) has the form:

$$x = (-1)^s \times 1.f \times 2^{e-1023}$$

where:
- $s$ is the sign bit (1 bit)
- $e$ is the exponent (11 bits, biased by 1023)
- $f$ is the fractional part (52 bits)

**Total**: 64 bits = 1 + 11 + 52

The leading 1 is implicit (normalized form), giving 53 bits of precision.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sys
sys.path.append('..')

from utils.floating_point_tools import machine_epsilon, ulp_distance
from utils.error_metrics import relative_error

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['lines.linewidth'] = 2

## Machine Epsilon

**Machine epsilon** ($\varepsilon_{\text{mach}}$) is the smallest number such that:

$$\text{fl}(1 + \varepsilon_{\text{mach}}) > 1$$

in floating-point arithmetic. For IEEE-754 double precision:

$$\varepsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}$$

The **unit roundoff** $u$ is defined as:

$$u = \frac{\varepsilon_{\text{mach}}}{2} = 2^{-53} \approx 1.11 \times 10^{-16}$$

This represents the maximum relative error in rounding a real number to the nearest floating-point number.

In [None]:
# Compute machine epsilon empirically
eps_computed = machine_epsilon(np.float64)
eps_numpy = np.finfo(np.float64).eps

print("Machine Epsilon Analysis")
print("=" * 50)
print(f"Computed machine epsilon: {eps_computed:.6e}")
print(f"NumPy machine epsilon:    {eps_numpy:.6e}")
print(f"Theoretical (2^-52):      {2**-52:.6e}")
print(f"\nUnit roundoff (eps/2):    {eps_computed/2:.6e}")
print(f"Theoretical (2^-53):      {2**-53:.6e}")

# Verify the definition
print(f"\nVerification:")
print(f"1 + eps/2 == 1:           {1.0 + eps_computed/2 == 1.0}")
print(f"1 + eps > 1:              {1.0 + eps_computed > 1.0}")

## Rounding Error Bound

For any real number $x$ in the representable range, the floating-point representation $\text{fl}(x)$ satisfies:

$$\text{fl}(x) = x(1 + \delta), \quad |\delta| \leq u$$

This is the **fundamental axiom of floating-point arithmetic**.

For basic operations, we have:

$$\text{fl}(x \circ y) = (x \circ y)(1 + \delta), \quad |\delta| \leq u$$

where $\circ \in \{+, -, \times, \div\}$.

In [None]:
# Demonstrate rounding error for random numbers
n_samples = 10000
x_values = np.random.uniform(-1e10, 1e10, n_samples)

# Simulate rounding by converting to float32 and back
# (exaggerates effect for visualization)
x_rounded = x_values.astype(np.float32).astype(np.float64)

# Compute relative errors
rel_errors = np.abs((x_rounded - x_values) / x_values)
rel_errors = rel_errors[np.isfinite(rel_errors)]  # Remove inf/nan

# Plot distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(rel_errors, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(np.finfo(np.float32).eps, color='red', linestyle='--', 
            label=f'float32 eps = {np.finfo(np.float32).eps:.2e}')
plt.xlabel('Relative Error')
plt.ylabel('Frequency')
plt.title('Distribution of Rounding Errors (float64 → float32 → float64)')
plt.legend()
plt.yscale('log')

plt.subplot(1, 2, 2)
plt.hist(np.log10(rel_errors), bins=50, edgecolor='black', alpha=0.7)
plt.axvline(np.log10(np.finfo(np.float32).eps), color='red', linestyle='--',
            label=f'log10(float32 eps)')
plt.xlabel('log10(Relative Error)')
plt.ylabel('Frequency')
plt.title('Log-scale Distribution')
plt.legend()

plt.tight_layout()
plt.savefig('../plots/01_rounding_error_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Maximum relative error: {np.max(rel_errors):.6e}")
print(f"Mean relative error:    {np.mean(rel_errors):.6e}")
print(f"Median relative error:  {np.median(rel_errors):.6e}")

## Units in Last Place (ULP)

The **ULP** (unit in last place) is the spacing between consecutive floating-point numbers.

For a number $x$ with exponent $e$, the ULP is:

$$\text{ulp}(x) = 2^{e-52}$$

The ULP distance between two floats measures how many representable numbers lie between them.

In [None]:
# Demonstrate ULP spacing
test_values = [1.0, 10.0, 100.0, 1000.0, 1e10, 1e-10]

print("ULP Spacing Analysis")
print("=" * 70)
print(f"{'Value':<15} {'ULP Spacing':<20} {'Next Float':<20}")
print("=" * 70)

for x in test_values:
    spacing = np.spacing(x)
    next_float = np.nextafter(x, np.inf)
    print(f"{x:<15.2e} {spacing:<20.6e} {next_float:<20.16e}")

# Visualize ULP spacing across magnitudes
magnitudes = np.logspace(-10, 10, 100)
ulp_spacings = np.array([np.spacing(x) for x in magnitudes])

plt.figure(figsize=(10, 6))
plt.loglog(magnitudes, ulp_spacings, linewidth=2)
plt.loglog(magnitudes, magnitudes * np.finfo(np.float64).eps, 
           '--', label=r'$x \cdot \varepsilon_{\mathrm{mach}}$')
plt.xlabel('Magnitude of x')
plt.ylabel('ULP Spacing')
plt.title('ULP Spacing vs Magnitude')
plt.grid(True, alpha=0.3)
plt.legend()
plt.savefig('../plots/01_ulp_spacing.png', dpi=150, bbox_inches='tight')
plt.show()

## Floating-Point Number Density

Floating-point numbers are **not uniformly distributed** on the real line. They are:
- Denser near zero
- Sparser at large magnitudes
- Logarithmically distributed

This has important implications for numerical algorithms.

In [None]:
# Visualize floating-point number density
def count_floats_in_interval(a, b, n_samples=10000):
    """Count representable floats in [a, b] by sampling."""
    x = a
    count = 0
    while x < b and count < n_samples:
        x = np.nextafter(x, np.inf)
        count += 1
    return count

# Sample intervals at different magnitudes
intervals = [
    (1.0, 2.0),
    (10.0, 11.0),
    (100.0, 101.0),
    (1000.0, 1001.0),
    (1e6, 1e6 + 1),
    (1e9, 1e9 + 1),
]

print("Floating-Point Number Density")
print("=" * 60)
print(f"{'Interval':<25} {'Count (sampled)':<20}")
print("=" * 60)

for a, b in intervals:
    count = count_floats_in_interval(a, b)
    print(f"[{a:.2e}, {b:.2e}] {count:<20}")

# Theoretical count for [1, 2)
theoretical_count = 2**52  # All 52-bit mantissas with exponent 0
print(f"\nTheoretical count in [1, 2): {theoretical_count:,}")

## Special Values

IEEE-754 defines special values:
- **Infinity** ($\pm \infty$): Result of overflow or division by zero
- **NaN** (Not a Number): Result of invalid operations (e.g., $0/0$, $\infty - \infty$)
- **Denormal numbers**: Numbers smaller than the smallest normalized number

These values follow specific arithmetic rules.

In [None]:
# Demonstrate special values
print("Special Values in IEEE-754")
print("=" * 60)

# Infinity
inf = np.inf
print(f"Infinity: {inf}")
print(f"1 / 0 = {1.0 / 0.0}")
print(f"inf + 1 = {inf + 1}")
print(f"inf * 2 = {inf * 2}")
print(f"inf / inf = {inf / inf}")

# NaN
print(f"\nNaN: {np.nan}")
print(f"0 / 0 = {0.0 / 0.0}")
print(f"inf - inf = {inf - inf}")
print(f"NaN == NaN: {np.nan == np.nan}")
print(f"isnan(NaN): {np.isnan(np.nan)}")

# Denormal numbers
tiny = np.finfo(np.float64).tiny
smallest_normal = np.finfo(np.float64).smallest_normal
print(f"\nSmallest positive normal: {smallest_normal:.6e}")
print(f"Smallest positive (denormal): {tiny:.6e}")
print(f"Ratio: {smallest_normal / tiny:.1f}")

## Key Takeaways

1. **Finite precision**: Only ~16 decimal digits of precision in float64

2. **Relative error bound**: Rounding introduces relative error ≤ $u \approx 10^{-16}$

3. **Non-uniform distribution**: Floating-point numbers are logarithmically spaced

4. **ULP spacing**: Grows with magnitude, affecting absolute precision

5. **Special values**: Infinity and NaN require careful handling

These fundamental properties underpin all subsequent analysis of numerical stability.

In [None]:
# Summary statistics
print("IEEE-754 Double Precision Summary")
print("=" * 60)
info = np.finfo(np.float64)
print(f"Precision (bits):        {info.nmant + 1}")
print(f"Exponent range:          [{info.minexp}, {info.maxexp}]")
print(f"Machine epsilon:         {info.eps:.6e}")
print(f"Smallest normal:         {info.smallest_normal:.6e}")
print(f"Largest representable:   {info.max:.6e}")
print(f"Smallest positive:       {info.tiny:.6e}")
print(f"Decimal precision:       ~{info.precision} digits")