# Floating Point Numbers

## What is a Floating Point Number?

**Definition**: A method to represent real numbers (with decimals) in binary using scientific notation.

**Why "Floating Point"?** The decimal point can "float" to accommodate a wide range of values.

### Key Components

- **Sign bit**: Indicates positive (0) or negative (1)
- **Exponent**: Determines magnitude (with bias)
- **Mantissa / Significand**: Encodes the significant digits (fractional part)

### Formula

$$\text{Value} = (-1)^{\text{sign}} \times \text{mantissa} \times 2^{\text{exponent}}$$

## IEEE 754 Standard Format

### 32-bit Float (Single Precision)

```
| S | Exponent (8 bits) | Mantissa (23 bits) |
| 1 | 8 bits           | 23 bits            |
```

### 64-bit Double (Double Precision)

```
| S | Exponent (11 bits) | Mantissa (52 bits) |
| 1 | 11 bits            | 52 bits            |
```

### Precision & Range

| Format        | Precision             | Range               |
|---------------|-----------------------|---------------------|
| 32-bit float  | ~7 decimal digits     | ~10⁻³⁸ to ~10³⁸    |
| 64-bit double | ~15–16 decimal digits | ~10⁻³⁰⁸ to ~10³⁰⁸  |

## Example: Converting 12.375

### Step-by-Step Conversion

**Step 1: Convert to Binary**
- 12.375₁₀ = 1100.011₂ (12 = 1100, 0.375 = 0.011)

**Step 2: Normalize**
- 1100.011₂ = 1.100011 × 2³

**Step 3: Extract Components**
- **Sign (S)** = 0
- **Exponent (E)** = 3 + 127 (bias) = 130 = `10000010`
- **Mantissa (M)** = `10001100000000000000000` (excluding implicit `1`)

```
Bit Pattern: 0 | 10000010 | 10001100000000000000000
Hexadecimal: 0x41460000
```

## Why is Bias Added to Exponent?

### The Problem Without Bias

- Signed integers complicate comparison
- Encoding special values becomes harder

### IEEE 754 Bias System

- **Single Precision (32-bit)**: Bias = 127
- **Formula**: $\text{Stored Exponent} = \text{Actual Exponent} + \text{Bias}$

### Benefits of Bias

1. **Simplified Comparisons** → Larger exponent = larger number
2. **Natural Ordering** → Binary comparison works directly
3. **Encodes Special Values** → Reserved exponent values for special cases

## BF16 (Brain Floating Point) Format

### BF16 Structure

```
| S | Exponent (8 bits) | Mantissa (7 bits) |
| 1 | 8 bits           | 7 bits           |
```

### Key Characteristics

- Same exponent range as FP32 (bias = 127)
- Only 7 bits for mantissa → ~2–3 digits precision
- Truncated FP32: Easy hardware conversion

### Simple Conversion

```c
bf16 = fp32 >> 16;
```

## Format Comparison (Float32, BF16, and Float16)

| Format | Sign | Exponent | Mantissa | Range           | Precision    |
|--------|------|----------|----------|-----------------|--------------|
| FP32   | 1    | 8        | 23       | ~10⁻³⁸–10³⁸    | ~7 digits    |
| BF16   | 1    | 8        | 7        | ~10⁻³⁸–10³⁸    | ~2–3 digits  |
| FP16   | 1    | 5        | 10       | ~10⁻⁸–10⁴      | ~3–4 digits  |

### Common Applications

| Format | Use Case                           |
|--------|------------------------------------|
| FP32   | Scientific, GPU computing          |
| FP64   | High-precision scientific tasks    |
| BF16   | Deep learning (training/inference) |
| FP16   | Mobile, edge, low-power devices    |

## Why BF16 in AI/ML?

### Four Key Advantages

1. **Wide Range**: Can handle large/small weights
   - Same exponent range as FP32

2. **Efficient Conversion**:
   - Simple bit shifting operation
   - No complex rounding needed

3. **Stable Gradients**:
   - Prevents overflow/underflow during training

4. **Memory Efficient**:
   - Half the size of FP32
   - Faster data transfer