### A3.2.1. Single Instruction Multiple Data Concepts

$$
\text{Speedup}_{\text{SIMD}} = \frac{n}{\lceil n / w \rceil}
$$

where $n$ is the number of elements and $w$ is the SIMD lane width (elements per vector register).

**Explanation:**

**SIMD (Single Instruction, Multiple Data)** executes one instruction on multiple data elements simultaneously using wide vector registers. Instead of adding two scalars, a single SIMD add processes 4, 8, or 16 elements in one cycle.

**x86 SIMD Instruction Sets:**

| ISA Extension | Register Width | Float Lanes (32-bit) | Year |
|---------------|---------------|---------------------|------|
| SSE           | 128-bit       | 4                    | 1999 |
| AVX           | 256-bit       | 8                    | 2011 |
| AVX-512       | 512-bit       | 16                   | 2017 |

**ARM SIMD:**

| ISA Extension | Register Width | Float Lanes (32-bit) |
|---------------|---------------|---------------------|
| NEON          | 128-bit       | 4                    |
| SVE/SVE2      | 128‚Äì2048-bit  | Variable             |

**Requirements for SIMD:**

- **Data parallelism** ‚Äî the same operation on independent elements.
- **Contiguous memory layout** ‚Äî aligned, stride-1 access patterns are fastest.
- **No cross-lane dependencies** ‚Äî each lane operates independently (reductions need special shuffle/horizontal ops).

**Sources of SIMD Code:**

1. **Compiler auto-vectorization** ‚Äî the compiler detects vectorizable loops.
2. **Intrinsics** ‚Äî explicit SIMD calls (`_mm256_add_ps`).
3. **Libraries** ‚Äî NumPy, Eigen, etc. use SIMD internally.

**Example:**

Adding two arrays of 8 floats:
- Scalar: 8 `fadd` instructions.
- AVX: 1 `vaddps ymm` instruction (8 floats in one 256-bit register).

In [None]:
import numpy as np
import time

SIZE = 10_000_000
array_a = np.random.rand(SIZE).astype(np.float32)
array_b = np.random.rand(SIZE).astype(np.float32)

start = time.perf_counter()
scalar_result = np.empty(SIZE, dtype=np.float32)
for index in range(SIZE):
    scalar_result[index] = array_a[index] + array_b[index]
scalar_time = time.perf_counter() - start

start = time.perf_counter()
vector_result = array_a + array_b
vector_time = time.perf_counter() - start

print(f"Elements: {SIZE:,}")
print(f"Scalar loop: {scalar_time:.3f}s")
print(f"NumPy vectorized: {vector_time:.6f}s")
print(f"Speedup: {scalar_time / vector_time:.0f}x")
print(f"Results match: {np.allclose(scalar_result, vector_result)}")

simd_widths = {
    "SSE (128-bit)": 128,
    "AVX (256-bit)": 256,
    "AVX-512 (512-bit)": 512,
}
element_bits = 32

print(f"\nTheoretical lanes for {element_bits}-bit floats:")
for name, width in simd_widths.items():
    lanes = width // element_bits
    print(f"  {name}: {lanes} lanes")

**References:**

[üìò Intel Corporation. *Intel Intrinsics Guide.*](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html)

[üìò Hennessy, J. & Patterson, D. (2019). *Computer Architecture: A Quantitative Approach (6th ed.).* Morgan Kaufmann.](https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1)

---

[‚¨ÖÔ∏è Previous: Branch Prediction](../01_Central_Processing_Unit_Performance/03_branch_prediction.ipynb) | [Next: Vectorization Reports ‚û°Ô∏è](./02_vectorization_reports.ipynb)