### Hop-Typed Communication Model (NVLink vs NDR) for **LLaMA-3 8B**

This notebook estimates the communication **latency** and **energy** incurred when training or serving LLaMA-3 8B on a DGX-H100.  Each network hop is classified as **NVLink 4** or **InfiniBand NDR 400**, allowing the model to account for the very different bandwidths, startup latencies, and energy costs of those two links.

---

#### Parallelism Strategies Examined
| Strategy | Communication Pattern (per transformer block) |
|----------|----------------------------------------------|
| **Data Parallel (DP)** | All-reduce of full-precision gradients across all GPUs. |
| **Tensor Parallel (TP)** | Reduce-scatter / all-reduce / all-gather on activation slices that have been sharded over GPUs. |
| **Pipeline Parallel (PP)** | Point-to-point transfer of activation checkpoints between consecutive pipeline stages. *Pipeline bubbles are **not** modeled.* |

---

#### Model Scope
- **Architecture:** LLaMA-3 8B  
  – 32 transformer layers, hidden size = 4096, multi-query attention  
- **Numerics:** FP16 / BF16 (2 B per element)  
- **Sequence Length:** 2048 tokens (for activation sizing)

---

#### Hardware Link Constants
| Link            | Peak BW | Startup \( \alpha \) | Inverse BW \( \beta = 1/\text{BW} \) | Energy/bit |
|-----------------|---------|----------------------|--------------------------------------|------------|
| **NVLink 4**    | 900 GB/s | 0.2 µs | 11 ps/B | 5 pJ |
| **NDR 400**     | 50 GB/s  | 2 µs  | 20 ps/B | 25 pJ |

A message of size \(M\) bytes traversing \(n_{\text{NV}}\) NVLink hops and \(n_{\text{IB}}\) InfiniBand hops incurs  

$
\text{Latency} = n_{\text{NV}}\bigl(\alpha_{NV} + \beta_{NV}M \bigr) \;+\; n_{\text{IB}}\bigl(\alpha_{IB} + \beta_{IB}M \bigr),
$

$
\text{Energy}  = 8M\bigl(5\,n_{\text{NV}} + 25\,n_{\text{IB}}\bigr)\text{ pJ}.
$

---

#### Objectives of the Notebook
1. **Quantify Communication Cost** — Report latency (µs) and energy (nJ) per message for each strategy and hop mix.  
2. **Compare Parallel Schemes** — Highlight how DP, TP, and PP trade off communication time and energy under the same hardware assumptions.  
3. **Guide Design Decisions** — Provide first-order numbers that help decide which parallelism (or combination) is appropriate for a given training or inference workload.

---

#### Notebook Outputs
- **Per-Hop Latency** and **Energy** tables for each collective or point-to-point operation.  
- **Data Volume** moved (MB) per GPU and in aggregate.  
- Consolidated **comparative charts** for DP, TP, and PP to illustrate trade-offs.

> **Caveat:**  Results are first-order; they do not model overlap, asynchronous progress, or pipeline fill/drain bubbles. They nevertheless capture the dominant communication costs needed for quick design-space exploration.

In [36]:
import numpy as np, pandas as pd

# ---------- Model sizes ----------
BYTES_FP16 = 2
d_model = 4096
d_ff    = 14336
seq_len = 2048

# Self‑attention params (MQA)
attn_elems = d_model*d_model + 2*1024*d_model + d_model*d_model
attn_bytes = attn_elems * BYTES_FP16

# MLP params (no gate)
mlp_elems  = d_ff*d_model + d_model*d_ff
mlp_bytes  = mlp_elems  * BYTES_FP16

# Activation size (one block output) for PP / TP
act_bytes = seq_len * d_model * BYTES_FP16

layer_bytes = dict(ATTN=attn_bytes, MLP=mlp_bytes)

In [35]:
# Hop tuples (nv, ib)
def hops_8():
    H = np.empty((8,8),dtype=object)
    for i in range(8):
        for j in range(8):
            if i==j: H[i,j]=(0,0)
            else:    H[i,j]=(1,0)   # single NV hop for any pair
    return H
hop8 = hops_8()

def hops_16():
    H = np.empty((16,16),dtype=object)
    for i in range(16):
        for j in range(16):
            if i==j: H[i,j]=(0,0)
            else:
                same_node=(i//8)==(j//8)
                if same_node: H[i,j]=(1,0)
                else:         H[i,j]=(2,1)  # GPU→NV + IB + NV
    return H
hop16 = hops_16()

In [34]:
alpha_nv, beta_nv, epb_nv = 0.2e-6, 1/90e9, 5e-12
alpha_ib, beta_ib, epb_ib = 2e-6, 1/50e9, 25e-12

In [33]:
def cost_bytes(lat_bytes, nv_hops, ib_hops):
    lat = nv_hops*(alpha_nv+beta_nv*lat_bytes) + ib_hops*(alpha_ib+beta_ib*lat_bytes)
    eng = nv_hops*lat_bytes*8*epb_nv + ib_hops*lat_bytes*8*epb_ib
    return lat, eng

In [26]:
rows=[]
def ring_dp(layer_name, layer_b, hopmat, mode_label):
    n = hopmat.shape[0]
    chunk = layer_b / n                      # bytes sent per link per step
    per_gpu_bytes = layer_b * (n - 1) / n    # textbook formula
    steps = 2 * (n - 1)                      # reduce-scatter + all-gather

    total_lat = 0.0
    total_eng = 0.0

    for _ in range(steps):
        max_link_lat = 0.0
        step_energy  = 0.0

        # each rank communicates with its neighbour every step
        for r in range(n):
            sender    = r
            receiver  = (r + 1) % n          # fixed clockwise ring
            nv, ib     = hopmat[sender, receiver]
            lat, eng   = cost_bytes(chunk, nv, ib)

            max_link_lat = max(max_link_lat, lat)
            step_energy += eng               # all links consume energy

        total_lat += max_link_lat            # critical-path latency
        total_eng += step_energy

    rows.append([layer_name, mode_label,
                 per_gpu_bytes / 1e6,        # MiB
                 total_lat  * 1e3,           # µs
                 total_eng  * 1e3])          # nJ


for L,B in layer_bytes.items():
    ring_dp(L,B,hop8,"DP‑8")
    ring_dp(L,B,hop16,"DP‑16")

In [27]:
def tp_cost(layer_name, bytes_per_msg):
    # reduce‑scatter + all‑reduce = 2 msgs; each msg one NV hop
    nv_hops,ib_hops = 1,0
    lat1,eng1 = cost_bytes(bytes_per_msg, nv_hops, ib_hops)
    lat=2*lat1; eng=2*eng1
    rows.append([layer_name,"TP‑8",bytes_per_msg*2/1e6,lat*1e3,eng*1e3])

tp_bytes = act_bytes  # 4096*seq*2
tp_cost("ATTN", tp_bytes)
tp_cost("MLP" , tp_bytes)

In [28]:
def pp_cost(label, nv_hops, ib_hops):
    lat,eng = cost_bytes(act_bytes,nv_hops,ib_hops)
    rows.append(["BLOCK",label,act_bytes/1e6,lat*1e3,eng*1e3])

pp_cost("PP‑NV",1,0)
pp_cost("PP‑Cross",2,1)

In [37]:
df=pd.DataFrame(rows,columns=["Layer","Mode","PerGPU_MB","Latency_ms","Energy_mJ"])
df.round(3)

# Derive group size from the Mode label
def _group_size(mode: str) -> int:
    if "-8"  in mode: return 8
    if "-16" in mode: return 16
    # For pipeline-parallel point-to-point assume two ranks
    return 2

df["Group"] = df["Mode"].map(_group_size)

df["Total_MB"] = df["PerGPU_MB"] * df["Group"]

# Show binary MiB alongside decimal MB
MB2MiB = 1_000_000 / 1_048_576         # ≈ 0.953674
df["PerGPU_MiB"] = df["PerGPU_MB"] * MB2MiB

# Re-order columns for readability
cols = ["Layer","Mode",
        "PerGPU_MB",
        "PerGPU_MiB",
        "Latency_ms","Energy_mJ"]
df = df[cols]

display(df.round(3))

Unnamed: 0,Layer,Mode,PerGPU_MB,PerGPU_MiB,Latency_ms,Energy_mJ
0,ATTN,DP‑8,73.4,70.0,1.634,46.976
1,ATTN,DP‑16,78.643,75.0,6.713,176.161
2,MLP,DP‑8,205.521,196.0,4.57,131.533
3,MLP,DP‑16,220.201,210.0,18.667,493.25
4,ATTN,TP‑8,33.554,32.0,0.373,1.342
5,MLP,TP‑8,33.554,32.0,0.373,1.342
6,BLOCK,PP‑NV,16.777,16.0,0.187,0.671
7,BLOCK,PP‑Cross,16.777,16.0,0.711,4.698
