### Hop-Typed Communication Model (NVLink vs NDR) for **LLaMA-3 8B**

This notebook estimates the communication **latency** and **energy** incurred when training or serving LLaMA-3 8B on a DGX-H100.  Each network hop is classified as **NVLink 4** or **InfiniBand NDR 400**, allowing the model to account for the very different bandwidths, startup latencies, and energy costs of those two links.

---

#### Parallelism Strategies Examined
| Strategy | Communication Pattern (per transformer block) |
|----------|----------------------------------------------|
| **Data Parallel (DP)** | All-reduce of full-precision gradients across all GPUs. |
| **Tensor Parallel (TP)** | All-gather the input activations, then reduce-scatter the output activations (equivalent computation as all-reduce) |
| **Pipeline Parallel (PP)** | Point-to-point transfer of activation checkpoints between consecutive pipeline stages. *Pipeline bubbles are **not** modeled.* |
---

We modeled the forward pass for Tensor and Pipeline, and backward pass for Data Parallelism. The backward pass communication is identical for Tensor and Pipeline parallelism, whilst Data Parallelism has no forward pass communication.

#### Model Scope
- **Architecture:** LLaMA-3 8B  
  – 32 transformer layers, hidden size = 4096, multi-query attention  
- **Numerics:** FP16 / BF16 (2 B per element)  
- **Sequence Length:** 2048 tokens (for activation sizing)

Note: We ignore the SwigLU layer for MLP.

---

#### Hardware Link Constants
| Link           | Peak BW                                                          | Startup α                                                | Inverse BW β                                                 | Energy/bit                                                    |
|----------------|------------------------------------------------------------------|----------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------------------|
| **NVLink-C2C** | [900 GB/s](https://www.nvidia.com/en-us/data-center/nvlink/)     | ~0 | [1.11 ps/B](https://www.nvidia.com/en-us/data-center/nvlink/) | [1.3 pJ/bit](https://www.nvidia.com/en-us/data-center/nvlink-c2c/) |
| **NDR 400**    | [50 GB/s](https://www.nvidia.com/content/dam/en-zz/Solutions/networking/infiniband-adapters/infiniband-connectx7-data-sheet.pdf) | ~0  | [20 ps/B](https://www.nvidia.com/content/dam/en-zz/Solutions/networking/infiniband-adapters/infiniband-connectx7-data-sheet.pdf) | [∼ 20 pJ/bit](calculated from 8W typical power consumption) |


A message of size \(M\) bytes traversing \(n_{\text{NV}}\) NVLink hops and \(n_{\text{IB}}\) InfiniBand hops incurs  

$
\text{Latency} = n_{\text{NV}}\bigl(\alpha_{NV} + \beta_{NV}M \bigr) \;+\; n_{\text{IB}}\bigl(\alpha_{IB} + \beta_{IB}M \bigr),
$

$
\text{Energy}  = 8M\bigl(5\,n_{\text{NV}} + 25\,n_{\text{IB}}\bigr)\text{ pJ}.
$

---

#### Objectives of the Notebook
1. **Quantify Communication Cost** — Report latency (µs) and energy (nJ) per message for each strategy and hop mix.  
2. **Compare Parallel Schemes** — Highlight how DP, TP, and PP trade off communication time and energy under the same hardware assumptions.  
3. **Guide Design Decisions** — Provide first-order numbers that help decide which parallelism (or combination) is appropriate for a given training or inference workload.

---

#### Notebook Outputs
- **Per-Hop Latency** and **Energy** tables for each collective or point-to-point operation.  
- **Data Volume** moved (MB) per GPU and in aggregate.  
- Consolidated **comparative charts** for DP, TP, and PP to illustrate trade-offs.

> **Caveat:**  Results are first-order; they do not model overlap, asynchronous progress, or pipeline fill/drain bubbles. They nevertheless capture the dominant communication costs needed for quick design-space exploration.

In [1]:
import numpy as np, pandas as pd

# ---------- Hardware Link Constants ----------
alpha_nv, beta_nv, epb_nv = 0e-6, 1.11e-12, 1.3e-12
alpha_ib, beta_ib, epb_ib = 0e-6, 20e-12, 20e-12

# ---------- Model sizes ----------
BYTES_FP16 = 2
d_model = 4096
d_ff    = 14336
seq_len = 2048

# Self‑attention params (MQA)
attn_elems = d_model*d_model + 2*1024*d_model + d_model*d_model
attn_bytes = attn_elems * BYTES_FP16

# MLP params (no gate)
mlp_elems  = d_ff*d_model + d_model*d_ff
mlp_bytes  = mlp_elems  * BYTES_FP16

# Activation size (one block output) for PP / TP
attn_bytes = seq_len * d_model * BYTES_FP16
MLP_bytes = attn_bytes * 4
layer_bytes = dict(ATTN=attn_bytes, MLP=mlp_bytes)

rows=[]
num_gpu_list = [4, 6, 8, 10, 32, 48]


In [2]:
# Hop tuples (nv, ib)
def hops_8():
    H = np.empty((8,8),dtype=object)
    for i in range(8):
        for j in range(8):
            if i==j: H[i,j]=(0,0)
            else:    H[i,j]=(1,0)   # single NV hop for any pair
    return H

def hops_16():
    H = np.empty((16,16),dtype=object)
    for i in range(16):
        for j in range(16):
            if i==j: H[i,j]=(0,0)
            else:
                same_node=(i//8)==(j//8)
                if same_node: H[i,j]=(1,0)
                else:         H[i,j]=(2,1)  # GPU→NV + IB + NV
    return H

def hop_9():
    H = np.empty((9,9),dtype=object)
    for i in range(9):
        for j in range(9):
            if i==j: H[i,j]=(0,0)
            else: 
                same_node=(i//8)==(j//8)
                if same_node: H[i,j]=(1,0)
                else:         H[i,j]=(2,1)  # GPU→NV + IB + NV
    return H

def hops(num_gpus):
    H = np.empty((num_gpus,num_gpus),dtype=object)
    for i in range(num_gpus):
        for j in range(num_gpus):
            if i==j: H[i,j]=(0,0)
            else:
                same_node=(i//8)==(j//8)
                if same_node: H[i,j]=(1,0)
                else:         H[i,j]=(2,1)  # GPU→NV + IB + NV
    return H

# Test the correctness of the function via the handwritten ones
hop8 = hops_8()
hop9 = hop_9()
hop16 = hops_16()
assert np.all(hops(8) == hop8)
assert np.all(hops(16) == hop16)
assert np.all(hops(9) == hop9) 

In [3]:
def cost_bytes(lat_bytes, nv_hops, ib_hops):
    lat = nv_hops*(alpha_nv+beta_nv*lat_bytes) + ib_hops*(alpha_ib+beta_ib*lat_bytes)
    eng = nv_hops*lat_bytes*8*epb_nv + ib_hops*lat_bytes*8*epb_ib
    return lat, eng

In [4]:
def ring_dp(layer_name, layer_b, hopmat, mode_label):
    n = hopmat.shape[0]
    chunk = layer_b / n                      # bytes sent per link per step
    per_gpu_bytes = layer_b * (n - 1) / n    # textbook formula
    steps = 2 * (n - 1)                      # reduce-scatter + all-gather

    total_lat = 0.0
    total_eng = 0.0

    for _ in range(steps):
        max_link_lat = 0.0
        step_energy  = 0.0

        # each rank communicates with its neighbour every step
        for r in range(n):
            sender    = r
            receiver  = (r + 1) % n          # fixed clockwise ring
            nv, ib     = hopmat[sender, receiver]
            lat, eng   = cost_bytes(chunk, nv, ib)

            max_link_lat = max(max_link_lat, lat)
            step_energy += eng               # all links consume energy

        total_lat += max_link_lat            # critical-path latency
        total_eng += step_energy

    rows.append([layer_name, mode_label,
                 per_gpu_bytes / 1e6,        # MiB
                 total_lat  * 1e3,           # µs
                 total_eng  * 1e3])          # nJ

for L,B in layer_bytes.items():
    for num_gpu in num_gpu_list:
        ring_dp(L,B,hops(num_gpu),f"DP-{num_gpu}")
    


In [5]:
def tp_cost(layer_name, bytes_per_msg, hopmat, ring_dp=ring_dp, num_gpus=8):
    # reduce‑scatter + all‑reduce = 2 msgs; each msg one NV hop
    ring_dp(layer_name, bytes_per_msg, hopmat, "TP‑" + str(num_gpus))

for num_gpu in num_gpu_list:
    tp_cost("ATTN", layer_bytes["ATTN"], hops(num_gpu), num_gpus=num_gpu)
    tp_cost("MLP" , layer_bytes["MLP"], hops(num_gpu), num_gpus=num_gpu)


In [6]:
def pp_cost(label, act_bytes, nv_hops, ib_hops):
    lat,eng = cost_bytes(act_bytes,nv_hops,ib_hops)
    rows.append(["BLOCK",label,act_bytes/1e6,lat*1e3,eng*1e3])

pp_cost("PP‑NV-ATTN",layer_bytes["ATTN"],0,0)
pp_cost("PP‑Cross-ATTN",layer_bytes["ATTN"],2,1)
pp_cost("PP‑NV-MLP",layer_bytes["MLP"],0,0)
pp_cost("PP‑Cross-MLP",layer_bytes["MLP"],2,1)


In [7]:
df=pd.DataFrame(rows,columns=["Layer","Mode","PerGPU_MB","Latency_ms","Energy_mJ"])
df.round(3)

# Derive group size from the Mode label
def _group_size(mode: str) -> int:
    if "-8"  in mode: return 8
    if "-16" in mode: return 16
    # For pipeline-parallel point-to-point assume two ranks
    return 2

df["Group"] = df["Mode"].map(_group_size)

df["Total_MB"] = df["PerGPU_MB"] * df["Group"]

# Show binary MiB alongside decimal MB
MB2MiB = 1_000_000 / 1_048_576         # ≈ 0.953674
df["PerGPU_MiB"] = df["PerGPU_MB"] * MB2MiB

# Re-order columns for readability
cols = ["Layer","Mode",
        "PerGPU_MB",
        "PerGPU_MiB",
        "Latency_ms","Energy_mJ"]
df = df[cols]

display(df.round(3))

Unnamed: 0,Layer,Mode,PerGPU_MB,PerGPU_MiB,Latency_ms,Energy_mJ
0,ATTN,DP-4,12.583,12.0,0.028,1.047
1,ATTN,DP-6,13.981,13.333,0.031,1.745
2,ATTN,DP-8,14.68,14.0,0.033,2.443
3,ATTN,DP-10,15.099,14.4,0.671,13.433
4,ATTN,DP-32,16.253,15.5,0.722,32.974
5,ATTN,DP-48,16.428,15.667,0.73,49.993
6,MLP,DP-4,176.161,168.0,0.391,14.657
7,MLP,DP-6,195.734,186.667,0.435,24.428
8,MLP,DP-8,205.521,196.0,0.456,34.199
9,MLP,DP-10,211.393,201.6,9.394,188.055


# Adding forward/backward pass

In [9]:
df_forward = df.copy()
df_backward = df.copy()
df_forward['Workload'] = 'forward pass'
df_backward['Workload'] = 'backward pass'
value_cols = ["PerGPU_MB", "PerGPU_MiB", "Latency_ms", "Energy_mJ"]

# DP forward pass has 0 communication
attn_dp_forward_mask = (df_forward['Workload'].str.contains('forward pass')) & \
                       (df_forward['Mode'].str.contains('DP'))
df_forward.loc[attn_dp_forward_mask, value_cols] = 0

# Otherwise for TP and PP, the backward pass has the same communication as the forward pass
df_processed = pd.concat([df_forward, df_backward], ignore_index=True)

cols = ["Layer", "Mode", "Workload",
        "PerGPU_MB", "PerGPU_MiB",
        "Latency_ms", "Energy_mJ"]
df_processed = df_processed[cols]

# Sort by mode name first, then by num_gpus in descending order
def mode_sort_key(mode):
    # Extract the prefix (DP, TP, PP) and the number of GPUs
    parts = mode.split('-')
    prefix = parts[0]
    
    # For PP modes that don't follow the standard pattern
    if prefix.startswith('PP'):
        return (2, 0)  # Place PP modes after DP and TP
    
    # For standard modes with numbers
    try:
        if len(parts) > 1:
            num_gpus = int(parts[1])
        else:
            num_gpus = 0
    except ValueError:
        num_gpus = 0
    
    # Order by prefix (DP=0, TP=1, PP=2), then by -num_gpus for descending order
    prefix_order = {'DP': 0, 'TP': 1, 'PP': 2}.get(prefix, 3)
    return (prefix_order, -num_gpus)

df_processed['mode_sort_key'] = df_processed['Mode'].apply(mode_sort_key)
df_processed = df_processed.sort_values(by=["mode_sort_key", "Layer", "Workload"]).reset_index(drop=True)
df_processed = df_processed.drop(columns=['mode_sort_key'])

# Display the final rounded DataFrame
display(df_processed.round(3))

Unnamed: 0,Layer,Mode,Workload,PerGPU_MB,PerGPU_MiB,Latency_ms,Energy_mJ
0,ATTN,DP-48,backward pass,16.428,15.667,0.73,49.993
1,ATTN,DP-48,forward pass,0.0,0.0,0.0,0.0
2,MLP,DP-48,backward pass,229.988,219.333,10.221,699.898
3,MLP,DP-48,forward pass,0.0,0.0,0.0,0.0
4,ATTN,DP-32,backward pass,16.253,15.5,0.722,32.974
5,ATTN,DP-32,forward pass,0.0,0.0,0.0,0.0
6,MLP,DP-32,backward pass,227.541,217.0,10.112,461.635
7,MLP,DP-32,forward pass,0.0,0.0,0.0,0.0
8,ATTN,DP-10,backward pass,15.099,14.4,0.671,13.433
9,ATTN,DP-10,forward pass,0.0,0.0,0.0,0.0
