# VLM Roofline Modeling: Vision & Text Encoders

This notebook provides a comprehensive roofline analysis for the **VLM portion** of smolVLA, specifically the **Vision Encoder** and **Text Encoder** (Language Model).

**Source of Truth**: `model-preparation/full/tests/smolvla_test_vectors/model_shape.txt`

We explore:
1.  **Max Throughput**: Theoretical limits on the Alveo U280.
2.  **Resource Constraints**: Memory usage and bandwidth bottlenecks.
3.  **Quantization**: Impact of INT8 and INT4 precision on performance.
4.  **Acceleration Ideas**: Strategies to improve performance beyond simple quantization.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('seaborn-v0_8-paper')
plt.rcParams.update({
    'font.size': 12,
    'axes.titlesize': 14,
    'axes.labelsize': 12,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 10,
    'figure.dpi': 150,
    'lines.linewidth': 2
})
%matplotlib inline


## 1. Hardware Specifications (Alveo U280)

We define the peak performance for different precisions. Note that INT8/INT4 performance is significantly higher on FPGAs due to DSP efficiency and logic-based compute.

*   **Frequency**: 300 MHz
*   **FP32 Peak**: 5.4 TFLOPs
*   **INT8 Peak**: ~18.6 TOPS (DSP-based estimate)
*   **INT4 Peak**: ~37.2 TOPS (Assumed 2x INT8 via logic/packing)
*   **Memory Bandwidth**: 460 GB/s (Peak), 300 GB/s (Realistic)

In [None]:
FREQ = 300e6
BW_PEAK = 460e9
BW_REAL = 300e9

# Peak Compute (Ops/s)
P_FP32 = 5.41e12
P_BF16 = 5.41e12 # Assumed same as FP32 for DSPs without native BF16
P_INT8 = 18.6e12
P_INT4 = 37.2e12

print(f"FP32 Peak: {P_FP32 / 1e12:.2f} TFLOPs/s")
print(f"INT8 Peak: {P_INT8 / 1e12:.2f} TOPS/s")
print(f"INT4 Peak: {P_INT4 / 1e12:.2f} TOPS/s")


## 2. Model Dimensions

We extract dimensions from `model_weights.h` and `model_shape.txt`.

### Vision Encoder (`vision_model`)
*   Layers: 12
*   Hidden Dim ($V_D$): 768
*   FFN Dim ($V_{FFN}$): 3072
*   Attention: Standard ($Q=K=V=768$)
*   Patch Embed: 768, 3, 16, 16

### Text Encoder (`text_model`)
*   Layers: 16
*   Hidden Dim ($T_D$): 960
*   FFN Dim ($T_{FFN}$): 2560
*   **Attention (GQA/MQA)**:
    *   $Q_{dim} = 960$
    *   $K_{dim} = 320$
    *   $V_{dim} = 320$
    *   $Out_{dim} = 960$

### Connector
*   Projection: 12288 -> 960

### LM Head
*   Vocab: 49280 -> 960

In [None]:
# Vision Encoder
V_LAYERS = 12
V_D = 768
V_FFN = 3072
V_ATTN_D = 768 # Q, K, V, Out all 768

# Text Encoder
T_LAYERS = 16
T_D = 960
T_FFN = 2560
T_Q_D = 960
T_K_D = 320
T_V_D = 320
T_OUT_D = 960

# Connector
CONN_IN = 12288
CONN_OUT = 960

# LM Head
VOCAB = 49280

# Batch Size
B = 1
S = 50 # Sequence Length (Text)
S_V = 256 # Sequence Length (Vision, 16x16 patches)


## 3. Resource Constraints: Memory Usage

We calculate the total memory required to store weights for each encoder under different precisions.

In [None]:
def calc_params_vision(layers, D, FFN):
    # Per Layer:
    # Attn: 4 * D * D (Q, K, V, Out) + 4 * D (Biases)
    # MLP: 2 * D * FFN (FC1, FC2) + FFN + D (Biases)
    # Norms: 4 * D (2 Layers * (Scale + Bias))
    attn = 4 * (D * D + D)
    mlp = 2 * (D * FFN) + FFN + D
    norms = 4 * D
    return layers * (attn + mlp + norms)

def calc_params_text(layers, D, FFN, Q_D, K_D, V_D, Out_D):
    # Per Layer:
    # Attn: (Q_D*D + K_D*D + V_D*D + Out_D*D) + (Q_D + K_D + V_D + Out_D) (Biases assumed?)
    # Note: model_shape.txt didn't explicitly show biases for text attn, but likely present.
    # MLP: 3 * D * FFN (Down, Gate, Up) -> Wait, Down is FFN->D, Gate/Up are D->FFN
    # Norms: 4 * D
    attn_weights = (Q_D * D) + (K_D * D) + (V_D * D) + (Out_D * D)
    mlp_weights = 3 * (D * FFN)
    norms = 4 * D
    return layers * (attn_weights + mlp_weights + norms)

v_params = calc_params_vision(V_LAYERS, V_D, V_FFN)
t_params = calc_params_text(T_LAYERS, T_D, T_FFN, T_Q_D, T_K_D, T_V_D, T_OUT_D)

# Connector & Head
conn_params = CONN_IN * CONN_OUT
head_params = VOCAB * T_D

total_params = v_params + t_params + conn_params + head_params

print(f"Vision Params: {v_params / 1e6:.2f} M")
print(f"Text Params:   {t_params / 1e6:.2f} M")
print(f"Connector:     {conn_params / 1e6:.2f} M")
print(f"LM Head:       {head_params / 1e6:.2f} M")
print(f"Total Params:  {total_params / 1e6:.2f} M")

precisions = {
    'FP32': 4,
    'BF16': 2,
    'INT8': 1,
    'INT4': 0.5
}

print("\n--- Memory Usage (GB) ---")
for name, bytes_per_param in precisions.items():
    mem_gb = total_params * bytes_per_param / 1e9
    print(f"{name}: {mem_gb:.2f} GB")


## 4. Throughput & Roofline Analysis

We calculate the Operational Intensity (OI) for the dominant kernels.

In [None]:
def calc_oi_linear(M, K, N, dtype_bytes):
    # Matrix Multiply: (M, K) x (K, N) -> (M, N)
    flops = 2 * M * K * N
    # Weights (K, N) + Input (M, K) + Output (M, N)
    bytes_xfer = (K*N + M*K + M*N) * dtype_bytes
    return flops / bytes_xfer

metrics = {}
for p_name, p_bytes in precisions.items():
    metrics[p_name] = {}
    
    # --- Vision Encoder ---
    # MLP FC1: (B*S_V) x V_D -> V_FFN
    metrics[p_name]['Vis_MLP'] = calc_oi_linear(B*S_V, V_D, V_FFN, p_bytes)
    
    # Patch Embed (Conv2d -> Im2Col -> GEMM)
    # In: 3 channels, Out: 768, Kernel: 16x16
    # Equivalent GEMM: (N_Patches) x (3*16*16) x (768)
    # M=1024 (Patches), K=3*16*16=768, N=768
    metrics[p_name]['Vis_PatchEmbed'] = calc_oi_linear(1024, 3*16*16, 768, p_bytes)
    
    # --- Text Encoder ---
    # MLP Gate/Up: (B*S) x T_D -> T_FFN
    metrics[p_name]['Txt_MLP'] = calc_oi_linear(B*S, T_D, T_FFN, p_bytes)
    
    # Attention Q Proj: (B*S) x T_D -> T_Q_D
    metrics[p_name]['Txt_Attn_Q'] = calc_oi_linear(B*S, T_D, T_Q_D, p_bytes)
    
    # Attention K Proj: (B*S) x T_D -> T_K_D (Smaller!)
    metrics[p_name]['Txt_Attn_K'] = calc_oi_linear(B*S, T_D, T_K_D, p_bytes)
    
    # --- Connector ---
    # (B*S_V) x 12288 -> 960 (Assuming flattened vision tokens?)
    # Actually, if it's token-wise, it might be (B*S_V) x (Something) -> (Something)
    # But the weight is [960, 12288]. This implies Input Dim = 12288.
    # If Vision Out is 768, then 12288 / 768 = 16 tokens concatenated?
    # Let's assume M=1 (Single large vector) or M=Sequence Length / 16.
    # Let's assume M=1 for now (Global descriptor?)
    metrics[p_name]['Connector'] = calc_oi_linear(1, 12288, 960, p_bytes)

print("--- Operational Intensity (FLOPs/Byte) ---")
for p_name, vals in metrics.items():
    print(f"{p_name}:")
    for k, v in vals.items():
        print(f"  {k}: {v:.2f}")


In [None]:
def plot_roofline_base(ax, title):
    x = np.logspace(-1, 3, 100)
    
    # Ceilings
    ceilings = [
        ('FP32/BF16', P_FP32, 'k-'),
        ('INT8', P_INT8, 'b-'),
        ('INT4', P_INT4, 'g-')
    ]
    
    # Memory Walls
    y_mem = BW_REAL * x
    ax.loglog(x, y_mem, 'r--', label='Memory Wall (300 GB/s)')
    
    for name, peak, style in ceilings:
        y = np.minimum(peak, y_mem)
        ax.loglog(x, y, style, label=f'{name} Peak')
    
    ax.set_xlabel('Operational Intensity (Ops/Byte)')
    ax.set_ylabel('Performance (Ops/s)')
    ax.set_title(title)
    ax.grid(True, which="both", ls="-", alpha=0.5)

def plot_kernels(ax, kernel_names, metrics_dict):
    markers = {'FP32': 'o', 'BF16': 's', 'INT8': '^', 'INT4': 'D'}
    colors = {'FP32': 'black', 'BF16': 'orange', 'INT8': 'blue', 'INT4': 'green'}
    
    for p_name, vals in metrics_dict.items():
        # Determine which peak applies
        if p_name in ['FP32', 'BF16']: peak = P_FP32
        elif p_name == 'INT8': peak = P_INT8
        else: peak = P_INT4
        
        for k_name in kernel_names:
            if k_name not in vals: continue
            oi = vals[k_name]
            perf = min(peak, BW_REAL * oi)
            
            ax.plot(oi, perf, marker=markers[p_name], color=colors[p_name], markersize=10)
            if p_name == 'INT4': # Label only once
                ax.text(oi, perf*1.3, k_name, fontsize=8, ha='center', rotation=45)

def plot_vision_roofline():
    fig, ax = plt.subplots(figsize=(10, 7))
    plot_roofline_base(ax, 'Vision Encoder Roofline')
    plot_kernels(ax, ['Vis_MLP', 'Vis_PatchEmbed', 'Connector'], metrics)
    ax.legend()
    plt.tight_layout()
    plt.show()

def plot_text_roofline():
    fig, ax = plt.subplots(figsize=(10, 7))
    plot_roofline_base(ax, 'Text Encoder Roofline')
    plot_kernels(ax, ['Txt_MLP', 'Txt_Attn_Q', 'Txt_Attn_K'], metrics)
    ax.legend()
    plt.tight_layout()
    plt.show()

print("Plotting Vision Encoder...")
plot_vision_roofline()

print("Plotting Text Encoder...")
plot_text_roofline()
