# Vision Encoder Roofline Analysis

This notebook provides a detailed breakdown of the compute and memory requirements for the **Vision Encoder** of the smolVLA model on the Alveo U280 FPGA.

**Objective**: Analyze individual kernels to identify bottlenecks and guide acceleration strategies.

## 1. Hardware Specifications (Alveo U280)
*   **Frequency**: 300 MHz
*   **Peak Bandwidth**: 460 GB/s (Theoretical), 300 GB/s (Realistic)
*   **Peak Compute**:
    *   FP32: 5.41 TFLOPs
    *   INT8: 18.6 TOPS
    *   INT4: 37.2 TOPS

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.style.use('seaborn-v0_8-paper')
plt.rcParams.update({'font.size': 12, 'figure.dpi': 150})
%matplotlib inline

# Hardware Specs
FREQ = 300e6
BW_REAL = 300e9
P_FP32 = 5.41e12
P_INT8 = 18.6e12
P_INT4 = 37.2e12


## 2. Model Dimensions (Vision Encoder)
Derived from `model_shape.txt`.

*   **Input**: 16x16 Patches
*   **Hidden Dim ($D$)**: 768
*   **FFN Dim**: 3072
*   **Layers**: 12
*   **Sequence Length ($S$)**: 1024 (32x32 grid of patches)
*   **Batch Size ($B$)**: 1

In [None]:
B = 1
S = 1024 # Num Patches (32x32 grid)
D = 768
FFN = 3072

# Precision (Bytes)
precisions = {'FP32': 4, 'BF16': 2, 'INT8': 1, 'INT4': 0.5}


## 3. Kernel Analysis

We calculate FLOPs, Memory Transfer (Bytes), and Operational Intensity (OI) for each kernel type.

**Formula**:
*   $FLOPs = 2 \times M \times K \times N$
*   $Bytes = (M \times K + K \times N + M \times N) \times \text{dtype\_size}$
*   $OI = FLOPs / Bytes$

In [None]:
def analyze_kernel(name, M, K, N, p_bytes):
    flops = 2 * M * K * N
    # Weights (K*N) + Input (M*K) + Output (M*N)
    mem_weights = K * N * p_bytes
    mem_io = (M * K + M * N) * p_bytes
    total_bytes = mem_weights + mem_io
    oi = flops / total_bytes
    return {
        'Kernel': name,
        'M': M, 'K': K, 'N': N,
        'FLOPs': flops,
        'Bytes': total_bytes,
        'OI': oi,
        'Weight_MB': mem_weights / 1e6
    }

data = []
p_name = 'INT8' # Baseline for detailed table
p_bytes = precisions[p_name]

# 1. Patch Embedding (Conv2d 3->768, 16x16)
# Equivalent GEMM: M=Patches(1024), K=3*16*16(768), N=768
data.append(analyze_kernel('PatchEmbed', S, 3*16*16, D, p_bytes))

# 2. Attention Q, K, V Projections
# M=S(1024), K=D(768), N=D(768)
data.append(analyze_kernel('Attn_Q', S, D, D, p_bytes))
data.append(analyze_kernel('Attn_K', S, D, D, p_bytes))
data.append(analyze_kernel('Attn_V', S, D, D, p_bytes))

# 3. Attention Output Projection
data.append(analyze_kernel('Attn_Out', S, D, D, p_bytes))

# 4. MLP FC1 (Up)
# M=S(1024), K=D(768), N=FFN(3072)
data.append(analyze_kernel('MLP_FC1', S, D, FFN, p_bytes))

# 5. MLP FC2 (Down)
# M=S(1024), K=FFN(3072), N=D(768)
data.append(analyze_kernel('MLP_FC2', S, FFN, D, p_bytes))

# 6. Connector (Project to Text Dim)
# M=1 (Global Token?), K=12288, N=960
# Assuming we project a single pooled vector or similar. If sequence, M would be larger.
# Let's assume M=1 for worst-case memory bound.
data.append(analyze_kernel('Connector', 1, 12288, 960, p_bytes))

df = pd.DataFrame(data)
print(f"--- Kernel Metrics ({p_name}) ---")
display(df[['Kernel', 'FLOPs', 'Bytes', 'OI', 'Weight_MB']].round(2))


## 4. Roofline Plot

We plot these kernels on the Roofline model to visualize their performance limitations.

In [None]:
def plot_roofline(df_metrics):
    fig, ax = plt.subplots(figsize=(12, 8))
    x = np.logspace(-2, 3, 100)
    
    # Ceilings
    ax.loglog(x, np.minimum(P_FP32, BW_REAL * x), 'k-', label='FP32 Peak')
    ax.loglog(x, np.minimum(P_INT8, BW_REAL * x), 'b-', label='INT8 Peak')
    ax.loglog(x, np.minimum(P_INT4, BW_REAL * x), 'g-', label='INT4 Peak')
    ax.loglog(x, BW_REAL * x, 'r--', label='Memory Wall')
    
    # Plot Kernels (INT8 Baseline)
    for i, (_, row) in enumerate(df_metrics.iterrows()):
        oi = row['OI']
        # Calculate perf for INT8
        perf = min(P_INT8, BW_REAL * oi)
        ax.plot(oi, perf, 'b^', markersize=12)
        
        # Offset labels to avoid overlap
        # Alternating vertical offset
        offset = 1.3 if i % 2 == 0 else 0.7
        # Also add slight horizontal jitter if needed, but vertical usually enough
        ax.text(oi, perf * offset, row['Kernel'], fontsize=9, ha='center', va='bottom' if offset > 1 else 'top')
        
    ax.set_xlabel('Operational Intensity (Ops/Byte)')
    ax.set_ylabel('Performance (Ops/s)')
    ax.set_title('Vision Encoder Roofline (INT8)')
    ax.grid(True, which="both", ls="-", alpha=0.5)
    ax.legend()
    plt.show()

plot_roofline(df)
