Skip to content

NetBr3ak/HSPMN

Repository files navigation

⚑ HSPMN v3.0: Hybrid Sparse-Predictive Matter Network

License Python PyTorch Status Performance Author

Hey! Some Science Guy here. Welcome to HSPMN v3.0 - an LLM architecture built directly for the NVIDIA Blackwell (RTX 5090) GPU. We stop burning cycles on every single token and borrow a trick from the mammalian brain: route the predictable stuff fast, and save the heavy matrix math for the complex problems.

"The brain does not use the full neocortex to process a simple 'hello'. I just apply this exact same rule to my models."


🧠 Architecture Overview: The Hybrid Approach

Instead of imposing a monolithic $O(N^2)$ computational penalty on every token equally, HSPMN bifurcates the processing stream based on inherent token complexity. Here is the operational flow:

graph TD
    A[Input Stream] -->|Token Embeddings| B{ALF-LB Router}
    
    B -->|Predictable & Routine 80%| C["Reflexive Stream<br/>(Linear Attn + SwiGLU O(N))"]
    B -->|Complex Anomaly 20%| D["Contextual Stream<br/>(Full SQSK Attention O(NΒ²))"]
    
    C --> E{Merge & Add}
    D --> E
    
    E --> F[Output to Next Block]

    style A fill:#2d3436,stroke:#fff,color:#fff
    style B fill:#e17055,stroke:#fff,color:#fff
    style C fill:#74b9ff,stroke:#fff,color:#fff,font-weight:bold
    style D fill:#a29bfe,stroke:#fff,color:#fff,font-weight:bold
    style E fill:#00b894,stroke:#fff,color:#fff
    style F fill:#2d3436,stroke:#fff,color:#fff
Loading

🌍 Real-World Example: Scalable Cyber-Security Logging

Imagine processing hundreds of thousands of routine firewall logs per second. A standard transformer immediately allocates massive VRAM tensors attempting to map relations between routine [INFO] pings until it crashes with an Out-of-Memory (OOM) error.

HSPMN sidesteps this limit gracefully: background [INFO] logs are compressed linearly (near-zero memory footprint). But the exact millisecond a rogue [SQL_INJECTION_ATTACK] is parsed, the hardware router mathematically snaps the anomaly into the heavy Contextual Stream for deep, focused reasoning. Result: You get massive, sustained sequence parsing where routine data incurs a flat O(1) VRAM cost, geometrically expanding your effective context window without exploding memory.


🎯 Key Technical Features (No Fluff)

  • 🏎️ Hybrid Execution: FlexAttention for training + Custom Triton kernels for inference.
  • πŸ“‰ Hardware Sparsity: Custom Triton kernels built ground-up for Blackwell architectures.
  • 🧠 328k Context Window: Tested on the RTX 5090 using just 30.24 GB VRAM via True Sparse KV Cache.
  • ⚑ Silly Fast: 1.33M tokens/sec at BF16 precision.
  • 🎲 ALF-LB Routing: A bias-based routing method without that annoying gradient/Gumbel noise.
  • βš–οΈ Dual Entropy Loss: Forces strict 0 or 1 token choices while keeping the hardware load totally even across batches.
  • 🚫 Zero Graph Breaks: Native static routing (torch.topk) so torch.compile(fullgraph=True) actually does its job.
  • πŸ“¦ CUDAGraphs Compatible: Sparsity targets stored as core Python floats (no .item() sync!). Captured neatly in precisely 2 partitions.

πŸš€ Performance Bracket (RTX 5090)

Metric Value Notes
Throughput 1,329,516 tok/s Batch=64, Seq=4096, Dim=2048
VRAM (throughput) 12.28 GB CUDAGraphs 2 partitions
Max Context 335,872 tokens Batch=1, Dim=2048 (30.24 GB VRAM)
Latency 197.17 ms avg Full forward pass (P95: 197.81 ms)
Training Speed ~980k tok/s Real training speed using FlexAttention

πŸ“‚ Repository Structure & Architecture

The codebase is strictly modularized into core architectural models, hardware-accelerated execution pipelines, and rigorous validation suites. This ensures a clean separation between the mathematical framework and its runtime components.

Here is the high-level topology of the repository:

graph LR
    Root["πŸ“ HSPMN-v3"]

    subgraph Core ["🧠 Core Architecture"]
        A1["πŸ€– hspmn_v3_0.py<br/><small>Main Architecture</small>"]
        A2["πŸ€— hspmn_hf_wrapper.py<br/><small>HuggingFace Wrap</small>"]
        A3["βš™οΈ kernels_v3_0.py<br/><small>Triton Magic!</small>"]
    end

    subgraph Runners ["πŸƒβ€β™‚οΈ Execution & Training"]
        B1["🏎️ benchmark_v3_0.py<br/><small>Go Fast</small>"]
        B2["πŸ‹οΈ train_v3_0.py<br/><small>Get Smart</small>"]
        B3["πŸ› οΈ utils_v3_0.py<br/><small>Helper Logic</small>"]
    end

    subgraph Testing ["πŸ§ͺ Validation & Tests"]
        C1["πŸ§ͺ test_v3_0.py<br/><small>Unit Tests</small>"]
        C2["⚑ test_kernels_v3_0.py"]
        C3["πŸͺ‘ needle_test.py<br/><small>Context Check</small>"]
        C4["βœ… verify_models.py"]
    end

    subgraph Docs ["πŸ“š Documentation & Config"]
        D1["πŸ“– README.md<br/><small>You are here</small>"]
        D2["πŸ“œ HSPMN_v3_0.tex & .pdf<br/><small>Architecture Paper</small>"]
        D3["πŸ“¦ requirements.txt"]
        D4["βš–οΈ LICENSE"]
    end

    Root --> Core
    Root --> Runners
    Root --> Testing
    Root --> Docs

    %% Colors optimized for dark mode (white/bright text, saturated dark backgrounds)
    style Root fill:#d63031,stroke:#fff,stroke-width:2px,color:#fff,font-weight:bold
    
    style A1 fill:#0984e3,stroke:#fff,color:#fff
    style A2 fill:#0984e3,stroke:#fff,color:#fff
    style A3 fill:#0984e3,stroke:#fff,color:#fff
    
    style B1 fill:#00b894,stroke:#fff,color:#fff
    style B2 fill:#00b894,stroke:#fff,color:#fff
    style B3 fill:#00b894,stroke:#fff,color:#fff
    
    style C1 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold
    style C2 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold
    style C3 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold
    style C4 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold

    style D1 fill:#6c5ce7,stroke:#fff,color:#fff
    style D2 fill:#6c5ce7,stroke:#fff,color:#fff
    style D3 fill:#6c5ce7,stroke:#fff,color:#fff
    style D4 fill:#6c5ce7,stroke:#fff,color:#fff
    
    classDef default font-family:sans-serif,font-size:14px;
    classDef title font-weight:bold,color:#fff;
Loading

πŸ› οΈ Blackwell (sm_120) Hackery & Fixes

Running things on bleeding-edge tech like the NVIDIA GB202 (RTX 5090) isn't without quirks. Here's what I fixed under the hood:

  • TF32 Math Errors: PyTorch defaults to TF32, which broke our router sigmoid gate math due to precision. Forced FP32 via set_float32_matmul_precision('highest'). Boom. Sorted.
  • Quantization Noise Gate: Fast MXFP8 math was bleeding noise. I added a < 0.05 hard floor to protect the routing logic.
  • SiLU NaN Errors: Deep padding into Blackwell SiLU kernels crashed them. Fixed with a good old clamp and nan_to_num.
  • TMA Stride Protection: Replaced tl.load with tl.make_block_ptr to stop massive L2 Cache misses dead in their tracks.
  • CUDAGraphs .item() Fix: Gutted tensor.item() from the router forward path. CUDAGraphs now captures properly since sparsity targets are standard _sparsity_float.

πŸ“¦ Getting Started (Installation)

Prerequisites: NVIDIA Driver 570+, CUDA 12.8+, Python 3.10+, PyTorch 2.10+ (nightly)

Pro-tip for reproducible benchmarks (OS tuning):

# GPU: persistence + power limit
sudo nvidia-smi -pm 1 && sudo nvidia-smi -pl 500
# CPU: performance governor + boost
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 1 | sudo tee /sys/devices/system/cpu/cpuboost
# Memory: reduce OS jitter
sudo sysctl vm.swappiness=10
# Clone the repository
git clone https://github.com/NetBr3ak/HSPMN.git
cd HSPMN

# Install dependencies
pip install -r requirements.txt

πŸŽ“ Quick Start Scripts

1. Let it Rip (Benchmarks)

Test how fast your rig really is:

python benchmark_v3_0.py --mode all

2. Standard Inference

For direct integration or testing the core block programmatically:

import torch
from hspmn_v3_0 import HSPMNBlock
from utils_v3_0 import HSPMNConfig

# Initialize configuration
config = HSPMNConfig(dim=2048, num_heads=16, num_kv_heads=4, sparsity_k=0.2)
model = HSPMNBlock(config).cuda().bfloat16()

# Compile the model
model = torch.compile(model, mode="max-autotune", fullgraph=True)

# Process a dummy sequence
x = torch.randn(1, 4096, 2048).cuda().bfloat16()
output, aux_loss, kv_cache = model(x)
print(f"Output shape: {output.shape}")

3. Spin Up Training

python train_v3_0.py \
    --batch 32 \
    --seq_len 4096 \
    --dim 2048 \
    --steps 1000 \
    --grad_accum 4 \
    --wandb "hspmn-experiment-1"

4. Integrity Testing

Tear it down to see if it breaks:

python test_kernels_v3_0.py
python verify_models.py

Author: Some Science Guy (Szymon JΔ™dryczko) License: Proprietary / All Rights Reserved - Non-Commercial Use Only

Source-available for portfolio viewing only. Commercial use, unauthorized modification, reproduction, or distribution is strictly prohibited. But feel free to look around!

About

HSPMN: Hybrid Sparse-Predictive Matter Network - LLM architecture optimized for Blackwell GPUs bridging O(N) and O(N^2) routing via ALF-LB

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors