Skip to content

PetrouilFan/fastnn

Repository files navigation

fastnn

fastnn is a high-performance, production-grade neural network framework built from scratch in Rust with seamless Python bindings. It delivers hardware-accelerated CPU and GPU compute through a familiar PyTorch-like API, without the overhead of mainstream deep learning stacks.

Version: v0.4.0 — Multi-GPU Support with Distributed Data Parallel

CI License: MIT Python: 3.12+ Rust: stable


Overview

fastnn is designed for researchers and engineers who need both the ergonomics of Python and the raw performance of systems-level code. It is implemented entirely in Rust and exposed to Python via PyO3, making it a fast, dependency-light alternative for training and inference workloads.

Core design goals:

  • Zero-compromise performance via hand-written SIMD kernels and GPU compute shaders
  • A clean, PyTorch-compatible Python API with no learning curve
  • Portable acceleration across x86-64 (AVX2/AVX512), ARM (NEON), and GPU (WebGPU/wgpu)
  • First-class autograd with full backward pass support

Features

  • Vectorized CPU Kernels — Hand-optimized SIMD kernels targeting AVX2, AVX512, and ARM NEON, with runtime dispatch to the best available instruction set. Includes Cephes-style fast approximations for transcendental functions (exp, log).
  • GPU Acceleration — Cross-platform GPU compute via wgpu (WebGPU). Vectorized vec4 shaders for elementwise ops and tiled matrix multiplication with shared memory.
  • Multi-threading — Automatic parallelism across CPU cores using rayon, with cache-aware chunking for memory-bound and compute-bound workloads.
  • Native Autograd — Built-in automatic differentiation engine with operation tracking, backward passes, and no_grad context support.
  • Multi-GPU Training — Distributed Data Parallel (DDP) with bucketed AllReduce gradient synchronization and dynamic load balancing across GPUs.
  • PyO3 Python Bindings — Train and evaluate models from Python with a PyTorch-like API. No Python performance penalty on the hot path.
  • Optimized Convolutions — im2col-based Conv2d with specialized kernels for 1×1, depthwise, and 3×3 convolutions at various stride/dilation configurations.
  • BLAS Integration — Optional OpenBLAS backend for matrix multiplication on large tensors.
  • Training Utilities — Datasets, DataLoaders, and Keras-style Callbacks (EarlyStopping, ModelCheckpoint, LearningRateScheduler, CSVLogger).

GPU Performance

Benchmarks measured against equivalent PyTorch CPU operations on medium-to-large tensors.

Operation Tensor Size GPU Speedup Notes
MatMul 512×1024×512 152× Tiled matmul with shared memory
GELU 1000×1000 14× Vectorized tanh computation
Sigmoid 1000×1000 11× Vectorized shader operations
Add 1000×1000 vec4 vectorized elementwise shader

Note: GPU acceleration shows the highest gains for medium-to-large tensors (≥ 100×100). Small tensor operations may be bound by kernel launch and data transfer overhead.


Project Structure

fastnn/
├── Cargo.toml                  # Rust dependencies (PyO3, rayon, wgpu, ...)
├── pyproject.toml              # Python package configuration (maturin)
├── Makefile                    # Common dev tasks (install, build, test, bench)
├── src/
│   ├── lib.rs                  # Python module export & PyO3 bindings
│   ├── tensor.rs               # Core Tensor struct, shape, strides, dtype
│   ├── storage/                # Memory backend, device allocation (CPU/GPU)
│   ├── autograd/               # Automatic differentiation tape and backward graph
│   ├── dispatcher/             # Dynamic kernel dispatch (CPU vs GPU)
│   ├── kernels/
│   │   ├── cpu.rs              # SIMD kernels: AVX2, AVX512, NEON, scalar fallbacks
│   │   ├── blas.rs             # BLAS-accelerated matrix multiplication
│   │   └── gpu/                # WebGPU compute pipelines and WGSL shaders
│   ├── nn/                     # Neural network layers, activations, attention
│   ├── optim/                  # SGD, Adam, AdamW optimizers
│   ├── train/                  # Trainer, callbacks, metrics, loss functions
│   └── io/                     # Model serialization (safetensors), DLPack
├── fastnn/                     # Python package
│   ├── __init__.py             # Public API surface
│   ├── nn.py                   # Sequential, ModuleList
│   ├── parallel.py             # DataParallel / DDP
│   ├── models/                 # Pre-built models: MLP, Transformer
│   ├── data.py                 # Dataset, TensorDataset, DataLoader
│   └── callbacks.py            # Training callbacks
└── tests/
    ├── bench/                  # Benchmarks vs PyTorch (CPU & GPU)
    └── *.py                    # Unit and integration tests

Installation

Prerequisites

Tool Version Purpose
Rust stable Build the core library
Python ≥ 3.12 Python bindings
uv latest Python dependency management

Install Rust via rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Build & Install

git clone https://github.com/PetrouilFan/fastnn.git
cd fastnn

# Install and build (editable mode, with dev dependencies)
uv pip install -e ".[dev]"

# Or using the Makefile
make install

The [dev] flag installs testing dependencies including pytest, pytest-benchmark, and numpy.

Platform Notes

fastnn auto-selects the best CPU instruction set at runtime:

  • x86-64: AVX512 → AVX2 → scalar fallback
  • ARM64: NEON intrinsics
  • GPU: WebGPU via wgpu (Vulkan, Metal, DX12, or WebGPU backends)

Quick Start

Basic Usage

import fastnn as fnn

# Create tensors
a = fnn.randn([1000, 1000])
b = fnn.randn([1000, 1000])
c = a @ b  # BLAS/SIMD-accelerated matmul

# Define a model
model = fnn.Sequential(
    fnn.Linear(128, 64),
    fnn.ReLU(),
    fnn.Linear(64, 10),
)

optimizer = fnn.Adam(model.parameters(), lr=0.001)
inputs  = fnn.randn([32, 128])
targets = fnn.randint(low=0, high=10, shape=[32])

# Training step
outputs = model(inputs)
loss    = fnn.cross_entropy_loss(outputs, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()

print(f"Loss: {loss.item():.4f}")

GPU Acceleration

import fastnn as fnn

# Switch to WebGPU
fnn.set_default_device("gpu:0")

a = fnn.randn([1000, 1000], device="gpu")
b = fnn.randn([1000, 1000], device="gpu")
c = a @ b  # GPU-accelerated matrix multiplication

Multi-GPU Training (DDP)

import fastnn as fnn

# Create identical model replicas for each GPU
model_gpu0 = fnn.models.MLP(input_dim=784, hidden_dims=[256], output_dim=10)
model_gpu1 = fnn.models.MLP(input_dim=784, hidden_dims=[256], output_dim=10)

# Wrap in DataParallel with optional weighted data splitting
dp_model = fnn.DataParallel(
    [model_gpu0, model_gpu1],
    device_ids=[0, 1],
    weights=[0.6, 0.4],  # Proportional to GPU memory/speed
)

optimizers = [
    fnn.Adam(dp_model.replicas[0].parameters(), lr=1e-3),
    fnn.Adam(dp_model.replicas[1].parameters(), lr=1e-3),
]

for x_batch, y_batch in dataloader:
    loss = dp_model.forward_backward(x_batch, y_batch, fnn.cross_entropy_loss)
    dp_model.sync_gradients()         # Bucketed AllReduce
    for opt in optimizers:
        opt.step()
        opt.zero_grad()

    dp_model.adjust_weights_based_on_performance()  # Dynamic load balancing

Training with Callbacks

import fastnn as fnn

model     = fnn.models.MLP(input_dim=2, hidden_dims=[16, 16], output_dim=1, activation="relu")
optimizer = fnn.Adam(model.parameters(), lr=1e-2)

ds     = fnn.TensorDataset(X, y)
loader = fnn.DataLoader(ds, batch_size=4, shuffle=True)

callbacks = [
    fnn.EarlyStopping(monitor="loss", patience=10),
    fnn.ModelCheckpoint(dirpath="./checkpoints", monitor="loss", save_best_only=True),
    fnn.LearningRateScheduler(schedule="cosine", lr=1e-2, T_max=100),
]

model.train()
for epoch in range(100):
    total_loss = 0
    for batch_x, batch_y in loader:
        pred = model(batch_x)
        loss = fnn.mse_loss(pred, batch_y)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Inference without gradient tracking
with fnn.no_grad():
    preds = model(X)
    print(preds.numpy().round(2))

Controlling Parallelism

# Set CPU thread count for parallel kernels
fnn.set_num_threads(8)

# Inspect memory and registered ops
print(fnn.allocator_stats())
print(fnn.list_registered_ops())

API Reference

Tensor Creation

Function Description
fnn.tensor(data, shape) Create tensor from a Python list
fnn.zeros(shape) Tensor of zeros
fnn.ones(shape) Tensor of ones
fnn.full(shape, value) Tensor filled with a value
fnn.eye(n) Identity matrix
fnn.arange(end) Integer range [0, end)
fnn.linspace(start, end, n) Linearly spaced values
fnn.randn(shape) Random normal (Gaussian)
fnn.rand(shape) Random uniform [0, 1)
fnn.randint(low, high, shape) Random integers in [low, high)

Neural Network Modules

Module Description
fnn.Linear(in, out, bias=True) Fully connected layer
fnn.Conv2d(cin, cout, kernel, stride, padding) 2D convolution
fnn.LayerNorm(shape) Layer normalization
fnn.BatchNorm1d(features) Batch normalization
fnn.Dropout(p) Dropout regularization
fnn.Embedding(num, dim) Learned word embeddings
fnn.ReLU / fnn.GELU / fnn.Sigmoid / fnn.Tanh / fnn.SiLU Activation layers
fnn.Sequential(*layers) Sequential layer container
fnn.ModuleList(modules) Indexable module list

Optimizers

Optimizer Description
fnn.SGD(params, lr, momentum=0, weight_decay=0) Stochastic Gradient Descent
fnn.Adam(params, lr, betas=(0.9, 0.999), eps=1e-8) Adam
fnn.AdamW(params, lr, betas=(0.9, 0.999), weight_decay=0.01) AdamW (decoupled L2)

Loss Functions

Function Description
fnn.mse_loss(pred, target) Mean squared error
fnn.cross_entropy_loss(logits, target) Cross-entropy loss

Testing & Benchmarking

# Run unit and integration tests
pytest

# Run benchmarks only
pytest --benchmark-only

# CPU benchmark suite (fastnn vs PyTorch)
python tests/bench/fastnn.py

# GPU vs CPU comparison (quick)
python tests/bench/bench_gpu_simple.py

# Full GPU benchmark suite
python tests/bench/bench_gpu.py

Building from Source

# Development build (faster compile, unoptimized)
maturin develop

# Release build (full optimizations: LTO, codegen-units=1, opt-level=3)
maturin build --release

# Or via Makefile
make build

Cargo feature flags:

Feature Description
simd Enable SIMD kernels (AVX2, AVX512, NEON)
parallel Enable Rayon multi-threaded parallelism
simd-avx512 Enable AVX-512 kernels (requires AVX512-capable CPU)
openblas Link against OpenBLAS for large matmul
prefetch Enable software prefetching in matmul kernels

License

fastnn is licensed under the MIT License.
Copyright © 2026 Petros Fanioudakis

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors