Skip to content

Add CUDA attention kernels, gradient norms, and CI improvements#69

Merged
codeaddict-119 merged 51 commits into
codeaddict-masterfrom
master
Jun 8, 2026
Merged

Add CUDA attention kernels, gradient norms, and CI improvements#69
codeaddict-119 merged 51 commits into
codeaddict-masterfrom
master

Conversation

@Eamon2009
Copy link
Copy Markdown
Owner

No description provided.

codeaddict-119 and others added 22 commits May 25, 2026 10:47
## Summary
## Causal Multi-Head Attention Forward Pass (CUDA)
PR implements the CUDA forward pass for causal multi-head attention
(attention_forward). It includes the core GPU kernel, custom block-level
reduction primitives, and tensor validation helpers.

## Core Attention Kernelattention_forward_kernel:
- Computes scaled dot-product attention on an interleaved QKV input
tensor structured as [Batch, Time, 3 * Channels].
- Causal Masking: Enforces autoregressive constraints by preventing
tokens from attending to future time steps ($t2 > t$).
- Implements parallelized block_max and block_sum device functions.
- Leverages cooperative warp shuffles (warp_max, warp_sum) and shared
memory to handle stable online softmax normalization

#52 
#11 
#12 
#14 
#29
# Pull Request Engineering Summary

## Core LLM Pipeline Modernization & Architectural Overhaul

> **Executive Summary:** This pull request aggregates a critical
sequence of engineering upgrades transitioning the standalone modeling
stack to a highly optimized, production-ready Decoder-Only
autoregressive Transformer engine. Updates encompass structural layout
transformations across front-end web UI wrappers, custom
hardware-accelerated CPU Tensor math kernels, and scalable multi-GPU
training/telemetry orchestration matrices.

---

## 1. Pull Request Core Metadata

| Metadata Field | Description |
| :--- | :--- |
| **PR Target Branch / Title** | `refactor/core-engine` $\rightarrow$
`main` \| Upgrade Core LLM Infrastructure to Decoder-Only Pipeline &
Analytics |
| **Primary Changes** | Architecture migration (Decoder-Only), Telemetry
implementation (WandB), UI Overhaul (Inline Styles), Native Optimization
(AVX/SSE) |
| **Impact Scope** | Core Neural Network Engine, Cluster Training
Primitives, Cross-Platform Frontend Subsystems, Vector Math Backends |
| **Telemetry & Tokenization** | Weights & Biases Runtime Tracking
Engine Integration; `tiktoken` (`o200k_base` Byte-Pair Encoding) Backend
Migration |
| **Hardware Optimization** | Unaligned 256-bit Vector Intrinsics
(`__AVX__`) and 128-bit Lane Vectors (`__SSE__`) with fallback Scalar
Arrays |

---

## 2. Core Neural Network & Architectural Shifts

The engineering modifications consolidate multiple independent core
layers (`Embedding`, `LayerNorm`, `Linear`) into a unified,
production-grade autoregressive decoder-only Transformer configuration
matching state-of-the-art LLM architectures:

* **Decoder-Only Refactor:** Phased out legacy sequence-to-sequence
(seq2seq) architectures to transition fully to a causal autoregressive
structure. This forces causal masking constraints over continuous hidden
dimensions during forward execution cycles to prevent the model from
looking at future tokens.
* **Token & Absolute Position Embeddings:** The core `Embedding` layout
maps flat input sequences directly into continuous 3D hidden tensor
spaces $[B, T, D]$. Features a dedicated standalone absolute positional
embedding route (`forward_pos`) generating specialized spatial frames
across variable text context boundaries ($T$).
* **Numerical Loss & Optimization Stability:** The `cross_entropy`
engine incorporates strict value isolation boundaries (max value
normalization) to secure log-softmax arrays against underflow/overflow
scenarios. The stateful `AdamW` optimizer registers continuous
memory-pointer streams directly to optimize raw weight vectors without
multi-hop structural replication overhead.

---

## 3. Low-Level Core Optimizations (C++ Tensor Kernel)

To eliminate memory-bound bottlenecks inside native execution calls,
element-wise arithmetic passes over raw vector structures (`add`,
`add_inplace`) have been decoupled into specialized architecture paths
compiled conditionally using preprocessor macro definitions:

* **256-Bit AVX Intrinsics:** Invokes explicit unaligned packet loading
loops (`_mm256_loadu_ps`) and vector additions (`_mm256_add_ps`) to
process eight single-precision floats concurrently per execution lane
clock cycle.
* **128-Bit SSE Downscaling:** Provides explicit 128-bit vector loops
(`_mm_loadu_ps`, `_mm_add_ps`) processing four float variables
simultaneously for legacy host target nodes.
* **Serialized Zero-Overhead Memory Layouts:** All layer components
(`Linear`, `LayerNorm`, `Embedding`) implement flat binary data routing
using raw `reinterpret_cast<char*>` byte blocks, ensuring lightning-fast
file serialization and model loading checkpoints without structural
serialization metadata baggage.

---

## 4. Distributed Orchestration & Cluster Telemetry

The Python cluster-orchestration codebase has been fundamentally
upgraded to support large-scale high-performance training profiles
across distributed multi-node hardware targets:

* **Multi-GPU DDP Architecture:** Integrates NCCL-backed
`DistributedDataParallel` orchestration, utilizing automated
execution-rank filtering, master process controls, and specialized
cluster seed off-setting logic to ensure deterministic replication
bounds.
* **Mixed-Precision Execution (AMP):** Deploys runtime context
auto-casting (`torch.amp.autocast`) toggling between pure `bfloat16` and
gradient-scaled `float16` layouts to prevent numerical underflow while
preserving maximum compute efficiency on Tensor Cores.
* **Sub-word Tokenization Backends:** Replaces slow legacy text
split-parsers with advanced byte-pair encodings (`tiktoken` utilizing
the `o200k_base` matrix), improving token density per context window and
reducing language vocabulary padding overhead.
* **WandB Experiment Telemetry:** Hooks up centralized Weights & Biases
telemetry tracking loops, automating real-time convergence parsing,
structural loss diagnostics, and hardware parameter health tracking
updates.

---

## 5. Frontend Framework Refactor (React Web Component Tree)

The web application dashboard migrates entirely from legacy
utility-first global Tailwind configuration models to explicit, typed
inline styles (`React.CSSProperties`) combined with native JavaScript
pointer events to manage high-frequency application interface states:

* **Component Modularity Overhauls:** The structural view layers
(`AppLayout` shell, `Sidebar`, `Topbar`, `SessionItem`, `StatsPanel`,
`SettingsPanel`, and `ModelBadge`) have been completely rewritten to
rely on atomic design tokens and explicit flexbox layout boundaries.
* **Dynamic Event Interactivity:** Replaces standard utility hover
configurations with optimized micro-interactions using native pointer
handlers (`onMouseEnter`, `onMouseLeave`, `onFocusCapture`,
`onBlurCapture`) to drive real-time component border glows, state
transitions, and translucent background overlays.
* **Layout & Responsive Edge-Case Safety:** Enforces rigid multi-device
rendering bounds using concrete visual rules (`flexShrink: 0`,
`minWidth: 0`, `wordBreak: 'break-all'`, and explicit multi-word text
ellipsis clamping) to ensure a bulletproof user interface across desktop
and mobile screens.
* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>
* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>
- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.
- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.
…ker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.
Added macOS binary build and release steps to CI workflow.
Removed dependency on build-macos-x64 for the release job.
@Eamon2009 Eamon2009 requested a review from codeaddict-119 June 3, 2026 05:29
@Eamon2009 Eamon2009 self-assigned this Jun 3, 2026
@Eamon2009 Eamon2009 added the cuda label Jun 3, 2026
@Eamon2009
Copy link
Copy Markdown
Owner Author

/run-checks

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

✅ All checks passed!

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>
…fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.
Changed the project title to include 'llm.cpp' for clarity.
Removed image from README and adjusted formatting.
@codeaddict-119 codeaddict-119 merged commit e4850cc into codeaddict-master Jun 8, 2026
4 checks passed
Eamon2009 added a commit that referenced this pull request Jun 8, 2026
* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Eamon2009 added a commit that referenced this pull request Jun 8, 2026
* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

* docs:report [run_20260530_165216](~791 tok/s) (#60)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs: report [run_20260530_165216] (~791 tok/s) (#62)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

---------

Co-authored-by: Max <eamon5174@gmail.com>

* chore: clang-format configuration file based on LLVM (#63)

Co-authored-by: Eamon <eamon112009@gmail.com>

* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

* Refactor core architecture and optimize CUDA features (#75)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon Sippy <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------

Co-authored-by: Max <eamon5174@gmail.com>

* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------

Co-authored-by: Max <eamon5174@gmail.com>

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------

Co-authored-by: Max <eamon5174@gmail.com>

* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv

Co-Authored-By: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-Authored-By: Eamon Sippy <eamon112009@gmail.com>

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Eamon <eamon112009@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Eamon2009 added a commit that referenced this pull request Jun 8, 2026
…#76)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

* docs:report [run_20260530_165216](~791 tok/s) (#60)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs: report [run_20260530_165216] (~791 tok/s) (#62)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



---------



* chore: clang-format configuration file based on LLVM (#63)



* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

* Refactor core architecture and optimize CUDA features (#75)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






* Add CUDA kernels, optimize CI, and update documentation (#74)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

* CUDA header declarations for (LayerNorm) forward and backward  (#66)

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

---------



* Add CUDA attention kernels, gradient norms, and CI improvements (#69)

* exp(#58)

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* feat(ci): optimize workflow pipeline and update docker configurations

* refactor(ci): optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* refactor : optimize workflow pipeline and update docker configurations

* Added MIT LICENSE to this project Quadtrix.cpp

* Refactor Dockerfile to use ARG for CUDA version

* Refactor Dockerfile for backend dependencies

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* Delete .devops/Dockerfile.frontend

* Delete .devops/Dockerfile.dev.frontend

* refactor : Dockerfile.backend optimize workflow pipeline

* refactor : Dockerfile.backend optimize workflow pipeline

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication

* refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes

* refactor : message bubble layout to use inline styles

* refactor(ui): complete inline-style migration and update auto-scroll implementation

* refactor(ui): complete inline-style migration for MessageAvatar component

* refactor(ui): rewrite EmptyState component using pure inline styles

* refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE

- Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations.
- Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions.
- Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout.
- Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`.

* refactor(main): redesign training loop to log per-step and sample during evaluation

- Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`).
- Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline.
- Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows.
- Streamlined architecture parameter reporting and consolidated command-line configuration visual prints.

* feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

* Update README.md with new banner for qudtrix.cpp

---------



* ci: add manual PR checks workflow with slash command support

* feat(cuda): add attention forward backward kernel declarations (#64)

* docs: report [run_20260530_165216] (~791 tok/s)

 Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900

* docs:report [run_20260530_165216](~791 tok/s)  (#61)

Includes metrics for generalization gap, throughput (~791 tok/s), and gradient norms.
Parameters: 6.68M | lr: 1e-3 | batch: 16 | steps: 6000 - Achieved best validation loss of 4.1319 at step 3900



* feat(cuda): add attention forward and backward kernel declarations

Introduces the header declarations for `attention_forward` and
`attention_backward` operations inside the `quadtrix::cuda` namespace.
Configured with support for custom CUDA streams and head partitioning.

---------



* feat(cuda): add checkpoint metadata struct and stub functions

* feat(cuda): introduce core type definitions and error handling utilities

- Defines `DType` and `DeviceKind` enums supporting standard types (F32, F16, BF16, I32, U8).
- Implements `dtype_name` and `dtype_size` metadata helper functions.
- Adds an explicit `Status` struct for non-throwing error propagation alongside `checked_mul` for safe allocation size computation.
- Introduces `check_cuda` and `abort_on_cuda` error macros and handling mechanisms, exposed via the `QUADTRIX_CUDA_CHECK` macro.

* feat(cuda): add TokenBatchView struct and DataLoader stub class

* feat(cuda): add GeLU activation forward and backward declarations

- Introduces the `GeluMode` enum to toggle between `Exact` and `Approximate` mathematical variants.
- Declares the `gelu_forward` and `gelu_backward` kernel entrypoints.
- Configures both signatures with optional stream execution and a default mode of `GeluMode::Approximate`.

* feat(cuda): add gradient norm calculation and clipping interfaces

* feat(cuda): add LayerNorm forward and backward kernel declarations

* refactor(ci): organize workflow into push-triggered QA and manual docker builds

Updated CI workflow to restrict branches for push events and improved input descriptions for image selection and push options.

* Fix formatting and update CI workflow steps

* Enhance CI with macOS binary build and release

Added macOS binary build and release steps to CI workflow.

* feat(docker): add Dockerfile for frontend application

* feat(docker): add Dockerfile for frontend application

* refactor(ci): remove release job from GitHub actions

* ci: add unified release and docker build workflow

* ci: add unified release and docker build workflow

* Refactor macOS build workflow for arm64 architecture

* Update release workflow to remove macOS x64 build

Removed dependency on build-macos-x64 for the release job.

* perf: update execution time benchmarks in csv




* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* ci(docker): refactor image build workflow and add frontend job

* Remove frontend job from Docker Images workflow

* Update release workflow to remove s390x and add notes

Removed s390x build configurations and added a step to write detailed release notes.

* feat: add local orchestration script for frontend and backend servers

Introduces a central Python execution script to concurrently manage and
orchestrate the development environment for both the frontend and backend.
- Detects system OS to invoke correct `npm` and `python` (virtualenv) binary variants.
- Verifies existence of the local PyTorch `.pt` model checkpoint before starting.
- Configures environment variables dynamically for Uvicorn (FastAPI) and Vite.
- Handles cross-origin setups (CORS) linking ports interactively.
- Gracefully handles process termination (`Ctrl+C`) by forwarding termination signals.
- Automatically launches the frontend application in the system web browser.

* chore(deps): bump actions/github-script from 7 to 9 (#71)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* feat(cuda): introduce log_message utility and LogLevel enum

* feat(cuda): add cuBLAS handle wrapper and matmul operations

* feat(cuda): implement core Tensor, TensorShape, and TensorView abstractions

* refactor: untie embedding and lm_head weights and to quadtrix

* feat(cuda): add NCCL communicator wrapper and all-reduce primitives

* Update README.md with workflow badges

Added badges for release, package, and CI workflows.

* kernels: add AdamW optimization kernel with stochastic rounding  Introduces the AdamW fused CUDA kernel including linear interpolation  optimizations (`lerp`), multi-slice batching support via 2D grids, and  `init_from_master` utility functions for low-precision parameter handling.

* cudnn: implement cached SDPA forward graph using cuDNN frontend

* feat(cuda): implement Packed128 memory vectorization utilities

* feat: add distributed sharded DataLoader for binary token files

* feat(multi-gpu): add foundational utilities for ZeRO sharding

* feat(utils): add  I/O and memory error-checking wrappers

* feat : add PyTorch-compatible Mersenne Twister random utilities

* README : Enhance README with header and workflow badges

Updated README to include a header and badges for release, package, and CI workflows.

* utils:`fopenCheck`, `freadCheck`, `fwriteCheck`, `fcloseCheck`, and `fseekCheck` with explicit crash details and project-specific troubleshooting hints. - Add cross-platform socket closure wrappers (`scloseCheck` and `closesocketCheck`) for Linux and Windows.

* mfu: add GPU specifications database and utilities for MFU estimation

* Modify project title in README.md

Changed the project title to include 'llm.cpp' for clarity.

* Update README to remove image and clean up content

Removed image from README and adjusted formatting.

---------






---------






---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Max <eamon5174@gmail.com>
Co-authored-by: codeenthusiasm23 <273188204+codeenthusiasm23@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants