Skip to content

EctoSpace/SCT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spectral Compact Training (SCT)

Train 70B-class neural networks on a Steam Deck.

SCT stores every weight matrix as W = U diag(s) V^T and never builds the dense matrix. Exact gradients flow through the small spectral factors via standard backpropagation. After each optimizer step, U and V are retracted to the Stiefel manifold via QR decomposition. That's the entire method.

Dense 70B + Adam:  1,245 GB
SCT 70B + Adam:      7.2 GB
Compression:          172x

Patent Pending — Irish Short-Term Patent Application PTIE20260000000219, filed March 27, 2026.


Results

70B Architecture on Consumer Hardware

Hardware Peak Memory Forward Backward Total Step
Apple M4 Pro (48 GB) 7,938 MB - 2.15s -
Steam Deck (16 GB) 7,235 MB 0.43s 0.92s 6.28s

80 layers, d=8192, ffn=28672, SwiGLU activation. LLaMA-3-70B proportions. 452M spectral parameters representing 77.8B dense equivalent. Orthonormality error after retraction: 1.30e-06.

MLP Proof (from-scratch training)

Task Dense+AdamW SFT+DFA SCT (ours)
Sine regression (loss) 0.000002 0.0768 0.0000680
XOR classification (acc) 100% 85.5% 100%

SCT matches dense training quality. 1,129x better than DFA on sine regression.

Compression Scales with Model Size

Model Layer SCT (k=32) Compression
SmolLM2-135M 576 x 1536 1.1 MB 13x
SmolLM2-1.7B 2048 x 8192 5.2 MB 51x
LLaMA-7B 4096 x 11008 7.7 MB 93x
Qwen3.5-27B 4096 x 17408 11.0 MB 104x
LLaMA-70B 8192 x 28672 18.9 MB 199x

How It Works

Forward Pass

h  = x @ U       # [batch, k]     project into spectral basis
hs = h * s       # [batch, k]     scale by singular values
y  = hs @ V.T    # [batch, out]   reconstruct in output space

Three small matmuls. Cost: O(bk(m+n)) instead of O(bmn). Never builds the m x n matrix.

Backward Pass

PyTorch autograd computes dL/dU, dL/ds, dL/dV exactly through the same three operations. Gradients are shapes (m x k), (k,), (n x k). No m x n gradient ever exists.

Stiefel Retraction

After Adam updates U and V, they're no longer orthonormal. QR retraction fixes this:

Q, R = torch.linalg.qr(U_updated)
U = Q * torch.sign(torch.diag(R))  # sign correction for stability

Cost: O(mk^2) per layer. This is what makes SCT a training method, not just compression.


Quick Start

70B Architecture Test (fits on any machine with 8+ GB RAM)

pip install torch transformers
python examples/sct_steamdeck.py

Fine-tuning SmolLM2

pip install torch transformers datasets
python examples/macbook_m4pro/sct_smollm2.py --energy 0.95 --steps 400

Core Implementation

The entire method is one class:

from spectral_compact_training import SpectralLinear, retract_all

# Use like nn.Linear
layer = SpectralLinear(in_features=4096, out_features=11008, rank=32)
y = layer(x)

# After optimizer.step(), retract to Stiefel manifold
optimizer.step()
retract_all(model)

Repository Structure

sct/
  spectral_compact_training/       Core library
    __init__.py
    spectral_layer.py              SpectralLinear implementation
  examples/
    sct_steamdeck.py               70B architecture validation
    macbook_m4pro/
      sct_70b_flex.py              70B on M4 Pro (MPS backend)
      sct_smollm2.py               SmolLM2 fine-tuning
      sct_vs_dense.py              Head-to-head Dense vs SCT
  proof/
    SteamDeck-Demo.mp4             Video: Steam Deck running 70B
    SteamDeck-Konsole.mp4          Video: terminal output
    SteamDeck-Konsole-Output.txt   Raw console log (v2)
    sct_smollm2_results.json       SmolLM2 fine-tuning results
    sct_vs_dense_results.json      Dense vs SCT comparison
    patent_pending.webp             Filing confirmation
  docs/
    SCT_Patent_Application.pdf     Patent specification
    SCT_Whitepaper.pdf             Technical whitepaper
    paper.tex                      arXiv preprint source

Important Notes

What SCT is: A training method that stores and updates weights exclusively in spectral form with exact gradients and Stiefel manifold constraints. The 70B results are architectural validation: a full training step (forward, backward, optimizer, retraction) fits in 7.2 GB.

What SCT is not: A finished 70B model. Training a model to completion requires compute time proportional to the dataset size, which SCT does not change. SCT changes how much memory you need to do that training.

Scaling: SCT compression improves with model size. Models below ~360M parameters (hidden dim < 1024) don't benefit meaningfully at practical ranks. The sweet spot is 1.7B+ where rank 32 gives 50x+ compression.


Citation

@misc{kohlberger2026sct,
  title={Spectral Compact Training: Memory-Efficient Neural Network Training
         via Truncated SVD Factorization with Stiefel Manifold Retraction},
  author={Kohlberger, Bj{\"o}rn Roman},
  year={2026},
  note={Irish Patent Application PTIE20260000000219}
}

License

Apache 2.0

Author

Bjorn Roman Kohlberger -- EctoSpace

About

Train 70B neural networks on a Steam Deck. Spectral Compact Training: 172x memory reduction via W=U·diag(s)·V^T with Stiefel QR retraction. Patent Pending.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Languages