Nott is a modern C++ deep-learning framework that layers a strongly typed API over LibTorch. It prioritizes reproducibility, predictable latency, and explicit control over kernels, memory, and optimizer state.
- First-class graph authoring. Layers and higher-order blocks can be connected as a DAG, letting you express anything from small CNNs to large transformer stacks without wrestling with manual tensor plumbing.
- Consistent systems model. Data loaders, augmentations, optimizers, regularizers, and metrics share the same descriptor-driven style so you can mix-and-match building blocks safely.
- Native performance. Nott keeps you close to the metal through LibTorch while still providing ergonomic abstractions. Benchmarks at the end of this document detail the runtime overhead compared to pure LibTorch.
- Not just LibTorch wrappers. The API covers extensive layer catalogs, attention descriptors, data loaders, and training utilities rather than exposing only a handful of LibTorch layers.
#include <Nott>
int main() {
Nott::Model model("demo");
model.use_cuda(torch::cuda::is_available());
model.add(Nott::Layer::FC({784, 256, true}, Nott::Activation::GeLU));
model.add(Nott::Layer::Dropout({0.1}));
model.add(Nott::Layer::FC({256, 10, true}, Nott::Activation::Softmax));
model.set_optimizer(
Nott::Optimizer::AdamW({.learning_rate = 1e-3}),
Nott::LrScheduler::CosineAnnealing({.T_max = 50})
);
model.set_loss(Nott::Loss::MSE({}));
auto [train_images, train_labels, test_images, test_labels] = Nott::Data::Load::MNIST("./datasets", 1.f, 1.f, true);
model.train(train_images, train_labels, {.epoch = 10, .batch_size = 64});
model.evaluate(test_images, test_labels, Nott::Evaluation::Classification, {
Nott::Metric::Classification::Precision,
Nott::Metric::Classification::F1,
Nott::Metric::Classification::Informedness});
return 0;
}
The model holds a directed acyclic graph of computational blocks and layers. You can
construct it incrementally with .add().
model.add(Nott::Layer::FC({258, 10, /*bias*/true}, Nott::Activation::GeLU, Nott::Initialization::HeNormal))or blocks:
model.add(Nott::Block::Sequential({ /*vector field*/
Nott::Layer::Conv2d({3, 64, {3, 3}, {1, 1}, {1, 1}, {1, 1}, 1, false},
Nott::Activation::Identity, Nott::Initialization::HeNormal),
Nott::Layer::BatchNorm2d({64, 1e-5, 0.1, true, true},
Nott::Activation::SiLU),
Nott::Layer::MaxPool2d({{2, 2}, {2, 2}})
}));The framework ships with a rich catalog of layers (see in Docs/Layers or Docs/Blocks). It will automatically link linearly every item's called via .add(). To rewire the network use .links() (see in Docs/Links). Multi-head attention descriptors that power the transformer blocks are documented in Docs/Attention.
Optimizer and scheduler choices are set once per model by default.
The example below pairs AdamW with cosine annealing warm restarts.
model.set_optimizer(
Nott::Optimizer::AdamW({.learning_rate = 1e-4, .weight_decay = 5e-4}),
Nott::LrScheduler::CosineAnnealing({
.T_max = steps_per_epoch * epochs,
.eta_min = 3e-7,
.warmup_steps = 5 * steps_per_epoch,
.warmup_start_factor = 0.1
})
);Losses and regularization follow the same pattern:
model.set_loss(Nott::Loss::CrossEntropy({.label_smoothing = 0.02f}));
model.set_regularization({ /*vector field*/
Nott::Regularization::SWAG({
.coefficient = 1e-3,
.variance_epsilon = 1e-6,
.start_step = static_cast<size_t>(0.85 * steps_per_epoch * epochs),
.accumulation_stride = static_cast<size_t>(steps_per_epoch),
.max_snapshots = 20,
})
});To see the complete list of Optimizers, Losses or Regularizations and their parameters check Docs/Optimizer, Docs/Loss and Docs/Regularization
It is also possible to use multiples configurations over the network, check Docs/Local
The Nott::Data::Load namespace includes ready-made loaders for popular datasets
such as MNIST, CIFAR-10, ETTH, PTBXL. Data manipulations (augmentation, shuffling, and splitting) are
exposed through Nott::Data::Manipulation utilities, while consistency checks live
under Nott::Data::Check.
at::Tensor [train_images, train_labels, test_images, test_labels] = Nott::Data::Load::CIFAR10(dataset_root, 1.f, 1.f, true);
at::Tensor [validation_images, validation_labels] = Nott::Data::Manipulation::Fraction(test_images, test_labels, 0.1f);
std::tie(train_images, train_labels) = Nott::Data::Manipulation::Cutout(train_images, train_labels,{-1, -1}, {12, 12}, -1, 1.f, true, false);More information inside Docs/Data
Training is initiated with model.train, which accepts tensors and a
Nott::TrainOptions struct describing epochs, batch size, graph mode, validation
splits, AMP, and other runtime settings.
model.train(train_images, train_labels, {.epoch=120, .batch_size=128, .test={x_val,y_val}});More information in Docs/Train
Post-training evaluation is performed with model.evaluate, which accepts the test
split, a task type, and a list of metrics. The evaluation API streams batches and
accumulates metrics such as accuracy, precision, recall, calibration errors, and
more.
model.evaluate(test_images, test_labels, Nott::Evaluation::Classification, { /*vector field*/
Nott::Metric::Classification::Precision,
Nott::Metric::Classification::Recall,
Nott::Metric::Classification::F1,
Nott::Metric::Classification::TruePositiveRate,
Nott::Metric::Classification::LogLoss,
}, {.batch_size = 64});More details inside Docs/Evaluation
To keep save your Network use model.save() and model.load();
model.save() will create a folder of _Network_Name_ name, and save inside architecture.json which correspond to Network layers, dimensions parameters, optimizer used, etc. As well as a parameter.binary which store learnable parameters of layers.
NB: Since model.load() read architecture.json, you don't need to re-code your network via model.add()
model.save("PATH");
model.load("PATH"+"/_Network_Name_");model.save generates a folder named after the model containing the architecture.json (graph, dimensions, optimizer metadata) and parameters.binary (learnable weights). Because model.load reads the JSON specification, you do not need to recreate the graph via model.add. Details live in Save & Load.
Results below represent warm runs filtered with a Tukey 0.98 fence on the MNIST workload
(60k samples, 28×28 | epochs = 100, batch = 64).
Two configurations are reported:
- Mixed I/O: async pinned memory enabled only in Nott::Train().
- Unified I/O: async pinned memory enabled in all runners (Nott prebuilt, Nott custom, LibTorch).
| Runner | Steps (filtered) | Mean (ms) | Std | CV | P10 | P50 | P90 | P98 | Mode | Throughput (steps/s) |
|---|---|---|---|---|---|---|---|---|---|---|
| Nott — Prebuilt Train() | 76 916 | 1.20268 | 0.00157 | 0.00131 | 1.20049 | 1.20302 | 1.20451 | 1.20537 | 1.20398 | 831.47 |
| Nott — Custom Train() | 91 027 | 1.33688 | 0.18792 | 0.14057 | 1.17145 | 1.23006 | 1.65251 | 1.72896 | 1.19031 | 748.01 |
| LibTorch Raw | 90 837 | 1.27572 | 0.18145 | 0.14224 | 1.12161 | 1.16910 | 1.59117 | 1.66006 | 1.13251 | 783.87 |
- CV (coefficient of variation) =
Std / Mean. Lower = less jitter.
| Comparison | Value |
|---|---|
| Nott Prebuilt vs LibTorch Overhead | -5.73% |
| Nott Prebuilt vs Nott Custom Overhead | -10.04% |
| Nott Custom vs LibTorch Overhead | +4.79% |
In this configuration, Nott’s prebuilt Train() benefits from async pinned memory while the other runners do not, so this setup is favorable to the prebuilt runner and mainly illustrates the impact of I/O configuration.
| Runner | Steps (filtered) | Mean (ms) | Std | CV | P10 | P50 | P90 | P98 | Mode | Throughput (steps/s) |
|---|---|---|---|---|---|---|---|---|---|---|
| Nott — Prebuilt Train() | 71 288 | 1.06486 | 0.00184 | 0.00172 | 1.06275 | 1.06475 | 1.06702 | 1.06889 | 1.06298 | 939.09 |
| Nott — Custom Train() | 75 622 | 1.06443 | 0.01764 | 0.01657 | 1.04117 | 1.06435 | 1.08850 | 1.10319 | 1.06208 | 939.47 |
| LibTorch Raw | 80 820 | 1.02841 | 0.00512 | 0.00498 | 1.02150 | 1.02813 | 1.03556 | 1.03934 | 1.02704 | 972.37 |
- CV (coefficient of variation) =
Std / Mean. Lower = less jitter.
| Comparison | Value |
|---|---|
| Nott Prebuilt vs LibTorch Overhead | +3.54% |
| Nott Prebuilt vs Nott Custom Overhead | +0.04% |
| Nott Custom vs LibTorch Overhead | +3.50% |
With identical pinned-memory settings, Nott’s prebuilt Train() stays within a few percent of raw LibTorch in mean latency and throughput, while keeping jitter extremely low.
Note on variability. These numbers are relative, not absolute. Modern hardware is noisy: clocks, power limits, thermals and OS scheduling all drift, so repeated runs with identical settings produce slightly different latency distributions. The robust takeaway across both tables is that Nott’s wrapper adds at most a few percent overhead compared to raw LibTorch, and can even be faster under certain I/O configurations, while preserving very low jitter.
Source: test/speedtest.cpp
Nott’s backend modules follow the algorithms as described in the modern deep-learning literature. For each mechanism we cite a canonical paper or textbook (not always the first historical appearance) and link both to the reference and to the implementation file.
- Gaussian Error Linear Unit (GeLU). Dan Hendrycks, Kevin Gimpel. “Gaussian Error Linear Units (GELUs).” arXiv:1606.08415. (Module: Nott::Activation::GeLU).
- Gated Linear Unit (GLU). Yann N. Dauphin et al. “Language Modeling with Gated Convolutional Networks.” arXiv:1612.08083. (Module: Nott::Activation::GLU).
- Mish. Diganta Misra. “Mish: A Self Regularized Non-Monotonic Neural Activation Function.” arXiv:1908.08681. (Module: Nott::Activation::Mish).
- SiLU / Swish. Prajit Ramachandran, Barret Zoph, Quoc V. Le. “Searching for Activation Functions.” arXiv:1710.05941. (Modules: Nott::Activation::SiLU, Nott::Activation::Swish).
- SwiGLU. Aakanksha Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways.” arXiv:2204.02311. (Module: Nott::Activation::SwiGLU).
- Classic Transformer. Ashish Vaswani et al. “Attention Is All You Need.” arXiv:1706.03762. (Module: Nott::Block::Transformer::Classic).
- BERT. Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv:1810.04805. (Module: Nott::Block::Transformer::BERT).
- Transformer++. Hanxiao Liu et al. “Transformer++: Improving Parallelism, Efficiency and Performance of Transformer Models.” arXiv:2003.04974. (Module: Nott::Block::Transformer::PlusPlus).
- Longformer-XL. Iz Beltagy et al. “Longformer: The Long-Document Transformer.” arXiv:2004.05150 and Zihang Dai et al. “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.” arXiv:1901.02860. (Module: Nott::Block::Transformer::LongformerXL).
- Vision Transformer. Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv:2010.11929. (Module: Nott::Block::Transformer::Vision).
- Perceiver. Andrew Jaegle et al. “Perceiver: General Perception with Iterative Attention.” arXiv:2103.03206. (Module: Nott::Block::Transformer::Perceiver).
- Mamba. Albert Gu et al. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv:2312.00752. (Module: Nott::Block::Transformer::Mamba).
- Energy-Based Transformer (EBT). Mikael Haziza et al. “Energy-Based Transformers.” arXiv:2507.02092. (Module: Nott::Block::Transformer::EBT).
- Atlas. Theodore Sumers et al. “Atlas: Learning to Optimally Memorize the Context at Test Time.” arXiv:2505.23735. (Module: Nott::Block::Transformer::Atlas).
- Titan. Zhifan Liu et al. “Titan: Scaling Language Model Training with Real-Time, Low-Latency Adaptation.” arXiv:2501.00663. (Module: Nott::Block::Transformer::Titan).
- Dropout. Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” arXiv:1207.0580. (Module: Nott::Layer::Dropout).
- Batch Normalization. Sergey Ioffe, Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167. (Module: Nott::Layer::BatchNorm).
- Instance Normalization. Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky. “Instance Normalization: The Missing Ingredient for Fast Stylization.” arXiv:1607.08022. (Module: Nott::Layer::InstanceNorm).
- Convolutional Layers. Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.” Biological Cybernetics 1980 (early convolution + pooling-like architecture); Yann LeCun et al. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 1998 (modern gradient-trained CNNs with conv + pooling + FC); Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS 2012 (large-scale deep CNN on GPUs). (Modules: Nott::Layer::Conv2d and variants).
- Fully Connected / Perceptron Layers. Frank Rosenblatt. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Rosenblatt 1958 (single-layer perceptron); David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams. “Learning Representations by Back-Propagating Errors.” Nature 1986 (multilayer perceptrons with backpropagation). (Module: Nott::Layer::FC).
- Pooling Layers. Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.” 1980 (early “subsampling” / pooling); Yann LeCun et al. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 1998 (max/average pooling in modern CNNs). (Modules: Nott::Layer::MaxPool2d, Nott::Layer::AvgPool2d).
- Recurrent Layers. Jeffrey L. Elman. “Finding Structure in Time.” Cognitive Science 1990 (Elman RNN); Sepp Hochreiter, Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation 1997; Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” arXiv:1406.1078. (Modules: Nott::Layer::RNN, Nott::Layer::LSTM, Nott::Layer::GRU).
- Positional Encoding. Ashish Vaswani et al. “Attention Is All You Need.” arXiv:1706.03762. (Module: Nott::Layer::PositionalEncoding).
- Structured State Spaces (S4). Albert Gu et al. “Efficiently Modeling Long Sequences with Structured State Spaces.” arXiv:2111.00396. (Module: Nott::Layer::S4).
- Patch (Un)Embedding. Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv:2010.11929. (Modules: Nott::Layer::PatchUnembed, Nott::Layer::Resizing).
- Flatten and Reduce. Generic tensor reshaping and spatial reduction operations used throughout early CNN architectures (e.g., LeCun et al. 1998). (Modules: Nott::Layer::Flatten, Nott::Layer::Reduce).
For classical statistical losses (CE, MSE, MAE, logistic), the references below are standard ML expositions, not the original 19th–20th century statistical papers.
-
Cross-Entropy / Negative Log Likelihood. Rooted in information theory and relative entropy: Claude E. Shannon. “A Mathematical Theory of Communication.” Bell System Technical Journal 1948; Solomon Kullback, Richard A. Leibler. “On Information and Sufficiency.” Annals of Mathematical Statistics 1951. For a modern ML treatment we follow: David J. C. MacKay. “Information Theory, Inference, and Learning Algorithms.” 2003 Text. (Modules: Nott::Loss::CE, Nott::Loss::NLL).
-
Binary Cross-Entropy / Logistic Loss. Originating from logistic models and Bernoulli log-likelihood: Joseph Berkson. “Application of the Logistic Function to Bio-Assay.” Journal of the American Statistical Association 1944 (introduces and justifies the logistic / logit model for bio-assay). For a standard applied treatment we follow: David W. Hosmer, Stanley Lemeshow. “Applied Logistic Regression.” 2000 Text. (Module: Nott::Loss::BCE).
-
Categorical Cross-Entropy. Multinomial / categorical negative log-likelihood in the sense of classical likelihood theory: Ronald A. Fisher. “On the Mathematical Foundations of Theoretical Statistics.” Phil. Trans. of the Royal Society A 1922. For the modern softmax cross-entropy formulation in ML we follow: Christopher M. Bishop. “Pattern Recognition and Machine Learning.” 2006 Text. (Module: Nott::Loss::CCE).
-
Mean Squared Error / Mean Absolute Error. Rooted in classical least-squares and least-absolute-deviations: Adrien-Marie Legendre. “Nouvelles méthodes pour la détermination des orbites des comètes.” 1805; Carl Friedrich Gauss. “Theoria motus corporum coelestium in sectionibus conicis solem ambientium.” 1809 (formalizing least squares / squared-error minimization). For a modern statistical learning treatment we follow: Vladimir Vapnik. “The Nature of Statistical Learning Theory.” 1995 Text. (Modules: Nott::Loss::MSE, Nott::Loss::MAE).
-
Smooth L1 (Huber) Loss. Peter J. Huber. “Robust Estimation of a Location Parameter.” Annals of Mathematical Statistics 1964 (original Huber loss); Ross Girshick. “Fast R-CNN.” arXiv:1504.08083 (popular Smooth L1 implementation for bounding-box regression). (Module: Nott::Loss::SmoothL1).
-
Dice Loss. Fausto Milletari et al. “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation.” arXiv:1606.04797. (Module: Nott::Loss::Dice).
-
Tversky Loss. Seyed Sadegh Mohseni Salehi et al. “Tversky loss function for image segmentation using 3D fully convolutional deep networks.” arXiv:1706.05721. (Based on the Tversky index from Amos Tversky, “Features of Similarity,” Psychological Review 1977.) (Module: Nott::Loss::Tversky).
-
Lovász-Softmax. Maxim Berman, Amal Rannen Triki, Matthew B. Blaschko. “The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks.” arXiv:1705.08790. (Module: Nott::Loss::LovaszSoftmax).
-
Cosine Embedding Loss. Raia Hadsell, Sumit Chopra, Yann LeCun. “Dimensionality Reduction by Learning an Invariant Mapping.” CVPR 2006. (Module: Nott::Loss::CosineEmbedding).
-
Margin Ranking Loss. Thorsten Joachims. “Optimizing Search Engines Using Clickthrough Data.” KDD 2002 (introducing large-margin ranking). (Module: Nott::Loss::MarginRanking).
-
Kullback–Leibler Divergence. Solomon Kullback, Richard A. Leibler. “On Information and Sufficiency.” Annals of Mathematical Statistics 1951. (Module: Nott::Loss::KL).
- Cosine Annealing with Warm Restarts. Ilya Loshchilov, Frank Hutter. “SGDR: Stochastic Gradient Descent with Warm Restarts.” arXiv:1608.03983. (Module: Nott::LrScheduler::CosineAnnealing).
- Exponential Decay. Yann LeCun, Léon Bottou, Genevieve B. Orr, Klaus-Robert Müller. “Efficient BackProp.” Tricks of the Trade 2012. (Module: Nott::LrScheduler::Exponential).
- Adafactor. Noam Shazeer, Mitchell Stern. “Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.” arXiv:1804.04235. (Module: Nott::Optimizer::Adafactor).
- LAMB. Yang You et al. “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.” arXiv:1904.00962. (Module: Nott::Optimizer::LAMB).
- Lion. Qianqian Gu et al. “Symbolic Discovery of Optimization Algorithms.” arXiv:2302.06675. (Module: Nott::Optimizer::Lion).
- Adam. Diederik P. Kingma, Jimmy Ba. “Adam: A Method for Stochastic Optimization.” arXiv:1412.6980. (Module: Nott::Optimizer::Adam).
- AdamW. Ilya Loshchilov, Frank Hutter. “Decoupled Weight Decay Regularization.” arXiv:1711.05101. (Module: Nott::Optimizer::AdamW).
- AdaGrad. John Duchi, Elad Hazan, Yoram Singer. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” JMLR 2011. (Module: Nott::Optimizer::AdaGrad).
- RMSProp. Geoffrey Hinton. “Neural Networks for Machine Learning” (Coursera lecture 6.5), 2012. Lecture Notes. (Module: Nott::Optimizer::RMSprop).
- Stochastic Gradient Descent with Momentum. Boris T. Polyak. “Some Methods of Speeding up the Convergence of Iteration Methods.” USSR Comput. Math. and Math. Phys. 1964. (Module: Nott::Optimizer::SGD).
- Muon, AdaMuon, MuonManifold. Pavol Bielik et al. “Muon: Muon Momentum + Adaptive Manifold Optimization.” arXiv:2502.16982; “AdaMuon: Adaptive Momentum for Muon Optimizer.” arXiv:2507.11005; Thinking Machines “Modular Manifolds.” Modular Manifolds. (Module: Nott::Optimizer::Muon).
- Sophia. Shuchen Zhang et al. “Sophia: A Scalable Stochastic Second-Order Optimizer for Language Models.” arXiv:2305.14342. (Module: Nott::Optimizer::Sophia).
- Spectral Normalization. Takeru Miyato et al. “Spectral Normalization for Generative Adversarial Networks.” arXiv:1802.05957. (Module: Nott::Regularization::SpectralNorm).
- Stochastic Weight Averaging (SWA). Pavel Izmailov et al. “Averaging Weights Leads to Wider Optima in Deep Learning.” arXiv:1803.05407. (Module: Nott::Regularization::SWA).
- SWAG. Wesley J. Maddox et al. “SWAG: A Simple Baseline for Bayesian Uncertainty in Deep Learning.” arXiv:1902.02476. (Module: Nott::Regularization::SWAG).
- TRADES. Hongyang Zhang et al. “Theoretically Principled Trade-off between Robustness and Accuracy.” arXiv:1901.08573. (Module: Nott::Regularization::TRADES).
- Virtual Adversarial Training (VAT). Takeru Miyato et al. “Virtual Adversarial Training: A Regularization Method for Supervised and Semi-supervised Learning.” arXiv:1704.03976. (Module: Nott::Regularization::VAT).
- Elastic Net. Hui Zou, Trevor Hastie. “Regularization and Variable Selection via the Elastic Net.” JRSS B 2005. (Module: Nott::Regularization::ElasticNet).
- L1 / L2 Penalties. Andrew Y. Ng. “Feature selection, L1 vs. L2 regularization, and rotational invariance.” ICML 2004. (Modules: Nott::Regularization::L1, Nott::Regularization::L2).
- Group Lasso. Ming Yuan, Yi Lin. “Model selection and estimation in regression with grouped variables.” JRSS B 2006. (Module: Nott::Regularization::GroupLasso). **** Max-Norm Constraints. George E. Dahl et al. “Improving Deep Neural Networks for LVCSR using Maxout and Dropout.” ICASSP 2013 (popularized max-norm constraints). (Module: Nott::Regularization::MaxNorm).
- Orthogonality Regularization. Ankit Bansal, Daniel Chen, David Jacobs. “Can We Gain More from Orthogonality Regularizations in Training Deep CNNs?” NeurIPS 2018. (Module: Nott::Regularization::Orthogonality).
- Nuclear Norm. Nathan Srebro, Jason Rennie, Tommi Jaakkola. “Maximum-Margin Matrix Factorization.” NeurIPS 2004. (Module: Nott::Regularization::NuclearNorm).
- Jacobian Regularization. Patrice Simard et al. “Best Practices for Convolutional Neural Networks applied to Visual Document Analysis.” ICDAR 2003. (Module: Nott::Regularization::Jacobian).
- Decorrelation (DeCov). Yingying C. Sun, Andrew L. Maas, Surya Ganguli, Andrew Y. Ng. “DeCov: A Simple Way to Improve Generalization.” arXiv:1511.06068. (Module: Nott::Regularization::Decov).
- Fisher and Sharpness-aware FGE. Timur Garipov et al. “Loss Surfaces, Mode Connectivity, and Fast Geometric Ensembling.” arXiv:1802.10026; Christian Liebel, Eva Müller. “Sharpness-Aware Training for Fast Geometric Ensembling.” arXiv:2303.00595. (Modules: Nott::Regularization::FGE, Nott::Regularization::SFGE).
- Elastic Weight Consolidation. James Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks.” PNAS 2017. (Module: Nott::Regularization::EWC).
- Memory Aware Synapses. Rahaf Aljundi et al. “Memory Aware Synapses: Learning what (not) to forget.” ECCV 2018. (Module: Nott::Regularization::MAS).
- Synaptic Intelligence. Friedemann Zenke, Ben Poole, Surya Ganguli. “Continual Learning Through Synaptic Intelligence.” arXiv:1703.04200. (Module: Nott::Regularization::SI).
- L0 Hard Concrete Gates. Christos Louizos, Max Welling, Diederik P. Kingma. “Learning Sparse Neural Networks through L0 Regularization.” arXiv:1712.01312. (Module: Nott::Regularization::L0HardConcrete).
- Kullback–Leibler Sparsity. Andrew Ng. “Sparse Autoencoder.” CS294A Lecture Notes 2011. (Module: Nott::Regularization::KLSparsity).
- R1 / R2 Gradient Penalties. Lars Mescheder, Sebastian Nowozin, Andreas Geiger. “Which Training Methods for GANs do actually Converge?” arXiv:1801.04406. (Modules: Nott::Regularization::R1, Nott::Regularization::R2).
- WGAN-GP. Ishaan Gulrajani et al. “Improved Training of Wasserstein GANs.” arXiv:1704.00028. (Module: Nott::Regularization::WGANGP).