Skip to content

MoonFlowww/Nott

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nott

Nott is a modern C++ deep-learning framework that layers a strongly typed API over LibTorch. It prioritizes reproducibility, predictable latency, and explicit control over kernels, memory, and optimizer state.

Why Nott?

  • First-class graph authoring. Layers and higher-order blocks can be connected as a DAG, letting you express anything from small CNNs to large transformer stacks without wrestling with manual tensor plumbing.
  • Consistent systems model. Data loaders, augmentations, optimizers, regularizers, and metrics share the same descriptor-driven style so you can mix-and-match building blocks safely.
  • Native performance. Nott keeps you close to the metal through LibTorch while still providing ergonomic abstractions. Benchmarks at the end of this document detail the runtime overhead compared to pure LibTorch.
  • Not just LibTorch wrappers. The API covers extensive layer catalogs, attention descriptors, data loaders, and training utilities rather than exposing only a handful of LibTorch layers.

Quick Start

#include <Nott>

int main() {
    Nott::Model model("demo");
    model.use_cuda(torch::cuda::is_available());

    model.add(Nott::Layer::FC({784, 256, true}, Nott::Activation::GeLU));
    model.add(Nott::Layer::Dropout({0.1}));
    model.add(Nott::Layer::FC({256, 10, true}, Nott::Activation::Softmax));

    model.set_optimizer(
        Nott::Optimizer::AdamW({.learning_rate = 1e-3}),
        Nott::LrScheduler::CosineAnnealing({.T_max = 50})
    );

    model.set_loss(Nott::Loss::MSE({}));

    auto [train_images, train_labels, test_images, test_labels] = Nott::Data::Load::MNIST("./datasets", 1.f, 1.f, true);
        
    model.train(train_images, train_labels, {.epoch = 10, .batch_size = 64});

    model.evaluate(test_images, test_labels, Nott::Evaluation::Classification, {
        Nott::Metric::Classification::Precision,
        Nott::Metric::Classification::F1,
        Nott::Metric::Classification::Informedness});
    return 0;
}

Adding Layers and Blocks

The model holds a directed acyclic graph of computational blocks and layers. You can construct it incrementally with .add().

model.add(Nott::Layer::FC({258, 10, /*bias*/true}, Nott::Activation::GeLU, Nott::Initialization::HeNormal))

or blocks:

model.add(Nott::Block::Sequential({ /*vector field*/
    Nott::Layer::Conv2d({3, 64, {3, 3}, {1, 1}, {1, 1}, {1, 1}, 1, false},
        Nott::Activation::Identity, Nott::Initialization::HeNormal),
    Nott::Layer::BatchNorm2d({64, 1e-5, 0.1, true, true},
        Nott::Activation::SiLU),
    Nott::Layer::MaxPool2d({{2, 2}, {2, 2}})
}));

The framework ships with a rich catalog of layers (see in Docs/Layers or Docs/Blocks). It will automatically link linearly every item's called via .add(). To rewire the network use .links() (see in Docs/Links). Multi-head attention descriptors that power the transformer blocks are documented in Docs/Attention.

Configuring Optimization

Optimizer and scheduler choices are set once per model by default.

The example below pairs AdamW with cosine annealing warm restarts.

model.set_optimizer(
    Nott::Optimizer::AdamW({.learning_rate = 1e-4, .weight_decay = 5e-4}),
    Nott::LrScheduler::CosineAnnealing({
        .T_max = steps_per_epoch * epochs,
        .eta_min = 3e-7,
        .warmup_steps = 5 * steps_per_epoch,
        .warmup_start_factor = 0.1
    })
);

Losses and regularization follow the same pattern:

model.set_loss(Nott::Loss::CrossEntropy({.label_smoothing = 0.02f}));
model.set_regularization({ /*vector field*/
    Nott::Regularization::SWAG({
        .coefficient = 1e-3,
        .variance_epsilon = 1e-6,
        .start_step = static_cast<size_t>(0.85 * steps_per_epoch * epochs),
        .accumulation_stride = static_cast<size_t>(steps_per_epoch),
        .max_snapshots = 20,
    })
});

To see the complete list of Optimizers, Losses or Regularizations and their parameters check Docs/Optimizer, Docs/Loss and Docs/Regularization

It is also possible to use multiples configurations over the network, check Docs/Local

Working with Data

The Nott::Data::Load namespace includes ready-made loaders for popular datasets such as MNIST, CIFAR-10, ETTH, PTBXL. Data manipulations (augmentation, shuffling, and splitting) are exposed through Nott::Data::Manipulation utilities, while consistency checks live under Nott::Data::Check.

at::Tensor [train_images, train_labels, test_images, test_labels] = Nott::Data::Load::CIFAR10(dataset_root, 1.f, 1.f, true);

at::Tensor [validation_images, validation_labels] = Nott::Data::Manipulation::Fraction(test_images, test_labels, 0.1f);

std::tie(train_images, train_labels) = Nott::Data::Manipulation::Cutout(train_images, train_labels,{-1, -1}, {12, 12}, -1, 1.f, true, false);

More information inside Docs/Data

Training

Training is initiated with model.train, which accepts tensors and a Nott::TrainOptions struct describing epochs, batch size, graph mode, validation splits, AMP, and other runtime settings.

model.train(train_images, train_labels, {.epoch=120, .batch_size=128, .test={x_val,y_val}});

More information in Docs/Train

Evaluation and Metrics

Post-training evaluation is performed with model.evaluate, which accepts the test split, a task type, and a list of metrics. The evaluation API streams batches and accumulates metrics such as accuracy, precision, recall, calibration errors, and more.

model.evaluate(test_images, test_labels, Nott::Evaluation::Classification, { /*vector field*/
    Nott::Metric::Classification::Precision,
    Nott::Metric::Classification::Recall,
    Nott::Metric::Classification::F1,
    Nott::Metric::Classification::TruePositiveRate,
    Nott::Metric::Classification::LogLoss,
}, {.batch_size = 64});

More details inside Docs/Evaluation

Saving and Loading

To keep save your Network use model.save() and model.load(); model.save() will create a folder of _Network_Name_ name, and save inside architecture.json which correspond to Network layers, dimensions parameters, optimizer used, etc. As well as a parameter.binary which store learnable parameters of layers.

NB: Since model.load() read architecture.json, you don't need to re-code your network via model.add()

model.save("PATH");
model.load("PATH"+"/_Network_Name_");

model.save generates a folder named after the model containing the architecture.json (graph, dimensions, optimizer metadata) and parameters.binary (learnable weights). Because model.load reads the JSON specification, you do not need to recreate the graph via model.add. Details live in Save & Load.

Latency Benchmarks

Results below represent warm runs filtered with a Tukey 0.98 fence on the MNIST workload
(60k samples, 28×28 | epochs = 100, batch = 64).

Two configurations are reported:

  1. Mixed I/O: async pinned memory enabled only in Nott::Train().
  2. Unified I/O: async pinned memory enabled in all runners (Nott prebuilt, Nott custom, LibTorch).

1) Mixed I/O — Async pinned memory only in Nott::Train()

Runner Steps (filtered) Mean (ms) Std CV P10 P50 P90 P98 Mode Throughput (steps/s)
Nott — Prebuilt Train() 76 916 1.20268 0.00157 0.00131 1.20049 1.20302 1.20451 1.20537 1.20398 831.47
Nott — Custom Train() 91 027 1.33688 0.18792 0.14057 1.17145 1.23006 1.65251 1.72896 1.19031 748.01
LibTorch Raw 90 837 1.27572 0.18145 0.14224 1.12161 1.16910 1.59117 1.66006 1.13251 783.87
  • CV (coefficient of variation) = Std / Mean. Lower = less jitter.

Overhead (relative to mean latency, positive = slower than reference)

Comparison Value
Nott Prebuilt vs LibTorch Overhead -5.73%
Nott Prebuilt vs Nott Custom Overhead -10.04%
Nott Custom vs LibTorch Overhead +4.79%

In this configuration, Nott’s prebuilt Train() benefits from async pinned memory while the other runners do not, so this setup is favorable to the prebuilt runner and mainly illustrates the impact of I/O configuration.


2) Unified I/O — Async pinned memory in all runners

Runner Steps (filtered) Mean (ms) Std CV P10 P50 P90 P98 Mode Throughput (steps/s)
Nott — Prebuilt Train() 71 288 1.06486 0.00184 0.00172 1.06275 1.06475 1.06702 1.06889 1.06298 939.09
Nott — Custom Train() 75 622 1.06443 0.01764 0.01657 1.04117 1.06435 1.08850 1.10319 1.06208 939.47
LibTorch Raw 80 820 1.02841 0.00512 0.00498 1.02150 1.02813 1.03556 1.03934 1.02704 972.37
  • CV (coefficient of variation) = Std / Mean. Lower = less jitter.

Overhead (relative to mean latency, positive = slower than reference)

Comparison Value
Nott Prebuilt vs LibTorch Overhead +3.54%
Nott Prebuilt vs Nott Custom Overhead +0.04%
Nott Custom vs LibTorch Overhead +3.50%

With identical pinned-memory settings, Nott’s prebuilt Train() stays within a few percent of raw LibTorch in mean latency and throughput, while keeping jitter extremely low.


Note on variability. These numbers are relative, not absolute. Modern hardware is noisy: clocks, power limits, thermals and OS scheduling all drift, so repeated runs with identical settings produce slightly different latency distributions. The robust takeaway across both tables is that Nott’s wrapper adds at most a few percent overhead compared to raw LibTorch, and can even be faster under certain I/O configurations, while preserving very low jitter.

Source: test/speedtest.cpp

Research References

Nott’s backend modules follow the algorithms as described in the modern deep-learning literature. For each mechanism we cite a canonical paper or textbook (not always the first historical appearance) and link both to the reference and to the implementation file.

Activations

Transformers

Layers

  • Dropout. Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” arXiv:1207.0580. (Module: Nott::Layer::Dropout).
  • Batch Normalization. Sergey Ioffe, Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167. (Module: Nott::Layer::BatchNorm).
  • Instance Normalization. Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky. “Instance Normalization: The Missing Ingredient for Fast Stylization.” arXiv:1607.08022. (Module: Nott::Layer::InstanceNorm).
  • Convolutional Layers. Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.” Biological Cybernetics 1980 (early convolution + pooling-like architecture); Yann LeCun et al. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 1998 (modern gradient-trained CNNs with conv + pooling + FC); Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS 2012 (large-scale deep CNN on GPUs). (Modules: Nott::Layer::Conv2d and variants).
  • Fully Connected / Perceptron Layers. Frank Rosenblatt. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Rosenblatt 1958 (single-layer perceptron); David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams. “Learning Representations by Back-Propagating Errors.” Nature 1986 (multilayer perceptrons with backpropagation). (Module: Nott::Layer::FC).
  • Pooling Layers. Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.” 1980 (early “subsampling” / pooling); Yann LeCun et al. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 1998 (max/average pooling in modern CNNs). (Modules: Nott::Layer::MaxPool2d, Nott::Layer::AvgPool2d).
  • Recurrent Layers. Jeffrey L. Elman. “Finding Structure in Time.” Cognitive Science 1990 (Elman RNN); Sepp Hochreiter, Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation 1997; Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” arXiv:1406.1078. (Modules: Nott::Layer::RNN, Nott::Layer::LSTM, Nott::Layer::GRU).
  • Positional Encoding. Ashish Vaswani et al. “Attention Is All You Need.” arXiv:1706.03762. (Module: Nott::Layer::PositionalEncoding).
  • Structured State Spaces (S4). Albert Gu et al. “Efficiently Modeling Long Sequences with Structured State Spaces.” arXiv:2111.00396. (Module: Nott::Layer::S4).
  • Patch (Un)Embedding. Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv:2010.11929. (Modules: Nott::Layer::PatchUnembed, Nott::Layer::Resizing).
  • Flatten and Reduce. Generic tensor reshaping and spatial reduction operations used throughout early CNN architectures (e.g., LeCun et al. 1998). (Modules: Nott::Layer::Flatten, Nott::Layer::Reduce).

Losses

For classical statistical losses (CE, MSE, MAE, logistic), the references below are standard ML expositions, not the original 19th–20th century statistical papers.

  • Cross-Entropy / Negative Log Likelihood. Rooted in information theory and relative entropy: Claude E. Shannon. “A Mathematical Theory of Communication.” Bell System Technical Journal 1948; Solomon Kullback, Richard A. Leibler. “On Information and Sufficiency.” Annals of Mathematical Statistics 1951. For a modern ML treatment we follow: David J. C. MacKay. “Information Theory, Inference, and Learning Algorithms.” 2003 Text. (Modules: Nott::Loss::CE, Nott::Loss::NLL).

  • Binary Cross-Entropy / Logistic Loss. Originating from logistic models and Bernoulli log-likelihood: Joseph Berkson. “Application of the Logistic Function to Bio-Assay.” Journal of the American Statistical Association 1944 (introduces and justifies the logistic / logit model for bio-assay). For a standard applied treatment we follow: David W. Hosmer, Stanley Lemeshow. “Applied Logistic Regression.” 2000 Text. (Module: Nott::Loss::BCE).

  • Categorical Cross-Entropy. Multinomial / categorical negative log-likelihood in the sense of classical likelihood theory: Ronald A. Fisher. “On the Mathematical Foundations of Theoretical Statistics.” Phil. Trans. of the Royal Society A 1922. For the modern softmax cross-entropy formulation in ML we follow: Christopher M. Bishop. “Pattern Recognition and Machine Learning.” 2006 Text. (Module: Nott::Loss::CCE).

  • Mean Squared Error / Mean Absolute Error. Rooted in classical least-squares and least-absolute-deviations: Adrien-Marie Legendre. “Nouvelles méthodes pour la détermination des orbites des comètes.” 1805; Carl Friedrich Gauss. “Theoria motus corporum coelestium in sectionibus conicis solem ambientium.” 1809 (formalizing least squares / squared-error minimization). For a modern statistical learning treatment we follow: Vladimir Vapnik. “The Nature of Statistical Learning Theory.” 1995 Text. (Modules: Nott::Loss::MSE, Nott::Loss::MAE).

  • Smooth L1 (Huber) Loss. Peter J. Huber. “Robust Estimation of a Location Parameter.” Annals of Mathematical Statistics 1964 (original Huber loss); Ross Girshick. “Fast R-CNN.” arXiv:1504.08083 (popular Smooth L1 implementation for bounding-box regression). (Module: Nott::Loss::SmoothL1).

  • Dice Loss. Fausto Milletari et al. “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation.” arXiv:1606.04797. (Module: Nott::Loss::Dice).

  • Tversky Loss. Seyed Sadegh Mohseni Salehi et al. “Tversky loss function for image segmentation using 3D fully convolutional deep networks.” arXiv:1706.05721. (Based on the Tversky index from Amos Tversky, “Features of Similarity,” Psychological Review 1977.) (Module: Nott::Loss::Tversky).

  • Lovász-Softmax. Maxim Berman, Amal Rannen Triki, Matthew B. Blaschko. “The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks.” arXiv:1705.08790. (Module: Nott::Loss::LovaszSoftmax).

  • Cosine Embedding Loss. Raia Hadsell, Sumit Chopra, Yann LeCun. “Dimensionality Reduction by Learning an Invariant Mapping.” CVPR 2006. (Module: Nott::Loss::CosineEmbedding).

  • Margin Ranking Loss. Thorsten Joachims. “Optimizing Search Engines Using Clickthrough Data.” KDD 2002 (introducing large-margin ranking). (Module: Nott::Loss::MarginRanking).

  • Kullback–Leibler Divergence. Solomon Kullback, Richard A. Leibler. “On Information and Sufficiency.” Annals of Mathematical Statistics 1951. (Module: Nott::Loss::KL).

Learning Rate Schedulers

Optimizers

Regularization

About

Wrapper of LibTorch, for fast DNN prototyping

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages