I started this project to get a better understanding of dynamic computation graphs in PyTorch, a framework I use and love. Inspired by the original paper of DyNet, this repository implements a light-weight general-purpose deep learning library from scratch. The ultimate goal is to have a well-maintained C++ library that can be used to end-to-end training.
gladius's architecture consists of the following components:
-
Parameter
: real-valued vectors and matrices representing weight matrices and bias vectors. -
LookupParameters
: sets of vectors ofParameter
. -
Model
: Collection ofParameter
andLookupParameter
objects. The model keeps track of the parameters and their gradients. -
Trainer
: implements an online update rule, such as SGD or Adam. The trainer holds a pointer to the model object and, therefore, the parameters it contains. -
Expression
: Main data type being manipulated in a gladius program. Each expression represents a subcomputation in the computation graph. For instance, aParameter
object can be added to the computation graph, resulting in an expression W or b. -
Operations
: These are functions that act on expressions and return other expressions. Crucially, they are not objects. gladius defines many different operations, including addition, multiplication, softmax, tanh, etc. -
Builder classes
: These define interfaces for building different networks. In our case, we will mostly be interested in implementing the transformer network, but one should not have a hard time having a recurrent neural network builder, for instance. These work on top of expressions and operations and provide easy-to-use libraries. More discussion on builders below. -
ComputationGraph
: Expressions are part of an implicit computation graph object, internally represented as a Directed Acyclic Graph. gladius currently assumes that only one computation graph will exist at a time. From the user's perspective, we create a computation graph for each new training example.
Before cloning the repository, make sure you have the following installed on your machine:
- cmake (version >= 3.18)
- cmake-format
- clang-format
While cloning this repository remember to also grab the submodules since we are using the following submodule dependencies: cereal, googletest, and benchmark. Then, build the library as follows (this will also build the unit tests):
$ git clone https://github.com/BlaiseMuhirwa/gladius.git --recurse-submodules
To build all unit and integration tests, you can pass an optional tests
argument as follows
$ ./build.sh tests
If you do not provide the tests
argument, cmake
will only build gladius static library.
Note: We currently only support Macs with x86-64 architectures. Support for more architectures will be added progressively.
- Full implementation of the Transformer architecture (scaled dot-product attention, etc.). This will be fun.
- Fast computations with Eigen
- Support for data-parallel training
- SGD and Adam
- Parallel implementations with OpenMP.
- No more Jacobian computations. Nobody computes the Jacobian, but I figured it is mathematically helpful for my understanding to actually derive every single gradient update. This turns out to be very inefficient since the Jacobian for Fully-Connected Layers is very sparse, so we can benefit a lot from not computing it.
To get a sense of the sparsity of the Jacobian, consider the following case. Suppose we have a weight matrix
Let
Notice that only