GitHub - BlaiseMuhirwa/gladius: An easy-to-use dynamic computation graph library for running e2e ML training.

Gladius

I started this project to get a better understanding of dynamic computation graphs in PyTorch, a framework I use and love. Inspired by the original paper of DyNet, this repository implements a light-weight general-purpose deep learning library from scratch. The ultimate goal is to have a well-maintained C++ library that can be used to end-to-end training.

Architectural Design

gladius's architecture consists of the following components:

Parameter: real-valued vectors and matrices representing weight matrices and bias vectors.
LookupParameters: sets of vectors of Parameter.
Model: Collection of Parameter and LookupParameter objects. The model keeps track of the parameters and their gradients.
Trainer: implements an online update rule, such as SGD or Adam. The trainer holds a pointer to the model object and, therefore, the parameters it contains.
Expression: Main data type being manipulated in a gladius program. Each expression represents a subcomputation in the computation graph. For instance, a Parameter object can be added to the computation graph, resulting in an expression W or b.
Operations: These are functions that act on expressions and return other expressions. Crucially, they are not objects. gladius defines many different operations, including addition, multiplication, softmax, tanh, etc.
Builder classes: These define interfaces for building different networks. In our case, we will mostly be interested in implementing the transformer network, but one should not have a hard time having a recurrent neural network builder, for instance. These work on top of expressions and operations and provide easy-to-use libraries. More discussion on builders below.
ComputationGraph: Expressions are part of an implicit computation graph object, internally represented as a Directed Acyclic Graph. gladius currently assumes that only one computation graph will exist at a time. From the user's perspective, we create a computation graph for each new training example.

Getting Started

Before cloning the repository, make sure you have the following installed on your machine:

cmake (version >= 3.18)
cmake-format
clang-format

While cloning this repository remember to also grab the submodules since we are using the following submodule dependencies: cereal, googletest, and benchmark. Then, build the library as follows (this will also build the unit tests):

$ git clone https://github.com/BlaiseMuhirwa/gladius.git --recurse-submodules

To build all unit and integration tests, you can pass an optional tests argument as follows

$ ./build.sh tests

If you do not provide the tests argument, cmake will only build gladius static library.

Note: We currently only support Macs with x86-64 architectures. Support for more architectures will be added progressively.

Upcoming Features and Optimization

Full implementation of the Transformer architecture (scaled dot-product attention, etc.). This will be fun.
Fast computations with Eigen
Support for data-parallel training
SGD and Adam
Parallel implementations with OpenMP.
No more Jacobian computations. Nobody computes the Jacobian, but I figured it is mathematically helpful for my understanding to actually derive every single gradient update. This turns out to be very inefficient since the Jacobian for Fully-Connected Layers is very sparse, so we can benefit a lot from not computing it.

To get a sense of the sparsity of the Jacobian, consider the following case. Suppose we have a weight matrix $W \in \mathbb{R}^{m\times n}$ and a vector of activations computed by the ReLU function, $z \in \mathbb{R}^{n}$. Let $\Phi: \mathbb{R}^{m\times n}\times \mathbb{R}^{n} \to \mathbb{R}^{m}$ be the map

$$ \Phi(W,z) = Wz $$

Let $D_{W}\Phi \in \mathbb{R}^{m \times (m\times n)}$ be the Jacobian of the map w.r.t the weight matrix. You can convince yourself that it is given by

$$ \begin{bmatrix} z_1 & z_2 & \ldots & z_n & \ldots & 0 & 0 & \ldots & 0 \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & \ldots & 0 & \ldots & z_1 & z_2 & \ldots & z_n \end{bmatrix} $$

Notice that only $\frac{1}{m}$ entries all non-zero. So, for large values of $m$, $D_{W}\Phi$ is very sparse and we are far better off not computing the matrix above.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
dependencies/cereal		dependencies/cereal
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
design.md		design.md
format.sh		format.sh
run-clang-tidy.py		run-clang-tidy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gladius

Architectural Design

Getting Started

Upcoming Features and Optimization

About

Releases

Packages

Languages

License

BlaiseMuhirwa/gladius

Folders and files

Latest commit

History

Repository files navigation

Gladius

Architectural Design

Getting Started

Upcoming Features and Optimization

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages