Note: fl::Tensor
is a work in progress. Potential API breakage is possible as development continues. We'll do our best to adapt existing code as the API evolves.
fl::Tensor
is an effort to create a portable tensor frontend that facilitates integrating custom tensor computation stacks for research at the library, compiler, and hardware level. It provides:
- An unopinionated API with which tensor libraries can be implemented or wrapped
- A small implementation surface with few required operators;
- Compute-level functions that help define implementation requirements;
- Tests to ensure equivalent expected behavior;
- Benchmarks with Flashlight to test end-to-end performance with a variety of state-of-the-art neural networks.
In general, the framework eschews opinionation and provides the minimum viable surface on top of which to integrate a new tensor backend with minimal footprint, requiring the user define only a few dozen operations. fl::Tensor
is not an IR: the user is to define everything under the base interface.
Tensor stack research can be challenging in that large-scale research frameworks can be opinionated, providing rigid APIs or IRs on top of which to develop compilers and require significant changes to modify internals. fl::Tensor
prioritizes the following to facilitate development:
- Small API footprint: a small, simple interface to add tensor ops.
- User-defined internals: implement things under a standard, high-level API; no pre-specified IRs.
- Compute-model flexibility: users can integrate JITs, eager-evaluation libraries, and others, with few requirements.
At a high level, implementing an fl::Tensor
backend is simple and requires subclassing two interfaces:
TensorAdapterBase
, which defines operations for member functions on anfl::Tensor
.TensorBackend
which defines global tensor operations.
Tensor backend implementations should be placed in tensor/backend/[backend name]
; for example, the ArrayFire backend implementation resides in tensor/backend/af
. ArrayFireTensor.h
and ArrayFireBackend.h
provide good starting point examples of implementations.
In this example, we'll integrate a new FooBaz
backend.
First, we'll add derived implementations of the TensorAdapterBase
and TensorBackend
interfaces. We've created a new directory: flashlight/fl/tensor/backend/foobaz
and added a CMakeLists.txt
that links dependency and compiles files with our implementations:
cmake_minimum_required(VERSION 3.10)
find_package(FooBazBackendDep REQUIRED) # if our backend depends on something external
target_link_libraries(flashlight PUBLIC FooBazBackendDep::FooBazBackendDep)
target_sources(
flashlight
PRIVATE
${CMAKE_CURRENT_LIST_DIR}/FooBazTensor.cpp
${CMAKE_CURRENT_LIST_DIR}/FooBazBackend.cpp
)
# If a backend requires a a compute runtime, e.g. CUDA, this can be set
# globally to ensure CUDA interoperability with other FL components:
fl_set_backend_state(ENABLE CUDA)
and we'll gate building our backend in flashlight/fl/tensor/CMakeLists.txt
:
# ...
# adding a CMake option to build the backend. Passing -DFL_USE_FOOBAZ=ON when
# running cmake will build the backend.
option(FL_USE_FOOBAZ "Build FooBaz tensor backend" OFF)
if (FL_USE_FOOBAZ)
include(${CMAKE_CURRENT_LIST_DIR}/backend/foobaz/CMakeLists.txt)
endif()
# ...
# Add a compile definition for compile-time use
target_compile_definitions(
flashlight
PUBLIC
FL_USE_FOOBAZ=$<BOOL:${FL_USE_FOOBAZ}> # added
)
# ...
Now, we'll add FooBazTensor.h
, also in flashlight/fl/tensor/backend/foobaz
which defines the class FooBazTensor
, derived from TensorAdapterBase
, as follows:
#include "flashlight/fl/tensor/TensorAdapter.h"
#include "flashlight/fl/tensor/Shape.h"
#include "flashlight/fl/tensor/Index.h"
class FooBazTensor : public fl::TensorAdapterBase {
// We can store state and metadata associated with our tensor here.
public:
// Provide a constructor to create a FooBazTensor using some existing data/a buffer:
FooBazTensor(
const Shape& shape,
fl::dtype type,
void* ptr,
Location memoryLocation);
// ...
// Add the backend type to the TensorBackendType enum in TensorBase.h, then implement:
TensorBackendType backendType() const override;
// Override other interface functions. For example:
const Shape& shape() override; // returns a Flashlight shape
void device(void** out) override; // returns a pointer to device memory, if applicable.
Tensor index(const std::vector<Index>& indices) override; // indexing operator
// ...
};
We write our implementations in FooBazTensor.cpp
accordingly (omitted here). If our FooBazTensor
doesn't support some operations, we can have them throw exceptions, or alternatively include, depend on, and delegate to implementations in another existing fl::Tensor
backend. See the tensor adapter interface documentation for function-level expected behavior.
Now, we'll create our backend implementation, FooBazBackend.h
, also in flashlight/fl/tensor/backend/foobaz
, where we'll define most tensor operations and can store global state related to our backend:
#include "flashlight/fl/tensor/TensorBackend.h"
class FooBazBackend : public fl::TensorBackend {
// We can store global state for our backend (dataflow graphs,
// memory management, compute streams, op status, etc) here.
public:
// Performs pre-flight tasks when initializing the backend
FooBazBackend();
// ...
// Ops on tensors. For example:
Tensor transpose(const Tensor& tensor, const Shape& dims /* = {} */) override;
Tensor exp(const Tensor& tensor) override;
Tensor matmul(const Tensor& lhs, const Tensor& rhs) override;
// ...
};
Implementations of this interface are in FooBazBackend.cpp
(or distributed across other compilation units), and are ommitted here. As with FooBazTensor
, if our backend doesn't support some operations, we can have those operations throw exceptions, or alternatively delegate to another existing backend. See the backend interface documentation for function-level expected behavior.
We can test an fl::Tensor
implementation against requirements for expected behavior by running test in flashlight/fl/test/tensor
. These include:
TensorBaseTest
, which tests most interface and compute operationsIndexTest
, which tests tensor indexing and some in-place operationsComputeTest
, which tests async computation and data availability, as applicable.
Below is a more formal definition of the implementation requirements for a tensor backend.
In addition to deriving from the TensorAdapterBase
and TensorBackend
interfaces, implementers are required to:
-
Provide memory interoperability as defined in interfaces. This includes constructors to create tensors from arbitrary buffers either on the host or accelerator devices. These include:
void fl::TensorAdapter::host(void** out)
, which provides a host-side host for an underlying tensor.void fl::TensorAdapter::device(void** out)
, which provides a device-side buffer pointing to an underlying tensor. This buffer may not be up-to-date; non-blocking relative synchronization should be used as needed, i.e. throughTensor::stream()::relativeSync(Stream&)
.- To avoid needing to copy device-side buffers, the API also provides
void unlock()
, which signals that a pointer to device memory can be freed if its corresponding tensor(s) are destroyed. - Only memory from buffers returned via
Tensor::host
is guaranteed to be up-to-date. Memory returned fromTensor::device
may still require computation to be up-to-date; relative synchronization should be used in these cases.
- To avoid needing to copy device-side buffers, the API also provides
-
Implement compute synchronization primitives as defined by interfaces. These include:
void fl::TensorBackend::sync()
: blocks the calling thread until all computation on all streams associated with active devices is complete.void fl::TensorBackend::eval(Tensor&)
: launches any kernels that need to be executed to make the tensor's updated value available.- These can be no-ops if the backend computation model so dictates.
-
Pass all tests as provided in
flashlight/fl/test/tensor
for implemented operators.
Backends implemented with fl::Tensor
can have both lazy and eager evaluation semantics. The sync
and eval
requirements above imply support for lazy evaluation, but one or both can be noops as long as implementation requirements are met.
There are no additional implementation requirements around memory usage besides the constructor, host
and device
functions above. Only tensors on which these memory-access functions are called need to have explicit buffers.
To adjust the default backend with which Flashlight tensors are created, change the types and preprocessor calues in TensorAdapter.cpp
accordingly.
- Note: Flashlight's default tensor backend continues to be ArrayFire, and newly-created tensors will be
fl::Tensor
will call into ArrayFire by default.
Some operations fall outside of the minimal tensor interface that fl::Tensor
provides but are still important in machine learning computation. For these operations, Flashlight defines various TensorExtension
s which allow implementation of these operations and connection to a TensorBackend
for seamless dispatch.
Tensor extensions derive from TensorExtension<T>
and can define their own APIs for backends to implement operations. Extensions are registered to a given backend using the FL_REGISTER_TENSOR_EXTENSION(...)
macro, which takes a tensor extension type and a tensor backend type. Only one implementation of a given tensor extension can be registered to a given backend.
For example, to register an extension implementation (in this case registering a CudnnAutogradExtension
to the ArrayFire
implementation of TensorBackend
), we first check that the backends are available and selected to be built by the user, before registering the extension:
#if FL_USE_CUDNN
#if FL_USE_ARRAYFIRE && FL_ARRAYFIRE_USE_CUDA
FL_REGISTER_TENSOR_EXTENSION(
CudnnAutogradExtension,
TensorBackendType::ArrayFire);
#endif // FL_USE_ARRAYFIRE && FL_ARRAYFIRE_USE_CUDA
#endif // FL_USE_CUDNN
Flashlight's AutogradExtension
provides an API for implementing operations that frequently have autograd-aware implementations. These include:
- Convolution (
conv2d
) - Pooling (
pool2d
) - Batch normalization (
batchnorm
) - Recurrent neural networks (
rnn
)
Each of the above interface functions have one or more corresponding backward functions (e.g. pool2dBackward
) with which to compute gradients with respect to their forward parameters.
For documentation on the behavior and input parameters of functions on AutogradExtension
, see the documentation for AutogradOps.h
, which dispatches to the registered AutogradExtention
implemetations for those operations' input tensors' backends.