TensileCreateLibrary

Definitions

Benchmark Run: Given a benchmark config (see Tensile/Configs/ for examples), generate a set of potential suitor solutions to solve a problem defined by parameterized properties. Run the generated kernels with specified sizes / ranges and record their performance. The best performing kernels for the given benchmark are selected and written to out as Library Logic.

Library Logic: One or more .yaml files containing benchmarking output results (usually one per gfx architecture, per benchmark problem). They consist of kernel meta-data and mappings to problems that it can solve, and performance data at particular sizes.

Solution: Solutions are basically a parameterized representations of a kernels intended to solve a specific problem. In the case we are solving a GEMM type of problem, a solution will encapsulate a set of fixed parameters that are applied to a generalized GEMM algorithm.

Predicate: A predicate is basically a test to either affirm or negate a condition.

Rationale

Every problem that can be solved with a generalized algorithm can be customized based on the application. In this case, we are solving problems on AMDGPUs which are highly configurable and can get things done in many ways. We have many choices of distributing workload, hardware resources, vector parallelization or unrolling loops just to name a few. How do we know which are the best choices, guaranteeing correct results with the best performance?

One way that we can deal with this is by considering parameterized benchmarks. We choose parameters and problem sizes that we are interested in, and try different combinations of these values to gain insight on their effects of correctness and performance. Once validated for correctness and show relative performance advantages, each of these combinations is considered a valid 'solution' to the problem.

Introduction

TensileCreateLibrary has a very important purpose in Tensile to generate and organize libraries of kernels to run on AMDGPU architectures. Given a set of previously defined solutions to a problem, each solution metadata is used to generate concrete GPU code (kernel) either in assembly or c++ source. The kernels themselves are basic operation building blocks that form the foundation of solving more complex problems and are common to many different applications. For example, kernels that implement the Generalized Matrix-Matrix Multiplication (GEMM) could ultimately end up being used in more complex machine learning or image processing techniques.

Of course we can define any solution meta-data that we want, but we have an aim to generate the most efficient and highest performing kernels possible. The benchmarking process that we use helps us find optimal kernels by testing out many varieties of parameterized solutions. Components from TensileCreateLibraries are used to generate the initial set of test solutions from the parameterized meta-data. After testing, the most optimal solutions are then selected and stored as configuration files. These configuration files are passed to TensileCreateLibraries to finally become the Master Solution Library.

Benchmarking process

Components from TensileCreateLibrary are used in a few different steps of the benchmarking process. Let's review:

Benchmark config files (.yaml) are created by hand. Problem type and properties are defined, then interesting solution properties are parameterized to ranges of values that are applicable. Example: see Tensile/Configs
Tensile benchmark script is called with a benchmark config as input E.g. Tensile/bin/Tensile path/to/config.yaml output/dir/
Benchmark config is read, and several 'potential' solutions are generated to test their performance in solving the problem.
Kernels are generated from the solutions and linked into a test solution library.
A client application is generated to load the test solution library with Tensile API and run performance benchmarks.
Performance results are written to file and are analyzed for optimal 'winning' kernels to given problem sizes.
Winning kernel meta-data (solutions) are written as Logic Files (.yaml) with performance data.
Repeat process for all benchmark configs.
Optimal solutions are then read from all Logic File outputs.
Kernel libraries are generated from the solutions and their meta-data combined into a master solution library (TensileLibrary.dat)

More specific details about the benchmark configs and protocol can be found here.

The final kernel libraries and master solution library can then be loaded by a client using the Tensile API. Tensile API will map user defined problems to specific kernels that are optimally suited to get the job done.

Where does TensileCreateLibrary come in? TensileCreateLibrary contains components that are suitable for processing solution metadata to generate kernel sources, assemble/compile them, and link them into code object libraries. It is also suitable for combining and managing solution meta-data and their problem mappings.

TensileCreateLibrary is mainly used to aggregate a given set of solutions into a master solution library and to generate and bundle associated kernels into code object libraries. The outputs are specifically the master solution library (TensileLibrary.dat) and the code object libraries (.co/.hsaco).

Master Solution Library

Now that we have an aggregate of solutions and their kernels, how are solutions solutions selected and kernels executed at run time? To see how this works, we need to know a little bit more about the structure of the Master Solution Library. It is inefficient to check EVERY solution for suitability, so they are organized in a hierarchical structure to drastically reduce search time. Note this is subject to change over time, but as of the time of writing, it fits the structure below.

Library Hierarchy

Libraries are used to implement hierarchical searches. At each level of the hierarchy, the predicates must be asserted before moving to the next level.

Hardware layer

This top level of the hierarchy requires the fewest comparisons as there are a limited amount of supported hardware available. At run time, we can only run kernels built for the hardware installed on the host machine. These are classified by 'gfx' architecture values. See here for supported architectures.

Problem Map

The problem map layer uses coarse problem operation classifications. This includes ops like GEMM TT or GEMM TN, which are coded similar to the following: Contraction_l_Alik_Bjlk_Cijk_Dijk. There are a few more comparisons at this level.

Problem

The problem library layer is more targeted around specific problem properties. This includes input types or other properties like high precision accumulate. For example, each GEMM kernel can only target specific input and output types and memory mappings otherwise it would need to change its implementation drastically.

Size and speed

Next, problem sizes are matched based on minimum euclidean distance. Benchmarking is not done for every size imaginable so we must match the closest possible size. Solutions must also pass a final predicate that compares finer details of the problem description (example: CDStridesEqual: true AND KernelLanguageCompatible=ASM, ...). If the predicate fails then it is not included in the final selection.

At this point we may have a small pool of kernels that can correctly solve the problem and have performance data for solving a problem of similar size. Based on what we know of the benchmarking, the kernel with the highest speed is selected to ultimately solve the problem.

IRL

Let's see a REAL example of a TensileLibrary.yaml file looks like:

---
library:
  rows:
  - library:
      map:
        Contraction_l_Alik_Bjlk_Cijk_Dijk:            <-- Problem Operation 
          rows:
          - library:
              distance: Euclidean                     <-- Distance measure by Euclidean method
              properties:
              - {index: 0, type: FreeSizeA}           <-- Size properties used to measure distance
              - {index: 0, type: FreeSizeB}    
              - {index: 0, type: BoundSize}
              table:
              - key: [1, 1, 1]                        <-- Benchmarked size
                speed: 0.00013586956220247663         <-- Benchmarked speed
                value: {index: 53, type: Single}      <-- Solution is index 53!
              
               ... thousands more benchmark key-pairs

              type: Matching                          <-- Matching table ^^^
            predicate: 
              type: And
              value:
              - type: TypesEqual                      
                value: [Half, Half, Float, Float]        
              - {type: HighPrecisionAccumulate, value: true}
          type: Problem                               <-- Problem Layer
      property: {type: OperationIdentifier}           <-- Problem Map Layer
      type: ProblemMap
    predicate:
      type: AMDGPU                                    <-- Hardware Layer
      value: {type: Processor, value: gfx900}
 type: Hardware
solutions:                                            <-- Begins the master solution list vvv
- debugKernel: false                                      
  hardwarePredicate: {type: TruePred}
  ideals: {}
  index: 0                                            <-- Solution Index
  info: {1LDSBuffer: '0', AggressivePerfMode: '1', ... lots more
  }
  name: Cijk_Alik_Bjlk_HSBH_MT16x16x8_SE_AF0EM2_AMAS3_ASEM2_EPS1_GRVW2_ISA900_K1_KLA_LRVW2_PGR1_PLR1_TT2_2_VW2_WG8_8_1_WGM1
  problemPredicate:                                       <-- Predicate: types must match + HPA + others
    type: And
    value:
    - {index: -1, type: BoundSizeMultiple, value: 2}
    - {index: 0, type: FreeSizeAMultiple, value: 2}
    - {type: OperationIdentifierEqual, value: Contraction_l_Alik_Bjlk_Cijk_Dijk}
    - type: TypesEqual
      value: [Half, Half, Float, Float]
    - {type: HighPrecisionAccumulate, value: true}
    ...
  problemType: {aType: Half, bType: Half, cType: Float, dType: Float, highPrecisionAccumulate: true,  <-- Problem that kernel can solve
    operationIdentifier: Contraction_l_Alik_Bjlk_Cijk_Dijk, useBeta: true, useInitialStridesAB: false,
    useInitialStridesCD: false}
  sizeMapping:                                             <-- Begin kernel properties meta-data 
    depthU: 8
    globalAccumulation: false
    globalSplitU: 1
    macroTile: [16, 16, 1]
  
... Many more properties

- debugKernel: false                                       <-- Next solution, and so on. 
   hardwarePredicate: {type: TruePred}
   ideals: {}
   index: 1



... Many more, hundreds of solutions


...EOF

Due to the .yaml indenting and structure, it is a bit difficult to understand the hierarchy just from looking at the file itself; this is why the real example wasn't covered until now. This file is however parsed and organized in memory to implement the hierarchy structure previously discussed.

Code Object Libraries

Kernels are compiled into shared object libraries. These are identified by their file extensions .co/.hsaco. The only difference between them is that .co libs are ASM kernels and .hsaco libs are SOURCE kernels. Same ABI, same format.

ASM Kernels

Solutions that have the KernelLanguage: Assembly property use the KernelWriterAssembly object to generate the actual kernel code. This object has the very complicated task of translating all of the solution properties, or meta-data, into a sequence of code modules that are eventually rendered into a string of targeted asm code and saved to file as assembly. The ISA property determines the target graphics architecture of the assembly which is especially important for the object to choose the correct assembly instruction set. For example, the KernelWriterAssembly may dynamically test the assembler for v_mfma instructions for ISA (9,0,6) and will find alternatives if unsupported.

ASM kernels are assembled into .o object files and finally linked into .co files. The file names are obtained from the Solution's Name property.

Source Kernels

Solutions that have the KernelLanguage: Source property use the KernelWriterSource object to generate the actual kernel code. This object has the very complicated task of translating all of the solution properties, or meta-data, into a sequence of code modules that are eventually rendered into a string of C++ code and saved to file as .h and .cpp sources. The ISA property for source kernels is (0,0,0) which means that source kernels are compiled for all architecture targets.

Source kernels are compiled and assembled into .o files and extracted into .hsaco code modules per architecture. The file names are obtained from the Solution's Name property and decorated with the architecture target.

Final Code Object Libraries

Depending on TensileCreateLibrary state for 'MergeFiles', the code object libraries may be linked together into monolithic libraries for each architecture. This just affects the number of library files in the final result of TensileCreateLibrary. Again, there .co and .hsaco files have basically the same format, however the extensions allow distinguishing between ASM and SOURCE kernels.

Usage:

$>Tensile/bin/TensileCreateLibrary <Options...> <LogicPath> <OutputPath> <RuntimeLanguage>

[-h] Display usage help

[--cxx-compiler {hcc,hipcc}] Compiler override (default hipcc).

[--code-object-version {V2,V3}] Code object version override (default V3).

[--architecture {all,gfx000,gfx803,gfx900,gfx906,gfx908}] Architecture override (default all).

[--merge-files] / [--no-merge-files] If merge, all kernels are merged to one code object file per architecture (default on).

[--short-file-names] / [--no-short-file-names] Windows only option for shortening output files names. Currently disabled.

[--library-print-debug] / [--no-library-print-debug] Solutions will print enqueue info when enqueueing a kernel (default off)

[--no-enumerate] Disable enumeration of host graphics architecture.

[--package-library]

[--no-legacy-components] Don't generate solution source code for old client.

[--embed-library EMBEDLIBRARY] Embed (new) library files into static variables. Specify the name of the library.

[--embed-library-key EMBEDLIBRARYKEY] Prefix embedded symbols with EMBEDLIBRARYKEY.

[--new-client-only] Only build libraries for the new client.

[--version VERSION] Embed version into library files.

[--generate-manifest-and-exit] In the output directory, create a manifest file of expected outputs only.

[--library-format {yaml,msgpack}] Choose format of output library (default msgpack). Respective file extensions {.yaml,.dat}

LogicPath Path to LibraryLogic .yaml files.

OutputPath Output directory.

{OCL,HIP,HSA} Chosen runtime language.

Tensile API

So we put Library Logic files into TensileCreateLibrary and it spits out a Master Solution Library and Code Object Libraries. Now who uses them?

'Clients' can invoke the Tensile API to run kernels on their data and solve the problems they are interested in. It was previously mentioned that clients, or Tensile API users don't need expert domain knowledge on AMDGPUs to run the best possible kernels to solve interesting problems. All they need is a domain knowledge on the problem they are trying to solve, and Tensile takes care of the rest.

The Tensile API provides a C++ entrypoint to load TensileCreateLibrary outputs and to describe problems and invoke kernels. It also has support to accommodate both HIP and OpenCL backends to suit the users' workflow.

Loading Master Solution Library

Tensile API conveniently has a function to read in the master solution library to memory:

 #include "Tensile/Tensile.hpp"

 ...

 auto library = LoadLibraryFile<ContractionProblem>(envDir.file("TensileLibrary").native());
 ASSERT_NE(library, nullptr);

Loading Code Object Files

Tensile API conveniently has wrapper classes to read code object files using either HIP or OpenCL backends under the hood:

 #include Tensile/Hip/HipSolutionAdapter.h"
 auto adapter = std::make_shared<hip::SolutionAdapter>(debug, "TENSILE_TEST_LIBRARY");

 // Alternatively, in OpenCL
 // #include Tensile/ocl/OclSolutionAdapter.h"
 // auto adapter = std::make_shared<ocl::SolutionAdapter>(debug, "TENSILE_TEST_LIBRARY");

 ASSERT_NE(adapter, nullptr);

 for(auto file : envDir.glob("*.co"))
 {
     adapter->loadCodeObjectFile(file.native());
 }

 for(auto file : envDir.glob("*.hsaco"))
 {
     try
     {
         adapter->loadCodeObjectFile(file.native());
     }
     catch(std::logic_error& exc)
     {
         ...
     }
 }

Initialize Hardware

Enumerate the current hardware architecture with HIP or OpenCL.

#include "hip/HipHardware.hpp"
auto hardware = hip::GetCurrentDevice();

// Alternatively, in OpenCL
// #include "ocl/OclHardware.hpp"
// auto hardware = ocl::GetCurrentDevice();

ASSERT_NE(hardware, nullptr);

Defining the Problem

A domain expert in the problem of interest can define its properties using the Tensile API:

#include "Tensile/ContractionProblem.hpp"

auto problem = ContractionProblem::GEMM(...);

Initialize Resources and Data

Domain expert would need to allocate and initialize their host and device memory.

using TypedInputs = TypedContractionInputs<AType, BType, CType, DType, AlphaType, BetaType> 

TypedInputs hostData;
TypedInputs deviceInputs;

// Initialize host data and init values
hostData.a = new AType[aSize];
foreach(a = hostData.a, aSize)
{
    a = ...
}

// Initialize device allocations and data
HIP_CHECK_EXC(hipMalloc(deviceInputs.a, aSize));

// Alternatively, in OpenCL
// cl::Buffer devA = {cl::Context::getDefault(), CL_MEM_READ_WRITE, aSize};
// deviceInputs.a = static_cast<AType*>(devA());

...

// Copy data to GPU, if necessary
HIP_CHECK_EXC(
        hipMemcpy(deviceInputs.a, hostData.a, aSize, hipMemcpyHostToDevice));

// Alternatively, in OpenCL
// cl::CommandQueue::getDefault().enqueueWriteBuffer(devA, CL_TRUE, 0, aSize, hostData.a);

...

Solve the Problem

The domain expert would then use the objects already attained from the API to solve the problem and launch the kernels. library->findBestSolution evaluates the Master Solution Library predicate tree to find the optimal solution for the problem. The solution then provides the kernel meta-data as KernelInvocation objects. Each one represents a single kernel. The adapter then finds the kernel code inside the loaded hip module with via hipModuleGetFunction(). Finally, the kernel is invoked by the adapter via hipExtModuleLaunchKernel().

// Use the master solution library to find the best solution to the problem
auto solution = library->findBestSolution(problem, hardware*);

// Build the input parameters to pass to the kernel
std::vector<KernelInvocation> result = solution->solve(problem, inputs_d, *hardware);

// Launch the kernels
adapter->launchKernels(result);

// Copy the data back to host from GPU
HIP_CHECK_EXC(
        hipMemcpy(hostData.a, deviceInputs.a, aSize, hipMemcpyDeviceToHost));

// Alternatively, in OpenCL
// cl::CommandQueue::getDefault().enqueueReadBuffer(devA, CL_TRUE, 0, aSize, hostData.a);

...
// Release memory
delete[] hostData.a;
hipFree(deviceInputs.a);

// Alternatively, in OpenCL C++ RAII destructor will clean up
...

Conclusion

This page has in a nutshell described the context surrounding TensileCreateLibrary and it's role in benchmarking, generating kernel libraries and solution libraries, and how they are consumed by a client application using the Tensile API. Code snippets in this document should serve as pseudo code only as the API may change or evolve. It nevertheless serves as an overview of potential usage. Specific examples of this process can be seen in more complex projects such as HostLibraryTests/RunGEMMKernel_test.cpp among others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly