[Compiler] Native code generation #5440

binarman · 2020-12-21T02:46:26Z

Lets investigate code generation techniques in neural network world.

Some NN has subgraphs with lots of relatively small operators, that introduce a lot of overhead on runtime.
There are also cases when some operators can be simplified and fused that can lead to performance improvement.

Some work in this direction is already done in other projects. For example:

Probably we can make use of these technologies to improve inference performance

binarman · 2020-12-21T03:31:58Z

To get some understanding of optimization possibilities I've experimented a bit with small subgraph from lpcnet
Model has GRU layer that is transformed in set of small operators.

Subgraph from original network I targeted:

I rewrote this subgraph using primitive tensorflow functions and evaluated it's performance with ONERT.
Then I reimplemented this graph in c++ in three variants:

naive implementation (just for reference)
naive implementation, but with kernels from ONERT (I assume no loop fusion was applied by compiler, though I did not inspect it carefully)
optimized implementation: manually fused and unrolled loops.

Performance results are following (100 000 runs, warm up 100):

ONERT: ~3.05 microseconds (I used modified nnpackage_run to get rid of infrastructure overhead and get better precision)
C++ with ONERT kernels: ~2.15 microseconds
C++ optimized: ~1.7 microseconds

Optimized model has three clusters of fused layers. It is possible to make only two clusters, but there are not enough registers to hold all needed data, so three-cluster implementation performed better:

I think these results look promising, though this is just a small part of the network. probably bigger layers, like FC can not offer such improvement.

related files:
gru_experiment.zip

binarman · 2020-12-25T10:28:44Z

I tried to generate code using Tiramisu infrastructure:
tiramisu_experiemnts.zip
due to it's technical limitations I broke graph into different loops:

Generated code took ~2.5 microsecnds (this is less than onert with 3 microseconds, but significantly longer than handmade version with 1.7 microsecods)

To understand why Tiramisu slower I wrote new version for manual computations. (I also noticed error in previous version and fix it. it did not affect performance though)
Source code for this version in manual_tiramisu_edition.cpp in gru_experiment.zip

It turned out that this version is also faster than previous one and achieves 1.62 microseconds.

Next I want to:

Compare Tiramisu generated code and manual implementation to understand why performance differs
Try Halide IR, instead of Tiramisu (Tiramisu uses Halide itself, so I want to skip one layer of abstraction)
Experiment with matrix multiplication (if we can speedup this operation we probably can speedup a lot of networks)

chunseoklee · 2021-01-04T02:51:06Z

@binarman FYI, Here is a survey for nn compiler : arXiv:2002.03794

binarman · 2021-01-12T10:19:06Z

@chunseoklee
Thank you!

I've already tried to use TVM. It's performance on x86 is the best according to this article, but for some reason it was poor on arm64 (I tried several convolution networks and arithmetic networks, like in experiments above).
Maybe I missed something and should redone investigation, I did not use auto-tuning features.

binarman · 2021-01-12T10:26:50Z

I've tried to recreate optimized function using Halide and got ~1.72 micro-second result
I think additional time is introduced by safety checks Halide performs and I do not, so we can assume Halide performance is essentially the same as manual code.

Related files:
halide_arithmetics.zip

Next I want to compare onert and halide BLAS implementation: https://github.com/halide/Halide/tree/master/apps/linear_algebra/src

I think I'll take size of matrices from Inception, mobilenet and lpcnet as a "target".

binarman · 2021-01-19T10:32:58Z

I tested FC layers of different sizes with simple halide schedule:

for output_dim:  <- unroll and vectorize this loop 4 times
  result[output_dim] = bias[output_dim];
  for input_dim:
    result[output_dim] += weights[output_dim][input_dim] * input[input_dim];

all timing in microseconds, one thread

weight size	onert	halide
1001x1024	254	254
1001x1280	325	325
1001x2048	530	540
1152x384	96	67
1152x512	130	92
48x16	0.5	0.34
48x512	5.4	3.67
4x4000	3.8	2.43

Small FC layers gains a lot of speedup, large ones does not, I suppose this is because algorithm breaks into memory limits or something, for now I want to focus on one threaded computations, since this what we probably need for low-end devices.

Next I want to prototype compilation into ONE compiler and investigate if this approach can speedup lcpnet/other recurrent networks.

glistening · 2021-01-19T23:05:09Z

I tested FC layers of different sizes with simple halide schedule:

@binarman Could you please let me know what target device did you test on?

binarman · 2021-01-20T11:09:37Z

@glistening
Sorry, I forgot to mention it.
All measurements are done on Galaxy S20 (exynos) on fastest core.

Note: I tweaked onert to run model 1000 times for every time measurement, because default schema works in millisecond range.

chunseoklee · 2021-02-04T02:37:01Z

I tested FC layers of different sizes with simple halide schedule:
for output_dim:  <- unroll and vectorize this loop 4 times
  result[output_dim] = bias[output_dim];
  for input_dim:
    result[output_dim] += weights[output_dim][input_dim] * input[input_dim];
all timing in microseconds, one thread

weight size onert halide
1001x1024 254 254
1001x1280 325 325
1001x2048 530 540
1152x384 96 67
1152x512 130 92
48x16 0.5 0.34
48x512 5.4 3.67
4x4000 3.8 2.43
Small FC layers gains a lot of speedup, large ones does not, I suppose this is because algorithm breaks into memory limits or something, for now I want to focus on one threaded computations, since this what we probably need for low-end devices.

Next I want to prototype compilation into ONE compiler and investigate if this approach can speedup lcpnet/other recurrent networks.

@binarman
Q1) This experiment tested with f32 model ?

Q2) What is the batch(of FC) size ?

binarman · 2021-02-05T02:24:03Z

@chunseoklee

Yes, this was f32 models
Just 1, I just tested operations from relevant models, like lpcnet.

During experiments I tried to outperform existing kernels and found that it is relatively simple if input data is small or has some uncommon properties (like on dimension is large, but other is small).
I did not test quantized operators, but I think results will be similar.

By the way, I do not remember if I mentioned it somewhere, so here is my current idea(in POC #5836):

Compiled model is actually an ordinary circle/tflite graph but with some parts replaced with custom operators.
Implementations for these operators are emitted as a static or dynamic libraries by codegen.
Execution engine is either luci-interpreter (in case of embedded edition it should be linked with generated static libraries) or ONERT (I think it's main use case is dynamic library that should be distributed with nn package)

wateret · 2021-02-08T09:22:07Z

@binarman Have you tried building those halide or tiramisu for Linux(Odroid XU4 Ubuntu)? And could you share the code to test FC Layer?

binarman · 2021-02-08T13:42:00Z

@wateret
Halide and tiramisu(with a little of hacking) can emit code for arm architecture. I tested it on galaxy S20 with exynos.
Actual generator I used: https://github.com/binarman/Halide/blob/arm_experiments/apps/linear_algebra/src/blas_l2_generators.cpp#L271 (as you can see it is pretty simple, actual kernel described by 3 lines 291-293)

To build you should build halide first (you can use <repo>/build/makeall.sh script)
Then use <repo>/apps/makeall.sh script to build generators and benchmark.
Built benchmark app is located in <repo>/apps/build/linear_algebra/benchmarks.
Check contents of mentioned scripts, you need to change paths (for example for llvm)

wateret · 2021-02-08T22:53:34Z

@binarman Thank you for sharing!

wateret · 2021-02-08T23:30:36Z

tiramisu(with a little of hacking)

Does it mean directly using this function fct->gen_halide_obj(obj_filename, os, arch, bits) rather than tiramisu::codegen in your tiramisu_experiment.zip?

binarman · 2021-02-15T08:54:00Z

@wateret
If I recall it correctly tiramisu::codegen method does not provide cross-build functionality.
I wanted to run code on android phone (and do not have any arm64 board at the time), so I used available cross-compile method PC -> arm.

binarman self-assigned this Dec 21, 2020

binarman mentioned this issue Jan 26, 2021

[DRAFT] Codegen draft #5836

Draft

wateret mentioned this issue Feb 10, 2021

[runtime/contrib] Tiramisu Test #6014

Closed

periannath mentioned this issue Feb 19, 2021

Generate GEMM kernel with NN Compiler technique #5951

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Compiler] Native code generation #5440

[Compiler] Native code generation #5440

binarman commented Dec 21, 2020 •

edited

Loading

binarman commented Dec 21, 2020

binarman commented Dec 25, 2020 •

edited

Loading

chunseoklee commented Jan 4, 2021

binarman commented Jan 12, 2021

binarman commented Jan 12, 2021

binarman commented Jan 19, 2021 •

edited by lemmaa

Loading

glistening commented Jan 19, 2021

binarman commented Jan 20, 2021

chunseoklee commented Feb 4, 2021 •

edited

Loading

binarman commented Feb 5, 2021

wateret commented Feb 8, 2021 •

edited

Loading

binarman commented Feb 8, 2021

wateret commented Feb 8, 2021 •

edited

Loading

wateret commented Feb 8, 2021

binarman commented Feb 15, 2021

[Compiler] Native code generation #5440

[Compiler] Native code generation #5440

Comments

binarman commented Dec 21, 2020 • edited Loading

binarman commented Dec 21, 2020

binarman commented Dec 25, 2020 • edited Loading

chunseoklee commented Jan 4, 2021

binarman commented Jan 12, 2021

binarman commented Jan 12, 2021

binarman commented Jan 19, 2021 • edited by lemmaa Loading

glistening commented Jan 19, 2021

binarman commented Jan 20, 2021

chunseoklee commented Feb 4, 2021 • edited Loading

binarman commented Feb 5, 2021

wateret commented Feb 8, 2021 • edited Loading

binarman commented Feb 8, 2021

wateret commented Feb 8, 2021 • edited Loading

wateret commented Feb 8, 2021

binarman commented Feb 15, 2021

binarman commented Dec 21, 2020 •

edited

Loading

binarman commented Dec 25, 2020 •

edited

Loading

binarman commented Jan 19, 2021 •

edited by lemmaa

Loading

chunseoklee commented Feb 4, 2021 •

edited

Loading

wateret commented Feb 8, 2021 •

edited

Loading

wateret commented Feb 8, 2021 •

edited

Loading