Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Compiler] Native code generation #5440

Open
binarman opened this issue Dec 21, 2020 · 15 comments
Open

[Compiler] Native code generation #5440

binarman opened this issue Dec 21, 2020 · 15 comments
Assignees

Comments

@binarman
Copy link
Contributor

binarman commented Dec 21, 2020

Lets investigate code generation techniques in neural network world.

Some NN has subgraphs with lots of relatively small operators, that introduce a lot of overhead on runtime.
There are also cases when some operators can be simplified and fused that can lead to performance improvement.

Some work in this direction is already done in other projects. For example:

Probably we can make use of these technologies to improve inference performance

@binarman binarman self-assigned this Dec 21, 2020
@binarman
Copy link
Contributor Author

To get some understanding of optimization possibilities I've experimented a bit with small subgraph from lpcnet
Model has GRU layer that is transformed in set of small operators.

Subgraph from original network I targeted:
subgraph

I rewrote this subgraph using primitive tensorflow functions and evaluated it's performance with ONERT.
Then I reimplemented this graph in c++ in three variants:

  1. naive implementation (just for reference)
  2. naive implementation, but with kernels from ONERT (I assume no loop fusion was applied by compiler, though I did not inspect it carefully)
  3. optimized implementation: manually fused and unrolled loops.

Performance results are following (100 000 runs, warm up 100):

  1. ONERT: ~3.05 microseconds (I used modified nnpackage_run to get rid of infrastructure overhead and get better precision)
  2. C++ with ONERT kernels: ~2.15 microseconds
  3. C++ optimized: ~1.7 microseconds

Optimized model has three clusters of fused layers. It is possible to make only two clusters, but there are not enough registers to hold all needed data, so three-cluster implementation performed better:
schema_fuse2

I think these results look promising, though this is just a small part of the network. probably bigger layers, like FC can not offer such improvement.

related files:
gru_experiment.zip

@binarman
Copy link
Contributor Author

binarman commented Dec 25, 2020

I tried to generate code using Tiramisu infrastructure:
tiramisu_experiemnts.zip
due to it's technical limitations I broke graph into different loops:
subgraph

Generated code took ~2.5 microsecnds (this is less than onert with 3 microseconds, but significantly longer than handmade version with 1.7 microsecods)

To understand why Tiramisu slower I wrote new version for manual computations. (I also noticed error in previous version and fix it. it did not affect performance though)
Source code for this version in manual_tiramisu_edition.cpp in gru_experiment.zip

It turned out that this version is also faster than previous one and achieves 1.62 microseconds.

Next I want to:

  1. Compare Tiramisu generated code and manual implementation to understand why performance differs
  2. Try Halide IR, instead of Tiramisu (Tiramisu uses Halide itself, so I want to skip one layer of abstraction)
  3. Experiment with matrix multiplication (if we can speedup this operation we probably can speedup a lot of networks)

@chunseoklee
Copy link
Contributor

@binarman FYI, Here is a survey for nn compiler : arXiv:2002.03794

@binarman
Copy link
Contributor Author

@chunseoklee
Thank you!

I've already tried to use TVM. It's performance on x86 is the best according to this article, but for some reason it was poor on arm64 (I tried several convolution networks and arithmetic networks, like in experiments above).
Maybe I missed something and should redone investigation, I did not use auto-tuning features.

@binarman
Copy link
Contributor Author

I've tried to recreate optimized function using Halide and got ~1.72 micro-second result
I think additional time is introduced by safety checks Halide performs and I do not, so we can assume Halide performance is essentially the same as manual code.

Related files:
halide_arithmetics.zip

Next I want to compare onert and halide BLAS implementation: https://github.com/halide/Halide/tree/master/apps/linear_algebra/src

I think I'll take size of matrices from Inception, mobilenet and lpcnet as a "target".

@binarman
Copy link
Contributor Author

binarman commented Jan 19, 2021

I tested FC layers of different sizes with simple halide schedule:

for output_dim:  <- unroll and vectorize this loop 4 times
  result[output_dim] = bias[output_dim];
  for input_dim:
    result[output_dim] += weights[output_dim][input_dim] * input[input_dim];

all timing in microseconds, one thread

weight size onert halide
1001x1024 254 254
1001x1280 325 325
1001x2048 530 540
1152x384 96 67
1152x512 130 92
48x16 0.5 0.34
48x512 5.4 3.67
4x4000 3.8 2.43

Small FC layers gains a lot of speedup, large ones does not, I suppose this is because algorithm breaks into memory limits or something, for now I want to focus on one threaded computations, since this what we probably need for low-end devices.

Next I want to prototype compilation into ONE compiler and investigate if this approach can speedup lcpnet/other recurrent networks.

@glistening
Copy link
Contributor

I tested FC layers of different sizes with simple halide schedule:

@binarman Could you please let me know what target device did you test on?

@binarman
Copy link
Contributor Author

@glistening
Sorry, I forgot to mention it.
All measurements are done on Galaxy S20 (exynos) on fastest core.

Note: I tweaked onert to run model 1000 times for every time measurement, because default schema works in millisecond range.

@chunseoklee
Copy link
Contributor

chunseoklee commented Feb 4, 2021

I tested FC layers of different sizes with simple halide schedule:

for output_dim:  <- unroll and vectorize this loop 4 times
  result[output_dim] = bias[output_dim];
  for input_dim:
    result[output_dim] += weights[output_dim][input_dim] * input[input_dim];

all timing in microseconds, one thread

weight size onert halide
1001x1024 254 254
1001x1280 325 325
1001x2048 530 540
1152x384 96 67
1152x512 130 92
48x16 0.5 0.34
48x512 5.4 3.67
4x4000 3.8 2.43
Small FC layers gains a lot of speedup, large ones does not, I suppose this is because algorithm breaks into memory limits or something, for now I want to focus on one threaded computations, since this what we probably need for low-end devices.

Next I want to prototype compilation into ONE compiler and investigate if this approach can speedup lcpnet/other recurrent networks.

@binarman
Q1) This experiment tested with f32 model ?

Q2) What is the batch(of FC) size ?

@binarman
Copy link
Contributor Author

binarman commented Feb 5, 2021

@chunseoklee

  1. Yes, this was f32 models
  2. Just 1, I just tested operations from relevant models, like lpcnet.

During experiments I tried to outperform existing kernels and found that it is relatively simple if input data is small or has some uncommon properties (like on dimension is large, but other is small).
I did not test quantized operators, but I think results will be similar.

By the way, I do not remember if I mentioned it somewhere, so here is my current idea(in POC #5836):

  • Compiled model is actually an ordinary circle/tflite graph but with some parts replaced with custom operators.
  • Implementations for these operators are emitted as a static or dynamic libraries by codegen.
  • Execution engine is either luci-interpreter (in case of embedded edition it should be linked with generated static libraries) or ONERT (I think it's main use case is dynamic library that should be distributed with nn package)

@wateret
Copy link
Member

wateret commented Feb 8, 2021

@binarman Have you tried building those halide or tiramisu for Linux(Odroid XU4 Ubuntu)? And could you share the code to test FC Layer?

@binarman
Copy link
Contributor Author

binarman commented Feb 8, 2021

@wateret
Halide and tiramisu(with a little of hacking) can emit code for arm architecture. I tested it on galaxy S20 with exynos.
Actual generator I used: https://github.com/binarman/Halide/blob/arm_experiments/apps/linear_algebra/src/blas_l2_generators.cpp#L271 (as you can see it is pretty simple, actual kernel described by 3 lines 291-293)

To build you should build halide first (you can use <repo>/build/makeall.sh script)
Then use <repo>/apps/makeall.sh script to build generators and benchmark.
Built benchmark app is located in <repo>/apps/build/linear_algebra/benchmarks.
Check contents of mentioned scripts, you need to change paths (for example for llvm)

@wateret
Copy link
Member

wateret commented Feb 8, 2021

@binarman Thank you for sharing!

@wateret
Copy link
Member

wateret commented Feb 8, 2021

tiramisu(with a little of hacking)

Does it mean directly using this function fct->gen_halide_obj(obj_filename, os, arch, bits) rather than tiramisu::codegen in your tiramisu_experiment.zip?

@binarman
Copy link
Contributor Author

@wateret
If I recall it correctly tiramisu::codegen method does not provide cross-build functionality.
I wanted to run code on android phone (and do not have any arm64 board at the time), so I used available cross-compile method PC -> arm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants