-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Compiler] Native code generation #5440
Comments
To get some understanding of optimization possibilities I've experimented a bit with small subgraph from lpcnet Subgraph from original network I targeted: I rewrote this subgraph using primitive tensorflow functions and evaluated it's performance with ONERT.
Performance results are following (100 000 runs, warm up 100):
Optimized model has three clusters of fused layers. It is possible to make only two clusters, but there are not enough registers to hold all needed data, so three-cluster implementation performed better: I think these results look promising, though this is just a small part of the network. probably bigger layers, like FC can not offer such improvement. related files: |
I tried to generate code using Tiramisu infrastructure: Generated code took ~2.5 microsecnds (this is less than onert with 3 microseconds, but significantly longer than handmade version with 1.7 microsecods) To understand why Tiramisu slower I wrote new version for manual computations. (I also noticed error in previous version and fix it. it did not affect performance though) It turned out that this version is also faster than previous one and achieves 1.62 microseconds. Next I want to:
|
@binarman FYI, Here is a survey for nn compiler : arXiv:2002.03794 |
@chunseoklee I've already tried to use TVM. It's performance on x86 is the best according to this article, but for some reason it was poor on arm64 (I tried several convolution networks and arithmetic networks, like in experiments above). |
I've tried to recreate optimized function using Halide and got ~1.72 micro-second result Related files: Next I want to compare onert and halide BLAS implementation: https://github.com/halide/Halide/tree/master/apps/linear_algebra/src I think I'll take size of matrices from Inception, mobilenet and lpcnet as a "target". |
I tested FC layers of different sizes with simple halide schedule:
all timing in microseconds, one thread
Small FC layers gains a lot of speedup, large ones does not, I suppose this is because algorithm breaks into memory limits or something, for now I want to focus on one threaded computations, since this what we probably need for low-end devices. Next I want to prototype compilation into ONE compiler and investigate if this approach can speedup lcpnet/other recurrent networks. |
@binarman Could you please let me know what target device did you test on? |
@glistening Note: I tweaked onert to run model 1000 times for every time measurement, because default schema works in millisecond range. |
@binarman Q2) What is the batch(of FC) size ? |
During experiments I tried to outperform existing kernels and found that it is relatively simple if input data is small or has some uncommon properties (like on dimension is large, but other is small). By the way, I do not remember if I mentioned it somewhere, so here is my current idea(in POC #5836):
|
@binarman Have you tried building those halide or tiramisu for Linux(Odroid XU4 Ubuntu)? And could you share the code to test FC Layer? |
@wateret To build you should build halide first (you can use |
@binarman Thank you for sharing! |
Does it mean directly using this function |
@wateret |
Lets investigate code generation techniques in neural network world.
Some NN has subgraphs with lots of relatively small operators, that introduce a lot of overhead on runtime.
There are also cases when some operators can be simplified and fused that can lead to performance improvement.
Some work in this direction is already done in other projects. For example:
Probably we can make use of these technologies to improve inference performance
The text was updated successfully, but these errors were encountered: