Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwiftFormer meets Android #14

Open
escorciav opened this issue Jan 9, 2024 · 11 comments
Open

SwiftFormer meets Android #14

escorciav opened this issue Jan 9, 2024 · 11 comments

Comments

@escorciav
Copy link

As mentioned in #13 , I forked the project to bring SwiftFormer onto Android (in Qualcomm hardware).

As of today, the performance of a single block as it's not encouraging under 2.2 msec. Gotten in S23 Utral S8G2 with QNN 2.16, details here

@escorciav
Copy link
Author

Update. The results were so discouraging that I had to benchmark SwiftFormer_L1 (as in the paper?). The results with S23 Utral S8G2 with QNN 2.16 are worse than in the iPhone. But, perhaps decent, under 2.7 msec.

@Amshaker
Copy link
Owner

Amshaker commented Jan 9, 2024

Thank you for the update.

Could you kindly conduct benchmark tests for both (MobileViT or MobileViT2×1) in addition to EfficientFormer_L1? I understand that we may not achieve the exact performance on the S23 (Ultra) as observed on the iPhone 14 (Pro Max) due to variations in hardware.

Please note that EfficientFormer_L1 has demonstrated comparable speed to SwiftFormer_L1 on the iPhone 14 (Pro Max). If you manage to replicate EfficientFormer_L1 on the S23 Ultra with a runtime of 2.63 msec, it suggests that the ANE of the iPhone 14 Pro Max is faster than the GPU or ANE on the S23 Ultra. If EfficientFormer_L1 significantly outperforms SwiftFormer_L1, it may indicate that the activations, normalization, and certain layers of SwiftFormer_L1 are not optimized for the S23 Ultra. This could mean that SwiftFormer requires additional optimization for optimal performance on this hardware.

I would appreciate your thoughts on this proposed plan.

Thank you.

@escorciav
Copy link
Author

escorciav commented Jan 9, 2024

Agree, SwiftFormer_L1 (PytTorch implementation) + QNN 2.16 (+ my way of porting) may be leaving room for optimization.

😉 I will leave it to someone else as:

  1. I'm kinda happy with the runtime,
  2. I'm not interested in the architectures mentioned above atm 😆
  3. Qualcomm does not pay my bills 🙃 ( for optimizing 3rdparty models on their hardware)

Perhaps add/edit your message with the relevant links for those arch 😊

@Amshaker
Copy link
Owner

Amshaker commented Jan 9, 2024

I can do that soon and will update you 😄

I would be grateful if you could provide details on the steps or requirements involved in measuring the inference time on the S23 Ultra. For iOS, Apple has introduced a valuable feature in their IDE (Xcode 14) that allows for the measurement of prediction time, load time, and compilation time. Could you please share this information or update the forked repository with these specific details on Android? I am following your repo and already checked the export file.

@escorciav
Copy link
Author

escorciav commented Jan 9, 2024

There are multiple ways to port a ML model on Android 😊. Feel free to rename the issue accordingly. I wrote it in that way for marketing reasons 😉

My approach is specific to Qualcomm hardware using QNN.

  1. My fork has the script used to export onto ONNX
  2. Then, it's just the QNN pipeline.
    1. conversion to cpp
    2. model library generation
    3. (optional, yet recommended for fast inference & speed up trials) context (aka npu/dsp/gpu) library generation
  3. profiling & execution of binaries from step 2

I'm preparing a tutorial for other folks in my org. I will share the slides later in Q1/Q2.

@escorciav
Copy link
Author

escorciav commented Jan 9, 2024

Attaching the latency results,

  • The JSON file with _basic corresponds to the most reliable results.
  • The model was run over 100 times.
  • I believe that I use the fast mode, so less energy efficient, of the S23 Ultra HTP (i.e., the DSP/NPU w/o Qualcomm marketing).

The JSON files were generated with an internal/private tool. However, QNN docs provide all the info to parse the binary with the profiling results from step 3. The TXT-file was generated by a tiny wrapper digesting the JSON.

report_ops.txt
model.iters-100.qnn.int8.json
model.iters-100.qnn.int8_basic.json

@escorciav
Copy link
Author

escorciav commented Jan 9, 2024

(perhaps) Good news, the latency of the block that I'm interested in improving got a speed-up of 1.27x by using QNN >= 2.17

With enough ⭐s on my fork, I may be persuaded to benchmark SwiftFormer L1 😊 🤣

@Amshaker
Copy link
Owner

Amshaker commented Jan 9, 2024

That's great! 🚀
You have one star now, come on! 🤣

If you benchmarked SwiftFormer models (Let's say L1), we can do a pull request and I will add you as a contributor to the main repo with a special shoutout in the acknowledgments 👀. Isn't it a good deal? 🤣

@escorciav
Copy link
Author

Push latency performance of SwiftFormerL1 with QNN 2.17 & 2.18. Improvement is as much 1.16x

we can do a pull request and I will add you as a contributor to the main repo with a special shoutout in the acknowledgments 👀. Isn't it a good deal?

Done with 80% of my duties. Awaiting instruction for the 20% & collecting the brownie points mentioned earlier 🍪

@Amshaker
Copy link
Owner

You have my word on it 💯. Here we go!

Please create a pull request to the readme file of the main repo with the following change: Create a new sub-section under "Latency Measurement" named as SwiftFormer meets Android (I liked the name). With this section, you can add the two tables (SwiftFormer Encoder & SwiftFormer-L1) for the latency measurements with the variants of QNN (Feel free to add the scripts as well). Then, I will check & merge the pull request and you will automatically added as a contributor! 🚀. Following this, I'll update the acknowledgment, earning you a well-deserved second brownie 🍪

escorciav pushed a commit to escorciav/SwiftFormer that referenced this issue Jan 12, 2024
Community drive contributions: SwiftFormer meets Android. Qualcomm S8G2
DSP/HTP hardware, via Qualcomm tooling (QNN). Details in Amshaker#14. Work done
by @3scorciav . Refer to his fork for details.
@escorciav
Copy link
Author

Thanks for merging 🥰. Let's keep the issue for 6-12 months in case someone else is interested in improving runtime performance, or exploring other porting avenue for Android 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants