[Mobile] Subgraphs duplicate initializers in RAM during execution #20709

niedev · 2024-05-17T11:43:59Z

Describe the issue

I created a model that is composed of an If node that based on a Boolean input executes one of two subgraphs, both subgraphs use the same weight matrix (about 1GB, for the Gather and for the Transpose node), saved in the initializers of the parent model, there are no duplications of this matrix (in fact on the disk the model weighs 1GB overall), but when running on Android the model consumes 2GB of RAM instead of 1, most likely because the matrix shared by the two subgraphs is duplicated, is this a bug or is it an expected behavior? And if it is expected, are there ways to avoid it?

This is the structure of the model:

To reproduce

Here is the code used for loading the session in Android:

onnxEnv = OrtEnvironment.getEnvironment();
OrtSession.SessionOptions embedAndLmHeadOptions = new OrtSession.SessionOptions();
embedAndLmHeadOptions.setMemoryPatternOptimization(false);
embedAndLmHeadOptions.setCPUArenaAllocator(false);
embedAndLmHeadSession = onnxEnv.createSession(embedAndLmHeadPath, embedAndLmHeadOptions);

The model is saved here: https://github.com/niedev/testModel/releases/download/testModel/nllb_embed_and_lm_head1.onnx

Urgency

Not so urgent

Platform

Android

OS Version

14

ONNX Runtime Installation

Released Package

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

onnxruntime-android

ONNX Runtime Version or Commit ID

1.17.3

ONNX Runtime API

Java/Kotlin

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

niedev · 2024-05-19T13:49:13Z

I found the problem, when onnxruntime performs the basic optimizations on the lm_head subgraph it transposes the weight matrix (called model.shared.weight) and saves it as another initializer, eliminating the Transpose node (if I deactivate the basic optimizations the transpose is simply done anyway on model.shared.weight by duplicating it during execution, so in addition to consuming 1GB more it also slows down execution), so I thought of applying this theorem on the lm_head, executing the Transpose on the other matrix which is multiplied with model.shared.weight (pre_logits) and on the final result (and inverting the order of the two matrices of the MatMul), in this way we obtain an equivalent MatMul but without having to perform the Transpose on model.shared.weight (which is much larger than the other two matrices on which we now perform the transpose), in this way I managed to reduce the RAM consumption by 1GB, but the problem is that the execution time increases.

This is the updated model: https://github.com/niedev/testModel/releases/download/testModel_2.0/nllb_embed_and_lm_head_if3.onnx

I used the onnx profiler on the new model (without optimizations) to understand which node causes the performance decrease, and the two added Transposes execute practically instantly, the node that takes longer than before is lm_head's MatMul (goes from 36ms to 50ms), but I can't understand why, given that the multiplication is practically equivalent.

This is the profiling result of the old model: https://github.com/niedev/testModel/blob/main/embed_and_lmhead_log_2024-05-19_old.json

This is the profiling result of the new model: https://github.com/niedev/testModel/blob/main/embed_and_lmhead_log_2024-05-19_new.json

The model I'm working on is the result of extracting the components that perform the embed and lm_head of NLLB, so if you could solve the MatMul problem (if it is solvable) and integrate this modification into the basic optimization process of onnxruntime this would lead to a significant reduction of RAM consumption for NLLB (and also for other Transformers that share the embed and lm_head matrix).

skottmckay · 2024-06-04T22:24:12Z

Try using the XNNPACK execution provider. The MLAS kernels on arm64 focus on quantized data, so for 32-bit floats the XNNPACK kernels might have optimizations that address the performance drop.

niedev · 2024-06-05T10:55:59Z

Hi, the final model I will implement will be quantized, so I also did tests with these quantized components (u8/u8 and with both weights and activations asymmetric ) on arm64 and the problem is the same (less RAM consumption, but about 35% more execution time).

skottmckay · 2024-06-06T10:01:40Z

The activation_size value is the sum of the input tensor sizes. In the new model that's 1024x larger.

Old:

New:

niedev · 2024-06-06T13:49:37Z

Ok, but I think this is a problem just with the log, since in the old model the second MatMul input (which has dimensions 1024x256000) is not shown in the log, but the MatMul must have 2 inputs (also based on the old model graph):

skottmckay · 2024-06-07T21:40:02Z

Ah ok. Definitely unexpected that a MatMul of {1, 1K} x {1K, 256K} is significantly better than {256K, 1K} x {1K, 1}. @yufenglee any ideas as to why that would be?

Did you try the XNNPACK EP just for another data point?

More of a workaround, but could you change the initializer to be in the transposed ordering that the original MatMul used and instead update the usage of model.shared.weight in the other subgraph to adjust for that? That may be a way to avoid the constant folding duplicating the initializer which would address the original problem.

niedev added the platform:mobile issues related to ONNX Runtime mobile; typically submitted using template label May 17, 2024

github-actions bot added the api:Java issues related to the Java API label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mobile] Subgraphs duplicate initializers in RAM during execution #20709

[Mobile] Subgraphs duplicate initializers in RAM during execution #20709

niedev commented May 17, 2024

niedev commented May 19, 2024

skottmckay commented Jun 4, 2024

niedev commented Jun 5, 2024

skottmckay commented Jun 6, 2024

niedev commented Jun 6, 2024

skottmckay commented Jun 7, 2024

[Mobile] Subgraphs duplicate initializers in RAM during execution #20709

[Mobile] Subgraphs duplicate initializers in RAM during execution #20709

Comments

niedev commented May 17, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

Compiler Version (if 'Built from Source')

Package Name (if 'Released Package')

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

niedev commented May 19, 2024

skottmckay commented Jun 4, 2024

niedev commented Jun 5, 2024

skottmckay commented Jun 6, 2024

niedev commented Jun 6, 2024

skottmckay commented Jun 7, 2024