Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8.6.0 diffusion demo txt2img not working #2784

Open
Vozf opened this issue Mar 17, 2023 · 13 comments
Open

8.6.0 diffusion demo txt2img not working #2784

Vozf opened this issue Mar 17, 2023 · 13 comments
Labels
Demo: Diffusion Issues regarding demoDiffusion triaged Issue has been triaged by maintainers

Comments

@Vozf
Copy link

Vozf commented Mar 17, 2023

Description

I try to run by instructions
and on step
python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN -v
I encounter the error

[E] ModelImporter.cpp:726: While parsing node number 7 [LayerNormalization -> "/text_model/encoder/layers.0/layer_norm1/LayerNormalization_output_0"]:                    
[E] ModelImporter.cpp:727: --- Begin node ---                                                                                                                             
[E] ModelImporter.cpp:728: input: "/text_model/embeddings/Add_output_0"                                                                                                   
    input: "text_model.encoder.layers.0.layer_norm1.weight"                                                                                                               
    input: "text_model.encoder.layers.0.layer_norm1.bias"                                                                                                                 
    output: "/text_model/encoder/layers.0/layer_norm1/LayerNormalization_output_0"                                                                                        
    name: "/text_model/encoder/layers.0/layer_norm1/LayerNormalization"                                                                                                   
    op_type: "LayerNormalization"                                                                                                                                         
    attribute {                                                                                                                                                           
      name: "axis"                                                                                                                                                        
      i: -1                                                                                                                                                               
      type: INT                                                                                                                                                           
    }                                                                                                                                                                     
    attribute {                                                                                                                                                           
      name: "epsilon"                                                                                                                                                     
      f: 1e-05                                                                                                                                                            
      type: FLOAT                                                                                                                                                         
    }                                                                                                                                                                     
[E] ModelImporter.cpp:729: --- End node ---                                                                                                                               
[E] ModelImporter.cpp:732: ERROR: builtin_op_importers.cpp:5428 In function importFallbackPluginImporter:                                                                 
    [8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"                                                             
[E] In node 7 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"     
[!] Could not parse ONNX correctly
[0] 0:[tmux]*                                                                                                                      "root@5100eb7a428b: /w" 16:17 17-Mar-23

Environment

TensorRT Version:
NVIDIA GPU:
NVIDIA Driver Version:
CUDA Version:
CUDNN Version:
Operating System:
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

@rajeevsrao
Copy link
Collaborator

@Vozf did you also upgrade to TensorRT 8.6.0?
python3 -c 'import tensorrt as trt;print(trt.__version__)' should give you 8.6.0.

@rajeevsrao rajeevsrao added triaged Issue has been triaged by maintainers Demo: Diffusion Issues regarding demoDiffusion labels Mar 18, 2023
@Vozf
Copy link
Author

Vozf commented Mar 18, 2023

Yeah you were right, I've had 8.5.3. It seems this "Optional" step in the instruction isn't so optional
Unfortunately, I've upgraded and now get the following error

[I] Saving engine to engine/clip.plan                                                                                                                             [4/1859]
Building TensorRT engine for onnx/unet.opt.onnx: engine/unet.plan                                                                                                         
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, par
sing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_st
ream.h.                                                                                                                                                                   
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1733934759                                                                 
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, par
sing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_st
ream.h.                                                                                                                                                                   
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1733934759                                                                 
[W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped                                                                                    
[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 64, 64), opt=(2, 4, 64, 64), max=(32, 4, 64, 64)).add('encoder_hidden_states', min=(2, 77, 1024), o
pt=(2, 77, 1024), max=(32, 77, 1024)).add('timestep', min=[1], opt=[1], max=[1])]                                                                                         
[I] Building engine with configuration:                                                                                                                                   
    Flags                  | [FP16]                                                                                                                                       
    Engine Capability      | EngineCapability.DEFAULT                                                                                                                         Memory Pools           | [WORKSPACE: 6934.31 MiB, TACTIC_DRAM: 11170.44 MiB]                                                                                          
    Tactic Sources         | []                                                                                                                                               Profiling Verbosity    | ProfilingVerbosity.DETAILED                                                                                                                  
    Preview Features       | [DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]                                                                                              
[E] 10: Could not find any implementation for node {ForeignNode[/down_blocks.0/attentions.0/norm/Constant_1_output_0 + (Unnamed Layer* 1216) [Shuffle].../down_blocks.0/re
snets.1/conv1/Cast]}.                                                                                                                                                     
[E] 10: [optimizer.cpp::computeCosts::3873] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/down_blocks.0/attentions.0/norm/Consta
nt_1_output_0 + (Unnamed Layer* 1216) [Shuffle].../down_blocks.0/resnets.1/conv1/Cast]}.)                                                                                 
[!] Invalid Engine. Please ensure the engine was built correctly                                                                                                          
Traceback (most recent call last):                                                                                                                                        
  File "demo_txt2img.py", line 76, in <module>                                                                                                                            
    demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,                                                                                                     
  File "/workspace/projects/pixomatic/TensorRT/demo/Diffusion/stable_diffusion_pipeline.py", line 290, in loadEngines                                                     
    engine.build(onnx_opt_path,                                                                                                                                           
  File "/workspace/projects/pixomatic/TensorRT/demo/Diffusion/utilities.py", line 206, in build                                                                           
    engine = engine_from_network(                                                                                                                                         
  File "<string>", line 3, in engine_from_network                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py", line 42, in __call__                                                                   
    return self.call_impl(*args, **kwargs)                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 530, in call_impl                                                                  
    return engine_from_bytes(super().call_impl)                                                                                                                           
  File "<string>", line 3, in engine_from_bytes                                                                                                                           
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py", line 42, in __call__                                                                   
    return self.call_impl(*args, **kwargs)                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 554, in call_impl                                                                  
    buffer, owns_buffer = util.invoke_if_callable(self._serialized_engine)                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/util/util.py", line 661, in invoke_if_callable                                                                  
    ret = func(*args, **kwargs)                                                                                                                                           
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 488, in call_impl                                                                  
    G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly")                                                                                     
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/logger/logger.py", line 597, in critical                                                                        
    raise PolygraphyException(message) from None                                                                                                                          
polygraphy.exception.exception.PolygraphyException: Invalid Engine. Please ensure the engine was built correctly                                                          

@aredden
Copy link

aredden commented Mar 22, 2023

I had a similar error-

[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 32, 32), opt=(2, 4, 80, 64), max=(2, 4, 192, 192)).add('encoder_hidden_states', min=(2, 77, 768), opt=(2, 77, 768), max=(2, 77, 768)).add('timestep', min=[1], opt=[1], max=[1])]
[I] Building engine with configuration:
    Flags                  | [FP16, REFIT]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 19704.19 MiB, TACTIC_DRAM: 24217.31 MiB]
    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[E] 10: Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.
[E] 10: [optimizer.cpp::computeCosts::3873] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)
[!] Invalid Engine. Please ensure the engine was built correctly

I also noticed that it seems like the flash attention plugins were removed? also- with this version, since it seems like no extra plugins are being added, instead of the unet getting down to UNet: final .. 1082 nodes, 2037 tensors, 3 inputs, 1 outputs it gets to: 4016 nodes, 6732 tensors, 3 inputs, 1 outputs. Is this a result of trying to get it functional for the other versions of stable diffusion? Sacrificing performance for flexibility?

@rajeevsrao
Copy link
Collaborator

rajeevsrao commented Mar 22, 2023

@aredden @Vozf please share the python commands you used.

@aredden It looks like in your case REFIT is enabled?

Is this a result of trying to get it functional for the other versions of stable diffusion? Sacrificing performance for flexibility?

The increase in nodes is expected if we don't use plugins, however they will be fused back into fMHA Ops by the TensorRT optimizer. Plugins are you note are also not very flexible and support fewer SD versions and GPU target than the TensorRT out-of-box solution.

@aredden
Copy link

aredden commented Mar 22, 2023

I was trying refit, to see what it would be like- maybe that was incorrect usage? Also- my environment was exactly the environment described in the stable-diffusion demo README.md via the docker container- installing requirements, to a T. @rajeevsrao One thing I noticed was that inside the container the tensorrt version is 8.5.3, whereas the optional tensorrt version I installed is 8.6.0- Maybe that caused some issue? - GPU is a 4090, with cuda 12.1 out of the container.

@aredden
Copy link

aredden commented Mar 22, 2023

As for command, I was using this script, I added some arguments to modify the max latent dimensions for the unet, and having the pytorch model get pulled from a local custom diffusers checkpoint path- something I had been doing for the previous version of tensorrt.

#!/bin/sh
CUDA_VISIBLE_DEVICES=0 CUDA_MODULE_LOADING=LAZY python3 demo_txt2img.py \
    --negative-prompt "((Horribly blurred)), very ugly, (jpeg artifacts, blurry, gross), messy, warped, split, bad anatomy, malformed body, malformed, warped, fake, 3d, drawn, hideous, disgusting" \
    --denoising-steps 50 \
    --scheduler DPM \
    --width 512 \
    --height 640 \
    --engine-dir engine \
    --onnx-dir onnx \
    --force-onnx-export \
    --force-engine-build \
    --force-onnx-optimize \
    --build-preview-features \
    --build-enable-refit \
    --build-all-tactics \
    --build-static-batch \
    --build-dynamic-shape \
    --max-size 1536 \
    --model-path ./oranjipiratejaydos \
    -v \    
    "(Stunningly beautiful detailed) lush futuristic (eutopian paradise cyberpunk cityscape) landscape, intricate, elegant, mountains and very high waterfall background, volumetric lighting"

edit: Interesting, the error seems to have gone away after I build a container from source with tensorrt 8.6 and cuda 12 using the basic demo script. Might be that I had some code errors, or something about larger dynamic shapes doesn't work very well? Error also could be related to two different tensorrt binary versions in the previous container with 8.5.3 after updating to 8.6 via pip. Not sure.

edit 2: It compiles, but the output images are all black.

edit 3: The black images were the result of a faulty tensorrt compiled clip model for some reason 🤔 - I didn't change any code whatsoever so not sure why that would happen, but the speed of inference is about 1/3 what it was with 8.5.3.

edit 4: Alright, I built from source and that helped shave off quite a bit of time, from ~1600 ms/50 unet passes to about 1100ms / 50 unet passes. Which is still considerably slower than with 8.5.3, which gets closer to ~ 580 ms/50

@Vozf
Copy link
Author

Vozf commented Mar 22, 2023

I'm doing step by step from diffusion readme
The error is on step
python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN -v

@chavinlo
Copy link

I had a similar error-

[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 32, 32), opt=(2, 4, 80, 64), max=(2, 4, 192, 192)).add('encoder_hidden_states', min=(2, 77, 768), opt=(2, 77, 768), max=(2, 77, 768)).add('timestep', min=[1], opt=[1], max=[1])]
[I] Building engine with configuration:
    Flags                  | [FP16, REFIT]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 19704.19 MiB, TACTIC_DRAM: 24217.31 MiB]
    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[E] 10: Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.
[E] 10: [optimizer.cpp::computeCosts::3873] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)
[!] Invalid Engine. Please ensure the engine was built correctly

I also noticed that it seems like the flash attention plugins were removed? also- with this version, since it seems like no extra plugins are being added, instead of the unet getting down to UNet: final .. 1082 nodes, 2037 tensors, 3 inputs, 1 outputs it gets to: 4016 nodes, 6732 tensors, 3 inputs, 1 outputs. Is this a result of trying to get it functional for the other versions of stable diffusion? Sacrificing performance for flexibility?

Had the same issue, Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization

Tried everything, verifying torch, reinstalling dependencies, compiling the plugins, nothing worked except adding the --build-preview-features flag.

There was some warning that mentioned that enabling this would prevent issues... so I guess thats it....

@chavinlo
Copy link

Had the same issue, Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization

Tried everything, verifying torch, reinstalling dependencies, compiling the plugins, nothing worked except adding the --build-preview-features flag.

There was some warning that mentioned that enabling this would prevent issues... so I guess thats it....

txt2img-fp16-a_beautifu-1-4725
Can confirm it works, although yes, this is wayyyy slower than before

@chavinlo
Copy link

The increase in nodes is expected if we don't use plugins, however they will be fused back into fMHA Ops by the TensorRT optimizer. Plugins are you note are also not very flexible and support fewer SD versions and GPU target than the TensorRT out-of-box solution.

@rajeevsrao is there a way to accelerate it in the current state? by "GPU target" do you mean compiling the plugins for especific architectures? would that help?

@Vozf
Copy link
Author

Vozf commented Mar 29, 2023

I've managed to run the demo until it was somehow killed at unet stage, although I had to manually install torch 1.13 because torch 2.0 was installing by default as torch isn't in requirements file. torch==1.13 must be added to requirements. torch 2.0 results in error from the start

@chavinlo
Copy link

man can't even replicate what I did yesterday
damn yall really broke it this time

@skirsten
Copy link

skirsten commented Apr 14, 2023

I am also getting the

Could not find any implementation for node {ForeignNode[down_blocks.0.attentions.0.transformer_blocks.0.norm1.weight + (Unnamed Layer* 1363) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]} [profile 1].

with this config on the normal text2img:

[I] Building engine with configuration:
    Flags                  | [FP16, REFIT]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 20480.00 MiB, TACTIC_DRAM: 24259.69 MiB]
    Tactic Sources         | []
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Optimization Profiles  | 2 profile(s)
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]

I was using the version_compatible build before without refit and everything was fine 😞.
Its a shame that I cannot build a engine with refit AND version_compatible to use with the lean runtime.
I also tested the normal build and that also works.

So it has to do with the refit which does not make any sense...
Now I am stuck having to ship gigabytes of unused dependencies and the build is magically failing without any reason...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Demo: Diffusion Issues regarding demoDiffusion triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants