You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experimenting with attention masking in Stable Diffusion (so that padding tokens aren't considered for cross attention), and I found that UNet2DConditionModel doesn't work when given an attention_mask.
For the attn1 blocks (self-attention), the target sequence length is different from the current length (target 4096, but it's only 77 for a typical CLIP output). The padding routine pads by addingtarget_length zeros to the end of the last dimension, which results in a sequence length of 4096 + 77, rather than the desired 4096. I think it should be:
encoder_attention_mask works fine - it's passed to the attn2 block and no padding ends up being necessary.
It seems that this would additionally fail if current_length were greater than target_length, since you can't pad by a negative amount, but I don't know that that's a practical concern.
(I know that particular masking isn't even semantically valid, but that's orthogonal to this issue!)
Reproduction
# given a Stable Diffusion pipeline# given te_mask = tokenizer_output.attention_maskpipeline.unet(latent_input, timestep, text_encoder_output, attention_mask=te_mask).sample
Additionally, are there any examples of attention_mask being used with UnetCondition2d? Since it's a unet, the blocks are of descending sizes, but there's just the single attention mask accepted. It seems like you would need to accept a mask per block level (or to downsample the mask to match the block size at each depth). I've got my local UnetCondition2d hacked up to do just that, and it seems to work, but I'd like to understand what the intended usage principle behind attention_mask is as-is.
Describe the bug
I'm experimenting with attention masking in Stable Diffusion (so that padding tokens aren't considered for cross attention), and I found that UNet2DConditionModel doesn't work when given an
attention_mask
.diffusers/src/diffusers/models/attention_processor.py
Line 740 in 8ead643
For the attn1 blocks (self-attention), the target sequence length is different from the current length (target 4096, but it's only 77 for a typical CLIP output). The padding routine pads by adding
target_length
zeros to the end of the last dimension, which results in a sequence length of 4096 + 77, rather than the desired 4096. I think it should be:encoder_attention_mask
works fine - it's passed to the attn2 block and no padding ends up being necessary.It seems that this would additionally fail if current_length were greater than target_length, since you can't pad by a negative amount, but I don't know that that's a practical concern.
(I know that particular masking isn't even semantically valid, but that's orthogonal to this issue!)
Reproduction
Logs
System Info
NVIDIA GeForce RTX 4060 Ti, 16380 MiB
Who can help?
No response
The text was updated successfully, but these errors were encountered: