Potentially missing positional encodings in SpatialTransformer #522

ivallesp · 2022-12-05T12:15:15Z

Hey folks, first of all thanks for all the effort on building this amazing open source community. Here's my two cents, I may be mistaken, but in the spatial transformers we are using attention without positional encodings. Is that correct? The attention does not have any mechanism to know the original order of pixels, may that be impacting performance?

SpatialSelfAttention:

stable-diffusion/ldm/modules/attention.py

Lines 99 to 149 in ce05de2

    
           class SpatialSelfAttention(nn.Module): 
        
               def __init__(self, in_channels): 
        
                   super().__init__() 
        
                   self.in_channels = in_channels 
        
                   self.norm = Normalize(in_channels) 
        
                   self.q = torch.nn.Conv2d(in_channels, 
        
                                            in_channels, 
        
                                            kernel_size=1, 
        
                                            stride=1, 
        
                                            padding=0) 
        
                   self.k = torch.nn.Conv2d(in_channels, 
        
                                            in_channels, 
        
                                            kernel_size=1, 
        
                                            stride=1, 
        
                                            padding=0) 
        
                   self.v = torch.nn.Conv2d(in_channels, 
        
                                            in_channels, 
        
                                            kernel_size=1, 
        
                                            stride=1, 
        
                                            padding=0) 
        
                   self.proj_out = torch.nn.Conv2d(in_channels, 
        
                                                   in_channels, 
        
                                                   kernel_size=1, 
        
                                                   stride=1, 
        
                                                   padding=0) 
        
               def forward(self, x): 
        
                   h_ = x 
        
                   h_ = self.norm(h_) 
        
                   q = self.q(h_) 
        
                   k = self.k(h_) 
        
                   v = self.v(h_) 
        
                   # compute attention 
        
                   b,c,h,w = q.shape 
        
                   q = rearrange(q, 'b c h w -> b (h w) c') 
        
                   k = rearrange(k, 'b c h w -> b c (h w)') 
        
                   w_ = torch.einsum('bij,bjk->bik', q, k) 
        
                   w_ = w_ * (int(c)**(-0.5)) 
        
                   w_ = torch.nn.functional.softmax(w_, dim=2) 
        
                   # attend to values 
        
                   v = rearrange(v, 'b c h w -> b c (h w)') 
        
                   w_ = rearrange(w_, 'b i j -> b j i') 
        
                   h_ = torch.einsum('bij,bjk->bik', v, w_) 
        
                   h_ = rearrange(h_, 'b c (h w) -> b c h w', h=h) 
        
                   h_ = self.proj_out(h_) 
        
                   return x+h_

SpatialTransformer:

stable-diffusion/ldm/modules/attention.py

Lines 218 to 261 in ce05de2

    
           class SpatialTransformer(nn.Module): 
        
               """ 
        
               Transformer block for image-like data. 
        
               First, project the input (aka embedding) 
        
               and reshape to b, t, d. 
        
               Then apply standard transformer action. 
        
               Finally, reshape to image 
        
               """ 
        
               def __init__(self, in_channels, n_heads, d_head, 
        
                            depth=1, dropout=0., context_dim=None): 
        
                   super().__init__() 
        
                   self.in_channels = in_channels 
        
                   inner_dim = n_heads * d_head 
        
                   self.norm = Normalize(in_channels) 
        
                   self.proj_in = nn.Conv2d(in_channels, 
        
                                            inner_dim, 
        
                                            kernel_size=1, 
        
                                            stride=1, 
        
                                            padding=0) 
        
                   self.transformer_blocks = nn.ModuleList( 
        
                       [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim) 
        
                           for d in range(depth)] 
        
                   ) 
        
                   self.proj_out = zero_module(nn.Conv2d(inner_dim, 
        
                                                         in_channels, 
        
                                                         kernel_size=1, 
        
                                                         stride=1, 
        
                                                         padding=0)) 
        
               def forward(self, x, context=None): 
        
                   # note: if no context is given, cross-attention defaults to self-attention 
        
                   b, c, h, w = x.shape 
        
                   x_in = x 
        
                   x = self.norm(x) 
        
                   x = self.proj_in(x) 
        
                   x = rearrange(x, 'b c h w -> b (h w) c') 
        
                   for block in self.transformer_blocks: 
        
                       x = block(x, context=context) 
        
                   x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w) 
        
                   x = self.proj_out(x) 
        
                   return x + x_in

BasicTransformerBlock:

stable-diffusion/ldm/modules/attention.py

Lines 196 to 215 in ce05de2

    
           class BasicTransformerBlock(nn.Module): 
        
               def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True): 
        
                   super().__init__() 
        
                   self.attn1 = CrossAttention(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout)  # is a self-attention 
        
                   self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff) 
        
                   self.attn2 = CrossAttention(query_dim=dim, context_dim=context_dim, 
        
                                               heads=n_heads, dim_head=d_head, dropout=dropout)  # is self-attn if context is none 
        
                   self.norm1 = nn.LayerNorm(dim) 
        
                   self.norm2 = nn.LayerNorm(dim) 
        
                   self.norm3 = nn.LayerNorm(dim) 
        
                   self.checkpoint = checkpoint 
        
               def forward(self, x, context=None): 
        
                   return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint) 
        
               def _forward(self, x, context=None): 
        
                   x = self.attn1(self.norm1(x)) + x 
        
                   x = self.attn2(self.norm2(x), context=context) + x 
        
                   x = self.ff(self.norm3(x)) + x 
        
                   return x

explainingai-code mentioned this issue Mar 5, 2024

In Diffusion Models: Why does not use 'Positional Encoding' in self-attention layers? explainingai-code/DDPM-Pytorch#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially missing positional encodings in SpatialTransformer #522

Potentially missing positional encodings in SpatialTransformer #522

ivallesp commented Dec 5, 2022 •

edited

Loading

Potentially missing positional encodings in SpatialTransformer #522

Potentially missing positional encodings in SpatialTransformer #522

Comments

ivallesp commented Dec 5, 2022 • edited Loading

ivallesp commented Dec 5, 2022 •

edited

Loading