Why cosformer not work on XL-base transformer architecture? #10

lwaekfjlk · 2022-07-15T13:42:30Z

When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net are different since XL-like transformer has different layer norm and residual connection. Why this ReLU(Q)ReLU(K).T softmax replacement is not robust on different transformer architectures?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why cosformer not work on XL-base transformer architecture? #10

Why cosformer not work on XL-base transformer architecture? #10

lwaekfjlk commented Jul 15, 2022

Why cosformer not work on XL-base transformer architecture? #10

Why cosformer not work on XL-base transformer architecture? #10

Comments

lwaekfjlk commented Jul 15, 2022