
# Analysis of Self Attention Adversarial Networks 

[Self-Attention Generative Adversarial Networks](https://arxiv.org/abs/1805.08318?fbclid=IwAR1a9tz5IiQRuJ0CuWth7XV4POKUhZFvAhtMBFGRFYHZID6h0U7dcsbF8r8)





# Idea in brief 

- It is observed that currently (Q2 2018) most commonly used GAN Architectures, which are based con Convolutions, are good at producing **realistic texture** but poor at generating **realistic geometrical structure** 

- Probably this is a limitation due to the Convolution itself which works only on local information 

- Attention based mechanism have been introduced to learn long term dependencies and so far have shown promising results 

- The proposed approach consists essentially in introducing an Attention Mechanism in a typical GAN so to increase its capability to learn long term dependencies and eventually be able to produce realistic geometric structure 



# Some Details 





## Key Points  

- CNN at the core of GAN applied of images (no surprise)
- Multi-class Dataset is a highly heterogenous dataset with a lot of different classes each of which has particular appearance in terms of both **texture** and **geometry**
  - Example: ImageNet 

- Texture intensive object class : highly recognizable by texture, loose geometric structure 
  - Example: ocean, sky, ... 
- Geometric intensive object class : highly recognizable by its specific geometry 
  - Example: animals 

- Current GAN have shown capability to learn a specific subset of classes, the ones less related to Geometric Structure 

- Convolution is a **local operation** 
- Self Attention is a **non local operation** 




## Key Issues 

- Q: Can Convolution learn non-local dependencies 
  - No, by construction they rely on a local receptive field 
  
- Q: How can CNN achieve global semantic 
  - This is achieved by a Deep Structure: **hierarchical feature learning** lead to global semantic estimation 
  
- Q: How does this connect to texture and geometric structure 
  - Texture generation is much less sensitive to long term dependencies than geometric structure 
  - Convolution + Hierarchical Structure can do well for texture 
  - Geometric structure relies on long term dependencies 



## Goal in more detail 

- Build a computational model able to model relationships between **widely separated spatial regions** in a computationally effective way 
- This is implemented by modifying a CNN GAN adding self attention mechanism 



## Problem 

- Current (Q2 2018) GANs applied to Multi-class Dataset (e.g. ImageNet) show a major limitation : 
  - good at generating texture 
  - poor at generating realistic geometric structure 

Example 

- From [NIPS 2016 Tutorial: Generative Adversarial Networks](https://arxiv.org/abs/1701.00160)

![Ian Goodfellow Tutorial GAN generated dog](https://cdn-images-1.medium.com/max/800/1*wPRcBE66_sj_AppB4tQ3lw.png)

- Texture looks realistic 
- Geometric Structure is still far from good (even if not totally random)







## Possible Explanations 

- Convolution works really well to learn short range dependencies (by construction it relies on a receptive field which is local)
- Geometric Structure should result from long range dependencies which can not be learned effectively with CNN 




## Proposed Solution 

- Provide GAN with a mechanism complementary to Convolution to learn long range and multi-level dependencies : use **Self Attention** which is an Attention based mechanism 



# More Details 

Some more details 


## GAN - More details 

### Overview 

- GAN as the current standard architecture for the image generation task 

- Hard to train 
  - Unstable 
  - Highly sensitive to Hypeparams choice 

### Recent Techniques 

- The Spectral Normalization seems to work pretty well 
  - Ref: [Spectral Normalization for Generative Adversarial Networks](https://arxiv.org/abs/1802.05957)




## Attention Mechanism - More Details 

### Overview 

- Very effective to learn long term dependencies 
- Special focus on Self Attention, also called Intra Attention
- Self Attention is a **non local operation** contrary to Convolution which is local 




# Architecture 

![Self Attention Mechanism1](https://i.paste.pics/3ddd18a26085af9f6e8cb1fd0e8a855a.png)

- From $X$ Input Feature Map to $O$ Self Attention Feature Map 

