# **Introduction to VAE**
Over the last decade, DL has taken the field of the AI by storm. Using NNs we can now solve a host of problems, e.g:
* Object detection: Feed the networj and image and they will be able to identify locations of important objects in that image. ***Additional information: we give a input image and after processing it by the network we now know what objects are present and where they are located  in the image.***
* Language translation: Feed a NN with a English sentence, and it'll spit out the equivalent in French.***Additional information: we give an English input sentence and after the NN processing we know how to say it in another language.***
* Audio classification: Feed a NN a sound wave and it will determine the object that produced that sound. ***Additional information: after processing we now know what animal made that sound.***
You can see these problems a quite different, they have completely different input and output variables. However, all of them have one thing in common in all cases, the neural network will process the input sample and it will spit out some result that gives us some additional information about the input. ***There are a category of networks that are a bit different. They don't merely provide additional information about some input sample, but they also try to create or GENERATE some sample image, audio or text themselves. This class of NNs is called GENERATIVE MODELS***.

We are going to go through a particular type of generative models: ***Variational Autoencoders (VAE)***.
## **Intuition on VAE**
Generative models are also just NNs themselves. Normal NN models usually take some sample as input (like raw data, it could be an image, text or audio). ***Generative models on the other hand produce a sample as an output***; because of this flip I think you can see how and why this is so interesting.

You can train a model to understand how dogs work by feeding it hundreds of dog images. Then, during test time, we can just ask the model for an image and it will spit out a dog image. Every time that we ask our model to generate a dog it'll generate a different dog every time, so you can create an unlimited gallery of your favorite animal.

### **What does this generative models black box look like?**
Let's take a look at this VAE. VAE are a type of generative model, they are based of another type of architecture called ***Autoencoders (AEs)***.

These AEs consists in two parts, an enconder and a decoder: 
* The encoder (conv) takes an input sample and converts its information into some vector (latent vector).
* The decoder (deconv) takes that vector and spit it out to reconstruct the input sample.
    * **What is the point of tryning to generate an output that is the same as the input?**: There is no point. While using AEs we don't tend to care about the output itself, but rather the ***vector constructed in the middle***. The latent vector is important because it is a representation of the input image, and it's in a form that the computer understands.
    * **What is so great about the latent vector?**: This vector itself has limited use but we can feed it to complex architectures to solve some really cool problems. 

Something we cannot do with AEs is generate data. Following the AE architecture:
* **During training time**, we feed the images input and make the model learn the encoder a decoder parameters required to reconstruct the image again. 
* **During testing time**, we only need the decoder part because this is the part that generates the image. To do this we need to input some vector (the latent vector), however we have no idea about the nature of this vector. We need some method to determine this ***hidden vector***. The **idea behind determine this vector is thorugh *Sampling from a Distribution***
    * **Distribution**: Think it like a pool of numbers. Consider the case where we want to build a generative model to generate different animals; to accomplish this our generative model need to learn to create a pool for cats, a pool for dogs, and another pool for giraffes. About the dog pool, it doesn't actually mean a pool of images, instead it consists of some vector representation of these images (they are only understood by the computer). A distribution is like a pool of vectors.
    * **Sampling**: *Close your eyes reaching into a pool and picking one vector*. If you know where the pool is then you can go to the pool and randomly pick the vector. So when we say "I sample from the distribution of dog images", it is equivalent to saying that we picked a random vector from the dog pool (could there are a lot of pools).
    * The problem with general AEs is that we as human beings don't really know where these pools are:
        * Each of these pools is learned by the model during training time so when we feed hundreds of images of animals, our model will find patterns linking similar dogs, cats and giraffes. 
        * Now these pools (distributions) are learned internally by the autoencoder but there is no way for humans to know about these pools to make use of them for generating images. 
        * During test time we are bassicaly sampling from a random distribution (equivalent to blindfolding ourselves and picking a value from this huge box only consists of valid vector in very specific locations and just garbage vectors everywhere else. This is a very high chance that we'll pick a non-relevant garbage vector from which we get a non relevant garbage output accordingly)
            * we cannot generate dog images with an autoencoder because we don't know how to assign values to the vector during the generation phase, that is clearly a problem. What if we did know where to pick these vectors from them? That would solve the problem: VAE just does that.

Using VAEs, we first define a region we want to constrain this universe (constraint the region from which we want to pick the vectors), an within this region the goal of VAEs is to find the pools: this is done during the training phase.

During the testing phase, all we need to do now to generate an image is randomly sample a vector from this known region and then pass this vector to the ***VAE generator part (decoder)***. This will generate an image.

A neat property about the region mentioned before is that it is ***continuous***. We can just alter some values in the vector to still get valid looking images.
## **Comparison with General Autoencoders **
* Why do each exist?
    * **General AE**: 
        * learn a hidden representation of the input (that ***"vector"***). AEs cannot generate new data.
    * **VAE**: 
        * it also learns a hidden representation that it also is used to generate new information .
* What are they optimizing?
    * **General AE**: 
        * learn to transform an input into some vector by minimazing reconstruction loss.
        * During training, an AE makes sure what is  thrown into it is also spit out: it tries to minimize the difference from the original and the reconstructed images, hence it seeks to minimize the reconstruction loss.
    * **VAE**: 
        * Generate images by minimizing the sum of the reconstruction loss (the same what we defined for AEs) and a latent loss (with it we ensure that all the pools learned by the networkare within the same region and close between them: we assume the pools follow a Normal o Gaussian distribution).
        * During testing time, they are actually sample from the mixture of these gaussians: latent vectors are sampled from Gaussian Mixture.
## **Comparison with Generative Adversarial Networks (GANs)**  
* How does this learn to generate data?
    * **VAE**:
        * It has two losses to optmize. 
            * **Reconstruction loss**: what goes into the network is also spit out making sure that there is as little difference as possible.
            * **Latent loss**: that is making sure the latent vector takes only a specific set of values. So we want to know which region to sample this vector from.

        Optimizing two losses, our VAE will learn to generate images. Remembar that a AE consists of an encoder and a decoder.
    * **GANs**:
        * They also have two components: a ***generator*** and a ***discriminator***.
            * **Generator**: responsible for generating images.
            * **Discriminator**: determines whether a given is either real or fake (whether it was actually created by the generator).
        
        ***Both*** Generator and Discriminator play a ***minimax game*** where one tris to outperform the other. 
        
        The generator will try to generate an image that fools the discriminator, making it think that its image is real. 
        
        The discriminator tries to correctly distinguish between the real and fake images, caching the generator with its wits. 
        
        If one of them messes up, then its architecture is slightly tweaked to improve performance. 
        
        While runnning thousands of images ***during training***, the generator and discriminator and networks improve each other until the generator becomes proficient at generating animal images, and the discriminator becomes proficient at determining real images from fake images generated by the generator.

        ***During testing*** we can just use the generator to spit out the images that we need.

* How stable is training? 
    * **GANs**: training using GANs involves finding something called a ***Nash equilibrium***, that is a point in the game between the generator and discriminator where the game is sate to terminate or that there is an end of game point. However, thereis no concrete algorithm to actually determine this equilibrium end of game point yet. 
    * **VAE**: Offers a closed form objective (there is a nice little formula that we can use to determine the end of training phase).

* How good are the generated images?
    * **VAEs** work very well in theory but they tend to generate blurry images (attributed to the fact that VAEs are looking to optimize two factors during the training phase: reconstruction and latent loss). 
        * Remember that ***reconstruction loss*** makes sure that the output is as close to the input as possible, while that ***latent loss*** makes sure that the latent vector can only take a fixed range of values. 
        * These two factor often counter each other: there is a trade off, and the middle ground usually leads to blourry image generated.
    * **GAN** training is more empirical and optimized by way of trial and error, they just work. 
        * You can write down the losses theoretically but most of the intuition is based on the fact that we had the results before the actual theory. 
        * For simple spatial data like images, Gans produce really high qualiy results (sharper images generated compared to VAEs)
