# 5. Comparative Analysis and Reporting

First and foremost, it is important to consider the context of the experiments, the dataset used is relatively small. Consequently, achieving high-quality results with complex architectures is not expected; the main goal is to gain some insights about the VAE and GAN models functioning for image generation and their training procedures.

## VAE models results

Moving on to the VAE results, in terms of reconstruction, most faces are somewhat faithfully regenerated, though there are instances where they are mistaken for faces with glasses or of different genders. However, the details surrounding the faces, such as hair or poses, are heavily blurred and don't match well with the rest of the image.

Similarly, for images generated by randomly selecting points in the latent space, the faces are of decent quality, but the rest of the image is blurred with some artifacts. This is likely due to consistent pixel distribution for faces but high variability for other elements in the image.

In comparing the experiments, both runs show acceptable performance and clear, similar convergence. However, the first run's lower output size means its results are qualitatively not as good as the second model's. Additionally, the FID metric may not fully represent generation quality. Rescaling images to 75x75 for FID computation with InceptionV3 layer results in smaller differences between distributions, particularly noticeable in the smoothing during rescaling from smaller images. This issue persists with GANs, making it difficult to compare models based on this metric.

Then, in the third experiment we increase the latent space dimensionality, which provides the VAE with more flexibility to capture the complexity and diversity of facial characteristics present in the dataset. Consequently, the generated faces exhibit higher fidelity and diversity, leading to a lower Fréchet Inception Distance (FID) score.

Considering the metric only for images of the same size reveals how it somewhat reflects image quality. There's a point where models can't surpass certain ranges due to their limited capacity and the available data. Regarding performance differences between models, it's deemed even, disregarding the second and third model's superior detail level. This compensates for the increased resolution with the network and latent space complexity.

Lastly, a grayscale version of the second model was tested but discarded due to its nearly identical results compared to the color version, chosen for simplicity.


## GAN models results

In general, we consider the generated images from certain trained networks quite satisfactory, particularly those from run 2, despite the simplicity of the architectures—mainly comprising convolutions and some regularization techniques. We confirm that GANs can indeed generate faces from random points in the latent space, though there's room for improvement. We also observe a significant diversity in the generated faces, encompassing various genders, hairstyles, skin tones, poses, and other facial characteristics, without any evident mode dropping. Despite consistent results overall, we encounter instances where faces blend into the background or appear overly blurry. Additionally, some artifacts are observed, likely attributed to dropout during training.

Efficiently training two networks through adversarial learning demands meticulous hyperparameter tuning, maintaining a delicate balance between the critic and the generator to prevent one overpowering the other. An overly empowered generator led to high generator accuracy but poor-quality generated images. Furthermore, we noticed that the accuracies of both the critic and the generator oscillated continuously between high and low values, impeding substantial improvement. Improving the critic losses appeared comparatively easier than improving the generator loss, possibly due to more frequent updates.

Regarding the FID metric, we find it suboptimal for comparing different approaches, especially across images of varying resolutions, as the upsampling required to fit the input shape of the Inception module can distort the results due to differing smoothing techniques applied. Concerning specific runs, run 2, featuring the baseline architecture with skip connections, emerged as the most successful. These skip connections facilitated extended learning durations, allowing the network to train more epochs compared to the baseline (from 30 epochs to 45 epochs) without overfitting. However, attempts to enhance results by increasing network complexity (run 3) resulted in significant divergence and poor outcomes. Similarly, efforts to mitigate divergence by reducing learning rate, adjusting batch size, or increasing beta2 (run 4) did not yield improved results. Also applying the more complex network to black and white images further deteriorated the quality of results. Finally, using dilated convolutions the training was destabilized even more with really strong oscillations between high and low Accuracy values for both, the critic and the discriminator. The likely cause of this issue is the low resolution of the training images, which leads to increased complexity and results in overfitting.


## Comparison VAEs vs GANs

Several distinctions have been observed between Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs):

1. **Image Quality**: GANs are renowned for their ability to generate high-quality, realistic images with complex details. This is primarily attributed to their adversarial training mechanism, where the generator is continually challenged by the discriminator to produce images that are indistinguishable from real ones. That is why GANs produce sharper images with more detail, while VAEs generate notably blurry images. This blurriness arises from the reconstruction process, where VAEs attempt to match the input data distribution, sacrificing some fidelity in the process. Consequently, they do not capture the nuanced details necessary for generating highly realistic images like GANs do.
2. **Artifacts**: Despite their ability to generate high-quality images, GANs are susceptible to producing artifacts—unintended visual anomalies or distortions in generated images, possibly attributed to dropout during training, detracting their overall realism.
3. **Image Diversity**: When generating images from random points in the latent space, GANs produce a wide array of outputs, spanning different poses, expressions, backgrounds, and other attributes which is particularly striking when compared to VAEs, where the inherent blurriness often leads to images that appear more homogeneous and less distinct from one another.
4. **FID Metric**: VAEs typically yield better Frechet Inception Distance (FID) values, although this metric might not be highly representative for comparing these models. Despite their architectural similarities, VAEs leverage features from input images to generate outputs, while GANs generate images directly from random noise. Moreover, comparing generations of different resolutions is distorted by the upsampling required to fit the input shape of the Inception module used for computing FID.
5. **Training Dynamics**: VAEs typically converge easier and smoother during training compared to GANs. VAEs often follow a typical loss curve with fast initial improvement followed by a flattening curve, whereas GAN training involves frequent oscillations and even divergence. This instability in GAN training stems from its dual-network structure, where the objective function optimized alternates between the critic and the generator, making the training process more volatile and reliant on the other network. Consequently, GANs are prone to overfitting and divergence, while VAEs tend to exhibit more stable training.
6. **Training Speed**: GANs, due to their increased architectural complexity, generally have slower training times compared to VAEs.
