Skip to content

A study on the loss of data variability in (generative) variational autoencoders, with a focus on architectural modifications to mitigate the effect.

License

Notifications You must be signed in to change notification settings

TommasoTarchi/enc_chain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enc_chain

What is this all about

The aim of the project is that of empirically investigate the effect of variational autoencoders (VAE) used as generative models on data distribution. In particular, we focus on the capability of VAEs to preserve data variability.

This is achieved by building a chain of VAEs, each one trained on a dataset generated by the previous one. The first autoencoder of the chain is trained on an "ad hoc" dataset made of grids in which pixels (actually a square cluster of nine pixels) are turned on according to a given distribution (unimodal or bimodal, usually the first one).

The chain structure is used to amplify any possible systematic effect introduced by VAEs and to distinguish systematic and random effects.

Then, some possible modifications to the standard VAE architecture are designed (and implemented) to try to mitigate data variability loss.

The outcome is interesting (in my opinion) because this is just a very simple and particular example of a much broader topic, that is AI models trained on AI-generated data.

What you will find in this repository

The repo is structured in the following way:

  • This README file
  • src/: directory containing all source code used for the project; contains:
    • chain_lib.py: library containing all classes and functions used in the scripts
    • gen_dataset.py: script that can be used to generate a syntetic dataset with pixels turned on according to some (chosen) distribution (single-peaked)
    • gen_dataset_bimodal.py: script that can be used to generate a syntetic dataset with pixels turned on according to some (chosen) distribution (double-peaked)
    • make_chain.py: script that can be used to run a (variable length) chain of VAEs starting from a target dataset
    • comp_distribution.py: script that can be used to compute the distribution of the turned on pixels in a given dataset
    • plot_variability.py: script that can be used to compute and plot the variability of the sequence of datasets produced by a chain
    • plot_difference.py: script that can be used to compute and plot the difference between train and generated datasets' distributions of all models of a chain
    • show_grids.ipynb: notebook that can be used to show a certain number of random images from a given dataset (used to check that the autoencoders are working as expected)
  • data/: directory containing data gathered with different parameters and run.sh scripts used to obtain them (zipped data are removed - only their graphical representations are left); each subdirectory contains a README.md file with a description of the parameters used
  • Enc_chain-Presentiation.pdf: slide presentation of the project

General pipeline

All data gathered for this project were obtained with the following general procedure:

  1. Create an empty directory inside data/ to store initial, final and intermediate datasets, with related distributions
  2. Inside the directory created at point 1., create an initial dataset called original_dset-ubyte.gz using the gen_dataset.py script (set the desired parameters by using the script's command line arguments)
  3. Run the chain using make_chain.py, setting the desired command line arguments (notice that the arguments passed to this script must be coherent to the ones used to generate the dataset at step 3., and that the path to the directory to store datasets must be passed from command line as well)
  4. For all datasets produced by the chain (saved in the directory created at point 1.), compute the related distribution of turned on pixels using comp_distribution.py, remembering to pass as command line arguments both the path to dataset and the path to where you want the related distribution's plot to be stored.
  5. Use plot_difference.py and plot_variability.py to produce, respectively, the plot of variability over all datasets of the chain and that of differences between train and generated datasets of each model of the chain.

Reproducing results

To reproduce data used in this project you can:

  1. Navigate to data/
  2. Navigate to the data folder corresponding to the data you want to reproduce
  3. Run bash run.sh

The Bash script wil produce an initial dataset called original_dset.gz (and corresponding pixels distribution original_dist.png), the intermediate datasets called dset_$i.gz (and corresponding pixels distributionsdist_$i.png), where $i is the index of the autoencoder in the chain used to generate the dataset, and the final dataset called final_dset.gz (and corresponding pixels distribution final_dist.png).

The script will also produce the plot of variability of all datasets produced by the chain and that of differences between all models' input (i.e. train) dataset and output (i.e. generated by the model) dataset, called respectively variability.png and difference.png.

Finally, a file called report.out will be produced, containing the output of the call to make_chain.py, in particular the elapsed time of each step of the chain.

References

  1. This is the tutorial I used to build VAEs
  2. This article explains how VAEs work more in detail

About

A study on the loss of data variability in (generative) variational autoencoders, with a focus on architectural modifications to mitigate the effect.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published