Enc_chain

What is this all about

The aim of the project is that of empirically investigate the effect of variational autoencoders (VAE) used as generative models on data distribution. In particular, we focus on the capability of VAEs to preserve data variability.

This is achieved by building a chain of VAEs, each one trained on a dataset generated by the previous one. The first autoencoder of the chain is trained on an "ad hoc" dataset made of grids in which pixels (actually a square cluster of nine pixels) are turned on according to a given distribution (unimodal or bimodal, usually the first one).

The chain structure is used to amplify any possible systematic effect introduced by VAEs and to distinguish systematic and random effects.

Then, some possible modifications to the standard VAE architecture are designed (and implemented) to try to mitigate data variability loss.

The outcome is interesting (in my opinion) because this is just a very simple and particular example of a much broader topic, that is AI models trained on AI-generated data.

What you will find in this repository

The repo is structured in the following way:

This README file
src/: directory containing all source code used for the project; contains:
- chain_lib.py: library containing all classes and functions used in the scripts
- gen_dataset.py: script that can be used to generate a syntetic dataset with pixels turned on according to some (chosen) distribution (single-peaked)
- gen_dataset_bimodal.py: script that can be used to generate a syntetic dataset with pixels turned on according to some (chosen) distribution (double-peaked)
- make_chain.py: script that can be used to run a (variable length) chain of VAEs starting from a target dataset
- comp_distribution.py: script that can be used to compute the distribution of the turned on pixels in a given dataset
- plot_variability.py: script that can be used to compute and plot the variability of the sequence of datasets produced by a chain
- plot_difference.py: script that can be used to compute and plot the difference between train and generated datasets' distributions of all models of a chain
- show_grids.ipynb: notebook that can be used to show a certain number of random images from a given dataset (used to check that the autoencoders are working as expected)
data/: directory containing data gathered with different parameters and run.sh scripts used to obtain them (zipped data are removed - only their graphical representations are left); each subdirectory contains a README.md file with a description of the parameters used
Enc_chain-Presentiation.pdf: slide presentation of the project

General pipeline

All data gathered for this project were obtained with the following general procedure:

Create an empty directory inside data/ to store initial, final and intermediate datasets, with related distributions
Inside the directory created at point 1., create an initial dataset called original_dset-ubyte.gz using the gen_dataset.py script (set the desired parameters by using the script's command line arguments)
Run the chain using make_chain.py, setting the desired command line arguments (notice that the arguments passed to this script must be coherent to the ones used to generate the dataset at step 3., and that the path to the directory to store datasets must be passed from command line as well)
For all datasets produced by the chain (saved in the directory created at point 1.), compute the related distribution of turned on pixels using comp_distribution.py, remembering to pass as command line arguments both the path to dataset and the path to where you want the related distribution's plot to be stored.
Use plot_difference.py and plot_variability.py to produce, respectively, the plot of variability over all datasets of the chain and that of differences between train and generated datasets of each model of the chain.

Reproducing results

To reproduce data used in this project you can:

Navigate to data/
Navigate to the data folder corresponding to the data you want to reproduce
Run bash run.sh

The Bash script wil produce an initial dataset called original_dset.gz (and corresponding pixels distribution original_dist.png), the intermediate datasets called dset_$i.gz (and corresponding pixels distributionsdist_$i.png), where $i is the index of the autoencoder in the chain used to generate the dataset, and the final dataset called final_dset.gz (and corresponding pixels distribution final_dist.png).

The script will also produce the plot of variability of all datasets produced by the chain and that of differences between all models' input (i.e. train) dataset and output (i.e. generated by the model) dataset, called respectively variability.png and difference.png.

Finally, a file called report.out will be produced, containing the output of the call to make_chain.py, in particular the elapsed time of each step of the chain.

References

This is the tutorial I used to build VAEs
This article explains how VAEs work more in detail

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
src		src
.gitignore		.gitignore
Enc_chain-Presentation.pdf		Enc_chain-Presentation.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enc_chain

What is this all about

What you will find in this repository

General pipeline

Reproducing results

References

About

Releases

Packages

Languages

License

TommasoTarchi/enc_chain

Folders and files

Latest commit

History

Repository files navigation

Enc_chain

What is this all about

What you will find in this repository

General pipeline

Reproducing results

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages