Skip to content

A collection of functions to handle toyzero dataset

Notifications You must be signed in to change notification settings

LS4GAN/toytools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

toytools -- A collection of functions to handle toyzero dataset

toytools provides a number of handy functions, pytorch Datasets and scripts to work with data produced by the toyzero package.

Installation

toytools is intended for developers, so the best way to install it is to run

python setup.py develop

Requirements

The toytools package has two mandatory dependencies: numpy and pandas. If you would like to use pytorch Datasets, then you would need pytorch installed. Some scripts also depend on tqdm to display pretty progress bars.

Overview

The contents of the toytools package can be logically separated into several categories:

  • PyTorch Datasets
  • Generic Functions
  • Scripts

Below you will find a brief overview of each.

PyTorch Datasets

Logically, there are two classes of toyzero datasets:

  • Version 0 datasets: SimpleToyzeroDataset, PreSimpleToyzeroDataset, PreUnalignedToyzeroDataset, PreCroppedToyzeroDataset. These datasets return pairs of images, where the first element is from domain "a" ("fake" domain) and the second is from domain "b" ("real" domain).

    These datasets implement their own random sampling that is not easy to use in a multiprocessing setting.

  • Version 1 datasets: PreSimpleToyzeroDatasetV1, PreCroppedToyzeroDatasetV1. These datasets are similar to their v0 counterparts, but return a single image, either from domain "a" or domain "b", depending on how they were instantiated.

    Version 1 datasets support only deterministic sampling and therefore require a separate random sampler. But, on the upside, there are no surprises when one uses these datasets in the multiprocessing setting.

SimpleToyzeroDataset

This is very simplistic datasets object that loads toyzero images on demand. It supports cropping images, filtering images based on the wire plane, automatically separates dataset into training and validation parts.

If image cropping is required, this dataset tries its best to select only regions of the image that contain signal. It retries random cropping up to 100 times to find signal regions of the image. However, some toyzero images are completely empty, so this approach fails for such images.

PreSimpleToyzeroDataset

This is another simplistic dataset object that loads toyzero images according to a preprocessed list, obtained with the scripts/preprocess script. Usage of the preprocessed list of images significantly speeds up image loading times, since there is no need to perform expensive search of signal regions in the image. Preprocessing also removes completely empty images from the dataset.

Similar to SimpleToyzeroDataset, PreSimpleToyzeroDataset automatically separates the preprocessed dataset into training and validation parts.

PreUnalignedToyzeroDataset

This dataset behavior is similar to the PreSimpleToyzeroDataset, except that it returns unpaired images during the training phase. During the testing and validation phases, the returned images are paired as in PreSimpleToyzeroDataset.

PreCroppedToyzeroDataset

This dataset loads cropped images created by the precrop script. It supports returning either paired or unpaired images depending on the constructor arguments.

NOTE

All datasets described in this section are implemented in a pytorch independent way. This allows datasets reuse with other frameworks.

You can use get_toyzero_dataset to select a pytorch-independent dataset based on its name. The get_toyzero_dataset_torch will select a pytorch-dependent dataset.

Generic Functions

toytools comes with a variety of functions to handle toyzero data:

  • toytools.collect -- contains functions to search for images in the toyzero image directory, parse image names, perform filtering based on image APA or wire plane, and load the images themselves.

  • toytools.transform -- contains functions to determine background value of the image, determine whether given image is empty, crop images, and search for regions of image that contain signal.

  • toytools.plot -- a few functions to simplify plotting of the toyzero images.

Scripts

You can find four scripts in the scripts directory:

  1. preprocess
  2. view_dataset
  3. train_test_split
  4. precrop

preprocess

This script can be used to preprocess images of the toyzero dataset. Preprocessing includes determining the background value of the image and selection of randomly cropped regions of the image, that contain signal. Filtering images based on APA and Wire Plane is also supported.

The results of preprocessing are saved in a csv file, suitable to be used with the PreSimpleToyzeroDataset dataset.

The preprocessing is a rather slow procedure, so preprocess script relies on a high degree of process based parallelism.

view_dataset

This script can be used to plot images generated by various datasets (e.g. SimpleToyzeroDataset or PreSimpleToyzeroDataset). The images can be either viewed in an interactive mode (default), or saved to files in a batch mode (when the --plotdir argument is specified).

train_test_split

This script can be used to split the preprocessed list of image crops (created with the preprocess script) into training and test parts.

precrop

This script extracts cropped regions found with preprocess script and saves them as seperate images, performing training/validation split in process. The extracted cropped regions have their background value subtracted as well. They can be loaded with the PreCroppedToyzeroDataset dataset.

Extracting cropped regions significantly speeds up subsequent data loading.

Usage Examples

Generate Image Crops

To generate image crops of shape 512x512 for the U plane you can use the following command

python scripts/preprocess --plane U -n 4 --min-signal 500 -s 512x512 \
   /path/to/data LABEL

Here, -n 4 controls the number of crops to extract from a single image. The parameter --min-signal 500 indicates the threshold value of the nonzero pixels to keep cropped image. If number of nonzero pixels after cropping is less than 500, then the cropped image will be discarded the the cropping process retried. /path/to/data is the directory where your toyzero dataset is located. Finally, LABEL is a label that will be given to the cropping dataset (this label will be embedded into generated dataset name).

After the preprocess script finished its run it will create a file LABEL-U-512x512.csv that contains a list of cropped regions.

Splitting Genearted Image Crops into Train/Test Parts

To split the generated image crops into training and validation parts we can run the train_test_split script like

python scripts/train_test_split \
   --test-size 4000 /path/to/data/LABEL-U-512x512.csv

It will split the preprocessed image crops /path/to/data/LABEL-U-512x512.csv into two parts:

  • Training samples /path/to/data/LABEL-U-512x512-train.csv
  • Test samples /path/to/data/LABEL-U-512x512-test.csv

Viewing Image Crops

The generated image crops can be viewed with the help of the view_dataset script:

python scripts/view_dataset --dataset toyzero-presimple \
    --data_args '{ "fname": "LABEL-U-512x512.csv" }' -i :100 \
    /path/to/data --plotdir /path/to/plot_dir

This script will load the toyzero image crops from the file LABEL-U-512x512.csv, plot 100 cropped samples (-i :100), and save the plots under /path/to/plot_dir.

Extracting Image Crops

Loading entire event images from the disk is a very slow process and could easily make the training CPU bound. To speed up image loading, one can add one more preprocessing step and extract image crops with the help of the precrop script. For example, running

python scripts/precrop /path/to/data/LABEL-U-512x512.csv /output/directory

will extract image crops according to the precomputed file LABEL-U-512x512.csv and save them as .npz files under /output/directory.

The resulted crops can be later loaded with the help of the PreCroppedToyzeroDataset dataset.

About

A collection of functions to handle toyzero dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published