toytools
provides a number of handy functions, pytorch Datasets and scripts
to work with data produced by the toyzero
package.
toytools
is intended for developers, so the best way to install it is to run
python setup.py develop
The toytools
package has two mandatory dependencies: numpy
and pandas
.
If you would like to use pytorch Datasets, then you would need pytorch
installed. Some scripts also depend on tqdm
to display pretty progress bars.
The contents of the toytools
package can be logically separated into several
categories:
- PyTorch Datasets
- Generic Functions
- Scripts
Below you will find a brief overview of each.
Logically, there are two classes of toyzero datasets:
-
Version 0 datasets:
SimpleToyzeroDataset
,PreSimpleToyzeroDataset
,PreUnalignedToyzeroDataset
,PreCroppedToyzeroDataset
. These datasets return pairs of images, where the first element is from domain "a" ("fake" domain) and the second is from domain "b" ("real" domain).These datasets implement their own random sampling that is not easy to use in a multiprocessing setting.
-
Version 1 datasets:
PreSimpleToyzeroDatasetV1
,PreCroppedToyzeroDatasetV1
. These datasets are similar to their v0 counterparts, but return a single image, either from domain "a" or domain "b", depending on how they were instantiated.Version 1 datasets support only deterministic sampling and therefore require a separate random sampler. But, on the upside, there are no surprises when one uses these datasets in the multiprocessing setting.
This is very simplistic datasets object that loads toyzero
images on demand.
It supports cropping images, filtering images based on the wire plane,
automatically separates dataset into training and validation parts.
If image cropping is required, this dataset tries its best to select only
regions of the image that contain signal. It retries random cropping up to 100
times to find signal regions of the image. However, some toyzero
images are
completely empty, so this approach fails for such images.
This is another simplistic dataset object that loads toyzero
images according
to a preprocessed list, obtained with the scripts/preprocess
script.
Usage of the preprocessed list of images significantly speeds up image loading
times, since there is no need to perform expensive search of signal regions
in the image. Preprocessing also removes completely empty images from the
dataset.
Similar to SimpleToyzeroDataset
, PreSimpleToyzeroDataset
automatically
separates the preprocessed dataset into training and validation parts.
This dataset behavior is similar to the PreSimpleToyzeroDataset
, except that
it returns unpaired images during the training phase. During the testing and
validation phases, the returned images are paired as in
PreSimpleToyzeroDataset
.
This dataset loads cropped images created by the precrop
script. It supports
returning either paired or unpaired images depending on the constructor
arguments.
All datasets described in this section are implemented in a pytorch
independent way. This allows datasets reuse with other frameworks.
You can use get_toyzero_dataset
to select a pytorch
-independent dataset
based on its name. The get_toyzero_dataset_torch
will select a
pytorch
-dependent dataset.
toytools
comes with a variety of functions to handle toyzero
data:
-
toytools.collect
-- contains functions to search for images in the toyzero image directory, parse image names, perform filtering based on image APA or wire plane, and load the images themselves. -
toytools.transform
-- contains functions to determine background value of the image, determine whether given image is empty, crop images, and search for regions of image that contain signal. -
toytools.plot
-- a few functions to simplify plotting of thetoyzero
images.
You can find four scripts in the scripts
directory:
preprocess
view_dataset
train_test_split
precrop
This script can be used to preprocess images of the toyzero dataset. Preprocessing includes determining the background value of the image and selection of randomly cropped regions of the image, that contain signal. Filtering images based on APA and Wire Plane is also supported.
The results of preprocessing are saved in a csv
file, suitable to be used
with the PreSimpleToyzeroDataset
dataset.
The preprocessing is a rather slow procedure, so preprocess
script relies on
a high degree of process based parallelism.
This script can be used to plot images generated by various datasets
(e.g. SimpleToyzeroDataset
or PreSimpleToyzeroDataset
). The images can
be either viewed in an interactive mode (default), or saved to files in a batch
mode (when the --plotdir
argument is specified).
This script can be used to split the preprocessed list of image crops
(created with the preprocess
script) into training and test parts.
This script extracts cropped regions found with preprocess
script and saves
them as seperate images, performing training/validation split in process. The
extracted cropped regions have their background value subtracted as well. They
can be loaded with the PreCroppedToyzeroDataset
dataset.
Extracting cropped regions significantly speeds up subsequent data loading.
To generate image crops of shape 512x512 for the U plane you can use the following command
python scripts/preprocess --plane U -n 4 --min-signal 500 -s 512x512 \
/path/to/data LABEL
Here, -n 4
controls the number of crops to extract from a single image.
The parameter --min-signal 500
indicates the threshold value of the nonzero
pixels to keep cropped image. If number of nonzero pixels after cropping
is less than 500, then the cropped image will be discarded the the cropping
process retried. /path/to/data
is the directory where your toyzero
dataset
is located. Finally, LABEL
is a label that will be given to the cropping
dataset (this label will be embedded into generated dataset name).
After the preprocess script finished its run it will create a file
LABEL-U-512x512.csv
that contains a list of cropped regions.
To split the generated image crops into training and validation parts we can
run the train_test_split
script like
python scripts/train_test_split \
--test-size 4000 /path/to/data/LABEL-U-512x512.csv
It will split the preprocessed image crops /path/to/data/LABEL-U-512x512.csv
into two parts:
- Training samples
/path/to/data/LABEL-U-512x512-train.csv
- Test samples
/path/to/data/LABEL-U-512x512-test.csv
The generated image crops can be viewed with the help of the view_dataset
script:
python scripts/view_dataset --dataset toyzero-presimple \
--data_args '{ "fname": "LABEL-U-512x512.csv" }' -i :100 \
/path/to/data --plotdir /path/to/plot_dir
This script will load the toyzero image crops from the file
LABEL-U-512x512.csv
, plot 100 cropped samples (-i :100
), and save the
plots under /path/to/plot_dir
.
Loading entire event images from the disk is a very slow process and could
easily make the training CPU bound. To speed up image loading, one can add one
more preprocessing step and extract image crops with the help of the precrop
script. For example, running
python scripts/precrop /path/to/data/LABEL-U-512x512.csv /output/directory
will extract image crops according to the precomputed file
LABEL-U-512x512.csv
and save them as .npz
files under /output/directory
.
The resulted crops can be later loaded with the help of the
PreCroppedToyzeroDataset
dataset.