Add wrapper scripts to pipe training tensors directly to Tensor2Bin #55

ftostevin-ont · 2021-09-23T14:40:21Z

This MR adds a wrapper script CreateTrainingTensor that calls CreateTensor{Pileup|FullAlignment} and handles piping the output to Tensor2Bin, in the same way as CallVarBam does with CallVariants. Additionally, UnifyRepresentation similarly calls CreateTensorFullAlignment and handles the piped output directly.
This has the advantage of saving writing and rereading the uncompressed tensors from disk, and allows tensor extraction and compression to run in parallel.

I also add a second script MergeBin that simply merges the individual chunk binaries into one, without changing their contents. This is mainly to limit the number of binary files that need to be passed to training.

One functional change introduced by this is that non-variant site subsampling is done at a variable rate (though targeting a constant variant:non-variant ratio) determined at the level of chunks of sites of size shuffle_bin_size, rather than at a global rate determined over all tensor details files. I have not seen that this significantly affects the resulting output tensors, though in theory the number of non-variant sites included will be sligthly more variable.

This should be backwards-compatible with the previous functionality of Tensor2Bin, though I have not extensively tested this.

…r without intermediate output

…essing piped output

…the second subprocess takes a while to close

… in full_alignment mode

…f duplicate positions are allowed

aquaskyline · 2021-09-24T02:01:19Z

#53 and #55 testing in progress.

ftostevin-ont added 12 commits September 23, 2021 11:33

combine training tensor generation and compression in a single wrappe…

c5ddf49

…r without intermediate output

implement running tensor creation within UnifyRepresentation and proc…

f0242f8

…essing piped output

turn off alarm when first subprocess finishes, to prevent crashes if …

e02fc43

…the second subprocess takes a while to close

MergeBin script

7c032b0

expose add_no_phasing option in CreateTrainingTensor, set min_af only…

adc572c

… in full_alignment mode

tweaks to tensor generation to properly apply maximum variant ratio i…

2ce8b9b

…f duplicate positions are allowed

re-add 'phasing_info_in_bam' option to CreateTrainingTensor

0f07619

make non-variant subsampling probabilistic again

6b8383f

update training docs

29f1f27

reduce threads used for tensor creation in training examples

71ce20b

update representation unification example documentation

7bf2b56

fix tensor file names in training examples

2b838d5

zhengzhenxian merged commit cc313f7 into HKU-BAL:main Sep 29, 2021

ftostevin-ont deleted the training_features_pipes branch April 29, 2022 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wrapper scripts to pipe training tensors directly to Tensor2Bin #55

Add wrapper scripts to pipe training tensors directly to Tensor2Bin #55

ftostevin-ont commented Sep 23, 2021

aquaskyline commented Sep 24, 2021

Add wrapper scripts to pipe training tensors directly to Tensor2Bin #55

Add wrapper scripts to pipe training tensors directly to Tensor2Bin #55

Conversation

ftostevin-ont commented Sep 23, 2021

aquaskyline commented Sep 24, 2021