WDL based workflows for BAM-to-PBAM-to-BAM conversions. For additional information on the protocol and file formats, see http://privaseq3.gersteinlab.org/docs/.
The workflows can be run using caper. Install caper following these installation instructions.
In wdl directory you will find workflows for all the supported formats. Alongside with each wdl file, an input json template is provided.
As an example, assume you have a bam file, that originates from bulk RNA sequencing experiment, and it has been aligned to human GRCh38 reference and you want to make a privacy-aware bam from it. The workflow you need is located in wdl/genome/make_pbam_genome.wdl, and the input template in wdl/genome/genome_pbam_input_template.json.
Fill in the locations of your bam, and reference files into the template. Acceptable file storages in addition to your local machine are https://, gs://, s3://. In all the workflows there is an option to provide the reference files (reference genome, reference transcriptome and the annotation) as plain files, or as files compressed with gzip. If you want to use the gzipped file, you should add _gz suffix to the input parameter name (see the second example below).
{
"genome.bam": "<your bam location here>",
"genome.reference_fasta": "<GRCh38.fasta location here>",
"genome.cpu": 1,
"genome.memory_gb": 2,
"genome.disk": "local-disk 20 SSD"
}If you want to provide the reference genome as a compressed file, genome.reference_fasta needs to be changed into genome.reference_fasta_gz:
{
"genome.bam": "<your bam location here>",
"genome.reference_fasta_gz": "<GRCh38.fasta.gz location here>",
"genome.cpu": 1,
"genome.memory_gb": 2,
"genome.disk": "local-disk 20 SSD"
}Save the input containing locations you your input files.
Memory and disk requirements depend on the size of the input. Good starting point for disk is 5x the size of your bam file, and for memory 16GB should be sufficient for most bam files. Most of the processes for now are single process. Parallelized version will be available in the future.
Run the workflow:
caper run -i <your_input.json> wdl/genome/make_pbam_genome.wdl -m metadata.json --dockerIf you are using singularity use --singularity option instead of --docker.
After the run finishes, the metadata.json containing detailed information of the run is written. In outputs section of the metadata.json you will find the location of the pbam file.
Assuming you are not intending anyone to be able to restore the information contained in the bam file, you are done. If you need to be able to reverse the transformation, you will need to create a diff file corresponding to your bam.
Note that if your bam file was aligned to human transcriptome, then you can use the workflow located in wdl/transcriptome/make_pbam_transcriptome.wdl. For single-cell RNA-Seq data that was aligned to human reference genome using STAR, you can still use the workflow located in wdl/genome/make_pbam_genome.wdl.
The workflow and corresponding input template are located in wdl/diff directory. As above fill in the location of the bam file into the template:
{
"diff.bam": "<your bam location here>,
"diff.cpu": 1,
"diff.memory_gb": 2,
"diff.disk": "local-disk 20 SSD"
}
Memory and disk requirements and running is as above.
To restore a regular bam from pbam and diff files you will need to use the workflow and input template located in wdl/pbam2bam. The process is very similar to the previous steps. Fill in the input files to the template:
{
"pbam2bam.pbam": "<your pbam location here>",
"pbam2bam.diff": "<corresponding diff location here>",
"pbam2bam.run_type": "genome",
"pbam2bam.reference_fasta": "<reference fasta location here>",
"pbam2bam.cpu": 1,
"pbam2bam.memory_gb": 2,
"pbam2bam.disk": "local-disk 20 SSD"
}If you want to use a gzip-compressed reference file, use the following:
{
"pbam2bam.pbam": "<your pbam location here>",
"pbam2bam.diff": "<corresponding diff location here>",
"pbam2bam.run_type": "genome",
"pbam2bam.reference_fasta_gz": "<reference fasta.gz location here>",
"pbam2bam.cpu": 1,
"pbam2bam.memory_gb": 2,
"pbam2bam.disk": "local-disk 20 SSD"
}Running and locating outputs is exactly same as before.
We recommend you run the WDL workflows, but it is also possible run the scripts directly, although some setup will be necessary (obviously it cannot be guaranteed that even after following these instructions your environment will be exactly the same as in the docker image we use to run the WDL):
- Install
python 3.8.5 - Install following python packages
numpy==1.19.2 biopython==1.78 pandas==1.1.3 - Get Picard 2.23.8 from
https://github.com/broadinstitute/picard/releases/download/2.23.8/picard.jar, and add the.jarfile toPATH. - Install samtools 1.11
- Install java runtime, for example
openjdk 11.0.8. - Add scripts from directories
10xscell,diff,genome,pbam2bamandtranscriptometoPATH.
Additional READMEs for running the bash scripts are provided in each folder.