### Reverting BAM files to fastq

It has been documented that reverting BAM files to FASTQ does not unambigously generate the original FASTQ files. This is mainly for 4 reasons, according to this [report](https://www.enancio.com/genomicdataintegrity.html):
- Read names are changed so the checksum will diverge
- Hard clipping; some local alignments can be omitted within the BAM files
- Unmapped reads; Reads that cannot be mapped to the genome may be omitted from the BAM file
- Read ordering; After the mapping phase, reads are sorted according to their alignment position on the genome. Thus when converting back to FASTQ file, the read order will be the ordering present in the BAM file, not the original one.

The [following paper](https://www.ncbi.nlm.nih.gov/pubmed/27153582) adressed the question of pipeline reproducibility. It was not possible to obtain the same variant calling from the same FASTQ files when the raw sequences were reshuffled. Pipeline and code can be found [here](https://zenodo.org/record/32611#.XXd7CS2B3UI)

Hence, we will prove that certain FASTQ files are not the same FASTQ files once these are aligned and reverted to BAM files.

BAM files are generated as described in [this notebook](FASTQ_to_BAM.ipynb)

Different options when converting BAM files to fastq:
- [Picard tools](http://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq)
- [Samtools](http://www.htslib.org/doc/samtools.html) and already described in [Hubert' code](Hubert_Pausch_initial_code.ipynb)
- [BEDtools](https://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html)

We will start by the Samtools approach, as was already tested in [Hubert' code](Hubert_Pausch_initial_code.ipynb)

#### Samtools apprach

In [None]:
module load samtools/1.6

mkdir /cluster/work/pausch/audald/data/yeast/fastq2
mkdir /cluster/work/pausch/audald/data/yeast/fastq2/tmp_scratch

bsub -o bam_fastq_samtools_output -R "rusage[mem=2500,scratch=4000]" -J "BAM_fastqs" "samtools collate -uOn 64 /cluster/work/pausch/audald/data/yeast/fastq_fastp/SRR10079472.bam  /cluster/work/pausch/audald/data/yeast/fastq2/tmp_scratch/tmp | samtools fastq -N -1 /cluster/work/pausch/audald/data/yeast/SRR10079472_reverted_1.fastq.gz -2 /cluster/work/pausch/audald/data/yeast/SRR10079472_reverted_2.fastq.gz -" 
#This is submitting a job, requesting 2 processor cores, 6000 Mb of memory, 4000 Mb of scratch space
#Samtools "collate" looped BAM files as it shuffles and groups reads together by their names. -uOn 64 indicates that resulting BAM files will be uncompressed, output to sdtout and the number of temporary files will be 64. Temporary files are generated in the folders created before
#Samtools fastq is used after the pipe for converting the BAM files to compressed fastq files (collate is required). Numbers for read names will be included (-N) and output will be added to files -1 and -2. The dash (-) at the end of the second command is just telling bash to read in standard in and process it.

Details and flags for samtools can be seen in the [documentation page](http://www.htslib.org/doc/samtools.html)

Interestingly, when running only the collate option, with no samtools fastq in the pipe, the resulting BAM is almost 10 time bigger than the BAM taken as input.

However, concatenating both actions, the resulting fastq files are slightly bigger than the original ones as well as the ones obtained from the fastp step (the ones used for alignment and BAM obtention).

#### Picard tools apprach

The main documentation can be found [here](http://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq)

In [None]:
module load picard/2.9.2

bsub -o bam_fastq_Picard_output -R "rusage[mem=2000]" -J "BAM_fastq" "picard SamToFastq I=/cluster/work/pausch/audald/data/yeast/fastq_fastp/SRR10079472.bam FASTQ=/cluster/work/pausch/audald/data/yeast/Picardtools/SRR10079472_reverted_1.fastq.gz SECOND_END_FASTQ=/cluster/work/pausch/audald/data/yeast/Picardtools/SRR10079472_reverted_2.fastq.gz UNPAIRED_FASTQ=/cluster/work/pausch/audald/data/yeast/Picardtools/SRR10079472_reverted_0.fastq.gz"

The output of this action seems successful (this time a third fastq file has been generated for the unaligned reads) but the size of the fastq files is one third of the expected one.

#### BEDtools approach

In [None]:
module load samtools/1.6
module load bedtools2/2.27.1

mkdir /cluster/work/pausch/audald/data/yeast/BEDtools

bsub -o bam_sorting_output -R "rusage[mem=2000]" -J "BAM_sorting" "samtools sort -n /cluster/work/pausch/audald/data/yeast/fastq_fastp/SRR10079472.bam -o /cluster/work/pausch/audald/data/yeast/BEDtools/SRR10079472_qsort.bam"
#Several temporary files are created while running this. It could be improved by generating a folder for those - see http://www.htslib.org/doc/samtools.html
bsub -o bam_fastq_output -R "rusage[mem=2000]" -J "BAM_fastq" "bedtools bamtofastq -i /cluster/work/pausch/audald/data/yeast/BEDtools/SRR10079472_qsort.bam -fq /cluster/work/pausch/audald/data/yeast/BEDtools/SRR10079472_reverted_1.fastq -fq2 /cluster/work/pausch/audald/data/yeast/BEDtools/SRR10079472_reverted_2.fastq"
#Here the output is uncompressed fastq files. It could be improved by adding a compression step at the end.

The sorting step generates a BAM file bigger (x2) than the BAM taken as input.

#### Overall comparison of file sizes

The three pipelines generate BAM and fastq files with different sizes - this is the evident different. Before moving to the actual quality control, I am listing here the differences:

In [1]:
echo Sizes of original fastq files:
ls -lrth /cluster/work/pausch/audald/data/yeast/original_raw_data/*fastq.gz
echo ---------------------------------------------------------------------------------------------------
echo Sizes of processed fastq files - after fastp:
ls -lrth /cluster/work/pausch/audald/data/yeast/fastq_fastp/*fastq.gz
echo ---------------------------------------------------------------------------------------------------
echo Sizes of reverted fastq files - samtools From BAM:
ls -lrth /cluster/work/pausch/audald/data/yeast/Samtools/*fastq.gz
echo ---------------------------------------------------------------------------------------------------
echo Sizes of reverted fastq files - Picard tools From BAM:
ls -lrth /cluster/work/pausch/audald/data/yeast/Picardtools/*fastq.gz
echo ---------------------------------------------------------------------------------------------------
echo Sizes of reverted fastq files - BEDtools from BAM:
ls -lrth /cluster/work/pausch/audald/data/yeast/BEDtools/*fastq.gz
echo ---------------------------------------------------------------------------------------------------
echo Size of BAM file - BWA mem from fastq:
ls -lrth /cluster/work/pausch/audald/data/yeast/fastq_fastp/output.bam
echo ---------------------------------------------------------------------------------------------------
echo Size of BAM file - sorted by sambamba from unsorted BAM:
ls -lrth /cluster/work/pausch/audald/data/yeast/fastq_fastp/SRR10079472.bam
echo ---------------------------------------------------------------------------------------------------
echo Size of BAM file - sorted by samtools from BAM file after sambamba:
ls -lrth /cluster/work/pausch/audald/data/yeast/BEDtools/SRR10079472_qsort.bam

Sizes of original fastq files:
ls: cannot access '/cluster/work/pausch/audald/data/yeast/original_raw_data/*fastq.gz': No such file or directory
---------------------------------------------------------------------------------------------------
Sizes of processed fastq files - after fastp:
ls: cannot access '/cluster/work/pausch/audald/data/yeast/fastq_fastp/*fastq.gz': No such file or directory
---------------------------------------------------------------------------------------------------
Sizes of reverted fastq files - samtools From BAM:
ls: cannot access '/cluster/work/pausch/audald/data/yeast/Samtools/*fastq.gz': No such file or directory
---------------------------------------------------------------------------------------------------
Sizes of reverted fastq files - Picard tools From BAM:
ls: cannot access '/cluster/work/pausch/audald/data/yeast/Picardtools/*fastq.gz': No such file or directory
----------------------------------------------------------------------------------

: 2