## Sorting, merging and generation of stats: all can be done with Sambamba

[Sambamba](https://lomereiter.github.io/sambamba/docs/sambamba-view.html) is a very useful tool for managing and dealing with the recently created BAM files. 4 different steps are performed at this point by using this tool:

* Sorting BAM files. The read-group-specific BAM files are not sorted; sorting is necessary for merging and further processing them. 
* Merge the sorted read-group-specific BAM files into sample-specific BAM files.
* Generation of index files (BAI); this step is part of both sorting and merging steps.
* Generation of stats so we can analyse and see the details of the BAM files at a glance.

#### Sorting read-group-specific BAM files

Recently generated BAM files are read-group-specific. The more read groups per BAM files the smaller the files will be with respect to the total sample. This a reason why the sorting is done at a read-group-specific BAM file: the  smaller files the better, as multiple jobs can run at the same time. However, the main reason why this step is done now is that sorted BAM files are necessary for the merging of read-group-specific BAM files into sample BAM files.

A index file (BAI) for each BAM is also created as part of this process.

Sambamba with sorting options can be used in bash as follows:

In [None]:
/cluster/work/pausch/group_bin/sambamba_v0.6.6 sort -t 6 --out {output.sorted_BAM} {input.BAM}

#### Merging read-group-specific BAM files into sample-specific BAM files

As mentioned within the documentation, splitting the FASTQ files into read-group-specific FASTQ files is a workaround to allow the alignment and sorting of the BAM files. However, we are interested in keeping all the read-groups together per each sample and eventually obtaining a BAM files for each sample. Therefore, we need now to merge all the read-group-specific BAM files in each sample into a single BAM files.

This can be performed as shown here (this step also generates sample-specific BAI files):

In [None]:
/cluster/work/pausch/group_bin/sambamba_v0.6.6 merge -t 6 {output} {input}

As can be seen in the command, input files need to be provided at the very end. This is due to the fact that the command will take all the files at the end and will merge them into the output. The important point is that at least 2 input files are required.

Although the majority of the FASTQ files contained different read groups and were properly split, some of the FASTQ files were only composed by a read group. For these outstanding samples, only one BAM files is present at this point. 

This needs to be taken into account in Snakemake, so the merging action only takes places when the read group number is more than one. The if rule we designed is the following:

In [None]:
run:
        if len(input) == 1:
            shell("mv {input} {output} \n" + "mv {input}.bai {output}.bai")
        else:
            shell(SAMBAMBA + " merge -t 6 {output} {input}")

#### Stats at a glance

Once the sample-specific BAM and BAI files are obtained, we might want to take a look at the details. Sambamba is also a good tool for generating a report including this stats.

This command can be run like:

In [None]:
/cluster/work/pausch/group_bin/sambamba_v0.6.6 flagstat -t 3 --nthreads 10 {input.BAM} > {output.Stats}

And the resulting report looks like:

In [1]:
cat /cluster/work/pausch/temp_scratch/audald/best_assembly/sorted_alignment/Angus/BSWCHEF120127770289/BSWCHEF120127770289.stats

936027873 + 0 in total (QC-passed reads + QC-failed reads)
4211599 + 0 secondary
0 + 0 supplementary
66950199 + 0 duplicates
931920065 + 0 mapped (99.56%:N/A)
931816274 + 0 paired in sequencing
465908137 + 0 read1
465908137 + 0 read2
909222300 + 0 properly paired (97.58%:N/A)
927239830 + 0 with itself and mate mapped
468636 + 0 singletons (0.05%:N/A)
16171804 + 0 with mate mapped to a different chr
8047353 + 0 with mate mapped to a different chr (mapQ>=5)


Two different stats reports are run for each sample within this workflow:
- The first one straight after the merging, as documented in this notebook
- The second one is the very last step and is run once [Picard tools](Picard_tools.ipynb) are applied

#### Sambamba in Snakemake

Overall, the parameters and the actual running within Snakemake can be seen in the following files: ([Snakefile](Snakemake/Snakefile.py), [Configuration details](Snakemake/config.yaml) and [Cluster details](Snakemake/cluster.json)).