## Picard tools

Once the sample specific BAM files are created (as well as the relevant BAI and stats) there is only a major step and a minor step missing: marking the duplicated reads and creating the final index file.

#### Mark Duplicates

Marking the duplicated reads within the BAM files is necessary in order for the following steps (variant calling workflow - GATK + Beagle) to accurately define the variants of each sample against the reference. This actions can be executed by using the software [Picard tools](https://broadinstitute.github.io/picard/).

As seen before, during the [alignment step](Alignment.ipynb), we are already marking the duplicates with [Samblaster](https://github.com/GregoryFaust/samblaster). Why do we need to mark duplicates again? There are two main reasons to run the MarkDuplicates function of Picard tools:

- Duplicates are marked according to the quality. This is, whenever a read is found more than once in the BAM file, the read with more quality is set as primary and the others as secondary duplicates. This is different to Samblaster, which is marking as primary the duplicated read found in first instance.
- Picard tool is read group aware. Samblaster was applied to the read-group-specific files and we are now dealing with the sample-specific files. Unlike before, we are now having different read groups within the BAM file. Picard tool is able to interpret reads according to their read group.

This is how duplicates are marked within the workflow:

In [None]:
rule mark_duplicates:
    input:
        "input.bam"
    output:
        bam= "output.bam",
        metrics= "output.metrics.txt"
    shell:
        module load picard/2.18.17 +
        module load jdk +
        "java -jar /cluster/apps/gcc-4.8.5/picard-2.18.17-jiviugfnvhk4qlzbcugx6bpzixgb7znb/bin/picard.jar MarkDuplicates I={input} O={output.bam} M={output.metrics}"

#### Build Index

Once the deduplicated BAM file is obtained (along with the metrics.txt), there is only one step missing before que have the final result: an updated index - BAI

The BuildBamIndex function of Picard tools is run over the resulting file with the following details:

In [None]:
rule build_index:
    input:
        "input.bam"
    output:
        "output.bam.bai"
    shell:
        module load picard/2.18.17 +
        module load jdk +
        "java -jar /cluster/apps/gcc-4.8.5/picard-2.18.17-jiviugfnvhk4qlzbcugx6bpzixgb7znb/bin/picard.jar BuildBamIndex I={input} O={output}"

We have now completed the workflow that prepares sorted and de-duplicated sample-specific BAM files from the raw data. The output can be used straight away for further analyses such as variant calling and analysis.

The concrete parameters for Picard tools within Snakemake can be found in ([Snakefile](Snakemake/Snakefile.py), [Configuration details](Snakemake/config.yaml) and [Cluster details](Snakemake/cluster.json)).