## Alignment against different references

#### Read group preparation

The read group needs to be provided to the alignment step in a very specific format:

rg="@RG\tID:flowcell.lane\tCN:center\tLB:sample\tPL:illumina\tPU:read_group\tSM:sample"

e.g.

rg_BSWDEUM000916561366_C2FH5ACXX_5="@RG\tID:C2FH5ACXX.5\tCN:TUM\tLB:BSWDEUM000916561366\tPL:illumina\tPU:C2FH5ACXX:5\tSM:BSWDEUM000916561366"

The actual alignment of the raw data (fastq files already separated in read groups) consists of three steps:

* [bwa alignment](http://bio-bwa.sourceforge.net). FASTQ file is aligned the reference genome and SAM file results from this
* [samblaster](https://github.com/GregoryFaust/samblaster). Duplicated reads are marked in the sam file
* [samtools](http://samtools.sourceforge.net). SAM file is converted to BAM files

#### Reference genomes

FASTQ files are necessary for the alignment step. Splitting them into read-groups and provide the read-groups in the right way is also necessary. Nonetheless, it would not be possible to align without the reference genome. I am taking UCD and Angus references for this specific pipeline. These references and other used by the group can be found at /cluster/work/pausch/inputs/ref/BTA.

#### Alignment as a concatenation of the three steps

When a specific read group is known (and prepared as shown above), the following commands are proven to be successful in the original exploration:

In [None]:
module load bwa/0.7.17
module load samtools/1.6
bsub -n 1 -W 8:00 -R "rusage[mem=25000,scratch=1000]" -J "bam_creation" -env "all" "bwa mem -M -t 12 -R '@RG\tID:C2FH5ACXX.5\tCN:TUM\tLB:BSWDEUM000916561366\tPL:illumina\tPU:C2FH5ACXX:5\tSM:BSWDEUM000916561366' /cluster/work/pausch/inputs/ref/BTA/UCD1.2/ARS-UCD1.2_Btau5.0.1Y.fa /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWDEUM000916561366/BSWDEUM000916561366_C2FH5ACXX_5_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWDEUM000916561366/BSWDEUM000916561366_C2FH5ACXX_5_R2.fq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > /cluster/work/pausch/temp_scratch/audald/alignment/BSWDEUM000916561366_C2FH5ACXX_5.bam"

Following the example of sample **BSWAUTM000336344707**, all the BAM files for the 3 read groups (and the 2 alignments) are manually generated.

In [None]:
module load bwa/0.7.17
module load samtools/1.6
bsub -n 1 -W 48:00 -R "rusage[mem=25000,scratch=1000]" -J "bam_creation" -env "all" "bwa mem -M -t 12 -R '@RG\tID:H5KCGDSXX.1\tCN:TUM\tLB:BSWAUTM000336344707\tPL:illumina\tPU:H5KCGDSXX:1\tSM:BSWAUTM000336344707' /cluster/work/pausch/inputs/ref/BTA/UCD1.2/ARS-UCD1.2_Btau5.0.1Y.fa /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_1_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_1_R2.fq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > /cluster/work/pausch/temp_scratch/audald/alignment/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_1_UCD.bam"
bsub -n 1 -W 48:00 -R "rusage[mem=25000,scratch=1000]" -J "bam_creation" -env "all" "bwa mem -M -t 12 -R '@RG\tID:H5KCGDSXX.1\tCN:TUM\tLB:BSWAUTM000336344707\tPL:illumina\tPU:H5KCGDSXX:1\tSM:BSWAUTM000336344707' /cluster/work/pausch/inputs/ref/BTA/Angus/GCA_003369685.2_UOA_Angus_1_ARSUCD1.2X.fa /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_1_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_1_R2.fq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > /cluster/work/pausch/temp_scratch/audald/alignment/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_1_Angus.bam"
bsub -n 1 -W 48:00 -R "rusage[mem=25000,scratch=1000]" -J "bam_creation" -env "all" "bwa mem -M -t 12 -R '@RG\tID:H5KCGDSXX.3\tCN:TUM\tLB:BSWAUTM000336344707\tPL:illumina\tPU:H5KCGDSXX:3\tSM:BSWAUTM000336344707' /cluster/work/pausch/inputs/ref/BTA/UCD1.2/ARS-UCD1.2_Btau5.0.1Y.fa /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_3_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_3_R2.fq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > /cluster/work/pausch/temp_scratch/audald/alignment/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_3_UCD.bam"
bsub -n 1 -W 48:00 -R "rusage[mem=25000,scratch=1000]" -J "bam_creation" -env "all" "bwa mem -M -t 12 -R '@RG\tID:H5KCGDSXX.3\tCN:TUM\tLB:BSWAUTM000336344707\tPL:illumina\tPU:H5KCGDSXX:3\tSM:BSWAUTM000336344707' /cluster/work/pausch/inputs/ref/BTA/Angus/GCA_003369685.2_UOA_Angus_1_ARSUCD1.2X.fa /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_3_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_3_R2.fq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > /cluster/work/pausch/temp_scratch/audald/alignment/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_3_Angus.bam"
bsub -n 1 -W 48:00 -R "rusage[mem=25000,scratch=1000]" -J "bam_creation" -env "all" "bwa mem -M -t 12 -R '@RG\tID:H5KCGDSXX.4\tCN:TUM\tLB:BSWAUTM000336344707\tPL:illumina\tPU:H5KCGDSXX:4\tSM:BSWAUTM000336344707' /cluster/work/pausch/inputs/ref/BTA/UCD1.2/ARS-UCD1.2_Btau5.0.1Y.fa /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_4_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_4_R2.fq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > /cluster/work/pausch/temp_scratch/audald/alignment/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_4_UCD.bam"
bsub -n 1 -W 48:00 -R "rusage[mem=25000,scratch=1000]" -J "bam_creation" -env "all" "bwa mem -M -t 12 -R '@RG\tID:H5KCGDSXX.4\tCN:TUM\tLB:BSWAUTM000336344707\tPL:illumina\tPU:H5KCGDSXX:4\tSM:BSWAUTM000336344707' /cluster/work/pausch/inputs/ref/BTA/Angus/GCA_003369685.2_UOA_Angus_1_ARSUCD1.2X.fa /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_4_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_4_R2.fq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > /cluster/work/pausch/temp_scratch/audald/alignment/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_4_Angus.bam"

However, this is not feasible for all the files. I am analysing hundreds of files, which are split into 1-8 subfiles and need to be aligned to two different reference genomes. The number of results is huge. For the sake of example, the flowcells and lanes present in the FASTQ files are displayed here:

In [2]:
cd /cluster/work/pausch/temp_scratch/audald
ls -lrth split_fastq/* > names.txt
awk -F " " '{print $9}' names.txt | grep 'fq' > fq_names.txt
echo 'These are all the flowcells:'
awk -F "_" '{print $2}' fq_names.txt | sort | uniq
echo 'These are all the lanes:'
awk -F "_" '{print $3}' fq_names.txt | sort | uniq

These are all the flowcells:
C192UACXX
C1989ACXX
C19JAACXX
C19JJACXX
C19RVACXX
C1VFKACXX
C1VR2ACXX
C2EWAACXX
C2FH5ACXX
C5PBFACXX
C62E0ANXX
C87MVANXX
CAJTNANXX
CBCYNANXX
CBD0JANXX
CBL3FANXX
D1C61ACXX
D1C72ACXX
D1DMRACXX
D1DUKACXX
D1H9RACXX
D1NLFACXX
D1YA9ACXX
D20MTACXX
H2VYKBBXX
H2Y2TBCXY
H3WNWDSXX
H3WT2DSXX
H3WYNBBXX
H3YFWBBXX
H3YGCBBXX
H52LGDSXX
H52NJBBXX
H5GHKDRXX
H5KCGDSXX
H5NVJDSXX
H5VHWDSXX
H72JVDSXX
H7G5HDSXX
H7HFFDSXX
HGKLWDSXX
HHJHTBBXX
HHJKCBBXX
HHJMKBBXX
HHLCCDSXX
HHV75DSXX
HJ5MVBBXX
HJ5V5BBXX
HJ7VHBBXX
HLGTNBBXX
HM5LGDSXX
HTJLNBBXX
HW3CKBBXX
These are all the lanes:
1
2
3
4
5
6
7
8


#### Programmatic execution of alignment steps

Snakemake is obviously the right approach not only for automatise the alignment but also to concatenate it to the previous (and following) steps. This workflow manager requires a concrete input and output in order to complete each rule. In this specific case, given the variety of read groups, the eventual read-group-specific bam files are not known *a priory*. 

The code for achieving this uncertaintment and connect all steps together can be seen in the [Snakefile](Snakemake/Snakefile.py) - checkpoints and defined functions are defined.

In a nutshell, these are the parameters and this is how alignment is run within Snakemake ([Snakefile](Snakemake/Snakefile.py), [Configuration details](Snakemake/config.yaml) and [Cluster details](Snakemake/cluster.json)):

In [None]:
    params:
        rg="@RG\\tID:{flowcell}.{lane}\\tCN:TUM\\tLB:{sample}\\tPL:illumina\\tPU:{flowcell}:{lane}\\tSM:{sample}",
        bwa_mem = "-M -t 12 -R"
    shell:
        "module load bwa/0.7.17 \n" +
        "module load samtools/1.6 \n" +
        "bwa mem {params.bwa_mem} '{params.rg}' {input.ref} {input.R1} {input.R2} | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > {output.BAM}"

It results into 2 read-group-specific BAM files per sample (one for each reference genome).

P.S. Given the apparent complexity of generating a programming solution for the automatic creation and delivery of concrete and predicted read groups to the bwa software within Snakemake, I decided to post this question to [Stackoverflow](https://stackoverflow.com/questions/58747002/including-unforeseen-file-names-as-wildcards-in-snakemake). The final code is indeed using checkpoints and defined functions but differs slightly.