### Alignment - wrapper

In order to evaluate and compare two different assemblies, we need to align the existing data to both references and draw some conclusion. The first part of the analysis is based on processing the raw data (FASTQ files) and align the files to both assemblies. The resulting BAM files will be used for the next part of the process: variant calling.

Two levels of complexity have been found over the last weeks while setting up the pipeline:
* Several tools and steps - Individual notebooks have been created for each one.
* Snakemake is used as workflow manager - this generates robust and reproducible workflows. 

Hence, I am using this notebook as wrapper of the different steps.

The IDs of interest have been obtained [here](../Data/IDs/OBV_BSW_raw_IDs_python.ipynb) in the exact format as expected by Snakefile: a python list

The pipeline being used as reference for the alignment was developed by Hubert Pausch and is taken from [here](Original_pipeline/Hubert_Pausch_initial_code.ipynb)

All the steps are summarised below, with the relevant links to the detailed notebooks.

Snakemake details and complete pipeline can be found underneath the tools and steps undertaken, as well as in the Snakemake files ([Snakefile](Snakemake/Snakefile.py), [Configuration details](Snakemake/config.yaml) and [Cluster details](Snakemake/cluster.json)).

#### Fastp - QC of raw data

Having confirmed that raw data exist for the IDs of interest and ignoring whether a previous sanity check has been performed over the raw data (it is the case for some samples), we will run the software [fastp](https://github.com/OpenGene/fastp). 

From this step we obtain the raw data after filtering (FASTQ files) plus two reports: JSON and HTML.

Commands, parameters and some of the observations of the QC can be seen [here](Fastp.ipynb)


#### gdc-fastq-splitter - Splitting of raw data into read-groups

The resulting FASTQ files (once applied fastp for Quality Control) are formed by different read groups. It is important to properly tag/sort the reads into read groups when generating the aligned (BAM) files, so the different batch effects and artifacts can be treated appropriately. Some of the tools we will be using need the reads properly sorted; the best way to  tag the read groups is by generating read-group-specific FASTQ files.

[gdc-fastq-splitter](https://github.com/kmhernan/gdc-fastq-splitter) is the tool we use for splitting the FASTQ files with different read groups into read-group-specific FASTQ files.
All details of the installation - it was not straightforward - can be found in the ["How_I_did_stuff" notebook](https://github.com/Audald/ETH_Jupyter/blob/master/How_I_did_stuff.ipynb).

In the [parallel notebook](ReadGroup.ipynb) the following items are described:
- What the read groups (RG) are  
- How these RG look like in FASTQ and BAM files
- Which parameters are used for splitting the files
- Confirmation that splitting of fastq files into multiple read-group-specific fastq files does work


#### BWA, samblaster and BAM tools - Alignment against different reference genomes

Read-group-specific FASTQ files are now created. These can now be aligned to our two reference genomes (UCD and Angus), so read-group-specific BAM files for each reference are created. In order to do so, three different softwares are used; this is defined in this following [Notebook](Alignment.ipynb).

Some difficulties have been faced during this process, also explained in the above notebook:

* How to provide the software with the Read groups in a programmatic manner
* Parameters used for the alignment


#### Sambamba - Sorting, merging and extracting stats from the BAM files

Althought the BAM files are the desired output of our workflow, these need to be further processed. [Sambamba](https://lomereiter.github.io/sambamba/docs/sambamba-view.html) is a very useful tool for managing and dealing with the recently created BAM files. The following steps are necessary and duly described in the [relevant notebook](Sorting_Merging_Stats.ipynb):

* Sorting BAM files. The read-group-specific BAM files are not sorted; sorting is necessary for merging and further processing them. 
* Merging the sorted read-group-specific BAM files into sample-specific BAM files.
* Generating of index files (BAI); this step is part of both sorting and merging steps.
* Generating of stats so we can analyse and see the details of the BAM files at a glance.

The proper concatenation of these steps as well as the optimised parameters/variables can be seen in the [same notebook](Sorting_Merging_Stats.ipynb).

#### Picard tools - Marking duplicates and building index

Once the sample specific BAM files are created (as well as the relevant BAI and stats) there are a major step and a minor step to complete the workflow:
- Marking the duplicated reads of the sample specific BAM files
- Generate an updated index (BAI) for the resulting BAM files

The [Picard tools notebook](Picard_tools.ipynb) defines the details of both actions.

We have now completed the workflow that prepares sorted and deduplicated sample-specific BAM files from the raw data. The output can be used straight away for further analyses such as variant calling (GATK + Beagle) - *this can be eventually linked to a wrapper notebook defining variant calling and subsequent analysis*.

#### All together in Snakemake

The different pieces of the workflow are finally prepared. The actual concatenation and integration of these steps is done within Snakemake.

In order to understand the necessary files, the requirements as well as the instructions for submitting the jobs, the [following notebook](Snakemake.ipynb) has been created.

Additionally, potential improvements and lessons learnt as presented for the sake of clarification and development ideas.

#### Validation and check points

Validating the results of a generated workflow (specially when the number of steps and level of complexity is high) is crucial to trust the output and be confident about moving forward with the obtained files. That is the reason why so much effort has been devoted to generate scripts and use softwares as check points. The different approaches used can be seen [in this notebook](Validation_checkpoints.ipynb) and aim at tackling the following concerns:

* Make sure that all pipeline steps have been completed
* Check that all samples have been successfully processed
* Find out what of the samples have already had applied a QC before the pipeline start
* Confirm that the number of reads is kept from the raw data to the aligned and sorted BAM files
* Keep an eye at the storage footprint
* Describe alternative methods for validating the results

#### Optimising resources

There is a very last activity we can do in order to complete the generation of the workflow and call it a day: optimising the computing resources as much as possible.

Taking into account that hundreds or thousands of jobs are to be submitted, choosing the best combination of resources is necessary if we want the jobs to finish successfully (enough resources) without being too long in the queue (not excessive resources). The correct balance is found in [this notebook](Optimising_resources.ipynb) and is duly set in the Cluster JSON file.