## Coverage - wrapper

Coverage of sequencing files is a good parameter to start the comparison of both assemblies: UCD and Angus.
Sorted and de-duplicated BAM files have been obtained for both reference genomes as seen in the [Processing notebook](../Alignment/Processing_pipeline.ipynb).

Although BSW and OBV samples are prepared and aligned we decided to move forward only with the [BSW breed](../Data/IDs/OBV_BSW_raw_IDs_python.ipynb).

BAM files are provided as input for [get_coverage scripts](get_coverage_mosdepth_original_code.ipynb), which calculate the coverage of the BAM files by using mosdepth ([paper](https://www.ncbi.nlm.nih.gov/pubmed/29096012) - [Github](https://github.com/brentp/mosdepth)) and generate average reports of all samples. The bash scripts were provided by Hubert Pausch and the goals and formation are duly explained within [the notebook](get_coverage_mosdepth_original_code.ipynb). 

However, given that I will be creating a two-step coverage pipeline with Snakemake, some tweaks are needed for adapting the bash scripts for workflow execution.

#### 1. get_coverage

The software [mosdepth](https://github.com/brentp/mosdepth) is run over the de-duplicated BAM files in order to extract the coverage of all the positions for each sample. A sample.output file with such information is generated during this step.

[This notebook](get_coverage.ipynb) explains:

* How the script is adapted for Snakemake integration
* How the output looks like
* Whether the results are correct

#### 2. average_coverage

In order to compare the coverage for both assemblies, a summary per assembly is more convenient than comparing 243 output files with 243 output files. 

The [following notebook](average_coverage.ipynb) describes the Python code used for gathering the interesting information from each output in a single assembly-summary file. The following aspects are discussed in this notebook:

* How the script is translated to Python and adapted for Snakemake integration
* How the output looks like and which details can be discussed
* Whether the results are correct and the translation to Python is not affecting it


### Snakemake

The two-step coverage pipeline, with the relevant configuration and cluster can be found from here:

* [Snakefile](Snakemake/Snakefile.py) 
* [Config file](Snakemake/config.yaml)
* [Cluster file](Snakemake/cluster.json)

The following files are required for correct performance of Snakemake:

* Sorted and de-duplicated BAM files (/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/)
* Information file from where BAM ID - Interbull ID relationship can be retrieved (/cluster/work/pausch/inputs/ref/BTA/individual_information.csv)
* get_coverage script (/cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/get_coverage.sh)
* average_coverage script (/cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/average_coverage.py)

Once the Snakemake files are prepared and the additional requirements are met, jobs can be submitted as follows:

In [None]:
module load python_gpu/3.6.4
snakemake --snakefile Snakefile --jobs 999 --cluster-config cluster.json --cluster "bsub -J {cluster.jobname} -n {cluster.ncore} -W {cluster.jobtime} -o {cluster.logi} -R \"rusage[mem={cluster.memo}]\"" 