This notebook aims at defining:

- How original script has been adapted for Snakamke
- What is the structure of the output file
- Which checks are necessary to deem the output as correct

#### Script adapted to Snakemake

The script which retrieves the coverage per each sample (get_coverage.sh) is adapted for Snakemake and can be found here:

In [1]:
ls -lrth /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/get_coverage.sh

-rwxrwx--- 1 avillas hest-hpc-tg 1007 Feb 13 15:37 /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/get_coverage.sh


And looks like this:

In [None]:
cd $TMPDIR

export LD_LIBRARY_PATH=/cluster/work/pausch/group_bin/htslib/

for chr in {1..32}
do
if [[ $chr = "30" ]]; then
    chr=X
fi
if [[ $chr = "31" ]]; then
    chr=Y
fi
if [[ $chr = "32" ]]; then
    chr=MT
fi

/cluster/work/pausch/group_bin/mosdepth -c $chr $3 $1

if [[ $chr != "X" ]]; then
coverage_sum=`zcat $4 | awk '{ total += ($3-$2)*$4; count++ } END {print total}'`;length=`zcat $4 | tail -n 1 | awk '{print $3}'`; coverage=`printf %.2f $(echo $coverage_sum/$length | bc -l)`; echo $1 $chr $length $coverage >> $2
fi

if [[ $chr = "X" ]]; then
coverage_sum=`zcat $4 | awk '$2<133300518' | awk '{ total += ($3-$2)*$4; count++ } END {print total}'`;length=133300518; coverage=`printf %.2f $(echo $coverage_sum/$length | bc -l)`; echo $1 $chr 133300518 $coverage >> $2
coverage_sum=`zcat $4 | awk '$2>133300518' | awk '{ total += ($3-$2)*$4; count++ } END {print total}'`;length=5708626; coverage=`printf %.2f $(echo $coverage_sum/$length | bc -l)`; echo $1 PAR 5708626 $coverage >> $2
fi
done

rm $3*

This script should be run over 486 files (243 selected BAMs - see [ID obtention notebook](../Data/IDs/OBV_BSW_raw_IDs_python.ipynb) x 2 assemblies - UCD & Angus). Additionally, atomic log files and optimised cluster resources are designed. In order to achieve the programmatic submission of the jobs with all conditions, the following Snakemake files are created:

* [Snakefile](Snakemake/Snakefile.py) 
* [Config file](Snakemake/config.yaml)
* [Cluster file](Snakemake/cluster.json)

#### Structure of the output file

As described above, a coverage file is generated per sample and assembly. To fully understand what *mosdepth* does, we can take a peek at the resulting data structure:

In [1]:
echo 'Content of the coverage file for sample RM724 (for both assemblies) look like this:'
echo
head -33 /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/*/RM724.coverage

Content of the coverage file for sample RM724 (for both assemblies) look like this:

==> /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/Angus/RM724.coverage <==
/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/RM724/RM724.bam 1 157005132 9.65
/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/RM724/RM724.bam 2 134168182 9.61
/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/RM724/RM724.bam 3 121042431 9.49
/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/RM724/RM724.bam 4 119712010 9.58
/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/RM724/RM724.bam 5 119436673 9.44
/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/RM724/RM724.bam 6 117143213 9.75
/cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/RM724/RM724.bam 7 110277844 9.33
/cluster/work/pausch/temp_scratch/audald/best_ass

The structure of the file is quite simple. It is formed of:

- Column 1, name of the BAM file analysed
- Column 2, ID of the chromosome
- Column 3, length of the chromosome
- Column 4, average of the chromosome

#### Checking that all steps were successfully completed

The Snakemake pipeline has been designed for generating atomic log files; this is, a log file is created for sample (243) and assembly (2). Hence, 486 jobs are generated and 486 successful log files are expected. The best way to track successful jobs is by grepping "Successfully completed." across the log files. The following loop moves all correct logs to a new folder and counts them.

In [6]:
root_folder="/cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/log_folder"
mydir=success_logs
cd $root_folder
mkdir $mydir
for file in $root_folder/mosdepth*
do
    if grep -q 'Successfully completed.' $file
    then
        mv $file $mydir
    else
        echo $file contains errors
    fi
done
echo 'Expected samples = 243 UCD + 243 Angus = 486'
echo 'Samples processed successfully:'
ls $mydir/*.log | wc -l
echo 'Should the previous numbers be different, samples processed with errors can be found in:' $root_folder
echo ''

Expected samples = 243 UCD + 243 Angus = 486
Samples processed successfully:
486
Should the previous numbers be different, samples processed with errors can be found in: /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/log_folder



All jobs seem to be successfully completed. Nevertheless, when taking a closer look, the following warning is outputed in some of the logs: **chromosome MT not found**. Although this is not halting the process, it is interesting to further understand it. After some inspections, it is made evident that the warning is only present for these BAM files aligned to the Angus assembly. In order to check the following script is generated:

In [10]:
root_folder="/cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/log_folder/success_logs"
cd $root_folder
echo 'Angus samples reporting the warning:'
grep 'chromosome MT not found' *log | grep 'Angus' | wc -l
echo 'UCD samples reporting the warning:'
grep 'chromosome MT not found' *log | grep 'UCD' | wc -l

243
0


Therefore, we can conclude that *mosdepth* has run successfully and the coverage file obtained are reliable. However, the lack of MT chromosome for the Angus assembly it is important to note for the [second part of the pipeline](average_coverage.ipynb).