This notebook aims at defining:

- How original script has been adapted for Snakemake
- What is the structure of the output file
- Whether the output file correct despite the code change
- Which checks are necessary to deem the output as correct

#### Script adapted to Snakemake

The first idea, given that the [original code](get_coverage_mosdepth_original_code.ipynb) is written in bash, is to slightly adapt the code for Snakemake. 
However, after some attempts, translating the script to Python (much more Snakemake-like) seems easier. The script which summarises the coverage per assembly (average_coverage.py) is adapted for Snakemake and can be found here:

In [2]:
%%bash
ls -lrth /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/average_coverage.py

-rwxrwx--- 1 avillas hest-hpc-tg 849 Feb 29 20:29 /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/average_coverage.py


And looks like this:

In [None]:
import fnmatch
import pandas as pd

sumcov_file = open(snakemake.output[0],"w+")
sumcov_file.write("bam_id\tITB_id\tautosome\tx\ty\tPAR\tMT\n")
info_file = pd.read_table(snakemake.input.info)
info_file.index = info_file['bam_id']
coverage_files=snakemake.input.file_list
for file in coverage_files:
    segments = file.split("/")
    seg_file = segments[9].split(".")
    ITB_id = (info_file.loc[info_file['bam_id'] == seg_file[0], 'Interbull-ID'].iloc[0])
    covy = pd.read_table(file, header=None , sep = ' ')
    AUTOS = round(sum((covy.iloc[0:28,3])/28),2)
    AUTOS = str(AUTOS)
    X = covy.iloc[29,3]
    X = str(X)
    PAR = covy.iloc[30,3]
    PAR = str(PAR)
    Y = covy.iloc[31,3]
    Y = str(Y)
    MT = covy.iloc[32,3]
    MT = str(MT)
    sumcov_file.write(seg_file[0] + ".bam\t" + ITB_id + "\t" + AUTOS + "\t" + X + "\t" + Y + "\t" + PAR + "\t" + MT + "\n")
sumcov_file.close()

This python code is nothing but a literal translation of the bash code. Brief explanation of the bash code can be found in the [original script](get_coverage_mosdepth_original_code.ipynb).

This script should be run over [486 files](get_coverage.ipynb). Again, log files and optimised cluster resources are designed. In order to achieve the programmatic submission of the jobs with all conditions, the following Snakemake files are created:

* [Snakefile](Snakemake/Snakefile.py) 
* [Config file](Snakemake/config.yaml)
* [Cluster file](Snakemake/cluster.json)

#### Structure of the output file

Two coverage files are created: each one summarising the coverage for all samples within each assembly. The coverage for autosomes and sexual chromosomes are differencied: 

In [3]:
%%bash

echo 'Content of the average coverage file for both assemblies look like this:'
echo
head -10 /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/*/summary_coverage.txt

Content of the average coverage file for both assemblies look like this:

==> /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/Angus/summary_coverage.txt <==
bam_id	ITB_id	autosome	x	y	PAR	MT
BSWCHEM120102541330.bam	BSWCHEM120102541330	7.86	3.59	16.97	4.21	16.97
BSWCHEM110203063244.bam	BSWCHEM110203063244	9.61	4.75	19.52	4.62	19.52
BSWCHEM120105187122.bam	BSWCHEM120105187122	10.79	4.94	22.5	5.76	22.5
BSWCHEF120071978076.bam	BSWCHEF120071978076	11.8	10.98	5.59	6.27	5.59
BSWCHEM120127953255.bam	BSWCHEM120127953255	7.59	3.5	16.0	4.0	16.0
BSWCHEM120110162602.bam	BSWCHEM120110162602	7.86	3.55	16.06	4.28	16.06
BSWCHEM120033040506.bam	BSWCHEM120033040506	10.05	4.86	20.9	5.12	20.9
BSWCHEM120033987405.bam	BSWCHEM120033987405	10.23	4.85	22.01	5.55	22.01
BSWCHEF120050280619.bam	BSWCHEF120050280619	15.27	14.85	4.29	7.35	4.29

==> /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/UCD/summary_coverage.txt <==
bam_id	ITB_id	autosome	x	y	PAR	MT
BSWCHEM120102541330.ba

The structure of the file is quite simple. It is formed of:

- Column 1, BAM ID
- Column 2, ID of the interbull (in some cases different from the BAM ID)
- Column 3, coverage of autosomes
- Column 4, coverage of chromosome X
- Column 5, coverage of chromosome Y
- Column 6, coverage of chromosome PAR
- Column 7, coverage of chromosome MT

Although, we are only interested in the coverage of autosomic chromosomes, there are a couple of results that are clearly outstanding:

- First of all, in the Angus assembly, the reported mitochondrial coverage is exactly the same as the coverage of the Y chromosome. This is directly related to what was documented on the [get_coverage script](get_coverage.ipynb) where this chromosome could not be found. This is due to the fact that Angus assembly does not include the mitochondrial chromosome. Therefore, a modification of the python script (removing the MT column) could be prepared as suggested from the [original code](get_coverage_mosdepth_original_code.ipynb).
- While MT coverage seems a valid value for the UCD assembly, some of the values are outstandingly high (*e.g.* 1738.15 for BSWCHEF120071978076). This is something the group has already experienced in the past and would be interesting to further explore.

#### Confirming that code changes do not affect the result

As mentioned above, the original bash code has been modified to Python. I would like to check whether the result of running both scripts (bash and python) is exactly the same. Two different summary_coverage files are created for both codes. These are the coverages for some of the samples:

In [6]:
%%bash
cd /cluster/work/pausch/temp_scratch/audald/analysis_test/coverage/coverage_files
samples="BSWCHEM120125290253 BSWCHEF120043744524 NDA032 RM1908"
for sample in $samples 
do
    grep $sample bash_summary_coverage.txt
    grep $sample python_summary_coverage.txt
done

BSWCHEM120125290253 BSWCHEM120125290253 10.3803 5.19 7.62 10.36 4.18
BSWCHEM120125290253.bam	BSWCHEM120125290253	10.37	5.19	7.62	10.36	4.18
BSWCHEF120043744524 BSWCHEF120043744524 15.2224 15.64 0.07 13.50 706.10
BSWCHEF120043744524.bam	BSWCHEF120043744524	15.23	15.64	0.07	13.5	706.1
NDA032 BSWCHEF120030437958 14.1834 14.41 0.07 12.66 155.23
NDA032.bam	BSWCHEF120030437958	14.19	14.41	0.07	12.66	155.23
RM1908 BSWCHEM110064090335 20.7697 10.75 14.04 18.75 15.69
RM1908.bam	BSWCHEM110064090335	20.78	10.75	14.04	18.75	15.69


#### Checking that all steps were successfully completed

The Snakemake pipeline has been designed for generating atomic log files; this is, a log file is created for each summary / assembly (2). Hence, 2 successful log files are expected. The best way to track successful jobs is by greping "Successfully completed." across the log files:

In [7]:
%%bash

cd /cluster/work/pausch/temp_scratch/audald/variant_analyses/coverage/log_folder
grep Successfully summary_cov_*

summary_cov_Angus.log:Successfully completed.
summary_cov_UCD.log:Successfully completed.


Therefore, we can conclude that:

* Translation from bash to Python does not affect the results
* The structure of the summary files is the expected except for a couple of commented aspects
* The jobs were duly completed