## Mapping stats

This notebook is a guide to generate the resuluts on mapping quality used for the publication. The output of the `sam_flags` rule of the [Snakefile](../Pipeline/Snakefile) is used as `R` dataframe. The names of the samples (column `1`) are not printed for privacy reasons. A `Quality` column is added as follows:

In [2]:
library(dplyr)
library(tidyverse)
setwd("/cluster/work/pausch/temp_scratch/audald/best_assembly/mapping_stats")
MQ_SAM_flag_reads = read.csv(file = "MQ_sam_flag_reads.txt", sep = "\t", header = TRUE) # Using the output of the sam_flags rule in the Snakefile
MQ_SAM_flag_reads$Quality = MQ_SAM_flag_reads$Filtered_reads / MQ_SAM_flag_reads$All_reads
head(MQ_SAM_flag_reads[,-1]) #First column, with the name of the samples, is not printed
tail(MQ_SAM_flag_reads[,-1]) #First column, with the name of the samples, is not printed

Reference,Chromosome,All_reads,Filtered_reads,Quality
<fct>,<int>,<int>,<int>,<dbl>
UCD,28,11405474,9717510,0.852004
Angus,29,10064451,8152210,0.8100005
Angus,21,12724687,12029051,0.9453318
UCD,29,15587380,13591290,0.8719419
UCD,16,18185516,15699027,0.8632709
Angus,28,7797907,7033641,0.9019909


Unnamed: 0_level_0,Reference,Chromosome,All_reads,Filtered_reads,Quality
Unnamed: 0_level_1,<fct>,<int>,<int>,<int>,<dbl>
9333,UCD,5,9535971,7993537,0.838251
9334,UCD,20,5740962,4802633,0.8365554
9335,UCD,8,9029371,7553995,0.8366026
9336,Angus,16,6470006,5418612,0.8374972
9337,UCD,2,10837977,9094054,0.8390915
9338,Angus,1,12291454,10308205,0.8386481


* Chromosomes are aggregated per sample. Samples are sorted by the number of `All_reads`:

In [3]:
agg_MQ_SAM_reads = aggregate(.~Sample+Reference, MQ_SAM_flag_reads, sum)
#Aggregating all the columns (.) by sample and assembly
agg_MQ_SAM_reads$Quality = agg_MQ_SAM_reads$Quality / 29
#Mean quality is obtained by dividing by the number of chromosomes
agg_MQ_SAM_reads = subset(agg_MQ_SAM_reads, select = -c(Chromosome) )
#The number of chromosomes is obviously a sum of all of them and not trustable anyomore
nrow(agg_MQ_SAM_reads)
#Checking that the number of samples is the expected one (161 samples x 2 assemblies)
head((agg_MQ_SAM_reads[order(-agg_MQ_SAM_reads$All_reads),])[,-1])
head((agg_MQ_SAM_reads[order(agg_MQ_SAM_reads$All_reads),])[,-1])

Unnamed: 0_level_0,Reference,All_reads,Filtered_reads,Quality
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>
232,UCD,1144927880,1068933835,0.9332857
71,Angus,1137593430,1059946752,0.931184
168,UCD,813074688,733332300,0.9014605
7,Angus,806565799,726333513,0.9000089
190,UCD,724587908,640490320,0.8835181
29,Angus,718670553,634291070,0.882016


Unnamed: 0_level_0,Reference,All_reads,Filtered_reads,Quality
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>
153,Angus,159903216,153190884,0.9581228
314,UCD,161177066,154664557,0.9593647
149,Angus,165330400,154991566,0.9376307
310,UCD,166380393,156307640,0.9393427
101,Angus,166729993,151643580,0.9090628
262,UCD,167901768,152975564,0.9105925


 Summaries of mapping quality per samples and assembly:

In [4]:
print('Summary of mapping quality for UCD assembly:')
summary(agg_MQ_SAM_reads[agg_MQ_SAM_reads$Reference == "UCD",]$Quality)
print('Standard deviation of the mapping quality for UCD assembly:')
sd(agg_MQ_SAM_reads[agg_MQ_SAM_reads$Reference == "UCD",]$Quality)
print('Summary of mapping quality for Angus assembly:')
summary(agg_MQ_SAM_reads[agg_MQ_SAM_reads$Reference == "Angus",]$Quality)
print('Standard deviation of the mapping quality for Angus assembly:')
sd((agg_MQ_SAM_reads[agg_MQ_SAM_reads$Reference == "Angus",]$Quality))

[1] "Summary of mapping quality for UCD assembly:"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7837  0.8478  0.8919  0.8928  0.9393  0.9722 

[1] "Standard deviation of the mapping quality for UCD assembly:"


[1] "Summary of mapping quality for Angus assembly:"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7831  0.8461  0.8905  0.8917  0.9376  0.9717 

[1] "Standard deviation of the mapping quality for Angus assembly:"


* Chromosomes are aggregated per assembly. A column `Diff` is created with the percent points difference between corresponding chromosomes across assemblies. The dataframe is sorted by percent points difference:

In [5]:
MQ_chr_mapping = MQ_SAM_flag_reads %>% group_by(Reference, Chromosome) %>% summarise(Quality = mean(Quality))
#The original data frame is grouped by assembly and chromosome and the mean quality is summarised

MQ_tidy_chr_mapping = MQ_chr_mapping %>% pivot_wider(names_from = Reference, values_from = Quality) %>% select(Chromosome, UCD, Angus)
#The distribution of the data frame is arranged and the names of the columns set

MQ_tidy_chr_mapping$Diff = (MQ_tidy_chr_mapping$UCD - MQ_tidy_chr_mapping$Angus) * 100
#A new column, with the percentual difference between UCD and Angus mapping quality, is created

MQ_tidy_chr_mapping[order(-MQ_tidy_chr_mapping$Diff),] #The data frame is sorted by the difference of quality means
print("Mean of mapping quality differences between assemblies:")
mean(MQ_tidy_chr_mapping$Diff)
print("Standard deviation of mapping quality differences between assemblies:")
sd(MQ_tidy_chr_mapping$Diff)

Chromosome,UCD,Angus,Diff
<int>,<dbl>,<dbl>,<dbl>
20,0.8920437,0.87166,2.03836888
6,0.8930665,0.8797884,1.327810628
14,0.895155,0.8846364,1.051858124
28,0.8952468,0.8852364,1.001048451
22,0.8993481,0.895074,0.427408657
12,0.8933443,0.889188,0.41563132
26,0.8927504,0.8890655,0.368488161
15,0.894387,0.8914946,0.289241123
18,0.8937579,0.8916432,0.211469747
3,0.8945121,0.8925317,0.19803311


[1] "Mean of mapping quality differences between assemblies:"


[1] "Standard deviation of mapping quality differences between assemblies:"


 Summaries of mapping quality per chromosome and assembly:

In [6]:
print('Summary of mapping quality for UCD assembly:')
summary(MQ_tidy_chr_mapping$UCD)
print('Standard deviation of the mapping quality for UCD assembly:')
sd(MQ_tidy_chr_mapping$UCD)
print('Summary of mapping quality for Angus assembly:')
summary(MQ_tidy_chr_mapping$Angus)
print('Standard deviation of the mapping quality for Angus assembly:')
sd(MQ_tidy_chr_mapping$Angus)

[1] "Summary of mapping quality for UCD assembly:"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8834  0.8920  0.8934  0.8928  0.8947  0.8993 

[1] "Standard deviation of the mapping quality for UCD assembly:"


[1] "Summary of mapping quality for Angus assembly:"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8717  0.8902  0.8932  0.8917  0.8951  0.8981 

[1] "Standard deviation of the mapping quality for Angus assembly:"


* Finally, a global summary of total reads and `high-quality` reads aggregating all chromosomes, samples and assemblies:

In [7]:
summary(agg_MQ_SAM_reads$All_reads)
summary(agg_MQ_SAM_reads$Filtered_reads)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
1.599e+08 2.134e+08 2.564e+08 2.939e+08 3.095e+08 1.145e+09 

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
1.516e+08 1.809e+08 2.371e+08 2.623e+08 2.831e+08 1.069e+09 