# Validation and check points

As summarised in the ["Processing pipeline notebook"](Processing_pipeline.ipynb), a series of steps are needed in order to align the raw data for the two assemblies. Given the multiplicity of steps, the number of potential errors also increase. We definitely need a system in place to check whether all the files are being properly generated. The strategy will be many-folded:

* Make sure that all pipeline steps have been completed
* Check that all samples have been successfully processed
* Find out what of the samples have already had applied a QC (by FASTP)
* Confirm that the number of reads is kept from the raw data to the aligned and sorted BAM files
* Reduce as much as possible the storage footprint
* Alternative methods for validating the results

#### Snakemake as tool for tracking and checking results

Snakemake is a very complete workflow manager. It does not only concatenates the different steps, automatise great amount of input files and optimise the resources used for every step, but it also generates specific logs for each and every job. Therefore, providing that logs are duly generated, these are very helpful to track errors in concrete steps. The logs are generated in a specific folder (/cluster/work/pausch/temp_scratch/audald/best_assembly/log_folder), where each rule/step will have a separate folder and each job (one job per sample or on job per read-group, depending on the rule) will have a unique log.
The name of the logs and the directories are set in advance at the [cluster config file](Snakemake/cluster.json).

Snakemake is also very interesting as you can define whether intermediate files (e.g. QC'ed FASTQ files or unsorted BAM) should be deleted as soon as the following file (e.g. split read-group-specific FASTQ files or de-duplicated and sorted BAM file) is created. This is defined in the very [Snakefile](Snakemake/Snakefile.py).

### Have all rules been successfully completed for all samples?

This is the first question to answer when the workflow has been completed. Hopefully all the steps will be successfully finished. If this is not the case, no problem, we can run the following lines of code so we really know where the problems arise and what are the concrete errors:

In [1]:
%%bash

mydir=success_logs
root_folder="/cluster/work/pausch/temp_scratch/audald/best_assembly/log_folder"
folders="fastp split_fastq alignment sambamba_sort sambamba_merge sambamba_flagstat_1 mark_duplicates build_index sambamba_flagstat_2"
for folder in $folders
do
    cd $root_folder/$folder
    #mkdir $mydir ##Uncomment this line if this is the first time and the success_logs folder does not exist
    mv $mydir/*.log . ##Uncomment this line if the code is run for new samples. All files in success_logs will be moved back to the right folder
    echo '------- STEP ' $folder '-------'
    echo 'Number of samples processed:'
    ls *.log | wc -l
    for file in *log
    do
        if grep -q 'Successfully completed.' $file
        then 
            mv $file $mydir
        else
            echo $file contains errors
        fi
    done 
    echo 'Processed successfully:'
    ls $mydir/*.log | wc -l
    echo 'Should the previous numbers be different, samples processed with errors can be found in:' $root_folder/$folder
    echo ''
done

------- STEP  fastp -------
Number of samples processed:
369
Processed successfully:
369
Should the previous numbers be different, samples processed with errors can be found in: /cluster/work/pausch/temp_scratch/audald/best_assembly/log_folder/fastp

------- STEP  split_fastq -------
Number of samples processed:
369
Processed successfully:
369
Should the previous numbers be different, samples processed with errors can be found in: /cluster/work/pausch/temp_scratch/audald/best_assembly/log_folder/split_fastq

------- STEP  alignment -------
Number of samples processed:
2702
Processed successfully:
2702
Should the previous numbers be different, samples processed with errors can be found in: /cluster/work/pausch/temp_scratch/audald/best_assembly/log_folder/alignment

------- STEP  sambamba_sort -------
Number of samples processed:
2702
Processed successfully:
2702
Should the previous numbers be different, samples processed with errors can be found in: /cluster/work/pausch/temp_scratch/aud

The output should be self explanatory but for the sake of clarification:
* The logs for each rule are parsed (as many logs as sample/read_group per rule)
* The logs contain 'Successfully completed.' when the job finished properly
* These successful logs are moved to a new folder "success_logs"
* The remaining logs (if any) are counted and deemed as troublesome - specific names are provided
* The successful logs (in the success folder) are counted
* The number of the total of logs and the successful logs is provided, so discrepancies can be spotted. Should these discrepancies exist, the folder path is provided so the troublesome logs can be checked. 

Ideally, the number of "Processed successfully" samples from the build_index & sambamba_flagstat_2 rules should be n times the number of samples from fastp (where n is the number of alignments - 2 in this project). If this is the case, all fine, good news. If this is not the case, we can check retrospectively the previous steps to see where the error happened. Once checked the cause and fixed, the error logs should be deleted and Snakemake be run again - it will automatically resume the process where it stopped.

Some of the errors found during the validations where in rules "alignment", "sambamba_sort" and "mark_duplicates":

* Alignment: 6 samples (only some of the read-groups, specially when aligning to Angus) retrieved the following error: "job killed after reaching LSF memory usage limit.". This is fixed by increasing the number of resources: from 12 cores and 3000 of memory to 12 cores and 4500 of memory. This is nevertheless not changed in [Cluster config file](Snakemake/cluster.json) as I prefer to fix the 6 samples individually rather than giving too many resources to the other hundreds of samples. The cluster is always giving priority to less memory-consuming jobs.
* Sambamba-sort fails for several samples with the following error message: "sambamba-sort: Unable to write to stream". According to this [webpage](https://github.com/biod/sambamba/issues/218) the error is the number of temporary files created. The more samples processing altogether the more chances to see these problems. Just by running the pipeline again, this problem vanished.
* Picard tools: There is the following error message for one sample when running Picard tools: "Suppressed: java.io.IOException: No space left on device". Having seen the experience of Sambamba-sort failure, I just try to run again the pipeline and it just works.

We might be interested in getting the sample ID out of the log files, so we easily spot the affected sample (e.g. we want to run the whole pipeline again for the troublesome sample). The following commands are used for extracting the sample ID from the validation steps described above. It is important to notice that *awk* is being used differently depending on the name of the log; it can be adapted to other rules:

In [None]:
%%bash
cd /cluster/work/pausch/temp_scratch/audald/best_assembly/log_folder
echo 'Failing samples:'
grep 'exited' alignment/*log | awk -F '_' '{print $3}' | sort -u
echo 'Failing samples:'
grep 'exited' sambamba_sort/*log | awk -F '_' '{print $4}' | sort -u
echo 'Failing samples:'
grep 'exited' mark_duplicates/*log | ls *log | awk -F '_' '{print $4}' | awk -F '.' '{print $1}' | sort -u

### Is the number of samples expected?

At this point we need to compare the number of processed samples and the number of expected samples.

According to the previous script output, 369 samples have been processed (369 samples at the beginning and 369 x n at the end of the pipeline, where n is the number of reference genomes).

According to the [ID criteria selection notebook](../Data/IDs/OBV_BSW_raw_IDs_R.ipynb), 371 animals meet the selection criteria from the metada file (i.e. are OBV or BSW breeds).

When looking at the [Selected samples notebook](../Data/IDs/OBV_BSW_raw_IDs_python.ipynb), 1 of the animals appears not to have raw data: **RM763**. Therefore, 370 samples remain.

However, one out of the 370 being processed is not accepted by the pipeline. This is sample **SRR6067094**, whose raw data appear not to have read groups well defined. Consequently, FASTQ files cannot be split and the process is stopped. It is possible that the trouble is due by the fact that raw data were downloaded from NCBI. Hubert acknowledges that this is possible and raises that he had to use dummy readgroups to process such files in the past.

In a nutshell: from 371 samples meeting the metadata criteria, 2 have been excluded (RM763 and SRR6067094) and hence 369 samples is the final and justified number.

### Are there any samples that were already applied a QC - FASTP?

Raw data are taken as FASTQ files as input for the processing workflow. Some of the samples are generated by us, used for other projects, downloaded from public repositories, etc. It is therefore difficult to trace back the origin and know the quality of such files. This is partially why we apply a QC (by using FASTP) to all the samples. At this point, though, we would like to know what of the samples had been already applied a QC before and what of the samples are still unfiltered by the time this pipeline is run.

FASTP reports [see here](Fastp.ipynb) are very useful to know the number of reads/bases filtered out due to low quality during the FASTP QC.

The following code helps finding out what of the samples have been already applied a QC:

In [16]:
%%bash

module load jq/1.5

path="/cluster/work/pausch/temp_scratch/audald/best_assembly"
cd $path
mkdir QC
touch QC/first_QC.txt
touch QC/second_QC.txt
for sample in `ls fastp | xargs -n 1 basename | awk -F " " '{print $1}'`
do
    cat fastp/$sample/${sample}_fastp.json | jq '.summary.before_filtering.total_reads' > QC/${sample}_bf_reads.txt
    cat fastp/$sample/${sample}_fastp.json | jq '.summary.after_filtering.total_reads' > QC/${sample}_af_reads.txt
    bf_reads=$(cat QC/${sample}_bf_reads.txt)
    af_reads=$(cat QC/${sample}_af_reads.txt)
    if [ $bf_reads == $af_reads ]; then
        echo $sample >> QC/second_QC.txt
    else
        echo $sample >> QC/first_QC.txt
    fi
done
echo --------------------------------------------------------------------------------------------------------
echo Samples with previous QC:
cat QC/second_QC.txt | wc -l
cat QC/second_QC.txt
echo --------------------------------------------------------------------------------------------------------
echo Samples under first QC:
cat QC/first_QC.txt | wc -l
cat QC/first_QC.txt
rm -r QC

--------------------------------------------------------------------------------------------------------
Samples with previous QC:
62
BSWCHEM110009030235
BSWCHEM110011106478
BSWCHEM110023102796
BSWCHEM110060137584
BSWCHEM110069121003
BSWCHEM110097174286
BSWCHEM110111072321
BSWCHEM110131060582
BSWCHEM110153046663
BSWCHEM110177093315
BSWCHEM110211065827
BSWCHEM110218087570
BSWCHEM110222168241
BSWCHEM110254129111
BSWCHEM110265057175
BSWCHEM110280223340
BSWCHEM110293038924
BSWCHEM110294048847
BSWCHEM110349149437
BSWCHEM110468065700
BSWCHEM110501069856
BSWCHEM110711018798
BSWCHEM110716018779
BSWCHEM110716027719
BSWCHEM110967032609
BSWCHEM111030024286
BSWCHEM111453050169
BSWCHEM120004573248
BSWCHEM120020086135
BSWCHEM120026503698
BSWCHEM120033898589
BSWCHEM120047653969
NDA032
NDA059
RM067
RM1003
RM1023
RM1118
RM1179
RM720
RM721
RM723
RM724
RM730
RM805
RM806
RM808
RM855
RM884
RM887
RM937
RM938
RM945
RM946
RM947
RM948
RM951
RM963
RM966
RM996
RM997
RM998
----------------------------------------

The above code is looping over all the samples a comparison between the reads of raw data before the filtering and after the filter.
When the number of reads before and after the filtering is the same (*i.e.* filtering does not affect) we assume that a previous QC had been already applied.

### Is the number of reads kept across the pipeline?

Once the Quality Control (QC - FASTP) has been applied to the raw data, there are no other steps where reads are filtered out from the files. Therefore, we would expect as many reads in the QC'ed FASTQ files as in the de-duplicated and sorted BAM files. The following lines of code are used for this check:

In [2]:
%%bash
module load jq/1.5
samples="BSWDEUF000948260022 BSWCHEM120074028822"
path="/cluster/work/pausch/temp_scratch/audald/best_assembly"
cd $path
mkdir QC
touch QC/matching_reads.txt
touch QC/discrepant_reads.txt
for sample in $samples
do
    echo '---------- Sample' $sample '----------'
    echo 'Reads after FASTP filtering:'
    cat fastp/$sample/${sample}_fastp.json | jq '.summary.after_filtering.total_reads' > QC/${sample}_fastp.txt
    QC_fastp=$(cat QC/${sample}_fastp.txt)
    echo $QC_fastp
    echo 'Reads after FASTQ splitting:'
    cat split_fastq/$sample/*report.json | jq '.metadata.record_count' | paste -sd+ | bc > QC/${sample}_split.txt
    QC_split=$(cat QC/${sample}_split.txt)
    echo $QC_split
    echo 'Reads after alignment and deduplication step - UCD alignment:'
    cat dedup_alignment/UCD/$sample/$sample.stats | grep 'QC-passed reads' | awk '{print $1}' > QC/${sample}_bam_pass_UCD.txt
    QC_pass_UCD=$(cat QC/${sample}_bam_pass_UCD.txt)
    cat dedup_alignment/UCD/$sample/$sample.stats | grep 'secondary' | awk '{print $1}' > QC/${sample}_bam_sec_UCD.txt
    QC_sec_UCD=$(cat QC/${sample}_bam_sec_UCD.txt)
    reads_UCD=$((QC_pass_UCD-QC_sec_UCD))
    echo $reads_UCD > QC/${sample}_bam_reads_UCD.txt
    echo $reads_UCD
    echo 'Reads after alignment and deduplication step - Angus alignment:'
    cat dedup_alignment/Angus/$sample/$sample.stats | grep 'QC-passed reads' | awk '{print $1}' > QC/${sample}_bam_pass_Angus.txt
    QC_pass_Angus=$(cat QC/${sample}_bam_pass_Angus.txt)
    cat dedup_alignment/Angus/$sample/$sample.stats | grep 'secondary' | awk '{print $1}' > QC/${sample}_bam_sec_Angus.txt
    QC_sec_Angus=$(cat QC/${sample}_bam_sec_Angus.txt)
    reads_Angus=$((QC_pass_Angus-QC_sec_Angus))
    echo $reads_Angus > QC/${sample}_bam_reads_Angus.txt
    echo $reads_Angus
    md5_reads_fastp=$(md5sum QC/${sample}_fastp.txt | awk -F ' ' '{print $1}')
    md5_reads_split=$(md5sum QC/${sample}_split.txt | awk -F ' ' '{print $1}')
    md5_reads_UCD=$(md5sum QC/${sample}_bam_reads_UCD.txt | awk -F ' ' '{print $1}')
    md5_reads_Angus=$(md5sum QC/${sample}_bam_reads_Angus.txt | awk -F ' ' '{print $1}')
    if [ $md5_reads_fastp == $md5_reads_split ] && [ $md5_reads_split == $md5_reads_UCD ] && [ $md5_reads_UCD == $md5_reads_Angus ]; then
        echo $sample >> QC/matching_reads.txt
    else
        echo $sample >> QC/discrepant_reads.txt
    fi
    rm QC/${sample}*
done
echo --------------------------------------------------------------------------------------------------------
echo Samples with expected reads:
cat QC/matching_reads.txt | wc -l
cat QC/matching_reads.txt
echo --------------------------------------------------------------------------------------------------------
echo Samples with discrepant reads:
cat QC/discrepant_reads.txt | wc -l
cat QC/discrepant_reads.txt
rm -r QC

---------- Sample BSWDEUF000948260022 ----------
Reads after FASTP filtering:
218362236
Reads after FASTQ splitting:
218362236
Reads after alignment and deduplication step - UCD alignment:
218362236
Reads after alignment and deduplication step - Angus alignment:
218362236
---------- Sample BSWCHEM120074028822 ----------
Reads after FASTP filtering:
264501772
Reads after FASTQ splitting:
264501772
Reads after alignment and deduplication step - UCD alignment:
264501772
Reads after alignment and deduplication step - Angus alignment:
264501772
--------------------------------------------------------------------------------------------------------
Samples with expected reads:
2
BSWDEUF000948260022
BSWCHEM120074028822
--------------------------------------------------------------------------------------------------------
Samples with discrepant reads:
0


The above script is a test for two samples (BSWDEUF000948260022 and BSWCHEM120074028822) to check the different read groups during the pipeline.

First of all, a QC folder is created for the generation of temporary files including the number of reads for each sample in the following steps:
* After FASTP, by parsing and extracting the number of reads from the JSON of the QC'ed fastq files
* After splitting the files, by parsing and summing up the number of reads from the JSON of the different read-group-specific fastq files
* At the end of the pipeline, after the deduplication step. QC-passed reads and secondary reads are extracted from the summary files generated by Picard tools, for both assemblies. The total number of reads is the substraction of the secondary reads minus the QC-passed reads.

The reads of the different steps are listed and displayed for visual checking. However, in order to programmatically compare and pinpoint the discrepant samples, the following strategy is proposed:
* The four intermediate files (reads after FASTP, reads after splitting, reads after Picard tools for UCD and reads after Picard tools for Angus) are compared by checking the md5 sums
* Should the md5 values for the four files match, the sample name is included in a text file with successful samples.
* Should the md5 values not match, the sample name is included in a text file with discrepant samples

Having seen the good performance of the script, the check is run for all the samples - loop for all samples existing at the beginning of the pipeline - note that this step can be run only when all rules have been successfully completed for all samples (see script above):

In [3]:
%%bash

module load jq/1.5
path="/cluster/work/pausch/temp_scratch/audald/best_assembly"
cd $path
mkdir QC
touch QC/matching_reads.txt
touch QC/discrepant_reads.txt
for sample in `ls fastp/ | xargs -n 1 basename | awk -F " " '{print $1}'`
do
    cat fastp/$sample/${sample}_fastp.json | jq '.summary.after_filtering.total_reads' > QC/${sample}_fastp.txt
    QC_fastp=$(cat QC/${sample}_fastp.txt)
    cat split_fastq/$sample/*report.json | jq '.metadata.record_count' | paste -sd+ | bc > QC/${sample}_split.txt
    QC_split=$(cat QC/${sample}_split.txt)
    cat dedup_alignment/UCD/$sample/$sample.stats | grep 'QC-passed reads' | awk '{print $1}' > QC/${sample}_bam_pass_UCD.txt
    QC_pass_UCD=$(cat QC/${sample}_bam_pass_UCD.txt)
    cat dedup_alignment/UCD/$sample/$sample.stats | grep 'secondary' | awk '{print $1}' > QC/${sample}_bam_sec_UCD.txt
    QC_sec_UCD=$(cat QC/${sample}_bam_sec_UCD.txt)
    reads_UCD=$((QC_pass_UCD-QC_sec_UCD))
    echo $reads_UCD > QC/${sample}_bam_reads_UCD.txt
    cat dedup_alignment/Angus/$sample/$sample.stats | grep 'QC-passed reads' | awk '{print $1}' > QC/${sample}_bam_pass_Angus.txt
    QC_pass_Angus=$(cat QC/${sample}_bam_pass_Angus.txt)
    cat dedup_alignment/Angus/$sample/$sample.stats | grep 'secondary' | awk '{print $1}' > QC/${sample}_bam_sec_Angus.txt
    QC_sec_Angus=$(cat QC/${sample}_bam_sec_Angus.txt)
    reads_Angus=$((QC_pass_Angus-QC_sec_Angus))
    echo $reads_Angus > QC/${sample}_bam_reads_Angus.txt
    md5_reads_fastp=$(md5sum QC/${sample}_fastp.txt | awk -F ' ' '{print $1}')
    md5_reads_split=$(md5sum QC/${sample}_split.txt | awk -F ' ' '{print $1}')
    md5_reads_UCD=$(md5sum QC/${sample}_bam_reads_UCD.txt | awk -F ' ' '{print $1}')
    md5_reads_Angus=$(md5sum QC/${sample}_bam_reads_Angus.txt | awk -F ' ' '{print $1}')
    if [ $md5_reads_fastp == $md5_reads_split ] && [ $md5_reads_split == $md5_reads_UCD ] && [ $md5_reads_UCD == $md5_reads_Angus ]; then
        echo $sample >> QC/matching_reads.txt
    else
        echo $sample >> QC/discrepant_reads.txt
    fi
    rm QC/${sample}*
done
echo --------------------------------------------------------------------------------------------------------
echo Total number of samples:
ls fastp/ | wc -l
echo --------------------------------------------------------------------------------------------------------
echo Samples with expected reads:
cat QC/matching_reads.txt | wc -l
echo --------------------------------------------------------------------------------------------------------
echo Samples with discrepant reads:
cat QC/discrepant_reads.txt | wc -l
cat QC/discrepant_reads.txt
rm -r QC

--------------------------------------------------------------------------------------------------------
Total number of samples:
369
--------------------------------------------------------------------------------------------------------
Samples with expected reads:
369
--------------------------------------------------------------------------------------------------------
Samples with discrepant reads:
0


### Is my storage footprint too high?

One of the main advantages of deleting the intermediate files as the following ones are created (part of the Snakemake process) is that not all files are stored, so the footprint is not that high. However, some of the intermediate files (namely splitting fastq files) are kept after running the pipeline.

If all the pipeline steps are fine and all the samples have the proper number of reads, my advice would be to reduce the footprint by removing the read-group-specific fastq files providing that the above sections were successful.

In [None]:
%%bash

cd /cluster/work/pausch/temp_scratch/audald/best_assembly/split_fastq
rm */*fq.gz

Let us check the size of the resulting files:

In [4]:
%%bash

cd /cluster/work/pausch/temp_scratch/
du -hs audald/
du -hs audald/best_assembly
du -hs audald/best_assembly/dedup_alignment

15T	audald/
14T	audald/best_assembly
14T	audald/best_assembly/dedup_alignment


### Alternative, advanced and secondary checks

Other validation approaches were created. However, these are deemed as secondary or not rellevant, as the previous steps are enough for checking. Validation steps can fall into the following sections:

**Looking into the read groups and the file sizevwith in-house scripts**

The following complex scripts were created in order to visually check that the number of read groups before and after splitting the fastq files are the same. Additionally, we are approximately checking the sizes, although this is not very precise.

I am keeping the scripts here as credit for the time invested generating them but these are not actually necessary and actually may not be possible to run as intermediate files are no longer there.

The first script (read_groups.sh) is generated so we can retrieve the read groups for each original fastq file:

In [None]:
%%bash

samples="RM1179 BSWCHEM110445039823 BSWCHEM110471084194 BSWCHEM120102474324 BSWCHEM110294048847 BSWCHEF120118254972 BSWCHEM120032196501 BSWCHEM110098045882 RM1905 RM1478 BSWCHEM120131096320 BSWCHEM110280223340 BSWCHEM110246097145 BSWCHEM110211065827 BSWCHEF120114550399 BSWDEUM000910241203 BSWDEUM000947736431 BSWCHEM120020086135 BSWDEUM000937046641 RM1901 RM1897 BSWCHEF120023224572 BSWCHEM110153046663 BSWDEUM000910075535 BSWCHEM120059624513 BSWCHEM120062497968 BSWDEUM000910195137 BSWCHEM110200059950 RM1906 BSWCHEF120043744524 BSWDEUM000916243713 RM720 BSWUSAM000000189797 BSWDEUM000945581509 BSWCHEM110967032609 BSWDEUM000936791480 BSWDEUM000945510719 BSWCHEM120035700514 BSWCHEM110011106478 BSWCHEM120088525126 BSWCHEM120074415035 RM998 BSWCHEM120089597825 BSWDEUM000944554741 BSWDEUM000951444644 BSWCHEM120054755137 BSWCHEF120066378591 RM1293 BSWCHEM120134766497 BSWCHEF120136261792 RM723 BSWCHEM110177093315 BSWUSAM000000187361 BSWCHEM110294042272 BSWCHEM110262082439 BSWCHEM110293038924 RM997 BSWCHEM120053485530 BSWCHEF120071978076 BSWCHEM120111313935 BSWCHEM110967005528 BSWCHEM120040689392 RM947 BSWCHEM110480021128 BSWCHEM110173016820 BSWCHEM111030024286 BSWDEUM000916363895 BSWCHEM120054800752 RM2009 BSWDEUM000916604780 BSWCHEM110276043280 RM884 BSWCHEM110023102796 BSWCHEM110919005798 BSWCHEM120079379790 BSWCHEM110482036847 BSWCHEM110716018779 RM963 RM1908 BSWCHEM120089597801 BSWDEUM000932230236 BSWCHEM110098071188 BSWCHEM110716027719 BSWCHEM110269048407 BSWCHEM110040047292 BSWCHEM110023148411 BSWDEUM000933943664 BSWCHEM110706016501 BSWCHEM120090508056 BSWCHEF120071057962 BSWCHEM120097723193 NDA032 BSWCHEF120127770289 BSWCHEM110556008244 RM1512 RM887 BSWCHEM120052893046 BSWCHEM110785019936 BSWCHEM120103533136 BSWUSAM000000184087 BSWCHEM120060662634 BSWCHEM120085769301 BSWCHEM120102474546 BSWCHEM120134769801 BSWCHEM110060137584 BSWCHEF120127240430 BSWCHEM110097189518 BSWAUTM000336344707 BSWCHEM110468065700 BSWCHEM120033898589 BSWDEUM000923095366 BSWCHEM120027093693 BSWCHEM110008038300 RM1902 BSWCHEM110265057175 BSWCHEM111512030200 BSWCHEM110057253457 BSWCHEM110111072321 BSWCHEM110098063770 BSWCHEM111453045080 RM1899 BSWDEUM000943355630 BSWCHEM110043090653 RM945 RM1118 BSWDEUM000923957139 BSWCHEM110374046077 BSWCHEM120032300663 RM1898 RM1904 BSWDEUM000923732528 BSWCHEM110069121003 RM805 BSWCHEF120124686545 BSWCHEM120031901830 BSWCHEF120130256848 BSWCHEM120100207702 BSWCHEF120075378834 BSWCHEF120144682558 BSWCHEF120108462332 RM1907 BSWCHEMRM2549 BSWCHEM120066618543 BSWCHEM110203063244 BSWCHEM120134731440 BSWDEUM000916561366 BSWCHEM120101096138 BSWCHEM120105020658 BSWCHEM110106053793 BSWCHEM110501069856 RM721 RM806 BSWCHEM120052037815 BSWCHEM120070555216 BSWCHEF120121047356 BSWCHEF120114550405 BSWUSAM000000186276 RM730 BSWCHEM111453050169 BSWCHEM110518030504 BSWDEUM000942296230 BSWCHEM120026503698 BSWUSAM000000184138 RM948 RM808 BSWDEUM000923943405 RM1894 RM1108 RM937 BSWDEUM000940438790 BSWDEUF09492825451 BSWCHEM120004573248 BSWCHEM110131060582 BSWCHEM120094785965 RM1896 BSWDEUM000910136233 BSWDEUM000910204734 BSWCHEM120098716606 BSWCHEF120145581935 BSWDEUM000943861279 BSWDEUM000924027269 BSWCHEM120047653969 BSWCHEM120033040506 BSWCHEM120062337035 BSWCHEF120050280619 BSWDEUM000912484731 RM1900 BSWDEUM000942118016 BSWDEUM000946065717 RM1513 BSWAUTM000077152128 BSWCHEF120069106030 BSWCHEM120070757016 BSWCHEM120057745616 BSWCHEM120099980013 BSWCHEM120098367211 BSWCHEM120033527809 BSWCHEF120107227826 BSWUSAM000000189182 BSWCHEM110009030235 RM1903 RM724 BSWDEUM000923832197 RM966 RM938 NDA059 BSWDEUM000943861630 SRR6067094 BSWCHEM120038685214 BSWCHEF120118511723 BSWDEUM000910179749 BSWCHEM111452065584 BSWDEUF000948260022 BSWCHEF120110871894 BSWCHEM110711018798 RM855 BSWCHEM110097174286 BSWCHEF120082542037 RM1003 RM951 RM067 BSWDEUM000951442383 BSWDEUM000935830301 BSWDEUM000808024689 BSWCHEF120104805454 BSWDEUM000948243974 BSWDEUM000538927305 BSWUSAM000000195618 RM946 BSWCHEM120100176039 RM996 BSWCHEM110502100268 BSWDEUM000931161073 BSWAUTM000217825118 BSWCHEM120066947247 BSWCHEM120000414675 BSWCHEM120126513481 BSWCHEM110222168241 BSWCHEM110349149437 BSWDEUF09539252639 BSWCHEM120112699038 RM854 BSWCHEM120064046362 BSWCHEM120078454467 BSWCHEM110218087570 RM1023 BSWDEUM000813034326 BSWCHEM120060730418 BSWCHEF120108040714 BSWDEUM000937108856 BSWCHEM110254129111 BSWCHEM120104250872 BSWCHEM110121201483 BSWCHEM120053474381 BSWCHEF120080761164"
#list of all the samples is provided so bash can process them as a variable
for sample in $samples #looping all the elements within samples
do
    zcat /cluster/work/pausch/inputs/fastq/BTA/${sample}_R1.fastq.gz | awk -F":" 'NR%2==1 {print $3,$4}' | awk 'NR%2==1 {print $1":"$2}' | sort -u > /cluster/work/pausch/temp_scratch/audald/read_groups/${sample}_readgroups.txt #retrieving read group
    echo 'File created for' $sample #Completing the loop
done

Once this is completed we end up with one text file per sample from where we can retrieve the read groups. Ideally, there should be as many read-group-specific fastq files as read groups. Actually, we are expecting twice as many fastq files and BAM files as read groups:
* R1 and R2 fastq files per read group
* UCD and ANGUS BAM files per read group

If this is correct, we can also take a look at the size; whether it seems to make sense. We would be expecting a similiar amount of Gb before splitting the fastq and after the split (summing up the splitted fastq). We would also be expecting a bit more than 2x for read-group-specific BAM files).

The second script (python) includes loops that permit the easy visualisation of number of files, size of the files and lists for files correctly processed and wrongly processed:

In [3]:
import os
import fnmatch

samples = "BSWAUTM000077152128",  "BSWAUTM000217825118",  "BSWCHEF120043744524",  "BSWCHEF120050280619",  "BSWCHEF120066378591",  "BSWCHEF120069106030",  "BSWCHEF120071978076",  "BSWCHEF120075378834",  "BSWCHEF120082542037",  "BSWCHEF120107227826",  "BSWCHEF120108040714",  "BSWCHEF120110871894",  "BSWCHEF120114550399",  "BSWCHEF120114550405",  "BSWCHEF120124686545",  "BSWCHEF120127240430",  "BSWCHEF120130256848",  "BSWCHEF120136261792",  "BSWCHEF120144682558",  "BSWCHEM110008038300",  "BSWCHEM110011106478",  "BSWCHEM110023148411",  "BSWCHEM110043090653",  "BSWCHEM110057253457",  "BSWCHEM110060137584",  "BSWCHEM110069121003",  "BSWCHEM110097174286",  "BSWCHEM110097189518",  "BSWCHEM110098045882",  "BSWCHEM110098071188",  "BSWCHEM110106053793",  "BSWCHEM110111072321",  "BSWCHEM110121201483",  "BSWCHEM110131060582",  "BSWCHEM110153046663",  "BSWCHEM110173016820",  "BSWCHEM110177093315",  "BSWCHEM110200059950",  "BSWCHEM110203063244",  "BSWCHEM110211065827",  "BSWCHEM110218087570",  "BSWCHEM110246097145",  "BSWCHEM110262082439",  "BSWCHEM110265057175",  "BSWCHEM110269048407",  "BSWCHEM110276043280",  "BSWCHEM110294048847",  "BSWCHEM110349149437",  "BSWCHEM110374046077",  "BSWCHEM110471084194",  "BSWCHEM110480021128",  "BSWCHEM110482036847",  "BSWCHEM110501069856",  "BSWCHEM110556008244",  "BSWCHEM110706016501",  "BSWCHEM110711018798",  "BSWCHEM110716018779",  "BSWCHEM110716027719",  "BSWCHEM110785019936",  "BSWCHEM110967005528",  "BSWCHEM110967032609",  "BSWCHEM111030024286",  "BSWCHEM111452065584",  "BSWCHEM111453045080",  "BSWCHEM111512030200",  "BSWCHEM120000414675",  "BSWCHEM120004573248",  "BSWCHEM120020086135",  "BSWCHEM120026503698",  "BSWCHEM120031901830",  "BSWCHEM120032196501",  "BSWCHEM120032300663",  "BSWCHEM120033040506",  "BSWCHEM120038685214",  "BSWCHEM120040689392",  "BSWCHEM120047653969",  "BSWCHEM120052037815",  "BSWCHEM120052893046",  "BSWCHEM120053474381",  "BSWCHEM120053485530",  "BSWCHEM120054755137",  "BSWCHEM120054800752",  "BSWCHEM120059624513",  "BSWCHEM120060730418",  "BSWCHEM120062337035",  "BSWCHEM120062497968",  "BSWCHEM120064046362",  "BSWCHEM120066618543",  "BSWCHEM120066947247",  "BSWCHEM120070555216",  "BSWCHEM120070757016",  "BSWCHEM120074415035",  "BSWCHEM120078454467",  "BSWCHEM120079379790",  "BSWCHEM120088525126",  "BSWCHEM120089597801",  "BSWCHEM120089597825",  "BSWCHEM120094785965",  "BSWCHEM120098716606",  "BSWCHEM120100176039",  "BSWCHEM120101096138",  "BSWCHEM120102474324",  "BSWCHEM120102474546",  "BSWCHEM120103533136",  "BSWCHEM120105020658",  "BSWCHEM120111313935",  "BSWCHEM120112699038",  "BSWCHEM120126513481",  "BSWCHEM120134766497",  "BSWCHEM120134769801",  "BSWDEUM000910075535",  "BSWDEUM000910195137",  "BSWDEUM000940438790",  "BSWDEUM000942296230",  "BSWDEUM000943355630",  "BSWDEUM000943861630",  "BSWDEUM000945510719",  "BSWDEUM000945581509",  "BSWUSAM000000184087",  "BSWUSAM000000184138",  "BSWUSAM000000187361",  "BSWUSAM000000189182",  "BSWUSAM000000189797",  "NDA032",  "RM067",  "RM1003",  "RM1118",  "RM1179",  "RM1293",  "RM1478",  "RM1512",  "RM1898",  "RM1899",  "RM1900",  "RM1901",  "RM1905",  "RM1906",  "RM2009",  "RM720",  "RM721",  "RM723",  "RM730",  "RM806",  "RM808",  "RM854",  "RM855",  "RM884",  "RM887",  "RM937",  "RM938",  "RM945",  "RM947",  "RM948",  "RM951",  "RM963",  "RM996",  "RM997",  "RM998", "BSWAUTM000336344707",  "BSWCHEF120023224572",  "BSWCHEF120071057962",  "BSWCHEF120080761164",  "BSWCHEF120104805454",  "BSWCHEF120108462332",  "BSWCHEF120118254972",  "BSWCHEF120118511723",  "BSWCHEF120121047356",  "BSWCHEF120127770289",  "BSWCHEF120145581935",  "BSWCHEM110009030235",  "BSWCHEM110023102796",  "BSWCHEM110040047292",  "BSWCHEM110098063770",  "BSWCHEM110222168241",  "BSWCHEM110254129111",  "BSWCHEM110280223340",  "BSWCHEM110293038924",  "BSWCHEM110294042272",  "BSWCHEM110445039823",  "BSWCHEM110468065700",  "BSWCHEM110502100268",  "BSWCHEM110518030504",  "BSWCHEM110919005798",  "BSWCHEM111453050169",  "BSWCHEM120027093693",  "BSWCHEM120033527809",  "BSWCHEM120033898589",  "BSWCHEM120035700514",  "BSWCHEM120057745616",  "BSWCHEM120060662634",  "BSWCHEM120085769301",  "BSWCHEM120090508056",  "BSWCHEM120097723193",  "BSWCHEM120098367211",  "BSWCHEM120099980013",  "BSWCHEM120100207702",  "BSWCHEM120104250872",  "BSWCHEM120131096320",  "BSWCHEM120134731440",  "BSWCHEMRM2549",  "BSWDEUF09492825451",  "BSWDEUF09539252639",  "BSWDEUM000538927305",  "BSWDEUM000943861279",  "BSWDEUM000944554741",  "BSWDEUM000947736431",  "BSWDEUM000948243974",  "BSWDEUM000951442383",  "BSWDEUM000951444644",  "BSWUSAM000000186276",  "BSWUSAM000000195618",  "NDA059",  "RM1023",  "RM1108",  "RM1513",  "RM1894",  "RM1896",  "RM1897",  "RM1902",  "RM1903",  "RM1904",  "RM1907",  "RM1908",  "RM724",  "RM805",  "RM946",  "RM966"
matching_samples= list()
non_matching_samples= list()
for sample in samples:
    print(sample)
    read_group_file = "/cluster/work/pausch/temp_scratch/audald/read_groups/"+sample+"_readgroups.txt"
    read_group_file = open(read_group_file, 'r')
    read_groups = read_group_file.readlines()
    for read_group in read_groups:
        read_group = read_group.strip()
        #print(read_group)
    print("Number of read groups: ")
    print(len(read_groups))
    all_bam_files=os.listdir("/cluster/work/pausch/temp_scratch/audald/sorted_alignment/"+sample)
    bam_files = list()
    size_bam_files = list()
    for file in all_bam_files:
        if fnmatch.fnmatch(file, '*.bam'):
            bam_files.append(file)
            bam_file_size = os.path.getsize("/cluster/work/pausch/temp_scratch/audald/alignment/"+sample+"/"+file)
            bam_file_size = round(bam_file_size/1024/1024/1024,1)
            size_bam_files.append(bam_file_size)
    #print(sorted(bam_files))
    print("Number of bam files: ")
    print(len(bam_files))
    #print(size_bam_files)
    print("Total size of bam files in Gb:")
    print(round(sum(size_bam_files),1))
    print("Do the number of files match?")
    if (len(read_groups)*2) == len(bam_files):
        print("All good!")
        matching_samples.append(sample)
    else:
        print("Nope, it does not match!")
        non_matching_samples.append(sample)
    print("")
print("--------------------------")
print("Matching samples: ")
print(len(matching_samples))
print(matching_samples)
print("Non matching samples: ") 
print(len(non_matching_samples))
print(non_matching_samples)

BSWAUTM000077152128
Number of read groups: 
3
Number of bam files: 
6
Total size of bam files in Gb:
84.6
Do the number of files match?
All good!

BSWAUTM000217825118
Number of read groups: 
3
Number of bam files: 
6
Total size of bam files in Gb:
122.1
Do the number of files match?
All good!

BSWCHEF120043744524
Number of read groups: 
13
Number of bam files: 
26
Total size of bam files in Gb:
79.7
Do the number of files match?
All good!

BSWCHEF120050280619
Number of read groups: 
5
Number of bam files: 
10
Total size of bam files in Gb:
77.0
Do the number of files match?
All good!

BSWCHEF120066378591
Number of read groups: 
4
Number of bam files: 
8
Total size of bam files in Gb:
77.9
Do the number of files match?
All good!

BSWCHEF120069106030
Number of read groups: 
4
Number of bam files: 
8
Total size of bam files in Gb:
78.5
Do the number of files match?
All good!

BSWCHEF120071978076
Number of read groups: 
2
Number of bam files: 
4
Total size of bam files in Gb:
62.0
Do the n

Number of bam files: 
32
Total size of bam files in Gb:
61.7
Do the number of files match?
All good!

RM067
Number of read groups: 
6
Number of bam files: 
12
Total size of bam files in Gb:
57.6
Do the number of files match?
All good!

RM1003
Number of read groups: 
8
Number of bam files: 
16
Total size of bam files in Gb:
52.8
Do the number of files match?
All good!

RM1118
Number of read groups: 
4
Number of bam files: 
8
Total size of bam files in Gb:
59.9
Do the number of files match?
All good!

RM1179
Number of read groups: 
1
Number of bam files: 
2
Total size of bam files in Gb:
70.1
Do the number of files match?
All good!

RM1293
Number of read groups: 
4
Number of bam files: 
8
Total size of bam files in Gb:
89.2
Do the number of files match?
All good!

RM1478
Number of read groups: 
8
Number of bam files: 
16
Total size of bam files in Gb:
86.2
Do the number of files match?
All good!

RM1512
Number of read groups: 
8
Number of bam files: 
16
Total size of bam files in Gb:
100

Number of read groups: 
6
Number of bam files: 
12
Total size of bam files in Gb:
49.2
Do the number of files match?
All good!

BSWCHEM120027093693
Number of read groups: 
5
Number of bam files: 
10
Total size of bam files in Gb:
83.5
Do the number of files match?
All good!

BSWCHEM120033527809
Number of read groups: 
5
Number of bam files: 
10
Total size of bam files in Gb:
57.5
Do the number of files match?
All good!

BSWCHEM120033898589
Number of read groups: 
6
Number of bam files: 
12
Total size of bam files in Gb:
38.5
Do the number of files match?
All good!

BSWCHEM120035700514
Number of read groups: 
3
Number of bam files: 
6
Total size of bam files in Gb:
44.7
Do the number of files match?
All good!

BSWCHEM120057745616
Number of read groups: 
3
Number of bam files: 
6
Total size of bam files in Gb:
131.6
Do the number of files match?
All good!

BSWCHEM120060662634
Number of read groups: 
2
Number of bam files: 
4
Total size of bam files in Gb:
45.0
Do the number of files matc

A list of samples is initially provided. For each sample:

- The text files generated in the previous step are parsed and the number of read groups is provided
- Fastq files found in the after splitting folder are listed and the number is provided. 2x number of read groups (R1 and R2 is expected)
- Total size of read-group-specific Fastq files is calculated and displayed
- BAM files found in the after alignment folder are listed and the number is provided. 2x number of read groups (UCD and Angus is expected)
- Total size of read-group-specific BAM files is calculated and displayed
- The number of read groups and the number of BAM files is compared. The answer is "all good" when the number of BAM files is 2x the number of read groups from the text file.
- Matching and non-matching samples are appended in different lists, which are displayed at the end.

Read groups can be also directly retrieved from the BAM files, so these can be compared to the starting point / expected ones. Therefore, a total comparison of read groups would be: read groups from the text file *vs* number of read-group-specific FASTQ files *vs* number of read groups within the final BAM file.

How to retrieve read groups from BAM files and compare to the read-group-specific FASTQ file:

In [36]:
%%bash
module load samtools/1.6
path="/cluster/work/pausch/temp_scratch/audald/best_assembly"
samples="RM1513 BSWCHEM111512030200"
aligned="dedup_alignment"
split_fastq="split_fastq"
for sample in $samples
do
    cd $path
    echo '---------- Sample' $sample '----------'
    cd $split_fastq
    echo 'Read groups once FASTQ are split:'
    echo 'R1:'
    ls $sample/${sample}*R1.fq.gz | wc -l
    ls $sample/${sample}*R1.fq.gz | awk -F "_" '{print $2"."$3}'
    echo 'R2:'
    ls $sample/${sample}*R2.fq.gz | wc -l
    ls $sample/${sample}*R2.fq.gz | awk -F "_" '{print $2"."$3}'
    cd ../$aligned
    echo 'Read groups for UCD alignment:' 
    samtools view -H UCD/$sample/$sample.bam | grep '@RG' | grep -v @PG | wc -l
    samtools view -H UCD/$sample/$sample.bam | grep '@RG' | grep -v @PG | awk -F "\t" '{print $2}' | awk -F ":" '{print $2}'| sort -u #discarding the @PG header
    echo 'Read groups for Angus alignment:'
    samtools view -H UCD/$sample/$sample.bam | grep '@RG' | grep -v @PG | wc -l
    samtools view -H UCD/$sample/$sample.bam | grep '@RG' | grep -v @PG | awk -F "\t" '{print $2}' | awk -F ":" '{print $2}'| sort -u #discarding the @PG header
done

---------- Sample RM1513 ----------
Read groups once FASTQ are split:
R1:
8
H3WNWDSXX.1
H3WNWDSXX.2
H3WNWDSXX.3
H3WNWDSXX.4
H52LGDSXX.1
H52LGDSXX.2
H52LGDSXX.3
H52LGDSXX.4
R2:
8
H3WNWDSXX.1
H3WNWDSXX.2
H3WNWDSXX.3
H3WNWDSXX.4
H52LGDSXX.1
H52LGDSXX.2
H52LGDSXX.3
H52LGDSXX.4
Read groups for UCD alignment:
8
H3WNWDSXX.1
H3WNWDSXX.2
H3WNWDSXX.3
H3WNWDSXX.4
H52LGDSXX.1
H52LGDSXX.2
H52LGDSXX.3
H52LGDSXX.4
Read groups for Angus alignment:
8
H3WNWDSXX.1
H3WNWDSXX.2
H3WNWDSXX.3
H3WNWDSXX.4
H52LGDSXX.1
H52LGDSXX.2
H52LGDSXX.3
H52LGDSXX.4
---------- Sample BSWCHEM111512030200 ----------
Read groups once FASTQ are split:
R1:
1
D1C72ACXX.4
R2:
1
D1C72ACXX.4
Read groups for UCD alignment:
1
D1C72ACXX.4
Read groups for Angus alignment:
1
D1C72ACXX.4


**Softwares and tools**

- Picard tools can be used for validating the BAM files as indicated [here](http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile)

In [5]:
%%bash

cd /cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment
module load picard/2.9.2
echo 'Summary of BSWCHEM120070555216_H7HFFDSXX_1_UCD.bam via Picard tools'
picard ValidateSamFile  I=UCD/BSWCHEM120070555216/BSWCHEM120070555216.bam MODE=SUMMARY
echo 'Summary of BSWCHEM120070555216_H7HFFDSXX_1_Angus.bam via Picard tools'
picard ValidateSamFile  I=Angus/BSWCHEM120070555216/BSWCHEM120070555216.bam MODE=SUMMARY

Summary of BSWCHEM120070555216_H7HFFDSXX_1_UCD.bam via Picard tools
No errors found
Summary of BSWCHEM120070555216_H7HFFDSXX_1_Angus.bam via Picard tools
No errors found


[Fri Feb 07 10:32:23 CET 2020] Executing as avillas@lo-a2-073 on Linux 3.10.0-862.14.4.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_141-b15; Picard version: 2.9.2-SNAPSHOT
INFO	2020-02-07 10:33:04	SamFileValidator	Validated Read    10,000,000 records.  Elapsed time: 00:00:41s.  Time for last 10,000,000:   41s.  Last read position: 1:114,066,540
INFO	2020-02-07 10:33:41	SamFileValidator	Validated Read    20,000,000 records.  Elapsed time: 00:01:17s.  Time for last 10,000,000:   36s.  Last read position: 2:70,528,039
INFO	2020-02-07 10:34:17	SamFileValidator	Validated Read    30,000,000 records.  Elapsed time: 00:01:53s.  Time for last 10,000,000:   35s.  Last read position: 3:48,169,658
INFO	2020-02-07 10:34:53	SamFileValidator	Validated Read    40,000,000 records.  Elapsed time: 00:02:29s.  Time for last 10,000,000:   36s.  Last read position: 4:42,438,752
INFO	2020-02-07 10:35:30	SamFileValidator	Validated Read    50,000,000 records.  Elapsed time: 00:03:06s.  Time for la

- Samtools may be another option for validation.