## Read Groups (RG)

This is an exploratory notebook to find out what the Read Groups (RG) are, how many RG there are in our FASTQ files of interest and how separate FASTQ files can be generated for each RG: read-group-specific FASTQ files.

Some of the links I found useful:
* https://software.broadinstitute.org/gatk/documentation/article.php?id=6472
* https://gatkforums.broadinstitute.org/gatk/discussion/3060/how-should-i-pre-process-data-from-multiplexed-sequencing-and-multi-library-designs
* https://drive.google.com/drive/folders/0BwTg3aXzGxEDR2JKTUIxVGxCT1k

In [3]:
cd /cluster/work/pausch/temp_scratch/audald/best_assembly/fastp/BSWDEUM000945581509 #This folder contains the FASTQ files after fastp
ls -lrth

total 24G
-rw-rw---- 1 avillas hest-hpc-tg  13G Jan 23 16:33 BSWDEUM000945581509_R2.fastq.gz
-rw-rw---- 1 avillas hest-hpc-tg  12G Jan 23 16:33 BSWDEUM000945581509_R1.fastq.gz
-rw-rw---- 1 avillas hest-hpc-tg 140K Jan 23 16:33 BSWDEUM000945581509_fastp.json
-rw-rw---- 1 avillas hest-hpc-tg 487K Jan 23 16:33 BSWDEUM000945581509_fastp.html


Let us see how a couple of reads look like in a fastq files:

In [4]:
cd /cluster/work/pausch/temp_scratch/audald/best_assembly/fastp/BSWDEUM000945581509
zcat BSWDEUM000945581509_R1.fastq.gz | head -16
zcat BSWDEUM000945581509_R2.fastq.gz | head -16

@A00460:55:H5KCGDSXX:1:1101:1344:1094 1:N:0:TTATAACC+GATATCGA
CGCCCTGGGGGTGGGAAGAGGATGGGTGAGAGAGGGCGTTGGATCCACTGCAGCACCAGCCTCCCACCAACCCAGACAGCCCATTCAGGGCCTTGGCGCTTGCTGCTCCCCTCCCCAGCCACCCACCTGGCTCGCCCCTCAATCCAGCCAG
+
FFF:::,F::F,:F:F:,F,:FFFF,:,FF:::,:FF,F,FF,F:F:F,,FFFFFFFF,FFFFFFFFF,FFFFF:FFFFFFFFFFFF::FFF,,F,FFF,:,F,:F,:FFFFFF,FF,:F,FFFFFFF:FF,,FFF,FFFF,,FFF:FFF,
@A00460:55:H5KCGDSXX:1:1101:1705:1094 1:N:0:TTATAACC+GATATCGA
GAGAAGAAAGAGCCTCAGCTCAATTCGGTGACTTGCCCACCATCATGTAGGAGATGAACGGCCCCATAGGAGTCTTCTGTTCTGATGCAGAAATATATTATTGCTGCTTTTCCTGCCTTAATTTTTGTCTGTTTCTTTAAGGAGCACTTCC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF,FFF:FFFF:FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFF
@A00460:55:H5KCGDSXX:1:1101:3803:1094 1:N:0:TTATAACC+GATATCGA
GTGTCTTTTGTGGTGCTTCTTACATGAGAAATCTATGTCTAACCCAAAATTACTAAGATTTTATCCGAATTTTTGTTAGGAATTTTACACTTTTAGCTCATACATATGTATCTGGCATAAGGTAAAGAACTAGATATTTTCCCCTGCATAT
+
:,:F:FF:,F,:FF,FF,,F,F:F:,:,FFF,,F,F,FF:FF,F:F,,

FASTQ files have series of 4 lines. The first line is the sequence identifier; fields separated by ":". Taking, for example "@J00121:120:HJ5V5BBXX:3:2228:29589:19812/1", these are the fields:
* Instrument: J00121
* Run number: 120
* Flowcell ID: HJ5V5BBXX
* Lane number: 3
* Tile: 2228
* X position: 29589
* Y position: 19812
* Read: 1

This has been understood from the following link: https://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

A **read group can be considered the Flowcell ID + Lane number** (third and fourth field).

I am taking the bash command [Hubert uses in his original pipeline](Original_pipeline/Hubert_Pausch_initial_code.ipynb) for extracting the different read groups from a fastq file.  The resulting are sorted and printed in order.

In [1]:
cd /cluster/work/pausch/temp_scratch/audald/best_assembly/fastp/BSWDEUM000916561366
zcat BSWDEUM000916561366_R1.fastq.gz | awk -F":" 'NR%2==1 {print $3,$4}' | awk 'NR%2==1 {print $1":"$2}' | sort -u
#By taking ":" as separator, flowcell ID and lane number are printed for the odd-numbered lines. 
#From here the odd-numbered lines have field 1 and 2 printed out. Only the specific fields every 4 lines will be printed.

C2FH5ACXX:1
C2FH5ACXX:2
C2FH5ACXX:3
C2FH5ACXX:4
C2FH5ACXX:5
C2FH5ACXX:6
C2FH5ACXX:7
C2FH5ACXX:8


Also, as exploration I would like to see how the read groups look like in a BAM file. Taking a BAM file from group work I use samtools to display the RG.

In [2]:
cd /cluster/work/pausch/temp_scratch/audald/best_assembly/dedup_alignment/Angus/BSWCHEM120038685214
module load samtools/1.6
samtools view -H BSWCHEM120038685214.bam | grep '@RG'

@RG	ID:HM5LGDSXX.3	CN:TUM	LB:BSWCHEM120038685214	PL:illumina	PU:HM5LGDSXX:3	SM:BSWCHEM120038685214
@RG	ID:HM5LGDSXX.2	CN:TUM	LB:BSWCHEM120038685214	PL:illumina	PU:HM5LGDSXX:2	SM:BSWCHEM120038685214
@PG	ID:bwa	PN:bwa	CL:bwa mem -M -t 12 -R @RG\tID:HM5LGDSXX.2\tCN:TUM\tLB:BSWCHEM120038685214\tPL:illumina\tPU:HM5LGDSXX:2\tSM:BSWCHEM120038685214 /cluster/work/pausch/temp_scratch/audald/best_assembly/ref/Angus_ref.fa /cluster/work/pausch/temp_scratch/audald/best_assembly/split_fastq/BSWCHEM120038685214/BSWCHEM120038685214_HM5LGDSXX_2_R1.fq.gz /cluster/work/pausch/temp_scratch/audald/best_assembly/split_fastq/BSWCHEM120038685214/BSWCHEM120038685214_HM5LGDSXX_2_R2.fq.gz	VN:0.7.17-r1188
@PG	ID:bwa.1	PN:bwa	CL:bwa mem -M -t 12 -R @RG\tID:HM5LGDSXX.3\tCN:TUM\tLB:BSWCHEM120038685214\tPL:illumina\tPU:HM5LGDSXX:3\tSM:BSWCHEM120038685214 /cluster/work/pausch/temp_scratch/audald/best_assembly/ref/Angus_ref.fa /cluster/work/pausch/temp_scratch/audald/best_assembly/split_fastq/BSWCHEM120038685214/BSWCH

The fields are described here: https://samtools.github.io/hts-specs/SAMv1.pdf

@RG opens the header of the read group. Where:
* ID: Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section.
* CN: Name of sequencing center producing the read
* LB: Library
* PL: Platform/technology used to produce the reads. Valid values: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO.
* PU: Platform unit. Unique identifier
* SM: Sample

@PG opeans the header for the used programs. As many @PG as @RG. Where:
* ID: Program record identifier. Each @PG line must have a unique ID
* PN: Program name
* CL: Command line. UTF-8 encoding may be used
* VN: Program version

Once understood what read groups are and how can these be separated we need to separate each sample FASTQ files in different FASTQ files per sample (one per each read group). Given that between 1-8 read groups are detected per FASTQ file, it means that each file can be split into an unknown number between 1-8 files.

### GDC-FASTQ-SPLITTER

Discussing with Zih-Hua we decide to use [gdc-fastq-splitter](https://github.com/kmhernan/gdc-fastq-splitter) for this task. All the instructions on how to install it in the cluster - it has not been easy - are documented in the ["How_I_did_stuff"](https://github.com/Audald/ETH_Jupyter/blob/master/How_I_did_stuff.ipynb) notebook.

These are the notes Zih-Hua is sharing with me:

In [None]:
input='path/to/input_folder/'
output='path/to/output_folder/'
split='/cluster/home/fangzi/gdc-fastq-splitter/venv/bin/gdc-fastq-splitter'

bsub -n 1 -W 6:00 -R "rusage[mem=3000,scratch=1000]" -J "fastq_split" -env "all" "unset PYTHONPATH;${split} -o ${output}${sample_id}_ ${input}${sample_id}_R1.fastq.gz ${input}${sample_id}_R2.fastq.gz"

After doing some testing, this is the optimised command that is used in Snakemake ([Snakefile](Snakemake/Snakefile.py), [Configuration details](Snakemake/config.yaml) and [Cluster details](Snakemake/cluster.json)):

In [None]:
module load python_gpu/3.7.4
bsub -n 6 -W 23:59 -R "rusage[mem=6000,scratch=1000]" -J "fastq_split" -env "all" "/cluster/work/pausch/audald/software/gdc-fastq-splitter/venv/bin/gdc-fastq-splitter -o /cluster/work/pausch/temp_scratch/audald/best_assembly/split_fastq/BSWDEUM000916561366_ /cluster/work/pausch/temp_scratch/audald/best_assembly/fastp/BSWDEUM000916561366/BSWDEUM000916561366_R1.fastq.gz /cluster/work/pausch/temp_scratch/audald/best_assembly/fastp/BSWDEUM000916561366/BSWDEUM000916561366_R2.fastq.gz"
#BSWDEUM000916561366 fastq files are tested and results are copied into split_fastq folder.
#Setting the output is important for the final name the fastq files will have. In this case, I am setting the names as BSWDEUM000916561366_readgroup_R{1,2}.fq.gz (e.g. BSWDEUM000916561366_C2FH5ACXX_7_R1.fq.gz)

We will now focuss on a specific example to demonstrate that this steps works properly. This is: the resulting files are as many as read groups in the original files. Also, the names of the resulting files and the read group containing should match:

In [3]:
cd /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707
ls -lrth
echo 'Number of read groups in original BSWDEUM000916561366 FASTQ file:'
zcat /cluster/work/pausch/temp_scratch/audald/fastp/BSWAUTM000336344707/BSWAUTM000336344707_R1.fastq.gz | awk -F":" 'NR%2==1 {print $3,$4}' | awk 'NR%2==1 {print $1":"$2}' | sort -u
echo 'Number of read groups in file BSWAUTM000336344707_H5KCGDSXX_1_R1.fq.gz:'
zcat /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_1_R1.fq.gz | awk -F":" 'NR%2==1 {print $3,$4}' | awk 'NR%2==1 {print $1":"$2}' | sort -u
echo 'Number of read groups in file BSWAUTM000336344707_H5KCGDSXX_3_R1.fq.gz:'
zcat /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_3_R1.fq.gz | awk -F":" 'NR%2==1 {print $3,$4}' | awk 'NR%2==1 {print $1":"$2}' | sort -u
echo 'Number of read groups in file BSWAUTM000336344707_H5KCGDSXX_4_R1.fq.gz:'
zcat /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707/BSWAUTM000336344707_H5KCGDSXX_4_R1.fq.gz | awk -F":" 'NR%2==1 {print $3,$4}' | awk 'NR%2==1 {print $1":"$2}' | sort -u

total 16G
-rw-rw---- 1 avillas hest-hpc-tg  287 Nov  7 12:13 BSWAUTM000336344707_H5KCGDSXX_4_R1.report.json
-rw-rw---- 1 avillas hest-hpc-tg 2.7G Nov  7 12:13 BSWAUTM000336344707_H5KCGDSXX_4_R1.fq.gz
-rw-rw---- 1 avillas hest-hpc-tg  287 Nov  7 12:13 BSWAUTM000336344707_H5KCGDSXX_3_R1.report.json
-rw-rw---- 1 avillas hest-hpc-tg 2.6G Nov  7 12:13 BSWAUTM000336344707_H5KCGDSXX_3_R1.fq.gz
-rw-rw---- 1 avillas hest-hpc-tg  287 Nov  7 12:13 BSWAUTM000336344707_H5KCGDSXX_1_R1.report.json
-rw-rw---- 1 avillas hest-hpc-tg 2.5G Nov  7 12:13 BSWAUTM000336344707_H5KCGDSXX_1_R1.fq.gz
-rw-rw---- 1 avillas hest-hpc-tg  287 Nov  7 12:16 BSWAUTM000336344707_H5KCGDSXX_4_R2.report.json
-rw-rw---- 1 avillas hest-hpc-tg 2.8G Nov  7 12:16 BSWAUTM000336344707_H5KCGDSXX_4_R2.fq.gz
-rw-rw---- 1 avillas hest-hpc-tg  287 Nov  7 12:16 BSWAUTM000336344707_H5KCGDSXX_3_R2.report.json
-rw-rw---- 1 avillas hest-hpc-tg 2.8G Nov  7 12:16 BSWAUTM000336344707_H5KCGDSXX_3_R2.fq.gz
-rw-rw---- 1 avillas hest-hpc-tg  287 No

Although the number of read groups in the original file is the same as the number of files created and although the read groups at the beginning of read-group-specific fastq files match with the expected name, I would like to check that the number of reads in the original fastq file is exactly the same as the sum of reads in the generated fastq files.

Let us focus again on **BSWAUTM000336344707** to see whether everything is correct (JQ software is being used here - see ["How_I_did_stuff"](https://github.com/Audald/ETH_Jupyter/blob/master/How_I_did_stuff.ipynb) to see the installation and working details):

In [14]:
module load jq/1.5
cd /cluster/work/pausch/temp_scratch/audald/split_fastq/BSWAUTM000336344707
echo 'Number of reads in the original file (including R1 and R2):'
cat /cluster/work/pausch/temp_scratch/audald/fastp/BSWAUTM000336344707/BSWAUTM000336344707_fastp.json | jq '.filtering_result.passed_filter_reads'
echo 'Number of reads in the first read group:'
cat BSWAUTM000336344707_H5KCGDSXX_1_R1.report.json | jq '.metadata.record_count' > BSWAUTM000336344707_H5KCGDSXX_1_reads.txt
cat BSWAUTM000336344707_H5KCGDSXX_1_reads.txt
echo 'Number of reads in the second read group:'
cat BSWAUTM000336344707_H5KCGDSXX_3_R1.report.json | jq '.metadata.record_count' > BSWAUTM000336344707_H5KCGDSXX_3_reads.txt
cat BSWAUTM000336344707_H5KCGDSXX_3_reads.txt
echo 'Number of reads in the third read group:'
cat BSWAUTM000336344707_H5KCGDSXX_4_R1.report.json | jq '.metadata.record_count' > BSWAUTM000336344707_H5KCGDSXX_4_reads.txt
cat BSWAUTM000336344707_H5KCGDSXX_4_reads.txt
echo 'Number of reads when summing all read groups (and taking into account R1 and R2)'
paste BSWAUTM000336344707_H5KCGDSXX_1_reads.txt BSWAUTM000336344707_H5KCGDSXX_3_reads.txt BSWAUTM000336344707_H5KCGDSXX_4_reads.txt | awk '{print ($1 + $2 +$3)*2}'
rm *txt

Number of reads in the original file (including R1 and R2):
[0;39m236389974[0m
Number of reads in the first read group:
37108710
Number of reads in the second read group:
40487310
Number of reads in the third read group:
40487310
Number of reads when summing all read groups (and taking into account R1 and R2)
236389974
