# <span style="color:green">Formation South Green 2022</span> - Structural Variants Detection by using short and long reads 

# __DAY 2 : SNP calling__

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)


***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>

[I - Preparing data](#data)

[II - SNP CALLING USING `GATK`](#gatk) 
   * [Mark duplications `GATK`](#mk)
   * [Indexing reference with `samtools` and `GATK`](#indexref)
   * [SNP Calling for a Clone](#HC)
   * [Error produced here is normal ! RG is important in mapping step !](#BUG)🐞 
   * [ => Calling all samples on one raw VCF with correct BAM files](#HCOK)
   * [Markduplicates `GATK`  for all clones](#md2)
   * [HaplotypeCaller `GATK`  for all clones  ](#HC2)
   * [CombineGVCF `GATK`  ](#samtoolsview)
   * [Compute the Genotypes  ](#combine)
   * [Count the number of variants with `bcftools stat`  ](#bcftools)
 

***

# <span style="color:#006E7F">__I - Preparing data__ <a class="anchor" id="data"></span>  

### <span style="color: #4CACBC;"> First create a dedicated folder to work</span>  


In [1]:
# go to work directory
cd /home/jovyan/work/
ls

MAPPING-ILL  myFirstJupyterBook.ipynb  training_SV_teaching
MAPPING-ONT  SV_DATA


# <span style="color:#006E7F">__II - SNP CALLING USING `GATK`__ <a class="anchor" id="gatk"></span>  

###  <span style="color:#006E7F">We are going to use only one clones to check is all is ok before run the whole of samples ! </span>  

In [23]:
## declare variables
i=10
REF_DIR="/home/jovyan/work/SV_DATA/REF/"
REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"
BAM="/home/jovyan/work/MAPPING-ILL/dirClone10/Clone$i.SORTED.bam"

In [20]:
# go to SR mapping results
cd /home/jovyan/work/MAPPING-ILL/dirClone${i}

In [24]:
ls /home/jovyan/work/MAPPING-ILL/dirClone${i}

Clone10.bam         Clone10.SORTED.bam.bai     duplicates.10.metrics
Clone10.flagstat    Clone10.SORTED.MD.bam
Clone10.SORTED.bam  Clone10.SORTED.MD.bam.bai


## <span style="color:#006E7F">Mark duplications <a class="anchor" id="mk"></span>  

In [25]:
echo -e "\nMarkDuplicates in Clone$i";
gatk MarkDuplicates -I $BAM -M duplicates.$i.metrics -O Clone$i.SORTED.MD.bam;
samtools index Clone$i.SORTED.MD.bam;


MarkDuplicates in Clone10
Using GATK jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar MarkDuplicates -I /home/jovyan/work/MAPPING-ILL/dirClone10/Clone10.SORTED.bam -M duplicates.10.metrics -O Clone10.SORTED.MD.bam
17:26:31.985 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Jun 20 17:26:32 UTC 2022] MarkDuplicates --INPUT /home/jovyan/work/MAPPING-ILL/dirClone10/Clone10.SORTED.bam --OUTPUT Clone10.SORTED.MD.bam --METRICS_FILE duplicates.10.metrics --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBE

In [26]:
# check MD bam
ls -l

total 80088
-rw-r--r-- 1 jovyan users 40018974 Jun 20 15:00 Clone10.bam
-rw-r--r-- 1 jovyan users      381 Jun 20 15:00 Clone10.flagstat
-rw-r--r-- 1 jovyan users 18213243 Jun 20 15:00 Clone10.SORTED.bam
-rw-r--r-- 1 jovyan users     2856 Jun 20 14:56 Clone10.SORTED.bam.bai
-rw-r--r-- 1 jovyan users 23756223 Jun 20 17:26 Clone10.SORTED.MD.bam
-rw-r--r-- 1 jovyan users     2856 Jun 20 17:26 Clone10.SORTED.MD.bam.bai
-rw-r--r-- 1 jovyan users     3465 Jun 20 17:26 duplicates.10.metrics


## <span style="color:#006E7F">Indexing reference with `samtools` and `GATK` <a class="anchor" id="indexref"></span>  

In [27]:
cd $REF_DIR
samtools faidx $REF
gatk CreateSequenceDictionary -R $REF

Using GATK jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar CreateSequenceDictionary -R /home/jovyan/work/SV_DATA/REF/reference.fasta
INFO	2022-06-20 17:26:44	CreateSequenceDictionary	Output dictionary will be written in /home/jovyan/work/SV_DATA/REF/reference.dict
17:26:44.861 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Jun 20 17:26:44 UTC 2022] CreateSequenceDictionary --REFERENCE /home/jovyan/work/SV_DATA/REF/reference.fasta --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RA

: 3

## <span style="color:#006E7F">SNP Calling for two Clones <a class="anchor" id="HC"></span>  
    
Before launching SNP calling on all the clones, we are going to test the command only on two samples/clones

In [None]:
# go to bam repertory
cd /home/jovyan/work/MAPPING-ILL/dirClone${i}

# change BAM file for MD one
BAM="/home/jovyan/work/MAPPING-ILL/dirClone10/Clone$i.SORTED.MD.bam"

# lauch GATK HaplotypeCaller
echo -e "\nCalling Clone$i";
gatk --java-options "-Xmx12g" HaplotypeCaller --native-pair-hmm-threads 4 -I Clone$i.SORTED.MD.bam -O Clone$i.g.vcf -R $REF -ERC GVCF;

In [29]:
head Clone$i.g.vcf

head: cannot open 'Clone10.g.vcf' for reading: No such file or directory


: 1

### <span style="color:#006E7F">Error produced here is normal ! RG is important in mapping step  ! <a class="anchor" id="BUG"></span>  🐞 

### Yes, you have to relaunch mapping with all samples ! BUT ...

Don't worry, bam files with correct `-R "@RG\tID:Clone${i}\tSM:Clone${i}"` parameter in bwa-mem2 command are available for downloading.

In [30]:
cd /home/jovyan/work/
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" wget https://itrop.ird.fr/sv-training/BAM_ILL.tar.gz
tar zxvf BAM_ILL.tar.gz
BAM_ILL="/home/jovyan/work/BAM_ILL"
rm BAM_ILL.tar.gz
ls $BAM_ILL

--2022-06-20 17:30:19--  http://wget/
Resolving wget (wget)... failed: Name or service not known.
wget: unable to resolve host address ‘wget’
--2022-06-20 17:30:20--  https://itrop.ird.fr/sv-training/BAM_ILL.tar.gz
Resolving itrop.ird.fr (itrop.ird.fr)... 91.203.35.184
Connecting to itrop.ird.fr (itrop.ird.fr)|91.203.35.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 396929293 (379M) [application/x-gzip]
Saving to: ‘BAM_ILL.tar.gz’


2022-06-20 17:30:27 (60.7 MB/s) - ‘BAM_ILL.tar.gz’ saved [396929293/396929293]

FINISHED --2022-06-20 17:30:27--
Total wall clock time: 7.6s
Downloaded: 1 files, 379M in 6.2s (60.7 MB/s)
BAM_ILL/
BAM_ILL/Clone10.SORTED.bam
BAM_ILL/Clone11.SORTED.bam
BAM_ILL/Clone12.SORTED.bam
BAM_ILL/Clone13.SORTED.bam
BAM_ILL/Clone14.SORTED.bam
BAM_ILL/Clone15.SORTED.bam
BAM_ILL/Clone16.SORTED.bam
BAM_ILL/Clone17.SORTED.bam
BAM_ILL/Clone18.SORTED.bam
BAM_ILL/Clone19.SORTED.bam
BAM_ILL/Clone1.SORTED.bam
BAM_ILL/Clone20.SORTED.bam
BAM_ILL/Clone

## <span style="color:#006E7F"> => Calling all samples on one raw VCF with correct BAM files<a class="anchor" id="HCOK"></span> 

In [31]:
BAM_ILL="/home/jovyan/work/BAM_ILL"
ls $BAM_ILL

Clone10.SORTED.bam  Clone15.SORTED.bam  Clone1.SORTED.bam   Clone5.SORTED.bam
Clone11.SORTED.bam  Clone16.SORTED.bam  Clone20.SORTED.bam  Clone6.SORTED.bam
Clone12.SORTED.bam  Clone17.SORTED.bam  Clone2.SORTED.bam   Clone7.SORTED.bam
Clone13.SORTED.bam  Clone18.SORTED.bam  Clone3.SORTED.bam   Clone8.SORTED.bam
Clone14.SORTED.bam  Clone19.SORTED.bam  Clone4.SORTED.bam   Clone9.SORTED.bam


In [32]:
# create a work directory to SNP calling results
mkdir -p /home/jovyan/work/VCF
VCF_DIR="/home/jovyan/work/VCF"

## <span style="color:#006E7F"> Markduplicates `GATK` for all clones <a class="anchor" id="md2"></span>  

In [33]:
cd $BAM_ILL
for i in {1..5}
    do
        samtools index Clone$i.SORTED.bam;
        echo -e "\nMarkDuplicates in Clone$i";
        gatk MarkDuplicates -I Clone$i.SORTED.bam -M duplicates.$i.metrics -O Clone$i.SORTED.MD.bam;
        samtools index Clone$i.SORTED.MD.bam;
    done


MarkDuplicates in Clone1
Using GATK jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar MarkDuplicates -I Clone1.SORTED.bam -M duplicates.1.metrics -O Clone1.SORTED.MD.bam
17:30:42.508 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Jun 20 17:30:42 UTC 2022] MarkDuplicates --INPUT Clone1.SORTED.bam --OUTPUT Clone1.SORTED.MD.bam --METRICS_FILE duplicates.1.metrics --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --

## <span style="color:#006E7F">HaplotypeCaller `GATK`  for all clones <a class="anchor" id="HC2"></span>  


In [34]:
cd $BAM_ILL
for i in {1..5}
    do
        # lauch GATK HaplotypeCaller
        echo -e "\n\n>>>>>>>>>>>> Calling Clone$i";
        gatk --java-options "-Xmx20g" HaplotypeCaller --native-pair-hmm-threads 4 -I Clone$i.SORTED.MD.bam -O $VCF_DIR/Clone$i.g.vcf -R $REF -ERC GVCF
    done


Calling Clone10
Using GATK jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx12g -jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar HaplotypeCaller --native-pair-hmm-threads 4 -I Clone10.SORTED.MD.bam -O /home/jovyan/work/VCF/Clone10.g.vcf -R /home/jovyan/work/SV_DATA/REF/reference.fasta -ERC GVCF
17:35:08.547 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:35:08.735 INFO  HaplotypeCaller - ------------------------------------------------------------
17:35:08.736 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.6.1
17:35:08.736 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/


## <span style="color:#006E7F">CombineGVCF `GATK`<a class="anchor" id="combine"></span>  

In [36]:
BAM_ILL="/home/jovyan/work/BAM_ILL"
VCF_DIR="/home/jovyan/work/VCF"
REF="/home/jovyan/work/SV_DATA/REF/reference.fasta"

# change of work directory
cd $VCF_DIR
# Loop to inflate the --variant option
OPTION=""
for i in {1..5}
do
    OPTION="${OPTION} --variant Clone${i}.g.vcf"
done
echo $OPTION
# GATK
gatk CombineGVCFs -R $REF $OPTION -O rawSNP.vcf

--variant Clone10.g.vcf --variant Clone11.g.vcf --variant Clone12.g.vcf --variant Clone13.g.vcf --variant Clone14.g.vcf --variant Clone15.g.vcf --variant Clone16.g.vcf --variant Clone17.g.vcf --variant Clone18.g.vcf --variant Clone19.g.vcf --variant Clone20.g.vcf
Using GATK jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar CombineGVCFs -R /home/jovyan/work/SV_DATA/REF/reference.fasta --variant Clone10.g.vcf --variant Clone11.g.vcf --variant Clone12.g.vcf --variant Clone13.g.vcf --variant Clone14.g.vcf --variant Clone15.g.vcf --variant Clone16.g.vcf --variant Clone17.g.vcf --variant Clone18.g.vcf --variant Clone19.g.vcf --variant Clone20.g.vcf -O rawSNP.vcf
20:54:09.101 INFO  NativeLibraryLoader - Loading libgkl_compression.so from

### <span style="color:#006E7F">Have a Look to it combineGVCF<a class="anchor" id="combine"></span> 

In [37]:
head -n 1000 rawSNP.vcf | tail

Reference	962	.	A	<NON_REF>	.	.	.	GT:DP:GQ:MIN_DP:PL	./.:46:99:44:0,117,1755	./.:29:81:29:0,81,1215	./.:37:99:37:0,99,1485	./.:37:99:35:0,99,1485	./.:40:99:38:0,105,1575	./.:37:99:37:0,99,1485	./.:23:60:22:0,60,847	./.:23:69:23:0,69,934	./.:41:99:41:0,111,1665	./.:20:0:16:0,0,499	./.:47:99:47:0,120,1800
Reference	963	.	T	C,<NON_REF>	.	.	DP=368;ExcessHet=0.00;RAW_MQandDP=172800,48	GT:AD:DP:GQ:MIN_DP:PGT:PID:PL:PS:SB	./.:.:46:99:44:.:.:0,117,1755,117,1755,1755	./.:.:28:75:27:.:.:0,75,1125,75,1125,1125	./.:.:37:99:37:.:.:0,99,1485,99,1485,1485	./.:.:37:99:35:.:.:0,99,1485,99,1485,1485	./.:.:40:99:38:.:.:0,105,1575,105,1575,1575	./.:.:37:96:37:.:.:0,96,1440,96,1440,1440	./.:.:23:60:22:.:.:0,60,847,60,847,847	./.:.:23:69:23:.:.:0,69,934,69,934,934	./.:.:41:99:41:.:.:0,111,1665,111,1665,1665	./.:.:20:0:16:.:.:0,0,499,0,499,499	.|.:0,48,0:48:99:.:0|1:888_G_A:2160,144,0,2160,144,2160:888:0,0,22,26
Reference	964	.	A	<NON_REF>	.	.	.	GT:DP:GQ:MIN_DP:PL	./.:46:99:44:0,117,1755	./.:28:75:27:0,75,11

## <span style="color:#006E7F">Compute the Genotypes<a class="anchor" id="geno"></span> 


In [38]:
gatk --java-options "-Xmx16g" GenotypeGVCFs -R $REF -V rawSNP.vcf -O output.vcf

Using GATK jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar GenotypeGVCFs -R /home/jovyan/work/SV_DATA/REF/reference.fasta -V rawSNP.vcf -O output.vcf
20:56:00.864 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
20:56:01.021 INFO  GenotypeGVCFs - ------------------------------------------------------------
20:56:01.021 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.6.1
20:56:01.021 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
20:56:01.029 INFO  GenotypeGVCFs - Initializing engine
20:56:01.362 INFO  FeatureManager - Using cod

In [None]:
### <span style="color:#006E7F">Have a Look to the vcf created by combineGVCF<a class="anchor" id="combine"></span> 

In [40]:
tail output.vcf

Reference	999958	.	C	G,T	1078.36	.	AC=4,2;AF=0.200,0.100;AN=20;DP=57;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.7962;MLEAC=5,2;MLEAF=0.250,0.100;MQ=60.00;QD=32.64;SOR=6.518	GT:AD:DP:GQ:PGT:PID:PL:PS	./.:0,0,0:0:0:.:.:0,0,0,0,0,0	2|2:0,0,5:5:15:1|1:999945_T_C:225,225,225,15,15,0:999945	0/0:5,0,0:5:12:.:.:0,12,180,12,180,180	1|1:0,7,0:7:21:1|1:999927_T_A:315,21,0,315,21,315:999927	0/0:7,0,0:7:15:.:.:0,15,225,15,225,225	1|1:0,13,0:13:39:1|1:999840_A_T:585,39,0,585,39,585:999840	0/0:3,0,0:3:9:.:.:0,9,130,9,130,130	0/0:4,0,0:4:6:.:.:0,6,90,6,90,90	0/0:4,0,0:4:12:.:.:0,12,165,12,165,165	0/0:3,0,0:3:9:.:.:0,9,110,9,110,110	0/0:6,0,0:6:9:.:.:0,9,135,9,135,135
Reference	999961	.	C	T	249.95	.	AC=2;AF=0.100;AN=20;DP=45;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.5266;MLEAC=2;MLEAF=0.100;MQ=60.00;QD=32.72;SOR=3.912	GT:AD:DP:GQ:PGT:PID:PL:PS	./.:0,0:0:0:.:.:0,0,0	0/0:5,0:5:9:.:.:0,9,135	0/0:4,0:4:9:.:.:0,9,135	0/0:4,0:4:12:.:.:0,12,175	0/0:4,0:4:12:.:.:0,12,169	0/0:10,0:10:21:.:.:0,21,315	0/0:2,0: