# <span style="color:green">Formation South Green 2022</span>

# Structural Variants Detection by using short and long reads

DAY 1 => MAPPING

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

June 2022

# 1. Creating the folder, downloading data and so on

## __Simulated clones__

Before starting, please download special data created for this practical training. Data are available on the from I-Trop server.

Each participant will analyse a Clone, results will be complete in this shared file

To generate Clone data, a 1Mb contig was extracted from chromosome 1 of rice.

20 levels of variation were generated and long reads were simulated for each.

We have introduced different variations (SNP, indel, indel+translocations) and also some contaminations.

In [None]:
# make a directory to DATA
cd ~/work
mkdir -p SG-SV-2022/DATA
cd SG-SV-2022/DATA

In [None]:
# download your compressed CloneX 
for i in {2..10}
    do
        echo -e "\nDownloading Clone$i data\n"
        CLONE="Clone$i"
        wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/${CLONE}.tar.gz
        #decompress file
        tar zxvf ${CLONE}.tar.gz
        rm ${CLONE}.tar.gz
    done

In [None]:
# check data 
ls -l Clone*

-----------------------
# 2. MAPPING PRACTICE

Read congruency is an important measure in determining assembly accuracy.

Clusters of read pairs or single long reads that align incorrectly are strong indicators of mis-assembly.

Reads mapping is usually the firt step before SNP or Variant calling.

### 2.1 Make a folder for your results

In [1]:
mkdir -p ~/work/SG-SV-2022/RESULTS/MAPPING-ILL
cd ~/work/SG-SV-2022/RESULTS/MAPPING-ILL

### 2.2 Declare important variables

In [2]:
i=10
REF="/home/jovyan/work/SG-SV-2022/DATA/Clone${i}/reference.fasta"
ONT="/home/jovyan/work/SG-SV-2022/DATA/Clone${i}/ONT/Clone${i}.fastq.gz"
ILL_R1="/home/jovyan/work/SG-SV-2022/DATA/Clone${i}/ILL/Clone${i}}_R1.fastq.gz"
ILL_R2="/home/jovyan/work/SG-SV-2022/DATA/Clone${i}/ILL/Clone${i}_R2.fastq.gz"

In [14]:
echo "Clone${i} $REF" 
echo $ILL_R1 $ILL_R2

Clone10 /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta
/home/jovyan/work/SG-SV-2022/DATA/Clone10/ILL/Clone10_R1.fastq.gz /home/jovyan/work/SG-SV-2022/DATA/Clone10/ILL/Clone10_R2.fastq.gz


## 2.1 Mapping short reads vs a reference

In this practice, we are going to map short reads against a reference.

To know, how well do the reads align back to the reference, we use bwa-mem2 and samtools to assess the basic alignment statistics.

In this exercise, we will use reference.fasta assembly as well ILLUMINA READS from your favorite CLONE.

## Reference indexation 

Before mapping we need index reference file! Check bwa-mem2 index command line.

In [None]:
echo -e "\nIndexing reference $REF\n"
bwa-mem2 index $REF

##  let's mapping now !

In [7]:
cd ~/work/SG-SV-2022/RESULTS/MAPPING-ILL
echo -e "\nCreation directory for Clone$i\n"
echo Clone$i
mkdir -p dirClone$i
cd dirClone$i


Creation directory for Clone10

Clone10


In [None]:
echo -e "\nMapping Clone$i\n"
bwa-mem2 mem -t 4 -R "@RG\tID:Clone${i}\tSM:Clone${i}" $REF $ILL_R1 $ILL_R2 > Clone$i.sam
samtools view -@4 -bS -f 0x02 Clone$i.sam | samtools sort -@4 - Clone$i.SORTED
samtools index Clone$i.SORTED.bam 
#rm Clone$i.sam


Mapping Clone10

-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta
Entering FMI_search
reference seq len = 2040003
count
0,	1
1,	576484
2,	1020002
3,	1463520
4,	2040003

Reading other elements of the index from files /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta
prefix: /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta.0123
Reference genome size: 2040002 bp
Done readng reference genome !!

[0000] 1: Calling process()

Threads used (compute): 4
Info: projected #read in a task: 264910
------------------------------------------
Memory pre-allocation for chaining: 557.3706 MB
Memory pre-allocation for BSW: 958.4681 MB
Memory pre-allocation for BWT: 309.2567 MB
------------------------------------------
No.

In [18]:
# check mapping, conversion samtobam and index
ls -l 

total 139636
-rw-r--r-- 1 jovyan users 120117860 Jun 15 10:09 Clone4.sam
-rw-r--r-- 1 jovyan users  22861715 Jun 15 10:09 Clone4.SORTED.bam
-rw-r--r-- 1 jovyan users      3096 Jun 15 10:09 Clone4.SORTED.bam.bai


## Using a loop for mapping, with a single folder per sample

In [17]:
for i in {2..4}
    do
        cd ~/work/SG-SV-2022/RESULTS/MAPPING-ILL
        echo -e "\nCreation directory for Clone$i\n"
        echo Clone$i
        mkdir -p dirClone$i
        cd dirClone$i

        echo -e "\nDeclare Clone$i data";
        REF="/home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta" # index is already done for clone10
        ONT="/home/jovyan/work/SG-SV-2022/DATA/Clone${i}/ONT/Clone${i}.fastq.gz"
        ILL_R1="/home/jovyan/work/SG-SV-2022/DATA/Clone${i}/ILL/Clone${i}_R1.fastq.gz"
        ILL_R2="/home/jovyan/work/SG-SV-2022/DATA/Clone${i}/ILL/Clone${i}_R2.fastq.gz"
        
        echo -e "\nMapping Clone$i\n"
        bwa-mem2 mem -t 4 -R "@RG\tID:Clone${i}\tSM:Clone${i}" $REF $ILL_R1 $ILL_R2 > Clone$i.sam
        samtools view -@4 -bS -f 0x02 Clone$i.sam | samtools sort -@4 - Clone$i.SORTED
        samtools index Clone$i.SORTED.bam 
    done


Creation directory for Clone2

Clone2

Declare Clone2 data

Mapping Clone2

-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta
Entering FMI_search
reference seq len = 2040003
count
0,	1
1,	576484
2,	1020002
3,	1463520
4,	2040003

Reading other elements of the index from files /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta
prefix: /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = /home/jovyan/work/SG-SV-2022/DATA/Clone10/reference.fasta.0123
Reference genome size: 2040002 bp
Done readng reference genome !!

[0000] 1: Calling process()

Threads used (compute): 4
Info: projected #read in a task: 264910
------------------------------------------
Memory pre-allocation for chaining: 557.3706 MB
Memory pre-allocation for BSW: 958.4681 MB
Memory pre-allocation for BWT:

In [None]:
What is the percentage of aligned ONT and ILLUMINa reads in your clone assembly?

In [None]:

        echo -e "\nCalling Clone$i";
        gatk MarkDuplicates -I Clone$i.SORTED.bam -M duplicates.$i.metrics -O Clone$i.SORTED.MD.bam;
        samtools index Clone$i.SORTED.MD.bam;
        gatk --java-options "-Xmx4g" HaplotypeCaller --native-pair-hmm-threads 4 -I Clone$i.SORTED.MD.bam -O Clone$i.g.vcf -R ../referenceCorrect.fasta -ERC GVCF;


## 2.2 Mapping Long reads vs a Reference

## Calling all samples on one raw VCF

In [None]:
cd /ifb/data/mydatalocal/MappingAndSNP
#Loop to inflate the --variant option
OPTION=""
for i in {1..20}
do
    OPTION="${OPTION} --variant dirClone${i}/Clone${i}.g.vcf"
done
echo $OPTION
gatk CombineGVCFs -R referenceCorrect.fasta $OPTION -O rawSNP.vcf

# Have a Look to it

In [None]:
head -n 1000 rawSNP.vcf | tail

## Compute the Genotypes *:D I know I am a bad person...*

In [None]:
gatk --java-options "-Xmx4g" GenotypeGVCFs -R referenceCorrect.fasta -V rawSNP.vcf -O output.vcf

## Compute the SNP density along the chromosomes

In [None]:
echo -e "Reference\t1000000\n" > /ifb/data/mydatalocal/MappingAndSNP/genome.txt
bedtools genomecov -bga -split -i /ifb/data/mydatalocal/MappingAndSNP/output.vcf -g /ifb/data/mydatalocal/MappingAndSNP/genome.txt > /ifb/data/mydatalocal/MappingAndSNP/density.csv

In [None]:
head /ifb/data/mydatalocal/MappingAndSNP/density.csv