# Formation South Green 2022

# Structural Variants Detection by using short and long reads

DAY 1 => MAPPING

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

June 2022

# 1. Creating the folder, downloading data and so on

## __Simulated clones__

Before starting, please download special data created for this practical training. Data are available on the from I-Trop server.

Each participant will analyse a Clone, results will be complete in this shared file

To generate Clone data, a 1Mb contig was extracted from chromosome 1 of rice.

20 levels of variation were generated and long reads were simulated for each.

We have introduced different variations (SNP, indel, indel+translocations) and also some contaminations.

In [None]:
CLONE=Clone10

In [None]:
cd ~
mkdir -p SG-SV-2022/DATA
cd SG-SV-2022/DATA
# download your compressed CloneX 
wget --no-check-certificat -rm -nH --cut-dirs=1 --reject="index.html*" https://itrop.ird.fr/ont-training/${CLONE}.tar.gz

In [None]:
#decompress file
tar zxvf ${CLONE}.tar.gz

In [None]:
# check data 
ls -l ${CLONE}

-----------------------
# 2. MAPPING PRACTICE

Read congruency is an important measure in determining assembly accuracy.

Clusters of read pairs or single long reads that align incorrectly are strong indicators of mis-assembly.

Reads mapping is usually the firt step before SNP or Variant calling.

### 2.1 Make a folder for your results

In [None]:
mkdir -p ~/SG-SV-2022/RESULTS/MAPPING
cd ~/SG-SV-2022/RESULTS/MAPPING

### 2.2 Declare important variables

In [None]:
i=10
CLONE="Clone${i}"
REF="/home/jovyan/SG-SV-2022/DATA/${CLONE}/reference.fasta"
ONT="/home/jovyan/SG-SV-2022/DATA/${CLONE}/ONT/${CLONE}.fastq.gz"
ILL_R1="/home/jovyan/SG-SV-2022/DATA/${CLONE}/ILL/${CLONE}_R1.fastq.gz"
ILL_R2="/home/jovyan/SG-ONT-2021/DATA/${CLONE}/ILL/${CLONE}_R2.fastq.gz"

In [None]:
echo $CLONE $REF

## 2.1 Mapping short reads vs a reference

In this practice, we are going to map short reads against a reference.

To know, how well do the reads align back to the reference, we use bwa-mem and samtools to assess the basic alignment statistics.

In this exercise, we will use reference.fasta assembly as well ILLUMINA READS from your favorite CLONE.

## Reference indexation 

Before mapping we need index reference file! Check bwa-mem2 index command line.

In [55]:
echo -e "\nIndexing reference $REF\n"
bwa-mem2 index $REF


Indexing reference file10

/home/jovyan/SG-SV-2022/DATA/Clone10/reference.fasta
[bwa_index] Pack FASTA... 0.01 sec
init ticks = 164039913
ref seq len = 2040002
binary seq ticks = 59986303
build index ticks = 598560799
ref_seq_len = 2040002
count = 0, 576483, 1020001, 1463519, 2040002
BWT[1932441] = 4
CP_SHIFT = 5, CP_MASK = 31
sizeof CP_OCC = 64
max_occ_ind = 63750
ref_seq_len = 2040002
count = 0, 576483, 1020001, 1463519, 2040002
BWT[1932441] = 4
CP_SHIFT = 6, CP_MASK = 63
sizeof CP_OCC = 64
max_occ_ind = 31875


##  let's mapping now !

In [57]:
echo -e "\nCreation directory for Clone$i\n"
echo Clone$i
mkdir -p dirClone$i
cd dirClone$i


Creation directory for Clone10

Clone10


In [58]:
echo -e "\nMapping Clone$i\n"
bwa-mem2 mem -t 4 $REF $ILL_R1 $ILL_R1 > Clone$i.sam
samtools view -@4 -b -f 0x02 Clone$i.sam | samtools sort -@4 -o Clone$i.SORTED.bam
samtools index Clone$i.SORTED.bam


Mapping Clone10

-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: /home/jovyan/SG-SV-2022/DATA/Clone10/reference.fasta
Entering FMI_search
reference seq len = 2040003
count
0,	1
1,	576484
2,	1020002
3,	1463520
4,	2040003

Reading other elements of the index from files /home/jovyan/SG-SV-2022/DATA/Clone10/reference.fasta
prefix: /home/jovyan/SG-SV-2022/DATA/Clone10/reference.fasta
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = /home/jovyan/SG-SV-2022/DATA/Clone10/reference.fasta.0123
Reference genome size: 2040002 bp
Done readng reference genome !!

[0000] 1: Calling process()

Threads used (compute): 4
Info: projected #read in a task: 264910
------------------------------------------
Memory pre-allocation for chaining: 557.3706 MB
Memory pre-allocation for BSW: 958.4681 MB
Memory pre-allocation for BWT: 309.2567 MB
------------------------------------------
No. of pipeline threads

## Using a loop for mapping, with a single folder per sample

In [None]:
for i in {1..20}
    do
        cd ~/SG-SV-2022/RESULTS/MAPPING
        echo "mkdir Clone$i for mapping";
        mkdir dirClone$i; 
        cd dirClone$i; 
        
        echo "Declare Clone$i data";
        CLONE="Clone$i"
        REF="/home/jovyan/SG-SV-2022/DATA/reference.fasta"
        ILL_R1="/home/jovyan/SG-SV-2022/DATA/${CLONE}/ILL/${CLONE}_R1.fastq.gz"
        ILL_R2="/home/jovyan/SG-ONT-2021/DATA/${CLONE}/ILL/${CLONE}_R2.fastq.gz"
        
        echo -e "\nMapping Clone$i\n"
        bwa-mem2 -t 4 -R "@RG\tID:Clone${i}\tSM:Clone${i}" $REF $ILL_R1 $ILL_R1 > Clone$i.sam
        samtools view -@4 -b -f 0x02 Clone$i.sam | samtools sort -@4 -o Clone$i.SORTED.bam
        samtools index Clone$i.SORTED.bam 
    done

In [None]:
for i in {1..20}
    do
        cd /ifb/data/mydatalocal/MappingAndSNP
        echo Clone$i;
        mkdir dirClone$i; 
        cd dirClone$i; 
        echo "Cloning Clone$i data";
        ln -s /ifb/data/mydatalocal/shortReads/Clone${i}_pairedFq1.fq /ifb/data/mydatalocal/MappingAndSNP/dirClone$i/Clone${i}_1.fastq;
        ln -s /ifb/data/mydatalocal/shortReads/Clone${i}_pairedFq2.fq /ifb/data/mydatalocal/MappingAndSNP/dirClone$i/Clone${i}_2.fastq;
        echo -e "\nMapping Clone$i\n";
        bwa mem -t 4 -R "@RG\tID:Clone${i}\tSM:Clone${i}" ../referenceCorrect.fasta Clone${i}_1.fastq Clone${i}_2.fastq > Clone$i.sam;
        samtools view -@4 -b -f 0x02 Clone$i.sam | samtools sort -@4 -o Clone$i.SORTED.bam;
        samtools index Clone$i.SORTED.bam;
        echo -e "\nCalling Clone$i";
        gatk MarkDuplicates -I Clone$i.SORTED.bam -M duplicates.$i.metrics -O Clone$i.SORTED.MD.bam;
        samtools index Clone$i.SORTED.MD.bam;
        gatk --java-options "-Xmx4g" HaplotypeCaller --native-pair-hmm-threads 4 -I Clone$i.SORTED.MD.bam -O Clone$i.g.vcf -R ../referenceCorrect.fasta -ERC GVCF;
    done

In [None]:
What is the percentage of aligned ONT and ILLUMINa reads in your clone assembly?

## 2.2 Mapping Long reads vs a Reference

## Calling all samples on one raw VCF

In [None]:
cd /ifb/data/mydatalocal/MappingAndSNP
#Loop to inflate the --variant option
OPTION=""
for i in {1..20}
do
    OPTION="${OPTION} --variant dirClone${i}/Clone${i}.g.vcf"
done
echo $OPTION
gatk CombineGVCFs -R referenceCorrect.fasta $OPTION -O rawSNP.vcf

# Have a Look to it

In [None]:
head -n 1000 rawSNP.vcf | tail

## Compute the Genotypes *:D I know I am a bad person...*

In [None]:
gatk --java-options "-Xmx4g" GenotypeGVCFs -R referenceCorrect.fasta -V rawSNP.vcf -O output.vcf

## Compute the SNP density along the chromosomes

In [None]:
echo -e "Reference\t1000000\n" > /ifb/data/mydatalocal/MappingAndSNP/genome.txt
bedtools genomecov -bga -split -i /ifb/data/mydatalocal/MappingAndSNP/output.vcf -g /ifb/data/mydatalocal/MappingAndSNP/genome.txt > /ifb/data/mydatalocal/MappingAndSNP/density.csv

In [None]:
head /ifb/data/mydatalocal/MappingAndSNP/density.csv