
# <span style="color:#006E7F">__Introduction to Oxford Nanopore Data Analysis__ <a class="anchor"></span>  

Created by C. Tranchant-Dubreuil (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD) - May 2022  SouthGreen training 

Adapted by J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) - Novembre 2022

Readapted using TransmittingScience teaching by C. Tranchant-Dubreuil (DIADE-IRD) and F. Sabot (DIADE-IRD) - October 2023
    
# <span style="color:#006E7F"> TP5. STRUCTURAL VARIANTS DETECTION USING LONG READS  </center> </span>

## __How to identify large variants using whole-genome assemblies ?__ 

</span>


### <span style="color:#006E7F">__0 - Preparing working environment__ <a class="anchor" id="work"></a></span>  
### <span style="color: #4CACBC;"> Create the working directory

In [None]:
mkdir -p /home/jovyan/work/RESULTS/SV
cd /home/jovyan/work/RESULTS/SV
ls -l

### <span style="color: #4CACBC;"> Use assemblies from AGGREGATED directory</span>  

In [None]:
ls ~/work/RESULTS/AGGREGATED/

#### Count the number of sequences of the fasta files in the directory `AGGREGATED` 

In [None]:
grep ">"  ~/work/RESULTS/AGGREGATED/*FLY*MEDAKA*fasta -c

## <span style="color:#006E7F">I. SV calling using nucmer, syri <a class="anchor" id="SVsyri"></a></span>  

Siry can __ONLY__ be used if two reference genomes are same name and number of contigs. For this example we will use two assemblies with similar caracteristics

We will use :
* nucmer to align these two genomes. [nucmer manual](http://mummer.sourceforge.net/manual/)
* syri to detect SVs from nucmer alignment. [syri manual](https://schneebergerlab.github.io/syri/)
 

### <span style="color: #4CACBC;"> Initialization of two variables "genome"</span>  


#### Download assemblies

In [None]:
cd /home/jovyan/work/RESULTS/
wget https://itrop.ird.fr/sv-training/Assemblies.tar.gz
tar xvf Assemblies.tar.gz && rm Assemblies.tar.gz

In [None]:
more /home/jovyan/work/RESULTS/Assemblies/assembly-stats.txt

#### Calculate the size of the genomes in the directory `Assemblies`  - seqtk comp 

Setk comp provides information on each sequence in the fasta file:
* column 1 contains the sequence name 
* column 2 corresponds to the sequence length in bp

In [None]:
seqtk comp ~/work/RESULTS/Assemblies/A8_assembly.fasta

In [None]:
seqtk comp ~/work/RESULTS/Assemblies/5417_assembly.fasta

##### Compute the size of the genome 5417 using `seqtk comp | awk`

In [None]:
seqtk comp ~/work/RESULTS/Assemblies/A8_assembly.fasta | awk '{pb=pb+$2} END {print "Total pb :",pb}'

In [None]:
seqtk comp  ~/work/RESULTS/Assemblies/5417_assembly.fasta | awk '{pb=pb+$2} END {print "Total pb :",pb}'

In [None]:
reference_assembly="/home/jovyan/work/RESULTS/Assemblies/A8_assembly.fasta"
query_assembly="/home/jovyan/work/RESULTS/Assemblies/5417_assembly.fasta"

### <span style="color: #4CACBC;"> Create the working directory for SYRI analysis</span>  

In [None]:
mkdir -p ~/work/RESULTS/SV/SV_CALLING_SYRI/
cd ~/work/RESULTS/SV/SV_CALLING_SYRI/

In [None]:
pwd

### <span style="color: #4CACBC;"> Aligning genomes using `Nucmer` <a class="anchor" id="nucmer"></a></span>  

[Nucmer manual](http://mummer.sourceforge.net/manual/)

Some interesting parameters :
<code>
-maxmatch       Compute all maximal matches regardless of their uniqueness
-b|breaklen     Set the distance an alignment extension will attempt to extend poor scoring regions before giving up (default 200)
-c|mincluster   Sets the minimum length of a cluster of matches (default 65)
-l|minmatch     Set the minimum length of a single match (default 20)
</code>


In [None]:
nucmer --maxmatch $reference_assembly $query_assembly

In [None]:
ls -lrt

In [None]:
pwd

#### Check if new files have been generated by nucmer and display the first lines of the file `.delta`

In [None]:
head out.delta

#### Filtering nucmer results

We are going to remove small and lower quality alignments. Some interesting parameters :
<code>
-i float	Set the minimum alignment identity [0, 100], (default 0)
-l int		Set the minimum alignment length (default 0)
-m            Many-to-many alignment allowing for rearrangements (union of -r and -q alignments)
-q            Maps each position of each query to its best hit in the reference, allowing for reference overlaps
-r            Maps each position of each reference to its best hit in the query, allowing for query overlaps
</code>

Remove small and lower quality alignments


In [None]:
delta-filter -m -i 90 -l 100 out.delta > out.filtered.delta

#### Check that the new `delta`files has been filtered

In [None]:
ls -lrt

In [None]:
wc -l *.delta

#### Converting the file `delta` into a tabular file using `show-coords`

[show-coords manual](http://mummer.sourceforge.net/manual/#coords)
Some interesting parameters :
<code>
-c	Include percent coverage columns in the output
-d	Include the alignment direction/reading frame in the output (default for promer)
-H	Omit the output header
-I float	Set minimum percent identity to display
-l	Include sequence length columns in the output
-L int	Set minimum alignment length to display
-q	Sort output lines by query
-r	Sort output lines by reference
-T	Switch output to tab-delimited format
</code>



In [None]:
show-coords -Trd out.filtered.delta > out.filtered.withHeader.coords

In [None]:
show-coords -THrd out.filtered.delta > out.filtered.coords

In [None]:
ls -lrt

In [None]:
head *.coords

### <span style="color: #4CACBC;"> SV calling using `SYRI` <a class="anchor" id="siri"></a></span>  


In [None]:
syri -c out.filtered.coords -d out.filtered.delta -r $reference_assembly -q $query_assembly 

In [None]:
ls -lt

In [None]:
cat syri.summary

In [None]:
head syri.vcf

In [None]:
tail syri.out

### <span style="color: #4CACBC;"> Extracting all SNP from syri  <a class="anchor" id="siri"></a></span>  

In [None]:
cat syri.out | grep SNP | cut -f 1,2 > a
cat syri.out | grep SNP | cut -f 4,5 > b
cat syri.out | grep SNP | cut -f 3 > c
awk '{ print $1+1 }' c > d
paste a d c > SNPs.bed
rm a b c d


In [None]:
echo -e $reference_assembly'\t ref' > plotsr_pos.txt
echo -e $query_assembly'\t query' >> plotsr_pos.txt

In [None]:
head plotsr_pos.txt

In [None]:
plotsr --sr syri.out --genomes plotsr_pos.txt -s 500 -o plotsr.pdf -H 8 -W 5


## <span style="color:#006E7F">II - Using assemblytics <a class="anchor" id="SVsyri"></a></span>  

* Download the file .delta generated by nucmer. 
* Load this file on the assemblytics website : [http://assemblytics.com/](http://assemblytics.com/)

---------------------

## <span style="color:#006E7F">II - III - using d-genies <a class="anchor" id="SVsyri"></a></span>  

* Download the two genomes (fasta files)
* Load these files on the d-genies website : [http://dgenies.toulouse.inra.fr/](http://dgenies.toulouse.inra.fr/)
