#### <span style="color:#grey"> __Formation South Green 2022 - Structural Variants Detection by using short and long reads__ </span>

# <span style="color:#006E7F">  <center> __DAY 4 : How to identify large variants using whole-genome assemblies ?__ </center> </span>

Created by C. Tranchant (DIADE-IRD), J. Orjuela (DIADE-IRD), F. Sabot (DIADE-IRD) and A. Dereeper (PHIM-IRD)

***

# <span style="color: #006E7F">Table of contents</span>
<a class="anchor" id="home"></a>

[Preparing working environment](#env)

[I - using nucmer & syri](#SVsyri)

* [Run minimap2](#minimap)
* [Run syri](#syri)

[II - using assemblytics](#SVassemblytics)

[III - using d-genies](#SVdgenies)


</span>

***



## <span style="color:#006E7F">__Preparing working environment__ <a class="anchor" id="env"></a></span>  

### <span style="color: #4CACBC;"> Create the working directory ~/work/DAY4-WGA-SV</span>  

### <span style="color: #4CACBC;"> Download the two genomes that we will </span>  
* Download the archive on the [itrop website](#https://itrop.ird.fr/)
* Decompress it
* Remove the archive
* Check that the directory Assemblies exists !
* List the content of directory Assembly

### <span style="color: #4CACBC;"> Count the number of sequences of the fasta files in the directory `Assemblies`  </span>  

### <span style="color: #4CACBC;"> Calculate the size of the two genomes  in the directory `Assemblies`  - seqtk comp </span>  

Setk comp donne des infos sur chaque sequence du fichier fasta notamment:
* la colonne 1 contient le nom de la séquence 
* la colonne 2 correspond à la longueur de la séquence en pb

In [None]:
seqtk comp PUT_YOUR_FASTA

##### Compute the size of the genome 5417 using `seqtk comp | awk`

In [None]:
seqtk comp PUT_YOUR_FASTA | awk '{ PUT_YOUR_CODE }'

##### Compute the size of the two genomes using `seqtk comp | awk` in a `for loop`

In [None]:
for file in ...;
    do
        echo -e "\n>>>> Je lis le fichier : $file"
        seqtk comp ...
    done

## <span style="color:#006E7F">__I - Using nucmer, syri__ <a class="anchor" id="SVassemblytics"></a></span>  

We will use :
* nucmer to align the two genomes. [nucmer manual](http://mummer.sourceforge.net/manual/)
* syri to detect SVs from nucmer alignment. [syri manual](https://schneebergerlab.github.io/syri/)

### <span style="color: #4CACBC;"> Initialization of two variables "genome"</span>  


In [None]:
dir_genome="/home/jovyan/work/day4-WGA-SV/"
reference_assembly=$dir_genome"Assemblies/A8_assembly.fasta"
query_assembly=$dir_genome"Assemblies/5417_assembly.fasta"

### <span style="color: #4CACBC;"> Create the working directory SYRI within the directory day4-WGA-SV</span>  

### <span style="color: #4CACBC;"> Aligning genomes using `Nucmer` <a class="anchor" id="nucmer"></a></span>  

__Alignement of the two genome with `nucmer`__

[Nucmer manual](http://mummer.sourceforge.net/manual/)

Some interesting parameters :
<code>
-maxmatch       Compute all maximal matches regardless of their uniqueness
-b|breaklen     Set the distance an alignment extension will attempt to extend poor scoring regions before giving up (default 200)
-c|mincluster   Sets the minimum length of a cluster of matches (default 65)
-l|minmatch     Set the minimum length of a single match (default 20)
</code>

In [None]:
nucmer --maxmatch PUT_YOUR_REF PUT_YOUR_QUERY

__Check if new files have been generated by nucmer and display the first lines of the file `.delta`__

__Filtering nucmer results with `delta-filter`__

We are going to remove small and lower quality alignments. Some interesting parameters :
<code>
-i float	Set the minimum alignment identity [0, 100], (default 0)
-l int		Set the minimum alignment length (default 0)
-m            Many-to-many alignment allowing for rearrangements (union of -r and -q alignments)
-q            Maps each position of each query to its best hit in the reference, allowing for reference overlaps
-r            Maps each position of each reference to its best hit in the query, allowing for query overlaps
</code>



In [None]:
delta-filter -m -i 90 -l 100 out.delta > out.filtered.delta

#### Check that the new `delta`files has been filtered

__Converting the file `delta` into a tabular file using `show-coords`__

[show-coords manual](http://mummer.sourceforge.net/manual/#coords)
Some interesting parameters :
<code>
-c	Include percent coverage columns in the output
-d	Include the alignment direction/reading frame in the output (default for promer)
-H	Omit the output header
-I float	Set minimum percent identity to display
-l	Include sequence length columns in the output
-L int	Set minimum alignment length to display
-q	Sort output lines by query
-r	Sort output lines by reference
-T	Switch output to tab-delimited format
</code>



In [None]:
show-coords -Trd out.filtered.delta > out.filtered.withHeader.coords

In [None]:
show-coords -THrd out.filtered.delta > out.filtered.coords

In [None]:
ls -lrt

In [None]:
head *.coords

### <span style="color: #4CACBC;"> SV calling using syri <a class="anchor" id="siri"></a></span>  

__activate syri envt with conda__

In [None]:
conda activate syri_env

In [None]:
syri -c PUT_YOUR_COORDS_FILE -d PUT_YOUR_DELTA_FILE_FILTERED -r PUT_YOUR_REF -q PUT_YOUR_QUERY

In [None]:
ls -lt

__Output file format__

SyRI outputs results in TSV format and VCF file format.

<code>
Annotation	Meaning	 	Annotation	Meaning
SYN	Syntenic region	 	SYNAL	Alignment in syntenic region
INV	Inverted region	 	INVAL	Alignment in inverted region
TRANS	Translocated region	 	TRANSAL	Alignment in translocated region
INVTR	Inverted translocated region	 	INVTRAL	Alignment in inverted translocated region
DUP	Duplicated region	 	DUPAL	Alignment in duplicated region
INVDP	Inverted duplicated region	 	INVDPAL	Alignment in inverted duplicated region
NOTAL	Un-aligned region	 	SNP	Single nucleotide polymorphism
CPG	Copy gain in query	 	CPL	Copy loss in query
HDR	Highly diverged regions	 	TDM	Tandem repeat
INS	Insertion in query	 	DEL	Deletion in query
</code>

In [None]:
cat syri.summary

__TSV format specifications__

<code>
Column Number	Value	Type
1	chromosome ID in reference	string
2	reference start position (1-based, includes start position)	int
3	reference end position (1-based, includes end position)	int
4	sequence in reference (Only for SNPs and indels)	string
5	sequence in query (Only for SNPs and indels)	string
6	chromosome ID in query	string
7	query start position (1-based, includes start position)	int
8	query end position (1-based, includes end position)	int
9	unique ID (annotation type + number)	string
10	parent ID (annotation type + number)	string
11	Annotation type	string
12	Copy status (for duplications)	string
</code>

In [None]:
head syri.vcf

In [None]:
head syri.out

__create the file plotsr_pos.txt required by the following commands__

In [None]:
echo -e $reference_assembly'\t ref' >> plotsr_pos.txt
echo -e $query_assembly'\t query' >> plotsr_pos.txt

In [None]:
head plotsr_pos.txt

__Generate the plot using `plotsr`__

[plotst manual](https://github.com/schneebergerlab/plotsr)

Some interesting parameters :

<code>
  -H H                  height of the plot (default: None)
  -W W                  width of the plot (default: None)
  -S S                  Space for homologous chromosome (0.1-0.75). Adjust this to make more space for annotation
                        markers/texts and tracks. (default: 0.7)
</code>

In [None]:
pwd
plotsr --sr YOUR_SYRI_OUT_FILE --genomes PUT_YOUR_PLOTSR_POS_FILE -s 500 -o plotsr.pdf -H 8 -W 5

### => download the file plotsr.pdf

--------------

## <span style="color:#006E7F">__II - using assemblytics__ <a class="anchor" id="SVassemblytics"></a></span>  

* Download the file .delta generated by nucmer. 
* Load this file on the assemblytics website : [http://assemblytics.com/](http://assemblytics.com/)

---------------------

## <span style="color:#006E7F">__III - using d-genies__ <a class="anchor" id="SVassemblytics"></a></span>  
* Download the two genomes (fasta files)
* Load these files on the d-genies website : [http://dgenies.toulouse.inra.fr/](http://dgenies.toulouse.inra.fr/)
