GitHub - Gabaldonlab/pyScaf: Genome assembly scaffolding using information from paired-end/mate-pair libraries, long reads, and synteny to closely related species.

Table of Contents

pyScaf

pyScaf

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (Scaffolding based on long reads)
synteny to the genome of some related species (Reference-based scaffolding)

Scaffolding modes

NGS-based scaffolding

This is under development... Stay tuned.

Scaffolding based on long reads

In this mode, pyScaf aligns long reads onto the contigs, identifies the reads the connects two or more contigs and join adjacent contigs.

Long reads are aligned locally onto contigs, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

Note, this is experimental implementation.

Reference-based scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned locally onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table). Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Usage

Dependencies

Parameters

Given reference genome, the program generates pairwise genome alignment (dotplots) by default.

Genral options:

`-h, --help`	show this help message and exit
`-f FASTA, --fasta FASTA`
	assembly FASTA file
`-o OUTPUT, --output OUTPUT`
	output stream [scaffolds.fa]
`-t THREADS, --threads THREADS`
	max no. of threads to run [4]
`--log LOG`	output log to [stderr]
`--dotplot`	generate dotplot as [png]
`--version`	show program's version number and exit

Reference-based scaffolding options:

`-r REF, --ref REF, --reference REF`
	reference FastA file
`--identity IDENTITY`
	min. identity [0.33]
`--overlap OVERLAP`
	min. overlap [0.66]
`-g MAXGAP, --maxgap MAXGAP`
	max. distance between adjacent contigs [0.01 * assembly_size]
`--norearrangements`
	high identity mode (rearrangements not allowed)

Long read-based scaffolding options (EXPERIMENTAL!):

-n LONGREADS, --longreads LONGREADS

FastQ/FastA file(s) with PacBio/ONT reads

NGS-based scaffolding options (!NOT IMPLEMENTED!):

`-i FASTQ, --fastq FASTQ`
	FASTQ PE/MP files
`-j JOINS, --joins JOINS`
	min pairs to join contigs [5]
`-a LINKRATIO, --linkratio LINKRATIO`
	max link ratio between two best contig pairs [0.7]
`-l LOAD, --load LOAD`
	align subset of reads [0.2]
`-q MAPQ, --mapq MAPQ`
	min mapping quality [10]

Test run

To perform reference-based assembly, provide assembled contigs and reference genome in FastA format. Dotplots of below runs can be found in docs. If you wish to skip dotplot generation (ie. no X11 on your system), provide --dotplot '' parameter.

# scaffold homogenised assembly (reduced contigs)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa

# scaffold reduced contigs using global mode (no norearrangements allowed)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.global.fa --norearrangements

# scaffold heterozygous assembly (de novo assembled contigs)
./pyScaf.py -f test/contigs.fa -r test/ref.fa -o test/contigs.ref.fa

# scaffold reduced contigs using long reads
## pacbio
./pyScaf.py -f test/contigs.reduced.fa -n test/pacbio.fq.gz -o test/contigs.reduced.pacbio.fa
## nanopore
./pyScaf.py -f test/contigs.reduced.fa -n test/nanopore.fa.gz -o test/contigs.reduced.nanopore.fa

# generate dotplot
lastdb test/ref.fa
lastal -f TAB test/ref.fa test/contigs.reduced.pacbio.fa | last-dotplot - test/contigs.reduced.pacbio.fa.ref.png
lastal -f TAB test/ref.fa test/contigs.reduced.nanopore.fa | last-dotplot - test/contigs.reduced.nanopore.fa.ref.png

# clean-up
#rm test/contigs.{,reduced.}fa.* test/ref.fa.* test/*.{nanopore,pacbio,ref}* test/*.log

Proof of concept

pyScaf is under heavy development right now. Nevertheless, both the reference-based mode and long-read mode are functional and produces meaningful assemblies. pyScaf has been implemented in Redundans.

For more info, have a look in workbook.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
FastaIndex @ 58f5237		FastaIndex @ 58f5237
docs		docs
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
FastaIndex.py		FastaIndex.py
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
pyScaf.py		pyScaf.py
scaffold_transcripts.py		scaffold_transcripts.py
setup.cfg		setup.cfg
setup.py		setup.py

`-n LONGREADS, --longreads LONGREADS`
	FastQ/FastA file(s) with PacBio/ONT reads

License

Gabaldonlab/pyScaf

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages