Skip to content

A repo of perl scripts used to improve genome assemblies (tuned for Illumina Synthetic Long Reads), born from the work on assembling allotetraploid species Trifolium repens (White Clover).

Notifications You must be signed in to change notification settings

Lanilen/SemHelpers

Repository files navigation

SemHelpers

A repo of perl scripts used to improve genome assemblies (tuned for Illumina Synthetic Long Reads), born from the work on assembling allotetraploid species Trifolium repens (White Clover).

correct_small_gaps_v2.pl: This script will take a FASTA file (reference sequence) and an mpileup file, and will do indel correction on said reference according to the alignment (goes by consensus, it is tuned to work with Illumina Synthetic Long Reads).

merge_and_replace.pl: Alternatively, this script will take a MAF alignment (such as one done with last or lastZ) and will replace the reference genome with the aligned reads based on similarity (user-defined).

maf_masking_by_feature.pl: This script will take a MAF alignment and a GFF file (or more), and mask the alignment where those features are located. Useful for, for example, align whole genomes taking advantage of synteny, and then mask exons or genes to do mutation rate estimates.

sort_scaffolds_by_LD.pl place_scaffolds_list_simplified.pl : These two scripts are merely aids to create chained scaffolds/pseudomolecules based on positions to a close relative reference sequence. The lists can be created via LD mapping, simple Megablast, or a combination or both.

score_selfies.pl: This script will take a MAF file generated by lastal (https://github.com/mcfrith/last-genome-alignments) of a genome against itself, and look for High Scoring Pairs (HSPs) of any sequence aligned to itself, to split into three types:
self-matches: As the name indicates, the number of hits found where the sequence hits itself in the exact same position on both ends
tandem-matches: HSPs where the start of one of the sequences is within 25 bp of the end of the other sequence in the alignment. This usually indicates a bubble that has been placed as a tandem repeat by the assembler (which may or may NOT be true!)
far-matches: A HSP that lands outside the 25 bp window.

About

A repo of perl scripts used to improve genome assemblies (tuned for Illumina Synthetic Long Reads), born from the work on assembling allotetraploid species Trifolium repens (White Clover).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages