Skip to content

DenisaConstantin/SubSequencesExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

SubSequencesExtractor

The SubSequencesExtractor.sh script is an auxiliary script for the digital scaffolding strategy, dScaff (https://github.com/DL-UB/dScaff). This script generates a file containing the coordinates of interest based on input values. This file consists of two columns: one for the start coordinates and another for the stop coordinates. The length of the target sequence replaces the final stop coordinate. The utility SeqKit (https://github.com/shenwei356/seqkit) is then employed to extract subsequences from the sequence of interest, using the previously generated coordinates as reference points. For each extracted subsequence, the header is modified to include the extraction coordinates and the source file. The subsequences are subsequently concatenated into a single output file, respectively ranked_queries.fasta. Additionally, the SubSequencesExtractor.sh script generates a tab-delimited file containing the coordinates of the genes (or subsequences in this case) required for running dScaff.

Dependencies

You need FASTA files containing the chromosomes/sequences of interest to use this script. The files must be in FASTA format and named with a ".fasta" extension, such as "2R.fasta". For instance, if you are working with the Drosophila suzukii genome downloaded from NCBI (NCBI RefSeq assembly GCF_037355615.1), you can generate the FASTA files for each chromosome using the following command:

awk -F " |," '/^>/ {s=$7".fa"}; {print > s}' GCF_037355615.1_Dsuz_RU_1.0_genomic.fna

The awk command must be changed based on the header structure. The resulting files will be in multiline format. To convert these to singleline format, use:

for i in *.fa; do seqtk seq $i > ${i%.fa}".fasta"; done

Before running the SubSequencesExtractor.sh script, ensure that SeqKit is installed on your system.

Running

Download the script, make sure that it is executable and run: ./SubSequencesExtractor.sh 4000 10000 Here, 4000 is the length of the subsequences and 10000 is the distance between each subsequence. For example, the first subsequence will span from coordinates 1 to 4000, the next will span from 14000 to 18000, and so on. The last coordinate will be the length of the sequence, even if the extracted subsequence will not be 4000 nucleotides. The last subsequence may be longer if the total length of the sequence is not an exact multiple of 4000.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages