Skip to content

Giving a contig name and positions (start and end), extractSeq can return the corresponding sequence from your fasta.

License

Notifications You must be signed in to change notification settings

Sanrrone/extractSeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

extractSeq

extracSeq is a bash script (awk embedded), that giving a contig name and positions (start and end), return the corresponding sequence from your fasta.

Usage

extractSeq need five (and intuitive), parameters (order is important):

  • The contig name where are the coordinates that you want.
  • Start position of your gene.
  • End position of your gene.
  • Fasta file to dive.
  • Output name (optional, by default is "output").
  • Strand where are your gene (default: +).

Example:

bash extractSeq.bash contig0001 500 2000 + myoutput Ecoli.fasta

This means that we dive in file Ecoli.fasta, searching the contig "contig0001" and then extract the sequence on the position 500 to 2000.

Real example: In the example folder there is a fasta (Ecoli.fasta), where the contigs names are like:

  • NODE_1_length_350779_cov_17.5196
  • NODE_2_length_328984_cov_17.2607

So, applying the usage will be:

bash extractSeq.bash NODE_1_length_350779_cov_17.5196 500 1400 - testgene example/Ecoli.fasta

This will give a testgene.fasta with the reverse complementary sequence.

Output

Following the same example, the output is a fasta (testgene.fasta), with the positions in the header indicating if are complement or not:

>NODE_1_length_350779_cov_17.5196 Ecoli.fasta c(500:1400)
TCTTTCGGTGTATTGTCCAGCTCCAGCGGCAGAGCCACGTTGCCAAAAACAGTACGCGAAG
AGAGCAGGTTAAAATGCTGGAAAATCATACCAATCTGGCGGCGAGCTTTGGTCAACTCGGA
TTCTGACAGCGTGGTCAGTTCCTGTCCATCGACCAGCACGCTACCCTCGGTTGGGCGCTCC
AGCAGGTTTACACAACGTATAAGCGTACTCTTACCCGCGCCTGAGGCACCGATAACGCCAT
AAATTTGCCCAGCTGGCACATGCAGGCTGACGTTGTTCAACGCCTGGATGGTGCGGGTGCC
CTGGTGGAACACTTTGGTGATATTCGAAAGTTTTATCATTGATTATTTATTATCGTCATTA
AGTTAGTCGTGGCATTTCGAATGCCTGAAACGGGCAACGCCGTCAATGAAATGGATGTTAA
GGCATCCAGACGTCTAAATCAATCAGGTTTATGCGAAGAGCACTTTCTTGCAGGTCGAAAC
ATGCGATACTAGCGTCACATGCCTTATTAAGGAGCTATAAAAGGTGGCGAAGAGCGTACCC
GCAATTTTTCTTGACCGTGATGGCACCATTAATGTCGATCACGGCTATGTCCATGAGATCG
ATAACTTTGAATTTATCGACGGTGTTATTGACGCCATGCGCGAGCTAAAAAAAATGGGCTT
CGCGCTGGTGGTAGTCACCAACCAGTCTGGCATTGCTCGCGGTAAATTTACTGAAGCACAG
TTTGAAACGCTGACCGAGTGGATGGACTGGTCGCTGGCGGACCGCGATGTCGATCTGGATG
GTATCTATTATTGCCCGCATCATCCGCAGGGTAGTGTTGAAGAGTTTCGCCAGGTCTGCGA
CTGCCGCAAACCACATCCGGGGATGCTTTTGT

WARNINGS

  • Be careful with coordinates, if your contig have 100 nucleotides and you specify the coord 50,101. the script only write to 100.
  • Make sure the coordinates begin from lower value to upper e.g (1 10, 400 425, etc).
  • IUPAC ambiguity codes of the two possible nucleotides are converted as following: R↔Y, K↔M, S and W unchanged.
  • Ambiguity codes of the three possible nucleotides are converted as following: B↔V, D↔H.
  • Upper and lower case is preserved and can be used to mark regions of interest.

External useful tools

check for these tools to extract some useful information from your data:

  • multiGenomicContext: Check the genomic context of several genomes or sequence just providing the GBK files.

  • fetchMyLineage: Return the complete lineage of your organism just providing the genus and species names.

  • extractSeq: Extract and size defined sequence from and specific contig, from and specific genome.

  • plotMyGBK: Plot your GBK in a circular graph with COG categories.

  • pasteTaxID: fetch the taxonomic IDs to your fastas.

  • GGisy: Plot synteny of two sequence (you can use two genomes), and see the identity of the matched regions.

  • getS2: obtain the order parameter to each residue of your simulation.

About

Giving a contig name and positions (start and end), extractSeq can return the corresponding sequence from your fasta.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages