Skip to content

sequence_formats

BjornFJohansson edited this page Jan 28, 2024 · 1 revision

FASTA format

The FASTA format is the simplest text format for biological sequences that still allow some metadata for the sequence.

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.

The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column, followed by a name that can not have spaces:

>myDNAsequence
agctactactgagtcatcgtgtatgcgtatgatcatctatgcgtagtcgtacgtatctattcgatcgt


>myprotein 255 amino acids
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK

A FASTA file can represent either a nucleotide or protein sequence.

Sequences are expected to be represented in the standard IUPAC amino acid or nucleic acid codes, with these exceptions:

Lower-case letters are accepted and assumed the same as upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; in amino acid sequences, U and * are acceptable letters (see below). any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes are:

    A --> adenosine           M --> A C (amino)
    C --> cytidine            S --> G C (strong)
    G --> guanine             W --> A T (weak)
    T --> thymidine           B --> G T C
    U --> uridine             D --> G A T
    R --> G A (purine)        H --> A C T
    Y --> T C (pyrimidine)    V --> G C A
    K --> G T (keto)          N --> A G C T (any)
                              -  gap of indeterminate length

The accepted amino acid codes are:

A ALA alanine                         P PRO proline
B ASX aspartate or asparagine         Q GLN glutamine
C CYS cystine                         R ARG arginine
D ASP aspartate                       S SER serine
E GLU glutamate                       T THR threonine
F PHE phenylalanine                   U     selenocysteine
G GLY glycine                         V VAL valine
H HIS histidine                       W TRP tryptophan
I ILE isoleucine                      Y TYR tyrosine
K LYS lysine                          Z GLX glutamate or glutamine
L LEU leucine                         X     any
M MET methionine                      *     translation stop
N ASN asparagine                      -     gap of indeterminate length

Genbank format

GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. The start of the annotation section is marked by a line beginning with the word "LOCUS". The start of sequence section is marked by a line beginning with the word "ORIGIN" and the end of the section is marked by a line with only "//".

https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

Example:

LOCUS       AF068625                 200 bp    mRNA    linear   ROD 06-DEC-1999
DEFINITION  Mus musculus DNA cytosine-5 methyltransferase 3A (Dnmt3a) mRNA,
            complete cds.
ACCESSION   AF068625 REGION: 1..200
VERSION     AF068625.2  GI:6449467
KEYWORDS    .
SOURCE      Mus musculus (house mouse)
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muroidea; Muridae; Murinae; Mus.
REFERENCE   1  (bases 1 to 200)
  AUTHORS   Okano,M., Xie,S. and Li,E.
  TITLE     Cloning and characterization of a family of novel mammalian DNA
            (cytosine-5) methyltransferases
  JOURNAL   Nat. Genet. 19 (3), 219-220 (1998)
   PUBMED   9662389
REFERENCE   2  (bases 1 to 200)
  AUTHORS   Xie,S., Okano,M. and Li,E.
  TITLE     Direct Submission
  JOURNAL   Submitted (28-MAY-1998) CVRC, Mass. Gen. Hospital, 149 13th Street,
            Charlestown, MA 02129, USA
REFERENCE   3  (bases 1 to 200)
  AUTHORS   Okano,M., Chijiwa,T., Sasaki,H. and Li,E.
  TITLE     Direct Submission
  JOURNAL   Submitted (04-NOV-1999) CVRC, Mass. Gen. Hospital, 149 13th Street,
            Charlestown, MA 02129, USA
  REMARK    Sequence update by submitter
COMMENT     On Nov 18, 1999 this sequence version replaced gi:3327977.
FEATURES             Location/Qualifiers
     source          1..200
                     /organism="Mus musculus"
                     /mol_type="mRNA"
                     /db_xref="taxon:10090"
                     /chromosome="12"
                     /map="4.0 cM"
     gene            1..>200
                     /gene="Dnmt3a"
ORIGIN
        1 gaattccggc ctgctgccgg gccgcccgac ccgccgggcc acacggcaga gccgcctgaa
       61 gcccagcgct gaggctgcac ttttccgagg gcttgacatc agggtctatg tttaagtctt
      121 agctcttgct tacaaagacc acggcaattc cttctctgaa gccctcgcag ccccacagcg
      181 ccctcgcagc cccagcctgc
//
Clone this wiki locally