# Bio.SeqIO

File Formats
------------
When specifying the file format, use lowercase strings.  The same format names are also used in Bio.AlignIO and include the following:

- abi     - Applied Biosystem's sequencing trace format
- abi-trim - Same as "abi" but with quality trimming with Mott's algorithm
- ace     - Reads the contig sequences from an ACE assembly file.
- cif-atom - Uses Bio.PDB.MMCIFParser to determine the (partial) protein
          sequence as it appears in the structure based on the atomic coordinates.
- cif-seqres - Reads a macromolecular Crystallographic Information File
          (mmCIF) file to determine the complete protein sequence as defined by the
          _pdbx_poly_seq_scheme records.
- **clustal** - A format for multiple sequence alignments, starting with a "CLUSTAL"
                header and aligned sequences in fixed-width blocks.
- **embl**    - The EMBL flat file format. Uses Bio.GenBank internally.
- **fasta**   - The generic sequence file format where each record starts with
          an identifier line starting with a ">" character, followed by
          lines of sequence.
- fasta-2line - Stricter interpretation of the FASTA format using exactly
          two lines per record (no line wrapping).
- **fastq**   - A "FASTA like" format used by Sanger which also stores PHRED
          sequence quality values (with an ASCII offset of 33).
- fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS
- fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which
          encodes Solexa quality scores (not PHRED quality scores) with an
          ASCII offset of 64.
- fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format
          which encodes PHRED quality scores with an ASCII offset of 64
          (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina
          will produce FASTQ files using the standard Sanger encoding.
- gck     - Gene Construction Kit's format.
- **genbank** - The GenBank or GenPept flat file format.
- gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities
- gfa1     - Graphical Fragment Assemblyv versions 1.x. Only segment lines
          are parsed and all linkage information is ignored.
- gfa2    - Graphical Fragment Assembly version 2.0. Only segment lines are
          parsed and all linkage information is ignored.
- ig      - The IntelliGenetics file format, apparently the same as the
          MASE alignment format.
- imgt    - An EMBL like format from IMGT where the feature tables are more
          indented to allow for longer feature types.
- nib     - UCSC's nib file format for nucleotide sequences, which uses one
          nibble (4 bits) to represent each nucleotide, and stores two nucleotides in
          one byte.
- pdb-seqres -  Reads a Protein Data Bank (PDB) file to determine the
          complete protein sequence as it appears in the header (no dependencies).
- pdb-atom - Uses Bio.PDB to determine the (partial) protein sequence as
          it appears in the structure based on the atom coordinate section of the
          file (requires NumPy for Bio.PDB).
- phd     - Output from PHRED, used by PHRAP and CONSED for input.
- pir     - A "FASTA like" format introduced by the National Biomedical
          Research Foundation (NBRF) for the Protein Information Resource
          (PIR) database, now part of UniProt.
- seqxml  - SeqXML, simple XML format described in Schmitt et al (2011).
- sff     - Standard Flowgram Format (SFF), typical output from Roche 454.
- sff-trim - Standard Flowgram Format (SFF) with given trimming applied.
- snapgene - SnapGene's native format.
- swiss   - Plain text Swiss-Prot aka UniProt format.
- tab     - Simple two column tab separated sequence files, where each
          line holds a record's identifier and sequence. For example,
          this is used as by Aligent's eArray software when saving
          microarray probes in a minimal tab delimited text file.
- qual    - A "FASTA like" format holding PHRED quality values from
          sequencing DNA, but no actual sequences (usually provided
          in separate FASTA files).
- uniprot-xml - The UniProt XML format (replacement for the SwissProt plain
          text format which we call "swiss")
- xdna        - DNA Strider's and SerialCloner's native format.

In [23]:
from Bio import SeqIO


# FASTA

In [25]:
# Reading a FASTA file
for record in SeqIO.parse("data/sequence.fasta", "fasta"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")
    print(f"Description: {record.description}")


ID: PA918075.1
Sequence: CCACCACTTAAACGTGGATGTACTTGCTTTGAAACTAAAGAAGTAAGTGCTTCCATGTTTTGGTGATGG
Description: PA918075.1 JP 2019517471-A/4: A Composition and Method of Using miR-302 Precursors as Anti-Cancer Drugs for Treating Human Lung Cancers


# FASTQ

In [27]:
# Reading a FASTQ file
for record in SeqIO.parse("data/sequence.fastq", "fastq"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")
    print(f"Quality Scores: {record.letter_annotations['phred_quality']}")


ID: K00271:89:HHWWNBBXX:2:1101:23277:1068
Sequence: NATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAATCTCAGACAACAAATCACAGAGTTAAGTCAGTTTACCGCACAAACTNACCAAGTCGGCGAAACAGAAGGTGGCGAC
Quality Scores: [2, 32, 32, 32, 37, 37, 41, 41, 37, 41, 41, 37, 41, 41, 41, 41, 37, 12, 37, 37, 37, 41, 41, 37, 41, 41, 41, 41, 37, 41, 41, 41, 41, 37, 41, 41, 37, 32, 37, 41, 37, 12, 37, 37, 41, 37, 41, 37, 41, 41, 41, 41, 22, 32, 27, 37, 37, 41, 41, 41, 37, 32, 41, 37, 27, 27, 12, 41, 41, 41, 37, 41, 41, 37, 12, 12, 12, 12, 37, 32, 12, 12, 22, 12, 32, 12, 22, 12, 12, 12, 12, 12, 12, 22, 32, 12, 22, 12, 12, 12, 12, 22, 37, 41, 12, 12, 12, 12, 22, 12, 12, 12, 22, 12, 22, 32, 12, 32, 37, 12, 2, 32, 27, 12, 12, 12, 22, 12, 12, 12, 22, 32, 12, 37, 8, 12, 22, 32, 32, 27, 12, 12, 22, 22, 12, 8, 12, 12, 8, 12]


# Genbank

In [8]:
# Reading a GenBank file
for record in SeqIO.parse("data/sequence.gb", "genbank"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")
    print(f"Annotations: {record.annotations}")
    for feature in record.features:
        print(f"Feature: {feature}")


ID: LT934502.1
Sequence: ATGTCGCTCACGGTCGTCAGCATGGCGTGCGTTGGGTTCTTCTTGCTGCAGGGGGCCTGGCCACTCATGGGTGGTCAGGACAAACCCTTCCTGTCTGCCCGGCCCAGCACTGTGGTGCCTCGAGGAGGACACGTGGCTCTTCAGTGTCACTATCGTCGTGGGTTTAACAATTTCATGCTGTACAAAGAAGACAGAAGCCACGTTCCCATCTTCCACGGCAGAATATTCCAGGAGAGCTTCATCATGGGCCCTGTGACCCCAGCACATGCAGGGACCTACAGATGTCGGGGTTCACGCCCACACTCCCTCACTGGGTGGTCGGCACCCAGCAACCCCGTGGTGATCATGGTCACAGGAAACCACAGAAAACCTTCCCTCCTGGCCCACCCAGGGCCCCTGCTGAAATCAGGAGAGACAGTCATCCTGCAATGTTGGTCAGATGTCATGTTTGAGCACTTCTTTCTGCACAGAGAGGGGATCTCTGAGGACCCCTCACGCCTCGTTGGACAGATCCATGATGGGGTCTCCAAGGCCAACTTCTCCATCGGTCCCTTGATGCCTGTCCTTGCAGGAACCTACAGATGTTATGGTTCTGTTCCTCACTCCCCCTATCAGTTGTCAGCTCCCAGTGACCCCCTGGACATCGTGATCACAGGTCTATATGAGAAACCTTCTCTCTCAGCCCAGCCGGGCCCCACGGTTCAGGCAGGAGAGAACGTGACCTTGTCCTGTAGCTCCTGGAGCTCCTATGACATCTACCATCTGTCCAGGGAAGGGGAGGCCCATGAACGTAGGCTCCGTGCAGTGCCCAAGGTCAACAGAACATTCCAGGCAGACTTTCCTCTGGGCCCTGCCACCCACGGAGGGACCTACAGATGCTTCGGCTCTTTCCGTGCCCTGCCCTGCGTGTGGTCAAACTCAAGTGACCCACTGCTTGTTTCTGTCACAGGAAACCCTTCAAGTAGTTGGCCTTCA

# EMBL

In [10]:
# Reading the EMBL file
for record in SeqIO.parse("data/sequence.embl", "embl"):
    print(f"ID: {record.id}")
    print(f"Description: {record.description}")
    print(f"Sequence: {record.seq}")
    print(f"Annotations: {record.annotations}")
    for feature in record.features:
        print(f"Feature: {feature.type}, Location: {feature.location}, Qualifiers: {feature.qualifiers}")


ID: PA918075.1
Description: JP 2019517471-A/4: A Composition and Method of Using miR-302 Precursors as Anti-Cancer Drugs for Treating Human Lung Cancers.
Sequence: CCACCACTTAAACGTGGATGTACTTGCTTTGAAACTAAAGAAGTAAGTGCTTCCATGTTTTGGTGATGG
Annotations: {'accessions': ['PA918075'], 'sequence_version': 1, 'topology': 'linear', 'molecule_type': 'RNA', 'data_file_division': 'XXX', 'keywords': ['JP 2019517471-A/4'], 'organism': 'Homo sapiens (human)', 'taxonomy': ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo'], 'references': [Reference(title='A Composition and Method of Using miR-302 Precursors as Anti-Cancer Drugs for Treating Human Lung Cancers', ...)]}
Feature: source, Location: [0:69](+), Qualifiers: {'organism': ['Homo sapiens'], 'mol_type': ['unassigned RNA'], 'db_xref': ['taxon:9606']}


# CLUSTAL

In [29]:
from Bio import SeqIO

# Reading a Clustal alignment file
for record in SeqIO.parse("data/sequence_alignment.aln", "clustal"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")


ID: NR_029653.1
Sequence: CCACCACTTAAACGTGGTTGTACTTGCTTTAGACCTAAGAAAGTAAGTGCTTCCATGTTTTGGTGATGG
ID: NR_049443.1
Sequence: CCACCACTTAAACGTGGATGTACTTGCTTTGAAACTCAAAAAGTAAGTGCTTCCATGTTTTAGTGATGG
ID: PA918075.1
Sequence: CCACCACTTAAACGTGGATGTACTTGCTTTGAAACTAAAGAAGTAAGTGCTTCCATGTTTTGGTGATGG
ID: NR_126584.1
Sequence: CCACCACTTAAACGTGGATGTACTTGCTTTGAAACTAAAAAAGTAAGTGCTTCCATGTTTTGGTGATGG


In [31]:
# Reading a GenBank file
record = SeqIO.read("data/sequence.gb", "genbank")

print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print(f"Annotations: {record.annotations}")
for feature in record.features:
    print(f"Feature: {feature}")


ID: LT934502.1
Sequence: ATGTCGCTCACGGTCGTCAGCATGGCGTGCGTTGGGTTCTTCTTGCTGCAGGGGGCCTGGCCACTCATGGGTGGTCAGGACAAACCCTTCCTGTCTGCCCGGCCCAGCACTGTGGTGCCTCGAGGAGGACACGTGGCTCTTCAGTGTCACTATCGTCGTGGGTTTAACAATTTCATGCTGTACAAAGAAGACAGAAGCCACGTTCCCATCTTCCACGGCAGAATATTCCAGGAGAGCTTCATCATGGGCCCTGTGACCCCAGCACATGCAGGGACCTACAGATGTCGGGGTTCACGCCCACACTCCCTCACTGGGTGGTCGGCACCCAGCAACCCCGTGGTGATCATGGTCACAGGAAACCACAGAAAACCTTCCCTCCTGGCCCACCCAGGGCCCCTGCTGAAATCAGGAGAGACAGTCATCCTGCAATGTTGGTCAGATGTCATGTTTGAGCACTTCTTTCTGCACAGAGAGGGGATCTCTGAGGACCCCTCACGCCTCGTTGGACAGATCCATGATGGGGTCTCCAAGGCCAACTTCTCCATCGGTCCCTTGATGCCTGTCCTTGCAGGAACCTACAGATGTTATGGTTCTGTTCCTCACTCCCCCTATCAGTTGTCAGCTCCCAGTGACCCCCTGGACATCGTGATCACAGGTCTATATGAGAAACCTTCTCTCTCAGCCCAGCCGGGCCCCACGGTTCAGGCAGGAGAGAACGTGACCTTGTCCTGTAGCTCCTGGAGCTCCTATGACATCTACCATCTGTCCAGGGAAGGGGAGGCCCATGAACGTAGGCTCCGTGCAGTGCCCAAGGTCAACAGAACATTCCAGGCAGACTTTCCTCTGGGCCCTGCCACCCACGGAGGGACCTACAGATGCTTCGGCTCTTTCCGTGCCCTGCCCTGCGTGTGGTCAAACTCAAGTGACCCACTGCTTGTTTCTGTCACAGGAAACCCTTCAAGTAGTTGGCCTTCA