# What information is held in the `.gbff` genome files?

In [5]:
from Bio import SeqIO

In [46]:
seq = SeqIO.parse('../data/refseq/archaea/GCF_021655615.1_ASM2165561v1_genomic.gbff', 'genbank')
count = 0
for record in seq:
    count += 1
count

1

In [48]:
dir(seq)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__next__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 'parse',
 'records',
 'should_close_stream',
 'stream']

In [22]:
[item for item in dir(record) if not item.startswith('__')]

['_per_letter_annotations',
 '_seq',
 '_set_per_letter_annotations',
 '_set_seq',
 'annotations',
 'dbxrefs',
 'description',
 'features',
 'format',
 'id',
 'letter_annotations',
 'lower',
 'name',
 'reverse_complement',
 'seq',
 'translate',
 'upper']

In [28]:
print(record.id, record.name)

NZ_AP025245.1 NZ_AP025245


This does not look like NCBI taxid? Those are just numbers. This must be the id of the record in the genome, so what are records if there is only one in this example?

In [49]:
record.annotations

{'molecule_type': 'DNA',
 'topology': 'circular',
 'data_file_division': 'CON',
 'date': '02-AUG-2022',
 'accessions': ['NZ_AP025245'],
 'sequence_version': 1,
 'keywords': ['RefSeq'],
 'source': 'Acidianus sp. HS-5',
 'organism': 'Acidianus sp. HS-5',
 'taxonomy': ['Archaea',
  'Crenarchaeota',
  'Thermoprotei',
  'Sulfolobales',
  'Sulfolobaceae',
  'Acidianus'],
 'references': [Reference(title='Complete Genome Sequence of Acidianus sp. Strain HS-5, Isolated from the Unzen Hot Spring in Japan', ...),
  Reference(title='Direct Submission', ...)],
 'comment': 'REFSEQ INFORMATION: The reference sequence is identical to\nAP025245.1.\nAnnotated by DFAST https://dfast.ddbj.nig.ac.jp/\nThe annotation was added by the NCBI Prokaryotic Genome Annotation\nPipeline (PGAP). Information about PGAP can be found here:\nhttps://www.ncbi.nlm.nih.gov/genome/annotation_prok/\nCOMPLETENESS: full length.',
 'structured_comment': OrderedDict([('Genome-Assembly-Data',
               OrderedDict([('Assembly

In [25]:
record.annotations.keys()

dict_keys(['molecule_type', 'topology', 'data_file_division', 'date', 'accessions', 'sequence_version', 'keywords', 'source', 'organism', 'taxonomy', 'references', 'comment', 'structured_comment', 'contig'])

Annotations Contain information about the type of sequence and taxonomy, but I do not see taxid

In [17]:
record.seq

Seq('AAAGTGGTAATATTTTCTCTCATTATAGAAACAAAGCTAATAAAGAAAACAGGA...ACT')

Full sequence is easily accessible.

In [18]:
record.dbxrefs

['BioProject:PRJNA224116',
 'BioSample:SAMD00413987',
 'Sequence Read Archive:DRR325693, DRR325694',
 'Assembly:GCF_021655615.1']

NCBI internal database information also given.

In [19]:
record.description

'Acidianus sp. HS-5 chromosome, complete genome'

Now look at features, which I believe are sub sequences.

In [27]:
feature = record.features[0]

In [32]:
dir(feature)

['__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_flip',
 '_get_location_operator',
 '_get_ref',
 '_get_ref_db',
 '_get_strand',
 '_set_location_operator',
 '_set_ref',
 '_set_ref_db',
 '_set_strand',
 '_shift',
 'extract',
 'id',
 'location',
 'location_operator',
 'qualifiers',
 'ref',
 'ref_db',
 'strand',
 'translate',
 'type']

In [34]:
feature.id

'<unknown id>'

That is not helpful.

In [37]:
feature.location

FeatureLocation(ExactPosition(0), ExactPosition(2584028), strand=1)

In [38]:
feature.type

'source'

what types are present?

In [60]:
t = []
for f in record.features:
    t.append(f.type)
set(t)

{'CDS',
 'gene',
 'ncRNA',
 'rRNA',
 'regulatory',
 'repeat_region',
 'source',
 'tRNA'}

In Lena's implimentation, I believe CDS indicated a protein.

In [40]:
feature.qualifiers

OrderedDict([('organism', ['Acidianus sp. HS-5']),
             ('mol_type', ['genomic DNA']),
             ('strain', ['HS-5']),
             ('db_xref', ['taxon:2886040']),
             ('country', ['Japan:Nagasaki, Unzen']),
             ('lat_lon', ['32.7398 N 130.2652 E'])])

Taxid is in there!!

In [72]:
for f in record.features:
    if f.type == 'CDS':
        print(f.qualifiers)
        break

OrderedDict([('locus_tag', ['HS5_RS00005']), ('old_locus_tag', ['HS5_00010']), ('inference', ['COORDINATES: similar to AA sequence:RefSeq:WP_013776240.1']), ('note', ['Derived by automated computational analysis using gene prediction method: Protein Homology.']), ('codon_start', ['1']), ('transl_table', ['11']), ('product', ['DsrE family protein']), ('protein_id', ['WP_236752042.1']), ('db_xref', ['GeneID:70680509']), ('translation', ['MKIVLSVDSEEKIPMSITAATHLSEMPEAEEVEVVYLNGGIAAVTQRNRIAPLLKNEKVKVVACGTSMEARNITKEELAPGVIYVPASLKEIIKRIQEGYIYLVL'])])


the tax id is only in features of type "source"

In [41]:
feature.ref

In [44]:
feature.ref_db

In [45]:
feature.strand

1

## Getting the protein sequence from the records

In [64]:
busted1 = 0
busted2 = 0
for f in record.features:
    if f.type == 'CDS':# and 'product' in f.qualifiers:
        try:
            trans = f.qualifiers['translation']
        except:
            busted1 +=1
            try:
                trans = f.translate(record.seq)
            except:
                busted2 += 1
                continue
        print(f.type, f.qualifiers['product'], f.qualifiers['translation'], trans)

CDS ['DsrE family protein'] ['MKIVLSVDSEEKIPMSITAATHLSEMPEAEEVEVVYLNGGIAAVTQRNRIAPLLKNEKVKVVACGTSMEARNITKEELAPGVIYVPASLKEIIKRIQEGYIYLVL'] ['MKIVLSVDSEEKIPMSITAATHLSEMPEAEEVEVVYLNGGIAAVTQRNRIAPLLKNEKVKVVACGTSMEARNITKEELAPGVIYVPASLKEIIKRIQEGYIYLVL']
CDS ['PaREP1 family protein'] ['MEEKLTSPKIDRKAYVRARIIETLDDIDVATNMWIAGRSRNSAGKIFSAVKALLSALVTKNLDKLSNEWYVKRGYNAPTHSLKGISIDLSKLGYAQVENIADKAFLLHDYQYNGFDPDFPKYKKKEEVLHDILIVSNFILNNIKEWFKDEWDSDLDKIYEITLSEVKKLK'] ['MEEKLTSPKIDRKAYVRARIIETLDDIDVATNMWIAGRSRNSAGKIFSAVKALLSALVTKNLDKLSNEWYVKRGYNAPTHSLKGISIDLSKLGYAQVENIADKAFLLHDYQYNGFDPDFPKYKKKEEVLHDILIVSNFILNNIKEWFKDEWDSDLDKIYEITLSEVKKLK']
CDS ['hypothetical protein'] ['MKVVKYYVDVKVELKEGKNADIEKVKLEGLLKRFTKKKEDKEEPKLTVNDHTIELSGIRFRERIGFRNFVDELAEKYEAKCTPLNEANGKITVQCESDKAKISFEANVTRFKRPKKEGEEKEGESKKEEEAKTSGSSQ'] ['MKVVKYYVDVKVELKEGKNADIEKVKLEGLLKRFTKKKEDKEEPKLTVNDHTIELSGIRFRERIGFRNFVDELAEKYEAKCTPLNEANGKITVQCESDKAKISFEANVTRFKRPKKEGEEKEGESKKEEEAKTSGSSQ']
CDS ['hypothetical protein'] ['MILGVLGSDKLVTTTNAVLTEIFTGL

In [65]:
print(busted1, busted2)

45 45


When the translation is missing from the qualifiers, it seems to be untranslatable, so we can use the qualifier so long as it never errors.

In [68]:
busted1 = 0
busted2 = 0
for f in record.features:
    if f.type == 'CDS':# and 'product' in f.qualifiers:
        try:
            trans = f.qualifiers['translation']
        except:
            continue
        trans2 = f.translate(record.seq)
        if trans == trans2:
            pass
        else:
            print(trans[0], trans2[0])

MKIVLSVDSEEKIPMSITAATHLSEMPEAEEVEVVYLNGGIAAVTQRNRIAPLLKNEKVKVVACGTSMEARNITKEELAPGVIYVPASLKEIIKRIQEGYIYLVL M
MEEKLTSPKIDRKAYVRARIIETLDDIDVATNMWIAGRSRNSAGKIFSAVKALLSALVTKNLDKLSNEWYVKRGYNAPTHSLKGISIDLSKLGYAQVENIADKAFLLHDYQYNGFDPDFPKYKKKEEVLHDILIVSNFILNNIKEWFKDEWDSDLDKIYEITLSEVKKLK M
MKVVKYYVDVKVELKEGKNADIEKVKLEGLLKRFTKKKEDKEEPKLTVNDHTIELSGIRFRERIGFRNFVDELAEKYEAKCTPLNEANGKITVQCESDKAKISFEANVTRFKRPKKEGEEKEGESKKEEEAKTSGSSQ M
MILGVLGSDKLVTTTNAVLTEIFTGLRDVEKIVILSEEKSKRDYGGLKDVIKILGIDAEIEEVELGRGLKSWRNKLQSINLDVADITPGRKYMAYSVIAYSKAKNVRYVYIAEEREGYRIFGYIPFNEVKVYNMRDGEEINFDPPKTAKGLPKENKLSVIATPALINIYSLLGKVTIENSFRESTPEEITKPTDDNEELCLLRSGFLRFKEEEEIKKETGSFFIADTNSYIYIGPRLKYLTYSKEYGYRLLASRSIYNELQYHTNSTQKDEKLYRFYMGMESYRKSHTPPLTEENNRFGDMPLIEESKRLKSELPEKLVLVTKDVGVSNTAKSKGISTILLRNEIKGEGNIGEYLNCVKYFTETSIKINEETVATIPKIREYEESVKIKTTKEELNYPYLLSVTENFLKS M
MCLGLILDIIFFIIDIIIPIWNSYNSGKISAYRKGLGKLLYTLGGFLPMSYVLSLIIAIVLGILGYISVSTAVFILSFSDLVFGLEIIVWGVIATYLSAMSTARGGGWKAGIVTAYNAFATIFDAWAYISSFFSNLRDARKAIDSSDFSVIDVIIIFVAALGVGF

In each case the computed translation is just the start codon. The qualifier is actually better.

## Proposed extraction
For each record:
- find feature labeled "source", get the taxid from its qualifiers
- record taxonomy and organism along with taxid and reference from annotations, original file name of sequence, and record id

For each feature in record of type CDS and having a translation
- record translation, and any name eg in the 'product' qualifiers
- link to taxid