# `GenomeScraper` example

This notebook is describes the use of how the `GenomeScraper` functor.

`GenomeScraper` only accepts `SeqRecord` objects from the `biopython` library. The functions in `genome` are meant to get `SeqRecord` objects from queries, however in this example, we will be using preloaded genomes, specifically the genome for _E. coli (strain K-12)_, accession `U00096.3`, stored in `tests/sequences`.

In [4]:
FILE = '../tests/sequences/U00096.gb'

In [10]:
from Bio import SeqIO
from GenomeScraper import GenomeScraper
import pprint
pp = pprint.pprint

## Instantiating `SeqRecord`

In [9]:
with open(FILE, 'r') as file:
    genome = SeqIO.read(file, 'gb')

`GenomeScraper` is agnostic to filetype or method used to retrieve data, as long as there is more than one value in `features`. If a `SeqRecord` does not have more than one record, a `ValueError` will be raised.

## Scraping Genes from Genome

Once `SeqRecord` is instantiated, it is passed to `GeneScraper`, which parses and extracts data on all genes.

In [12]:
genes = GenomeScraper(genome)

## Analyzing Gene

Once all genes have been parsed from `genome`, they are accessible by name.

Finding the _mak_ gene is as follows:

In [16]:
mak = genes['mak']
pp(mak.data)

{'codon_start': '1',
 'db_xref': ['UniProtKB/Swiss-Prot:P23917',
             'ASAP:ABE-0001372',
             'ECOCYC:EG11288'],
 'end': 411052,
 'gene': 'mak',
 'gene_synonym': ['ECK0389', 'yajF'],
 'locus_tag': 'b0394',
 'product': 'fructokinase',
 'protein_id': 'AAC73497.2',
 'start': 410143,
 'strand': 1,
 'transl_table': '11',
 'translation': 'MRIGIDLGGTKTEVIALGDAGEQLYRHRLPTPRDDYRQTIETIATLVDMAEQATGQRGTVGMGIPGSISPYTGVVKNANSTWLNGQPFDKDLSARLQREVRLANDANCLAVSEAVDGAAAGAQTVFAVIIGTGCGAGVAFNGRAHIGGNGTAGEWGHNPLPWMDEDELRYREEVPCYCGKQGCIETFISGTGFAMDYRRLSGHALKGSEIIRLVEESDPVAELALRRYELRLAKSLAHVVNILDPDVIVLGGGMSNVDRLYQTVGQLIKQFVFGGECETPVRKAKHGDSSGVRGAAWLWPQE',
 'type': 'CDS'}


Gene data is accessible via a dict-like interface. For example, to find the UniProt id to pass to `OntologyScraper`:

In [17]:
mak['db_xref']

['UniProtKB/Swiss-Prot:P23917', 'ASAP:ABE-0001372', 'ECOCYC:EG11288']