# Bitome Knowledgebase: Example Usage

The following notebook provides some demonstrations for the usage of the Bitome knowledgebase.

Requires Python 3.7 and the following third-party packages (all can be installed via `pip install <package>`)
- biopython
- CAI
- matplotlib
- numpy
- pandas
- scipy
- seaborn

## Import the Knowledgebase Class

The following code imports the base class, called Bitome, that will load and contain the knowledgebase (along with a
useful built-in Python package for ensuring compatibility of file paths across operating systems):

In [2]:
from pathlib import Path
import sys
sys.path.append('../bitome-kb/')
from bitome.core import Bitome

## Load the Knowledgebase

Now, we can use the attached methods for the Bitome class to load the data stored in the `data` directory. This may
take a couple of minutes to run to completion. Some warnings about sequence mismatches may appear, but these can be
safely ignored.

In [4]:
# instantiate an object of class Bitome and point it to the GenBank record for E. coli K-12 MG1655
bitome = Bitome(Path('../bitome-kb/data', 'NC_000913.3.gb'))

# load all data into the knowledgebase class, specifying that RegulonDB information should be included.
# (default is False to ensure compatibility with GenBank records for other organisms, as RegulonDB is specific to K-12)
bitome.load_data(regulon_db=True)

  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded

  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'Selenocysteine found in GEM-PRO sequence for {locus_tag}; Genbank translation audit skipped')
  warn(f'No GEM-PRO file found for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')
  warn(f'GEM-PRO sequence and coded sequence are not the same for {locus_tag}')


ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

## Save and Reload the Knowledgebase

For convenience, the fully-loaded Bitome knowledgebase can be dumped to a `pickle` file for easy re-loading
if the underlying data has not changed. Once the following cell has been executed, the `bitome.pkl` file created in the
`data` directory can be loaded (instead of running the more time-consuming `load_data` method).

In [8]:
# this code will dump the loaded Bitome object to a file called 'bitome.pkl' in the data directory
# both the storage directory and file name of the pickle dump can be changed via keyword argument
bitome.save(dir_name='../bitome-kb/')

# now, let's initiate a new Bitome knowledgebase from that file
bitome_from_file = Bitome.init_from_file(Path('../bitome-kb/', 'bitome.pkl'))

`bitome` and `bitome_from_file` are the same (different objects, but the same underlying data). Let's use `bitome`
moving forward for our examples, but the `init_from_file` route for loading the knowledgebase is preferred (again, as
long as no data has been added or modified)

## Exploring the Bitome 


The Bitome knowledgebase is a collection of heavily-linked objects. These objects may be accessed via attributes of the bitome object we've instantiated above. An example of such "link-hopping" is given below:

### Simple Example of Link-Hopping

In [4]:
# pull out the dsrA gene (chosen at random) from the master list of genes in the knowledgebase 
# NOTE: I just entered 2000 and then later added this comment to determine the gene is dsrA; I need to add 
# a convenient function for pulling out a gene object by name...
dsrA = bitome.genes[2000]

# print out some features of this gene
print(f'Gene name: {dsrA.name}')
print(f'Gene location: {dsrA.location}')
print(f'Gene sequence: {dsrA.sequence}')
print(f'Type of random_gene object: {type(dsrA)}')

Gene name: dsrA
Gene location: [2025226:2025313](-)
Gene sequence: AACACATCAGATTTCCTGGTGTAACGAATTTTTTAAGTGCTTCTTGCTTAAGCAAGTTTCATCCCGACCCCCTCAGGGTCGGGATTT
Type of random_gene object: <class 'bitome.features.Gene'>


So we've pulled out the dsrA gene. It is represented by an object of type Gene. The Gene object (and other types of features within the Bitome, as we'll see in a moment) has some useful attributes such as its absolute location, its sequence, its name, and links to other related objects.

Note that: dsrA is on the reverse strand. However, the sequence attribute is in the CODING direction.

Now, let's say we're interested in where the TSS (or multiple TSS for this gene are located). Let's first access any transcription units associated with this gene:

In [5]:
dsrA.transcription_units

[<bitome.features.TranscriptionUnit at 0x1219b4bd0>]

So this particular gene is just involved in one transcription unit; let's pull out some information on it

In [6]:
dsrA_tu = dsrA.transcription_units[0]

print(f'TU name: {dsrA_tu.name}')
print(f'TU location: {dsrA_tu.location}')
print(f'TU object type: {type(dsrA_tu)}')

TU name: dsrA
TU location: [2025226:2025313](-)
TU object type: <class 'bitome.features.TranscriptionUnit'>


So we're dealing with a single-gene TU on the reverse strand. Let's see what operon it belongs to:

In [7]:
dsrA_tu.operon.name

'dsrA'

We could have answered the question of which operon(s) does the gene belong to with the following one-liner:

In [8]:
[tu.operon.name for tu in dsrA.transcription_units]

['dsrA']

What about promoters? Each TranscriptionUnit object has a promoter attribute, which in turn links to things like TF binding sites, attenuators, and more:

NOTE: not all TUs have an annotated promoter (in those cases, the tu.promoter attribute will be `None`)

The Promoter object houses the TSS information:

In [9]:
dsrA_prom = dsrA_tu.promoter

print(f'Promoter name: {dsrA_prom.name}')
print(f'Promoter location: {dsrA_prom.location}')
print(f'Promoter object type: {type(dsrA_prom)}')
print(f'TSS: {dsrA_prom.tss}')

Promoter name: dsrAp
Promoter location: [2025292:2025373](-)
Promoter object type: <class 'bitome.features.Promoter'>
TSS: 2025313


Let's see how we would locate all TSS for a given gene in a one-liner:

In [10]:
[tu.promoter.tss for tu in dsrA.transcription_units]

[2025313]

### Bitome Table of Contents

The Bitome contains master lists of many different types of objects, all that can be located on the provided reference sequence. Below is shown a master list of those attributes:

In [11]:
print(f'Genes: {len(bitome.genes)}')
print(f'Proteins: {len(bitome.proteins)}')
print(f'Mobile Elements: {len(bitome.mobile_elements)}')
print(f'Repeat Regions: {len(bitome.repeat_regions)}')
print(f'Operons: {len(bitome.operons)}')
print(f'Transcription Units: {len(bitome.transcription_units)}')
print(f'Promoters: {len(bitome.promoters)}')
print(f'Terminators: {len(bitome.terminators)}')
print(f'Attenuators: {len(bitome.attenuators)}')
print(f'Shine-Dalgarnos: {len(bitome.shine_dalgarnos)}')
print(f'Riboswitches: {len(bitome.riboswitches)}')
print(f'Transcription Factors: {len(bitome.transcription_factors)}')
print(f'TF binding sites: {len(bitome.tf_binding_sites)}')
print(f'Regulons: {len(bitome.regulons)}')
print(f'iModulons: {len(bitome.i_modulons)}')

Genes: 4497
Proteins: 4140
Mobile Elements: 49
Repeat Regions: 355
Operons: 2619
Transcription Units: 3560
Promoters: 8631
Terminators: 512
Attenuators: 1466
Shine-Dalgarnos: 179
Riboswitches: 51
Transcription Factors: 224
TF binding sites: 3235
Regulons: 493
iModulons: 61


The Bitome also retains the full sequence, along with the GenBank record and ID:

In [12]:
print(f'GenBank ID: {bitome.genbank_id}')
print(f'Full Sequence (first 200 bps): {bitome.sequence[:200]}')
print(f'GenBank description: {bitome.description}')

GenBank ID: NC_000913.3
Full Sequence (first 200 bps): AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCAT
GenBank description: Escherichia coli str. K-12 substr. MG1655, complete genome


## Utilities

A function for finding a feature with a given name.

In [13]:
glpR = [gene for gene in bitome.genes if gene.name == 'glpR'][0]
print(glpR.name)
print(glpR.location)

glpR
[3559847:3560605](-)


Sequence location slicing

This example shows how to extract an arbitrary genomic locus from the main sequence. Let's say we wanted to pull out the sequence of the 5' UTR for glpR (from above). Note that glpR is on the (-), or reverse, strand; so the "right" end of the position is the 5' end. 

In Biopython's SeqLocation terms, left = "start" and right = "end", REGARDLESS of strand. It's annoying and confusing...

I can write some better utility functions for this sort of thing if desired.

In [14]:
from Bio.SeqFeature import FeatureLocation

# so we're giving as the left end, the START of the gene, which is the "end" of its location...
# and as the right end for the 5' UTR, we're giving the TSS for the promoter (just taking the first one arbitrarily)
FeatureLocation(glpR.location.end.position, glpR.transcription_units[0].promoter.tss).extract(bitome.sequence)

Seq('TTATAAATCCCTGGAATTATTTTCGTTTTCGCGCATTGAGCGAATCAACAAAAG...AGT', IUPACAmbiguousDNA())

## There are a lot of possibilities for accessing data from the Bitome knowledgebase, so please ask! Also, don't be afraid to dive into the source code (in the `bitome` directory), it is very heavily commented and documented. 