<h1> DBAdapter usage </h1>

This tutorial illustrates the use of epytope to map gene names and retrieve database accessions and genetic or transcript sequences from a database source like BioMart. epytope can connect to a variety of DB sources both online and offline.
Here, we will cover the use of epytope MartsAdapter as and example for online access and EnsemblAdapter for offline access.

<h2> Chapter 1: The basics </h2>
<br/>
We first start with importing the needed packages.

In [1]:
from epytope.IO.MartsAdapter import MartsAdapter
from epytope.IO.EnsemblAdapter import EnsemblDB
from epytope.IO.ADBAdapter import EIdentifierTypes
from epytope.IO.ADBAdapter import EAdapterFields
from epytope.Core import Transcript

For starters we will connect to the BioMart:

Initializing the MartsAdapter, you can specify the URL under which the BioMart of your choice is to be reached by supporting the attribute **`biomart`**. If you do not choose a specific BioMart it will default to <a href="http://biomart.org">http://biomart.org</a>. Here however, we will use: <a href="http://grch37.ensembl.org">http://grch37.ensembl.org</a>. Please refer to the documentation of your BioMart to find the correct URL.

In [2]:
mart_adapter = MartsAdapter(biomart="http://grch37.ensembl.org")

Now we can start using the BioMart. For a comprehensive list of methods implemented to the adapter, you can refer to the <a href="http://epytope.readthedocs.org/en/latest/epytope.IO.html#module-epytope.IO.MartsAdapter">documentation</a>.

You can fetch all different kinds of sequences with the adapter. We will start with a transcript sequence to the glucagon gene. You have to provide an identifier that will be known by the BioMart and identifies the <i>transcript</i>, in our <a href="http://www.ensembl.org/">ensembl</a> case in the form "ENST...".


In [3]:
transcript = mart_adapter.get_transcript_sequence('ENST00000375497', type=EIdentifierTypes.ENSEMBL)
print(type(transcript))

epytope_transcript = Transcript(transcript, 'glucagon', 'ENST00000375497')
epytope_transcript

<class 'str'>


TRANSCRIPT: ENST00000375497
	VARIANTS:
	SEQUENCE: ATGAAAAGCATTTACTTTGTGGCTGGATTATTTGTAATGCTGGTACAAGGCAGCTGGCAACGTTCCCTTCAAGACACAGAGGAGAAATCCAGATCATTCTCAGCTTCCCAGGCAGACCCACTCAGTGATCCTGATCAGATGAACGAGGACAAGCGCCATTCACAGGGCACATTCACCAGTGACTACAGCAAGTATCTGGACTCCAGGCGTGCCCAAGATTTTGTGCAGTGGTTGATGAATACCAAGAGGAACAGGAATAACATTGCCAAACGTCACGATGAATTTGAGAGACATGCTGAAGGGACCTTTACCAGTGATGTAAGTTCTTATTTGGAAGGCCAAGCTGCCAAGGAATTCATTGCTTGGCTGGTGAAAGGCCGAGGAAGGCGAGATTTCCCAGAAGAGGTCGCCATTGTTGAAGAACTTGGCCGCAGACATGCTGATGGTTCTTTCTCTGATGAGATGAACACCATTCTTGATAATCTTGCCGCCAGGGACTTTATAAACTGGTTGATTCAGACCAAAATCACTGACAGGAAATAA (mRNA)

The adapter will yield a simple string. You can use this string to contruct your <a href="http://epytope.readthedocs.org/en/latest/epytope.Core.html#module-epytope.Core.Transcript">transcript object</a>.

The **`type`** attribute designates which type of identifier you are giving. Which ones your Adapter is supporting, you can read in the <a href="http://epytope.readthedocs.org/en/latest/epytope.IO.html">documentation</a>.


The MartsAdapter is also sporting a method which will yield you more information on your transcript of interest than just the sequence. This is called called **`get_transcript_information`** and uses the same input as **`get_transcript_sequence`**.

In [4]:
transcript_info = mart_adapter.get_transcript_information('ENST00000375497', type=EIdentifierTypes.ENSEMBL)
transcript_info                                      

{2: 'ATGAAAAGCATTTACTTTGTGGCTGGATTATTTGTAATGCTGGTACAAGGCAGCTGGCAACGTTCCCTTCAAGACACAGAGGAGAAATCCAGATCATTCTCAGCTTCCCAGGCAGACCCACTCAGTGATCCTGATCAGATGAACGAGGACAAGCGCCATTCACAGGGCACATTCACCAGTGACTACAGCAAGTATCTGGACTCCAGGCGTGCCCAAGATTTTGTGCAGTGGTTGATGAATACCAAGAGGAACAGGAATAACATTGCCAAACGTCACGATGAATTTGAGAGACATGCTGAAGGGACCTTTACCAGTGATGTAAGTTCTTATTTGGAAGGCCAAGCTGCCAAGGAATTCATTGCTTGGCTGGTGAAAGGCCGAGGAAGGCGAGATTTCCCAGAAGAGGTCGCCATTGTTGAAGAACTTGGCCGCAGACATGCTGATGGTTCTTTCTCTGATGAGATGAACACCATTCTTGATAATCTTGCCGCCAGGGACTTTATAAACTGGTTGATTCAGACCAAAATCACTGACAGGAAATAA',
 0: '',
 1: '-'}

It will return a dictionary, which keys are defined by the Enum type **`EAdapterFields`**, coding for *GENE=0, STRAND=1, SEQ=2, TRANSID=3, PROTID=4*. That way, you can access the information in a more comprehensible way. For example, you will find out the strand direction of the transcript this way:

In [5]:
transcript_info[EAdapterFields.STRAND]

'-'

which will tell you that the transcript comes from the reverse strand.

If you are dealing with genomic positions or regions and want to find out, whether a particular one is in a gene's coding region, you can use the **`get_gene_by_position`** function to find that out. It will yield the genes name

In [6]:
mart_adapter.get_gene_by_position(17, 7565101, 7565101)

'TP53'

<h2> Chapter 2: Connecting to offline databases</h2>
<br/>
epytope Also supports the read from offline databases such as fasta and dat files as you can download from Ensebl, UniProt or RefSeq.
To connect, you will have to initialize the corresponding adapter and feed it the location of your database file.

As example, we will use the EnsemblAdapter. You can get the official sequence resources from ensembl <a href="http://www.ensembl.org/info/data/ftp/index.html">here</a>. However for this tutorial, we will use a small test excerpt from the ensembl Protein sequence (FASTA).

In [7]:
ed = EnsemblDB()
ed.read_seqs("data/Homo_sapiens.GRCh38.cds.test_stub.fa")
ed.read_seqs("data/Homo_sapiens.GRCh38.pep.test_stub.fa")

As EnsemblDB is implementing the ADBAdapter interface, you can basically achieve the same as in  <a href="DBAdapterUsage.ipynb#-Chapter-1:-The-basics-">Chapter 1</a> just offline. For example:

In [8]:
ed.get_transcript_information('ENST00000395310', type=EIdentifierTypes.ENSEMBL)

{2: 'ATGAAGTTAAAGGAAGTAGATCGTACAGCCATGCAGGCATGGAGCCCTGCCCAGAATCACCCCATTTACCTAGCAACAGGAACATCTGCTCAGCAATTGGATGCAACATTTAGTACGAATGCTTCCCTTGAGATATTTGAATTAGACCTCTCTGATCCATCCTTGGATATGAAATCTTGTGCCACATTCTCCTCTTCTCACAGGTACCACAAGTTGATTTGGGGGCCTTATAAAATGGATTCCAAAGGAGATGTCTCTGGAGTTCTGATTGCAGGTGGTGAAAATGGAAATATTATTCTCTATGATCCTTCTAAAATTATAGCTGGAGACAAGGAAGTTGTGATTGCCCAGAATGACAAGCATACTGGCCCAGTGAGAGCCTTGGATGTGAACATTTTCCAGACTAATCTGGTAGCTTCTGGTGCTAATGAATCTGAAATCTACATATGGGATCTAAATAATTTTGCAACCCCAATGACACCAGGAGCCAAAACACAGCCGCCAGAAGATATCAGCTGCATTGCATGGAACAGACAAGTTCAGCATATTTTAGCATCAGCCAGTCCCAGTGGCCGGGCCACTGTATGGGATCTTAGAAAAAATGAGCCAATCATCAAAGTCAGTGACCATAGTAACAGAATGCATTGTTCTGGGTTGGCATGGCATCCTGATGTTGCTACTCAGATGGTCCTTGCCTCCGAGGATGACCGGTTACCAGTGATCCAGATGTGGGATCTTCGATTTGCTTCCTCTCCACTTCGTGTCCTGGAAAACCATGCCAGGGGGATTTTGGCAATTGCTTGGAGCATGGCAGATCCTGAATTGTTACTGAGCTGTGGAAAAGATGCTAAGATTCTCTGCTCCAATCCAAACACAGGAGAGGTGTTATATGAACTTCCCACCAACACACAGTGGTGCTTCGATATTCAGTGGTGTCCCCGAAATCCTGCTGTCTTATCAGCTGCTTCGTTTGATGGGCGTATCAGTGTTTATTC

But you can also conduct quick exact searches to find sequence occurrences in the database.
**`search`** will yield the first entry that contains the given sequence, ...

In [9]:
ed.search("GAGAGA")

{'GAGAGA': 'ENST00000348405'}

... whereas **`search_all`** will yield all entries.

In [10]:
ed.search_all("GAGAGA")

{'GAGAGA': 'ENST00000348405,ENST00000513858,ENST00000395310,ENST00000443462,ENST00000509142,ENST00000505472,ENST00000500777,ENST00000508502,ENST00000355196,ENST00000264405,ENST00000505984,ENST00000508479,ENST00000507828,ENST00000512664,ENST00000510167,ENST00000311785,ENST00000448323'}