# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 4: BioPython - Entrez E-utilities

1. NCBI - Entrez Databases
2. E-Utils
    - `Entrez.esearch()`
    - `Entrez.esummary()`
    - `Entrez.efetch()`
    - `Entrez.epost()`
    - `Entrez.einfo()`

#### Requirements

- Python 2.7 or 3.5
- `Bio` (BioPython) module
- Miscellaneous Files
    - `./images/ncbi_ids.jpg`

In [1]:
from __future__ import print_function, division

## NCBI - Entrez Databases

- Global Query Cross‐Database Search System
    - Allows metasearch of NCBI health science repository
    - National Center for Biotechnology Information (NBCI) started GenBank in 1992
    - [http://www.ncbi.nlm.nih.gov/gquery/](http://www.ncbi.nlm.nih.gov/gquery/)
- E-utilities
    - Supported by NCBI to provide a stable interface to Entrez query and database system
    - Queries are submitted via web URLs and XML formatted data is returned
    - The `Entrez` module from BioPython provides a programming interface to E-utils
        - Make no more than 3 queries per second (enforced by BioPython)
        - Queries should be accompanied by your email address
        - For large/regular queries consider downloading and accessing a local copy of the database

## E-Utils

[http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)

### `Entrez.esearch()`

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch](http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch)

The `Entrez.esearch()` function allows you to search specific NCBI databases for entries that match a specified search term. The function will return a list of unique identifiers (UIDs). The type of UID will depend on the database searched. By default, only the first 20 records are returned (use the `retmax` parameter to change this).

<img src="./images/ncbi_ids.jpg" align="left"/>

In [2]:
from Bio import Entrez

In [3]:
help(Entrez.esearch)

Help on function esearch in module Bio.Entrez:

esearch(db, term, **keywds)
    Run an Entrez search and return a handle to the results.
    
    ESearch searches and retrieves primary IDs (for use in EFetch, ELink
    and ESummary) and term translations, and optionally retains results
    for future use in the user's environment.
    
    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
    
    Return a handle to the results which are always in XML format.
    
    Raises an IOError exception if there's a network error.
    
    Short example:
    
    >>> from Bio import Entrez
    >>> Entrez.email = "Your.Name.Here@example.org"
    >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD", idtype="acc")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> int(record["Count"]) >= 2
    True
    >>> "EF590893.1" in record["IdList"]
    True
    >>> "EF590892.1" in

In [4]:
## Provide your email address
email = "mooneymi@ohsu.edu"
Entrez.email = email

## Submit a query
handle = Entrez.esearch(db="nuccore", term="sonic")

## Entrez.read() parses XML results
## A dictionary is returned
record = Entrez.read(handle)
record.keys()

[u'Count',
 u'RetMax',
 u'IdList',
 u'TranslationStack',
 u'TranslationSet',
 u'RetStart',
 u'QueryTranslation']

In [5]:
ids = record["IdList"]
ids

['1465623769', '1463294413', '1128611434', '999808998', '998429153', '922304372', '821595496', '821324931', '651914274', '341865554', '312434022', '263190679', '229892344', '224809486', '190343019', '169790772', '167736374', '167736372', '124248523', '57524605']

In [6]:
record["Count"]

'2115'

In [7]:
handle.close()

### `Entrez.esummary()`

The `Entrez.esummary()` function provides a document summary for a specified UID. The provided summary is useful for initial filtering of the UID list returned by `Entrez.esearch()`.

#### UIDs Matter!

When searching multiple databases, make sure to use the appropriate UID for the given database. 

For example, <b>Gene ID != GI number</b> (although both are integers).

In [8]:
handle = Entrez.esummary(db="nuccore", id=ids[0])
summary = Entrez.read(handle)
summary

[DictElement({'Status': 'live', 'Comment': '  ', 'Caption': 'XM_011165255', 'AccessionVersion': 'XM_011165255.2', 'Title': 'PREDICTED: Solenopsis invicta sonic hedgehog protein A (LOC105198519), mRNA', 'CreateDate': '2015/01/27', 'Extra': 'gi|1465623769|ref|XM_011165255.2|[1465623769]', 'TaxId': IntegerElement(13686, attributes={}), 'ReplacedBy': '', u'Item': [], 'Length': IntegerElement(3027, attributes={}), 'Flags': IntegerElement(512, attributes={}), 'UpdateDate': '2018/08/30', u'Id': '1465623769', 'Gi': IntegerElement(1465623769, attributes={})}, attributes={})]

In [9]:
for k,v in summary[0].items():
    print(k+":", v)

Status: live
Comment:   
Caption: XM_011165255
AccessionVersion: XM_011165255.2
Title: PREDICTED: Solenopsis invicta sonic hedgehog protein A (LOC105198519), mRNA
CreateDate: 2015/01/27
Extra: gi|1465623769|ref|XM_011165255.2|[1465623769]
TaxId: 13686
ReplacedBy: 
Item: []
Length: 3027
Flags: 512
UpdateDate: 2018/08/30
Id: 1465623769
Gi: 1465623769


In [10]:
handle.close()

### `Entrez.efetch`

The `Entrez.efetch()` function retrieves entire records in a specified format. In addition to the `database` and `id` parameters, you can specify the retrieval type `rettype` and retrieval mode `retmode` parameters. 

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_](http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_)

In [11]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=ids[0], rettype="fasta", retmode="txt")

## Here we use the handle's read() method, not Entrez.read(),
## since the retmode parameter is text, not XML
fasta_record = handle.read()
print(fasta_record)

>XM_011165255.2 PREDICTED: Solenopsis invicta sonic hedgehog protein A (LOC105198519), mRNA
TTTGCCATAGGGTGGGGGCAGGAAGAGGGGTACTCCTGGACCTGGTCAGGCCAAAGAAACAACGCAGGAG
TACCCACCGAAACCCATCCTGCATTTCCCACCACCGGCTCGTCCGGAAAACGTGCTAGCTTCCGGACGTC
GTCCGAGTTCAGTCAAGTGAGTCGACTCGCACGTACCTCCTCTCCGACGAAAACGATTGAGCAGAGGGTT
CCGCTCCGGAAAGGCGGGAGTGTTGGCGCGTGCCGACGTGAGAATTTAAAGAGCCGCCCGGTACCGCTAG
AGGAGAGAATCGGAGTCGATCGGAGTCGAAGAAACCCGGCTTTGAATGATTCTCCAATCGAATCCGACGA
CGCTTAGGATGTAGAAAGATCCGCCCTCGTAGACTCTTCGACTTGGTACCATATCTGTCTGATAATATCC
CGGTGACACATGTCGGCTCCTTTCGGTATTCACGGAAAAGAAAAACGGAATCTTTTTCCAATTCCTCGGG
AGGAGGTTGTAGCTTCGCGTCAATTCGTCACGATTGATCGGTCGCTTGTTTTCCGTTCGAGTTCTCCCAG
TGAACCATCATCGTCATCATCATCATCGTCATCGGTTTTGGTCAGTCGCGCGTGGAGTTTATCAACGCAG
CTACGGGGATCACATCGGTGATAAGAGTGCGTGCACGGGGGGGTGTGTGTGATGCCCTGGTTATGGTTAA
GCGTGAGTGAGGTCAGTGGCAAGAGGCGGAGGAGAATAATTGGTAGCAAAACGTCAAGCGGATTGTCACC
CTCATCGTCGTTGCCGCCGTCGTCGACGTCGTCGTCGTCGTCGTCGTCACTGTCGTTCTGCCAGCCATTC
TGGAGACGACGAGAGCAGTGACCGCGCAAGGTCGACCTCGAGGCAGAGGCACGCGA

In [12]:
handle.close()

In [13]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=ids[0], rettype="fasta", retmode="xml")

## Use Entrez.read() to parse XML output
fasta_record = Entrez.read(handle)
fasta_record[0].keys()

[u'TSeq_accver',
 u'TSeq_sequence',
 u'TSeq_length',
 u'TSeq_taxid',
 u'TSeq_orgname',
 u'TSeq_seqtype',
 u'TSeq_defline']

In [14]:
fasta_record[0]['TSeq_taxid']

'13686'

In [15]:
fasta_record[0]['TSeq_sequence']

'TTTGCCATAGGGTGGGGGCAGGAAGAGGGGTACTCCTGGACCTGGTCAGGCCAAAGAAACAACGCAGGAGTACCCACCGAAACCCATCCTGCATTTCCCACCACCGGCTCGTCCGGAAAACGTGCTAGCTTCCGGACGTCGTCCGAGTTCAGTCAAGTGAGTCGACTCGCACGTACCTCCTCTCCGACGAAAACGATTGAGCAGAGGGTTCCGCTCCGGAAAGGCGGGAGTGTTGGCGCGTGCCGACGTGAGAATTTAAAGAGCCGCCCGGTACCGCTAGAGGAGAGAATCGGAGTCGATCGGAGTCGAAGAAACCCGGCTTTGAATGATTCTCCAATCGAATCCGACGACGCTTAGGATGTAGAAAGATCCGCCCTCGTAGACTCTTCGACTTGGTACCATATCTGTCTGATAATATCCCGGTGACACATGTCGGCTCCTTTCGGTATTCACGGAAAAGAAAAACGGAATCTTTTTCCAATTCCTCGGGAGGAGGTTGTAGCTTCGCGTCAATTCGTCACGATTGATCGGTCGCTTGTTTTCCGTTCGAGTTCTCCCAGTGAACCATCATCGTCATCATCATCATCGTCATCGGTTTTGGTCAGTCGCGCGTGGAGTTTATCAACGCAGCTACGGGGATCACATCGGTGATAAGAGTGCGTGCACGGGGGGGTGTGTGTGATGCCCTGGTTATGGTTAAGCGTGAGTGAGGTCAGTGGCAAGAGGCGGAGGAGAATAATTGGTAGCAAAACGTCAAGCGGATTGTCACCCTCATCGTCGTTGCCGCCGTCGTCGACGTCGTCGTCGTCGTCGTCGTCACTGTCGTTCTGCCAGCCATTCTGGAGACGACGAGAGCAGTGACCGCGCAAGGTCGACCTCGAGGCAGAGGCACGCGATTATGGTGCACCATCATCATCGTCGCCGCGGCCTCCAGGCGACCCACCACTGTGCCATTGGGAGTCGCGTAGGCACATGTACCTCGCTCTTCCAGGTTTTCCT

In [16]:
handle.close()

#### Downloading Records in Bulk

Multiple IDs can be supplied to `Entrez.efetch()` as a comma separated list. 

In [17]:
print(','.join(ids[0:3]))

1465623769,1463294413,1128611434


In [18]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=','.join(ids[0:3]), rettype="fasta", retmode="text")

## Here we use the handle's read() method, not Entrez.read(),
## since the retmode parameter is text, not XML
fasta_records = handle.read()
print(fasta_records)

>XM_011165255.2 PREDICTED: Solenopsis invicta sonic hedgehog protein A (LOC105198519), mRNA
TTTGCCATAGGGTGGGGGCAGGAAGAGGGGTACTCCTGGACCTGGTCAGGCCAAAGAAACAACGCAGGAG
TACCCACCGAAACCCATCCTGCATTTCCCACCACCGGCTCGTCCGGAAAACGTGCTAGCTTCCGGACGTC
GTCCGAGTTCAGTCAAGTGAGTCGACTCGCACGTACCTCCTCTCCGACGAAAACGATTGAGCAGAGGGTT
CCGCTCCGGAAAGGCGGGAGTGTTGGCGCGTGCCGACGTGAGAATTTAAAGAGCCGCCCGGTACCGCTAG
AGGAGAGAATCGGAGTCGATCGGAGTCGAAGAAACCCGGCTTTGAATGATTCTCCAATCGAATCCGACGA
CGCTTAGGATGTAGAAAGATCCGCCCTCGTAGACTCTTCGACTTGGTACCATATCTGTCTGATAATATCC
CGGTGACACATGTCGGCTCCTTTCGGTATTCACGGAAAAGAAAAACGGAATCTTTTTCCAATTCCTCGGG
AGGAGGTTGTAGCTTCGCGTCAATTCGTCACGATTGATCGGTCGCTTGTTTTCCGTTCGAGTTCTCCCAG
TGAACCATCATCGTCATCATCATCATCGTCATCGGTTTTGGTCAGTCGCGCGTGGAGTTTATCAACGCAG
CTACGGGGATCACATCGGTGATAAGAGTGCGTGCACGGGGGGGTGTGTGTGATGCCCTGGTTATGGTTAA
GCGTGAGTGAGGTCAGTGGCAAGAGGCGGAGGAGAATAATTGGTAGCAAAACGTCAAGCGGATTGTCACC
CTCATCGTCGTTGCCGCCGTCGTCGACGTCGTCGTCGTCGTCGTCGTCACTGTCGTTCTGCCAGCCATTC
TGGAGACGACGAGAGCAGTGACCGCGCAAGGTCGACCTCGAGGCAGAGGCACGCGA

In [19]:
handle.close()

### `Entrez.epost()`

Alternatively, use the `Entrez.epost()` function to cache a large number of IDs (too many IDs can make the URL-based requests fail). This function uploads the ID list to the NCBI servers and returns a `WebEnv` value and a `QueryKey` value that can be supplied to `Entrez.efetch()` to retrieve the query results.

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_](http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_)

In [20]:
## Use Entrez.epost() to cache multiple IDs
handle = Entrez.epost(db="nuccore", id=','.join(ids[0:3]))
epost_results = Entrez.read(handle)
web_env = epost_results['WebEnv']
query_key = epost_results['QueryKey']
handle.close()

## Use the WebEnv and QueryKey values to retrieve
## the query results with Entrez.efetch()
handle = Entrez.efetch(db="nuccore", rettype="fasta", retmode="text", webenv=web_env, query_key=query_key)
fasta_records = handle.read()
print(fasta_records)

>XM_011165255.2 PREDICTED: Solenopsis invicta sonic hedgehog protein A (LOC105198519), mRNA
TTTGCCATAGGGTGGGGGCAGGAAGAGGGGTACTCCTGGACCTGGTCAGGCCAAAGAAACAACGCAGGAG
TACCCACCGAAACCCATCCTGCATTTCCCACCACCGGCTCGTCCGGAAAACGTGCTAGCTTCCGGACGTC
GTCCGAGTTCAGTCAAGTGAGTCGACTCGCACGTACCTCCTCTCCGACGAAAACGATTGAGCAGAGGGTT
CCGCTCCGGAAAGGCGGGAGTGTTGGCGCGTGCCGACGTGAGAATTTAAAGAGCCGCCCGGTACCGCTAG
AGGAGAGAATCGGAGTCGATCGGAGTCGAAGAAACCCGGCTTTGAATGATTCTCCAATCGAATCCGACGA
CGCTTAGGATGTAGAAAGATCCGCCCTCGTAGACTCTTCGACTTGGTACCATATCTGTCTGATAATATCC
CGGTGACACATGTCGGCTCCTTTCGGTATTCACGGAAAAGAAAAACGGAATCTTTTTCCAATTCCTCGGG
AGGAGGTTGTAGCTTCGCGTCAATTCGTCACGATTGATCGGTCGCTTGTTTTCCGTTCGAGTTCTCCCAG
TGAACCATCATCGTCATCATCATCATCGTCATCGGTTTTGGTCAGTCGCGCGTGGAGTTTATCAACGCAG
CTACGGGGATCACATCGGTGATAAGAGTGCGTGCACGGGGGGGTGTGTGTGATGCCCTGGTTATGGTTAA
GCGTGAGTGAGGTCAGTGGCAAGAGGCGGAGGAGAATAATTGGTAGCAAAACGTCAAGCGGATTGTCACC
CTCATCGTCGTTGCCGCCGTCGTCGACGTCGTCGTCGTCGTCGTCGTCACTGTCGTTCTGCCAGCCATTC
TGGAGACGACGAGAGCAGTGACCGCGCAAGGTCGACCTCGAGGCAGAGGCACGCGA

### `Entrez.einfo()`

The `Entrez.einfo()` function can be used to retrieve information about the structure of Entrez databases.

In [21]:
## To list available databases
handle = Entrez.einfo()
result = handle.read()
print(result)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
<DbList>

	<DbName>pubmed</DbName>
	<DbName>protein</DbName>
	<DbName>nuccore</DbName>
	<DbName>ipg</DbName>
	<DbName>nucleotide</DbName>
	<DbName>nucgss</DbName>
	<DbName>nucest</DbName>
	<DbName>structure</DbName>
	<DbName>sparcle</DbName>
	<DbName>genome</DbName>
	<DbName>annotinfo</DbName>
	<DbName>assembly</DbName>
	<DbName>bioproject</DbName>
	<DbName>biosample</DbName>
	<DbName>blastdbinfo</DbName>
	<DbName>books</DbName>
	<DbName>cdd</DbName>
	<DbName>clinvar</DbName>
	<DbName>clone</DbName>
	<DbName>gap</DbName>
	<DbName>gapplus</DbName>
	<DbName>grasp</DbName>
	<DbName>dbvar</DbName>
	<DbName>gene</DbName>
	<DbName>gds</DbName>
	<DbName>geoprofiles</DbName>
	<DbName>homologene</DbName>
	<DbName>medgen</DbName>
	<DbName>mesh</DbName>
	<DbName>ncbisearch</DbName>
	<DbName>nlmcatalog</DbName>
	<DbName

In [22]:
handle.close()

In [23]:
## Or you can parse the XML
handle = Entrez.einfo()
result = Entrez.read(handle)
print(result.keys())
result['DbList']

[u'DbList']


['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']

In [24]:
handle.close()



By specifying the database name when calling `Entrez.einfo()` database field information can be retrieved.

In [25]:
## To get info about a specific database
handle = Entrez.einfo(db="nuccore")
result = Entrez.read(handle)
for field in result['DbInfo']['FieldList']:
    print("%(Name)s: %(Description)s" % field)

ALL: All terms from all searchable fields
UID: Unique number assigned to each sequence
FILT: Limits the records
WORD: Free text associated with record
TITL: Words in definition line
KYWD: Nonstandardized terms provided by submitter
AUTH: Author(s) of publication
JOUR: Journal abbreviation of publication
VOL: Volume number of publication
ISS: Issue number of publication
PAGE: Page number(s) of publication
ORGN: Scientific and common names of organism, and all higher levels of taxonomy
ACCN: Accession number of sequence
PACC: Does not include retired secondary accessions
GENE: Name of gene associated with sequence
PROT: Name of protein associated with sequence
ECNO: EC number for enzyme or CAS registry number
PDAT: Date sequence added to GenBank
MDAT: Date of last update
SUBS: CAS chemical name or MEDLINE Substance Name
PROP: Classification by source qualifiers and molecule type
SQID: String identifier for sequence
GPRJ: BioProject
SLEN: Length of sequence
FKEY: Feature annotated on sequ

In [26]:
result.keys()

[u'DbInfo']

In [27]:
result['DbInfo'].keys()

[u'Count',
 u'LastUpdate',
 u'MenuName',
 u'Description',
 u'LinkList',
 u'DbBuild',
 u'FieldList',
 u'DbName']

In [28]:
handle.close()

## In-Class Exercises

In [29]:
## Exercise 1.
## Use the Entrez BioPython module to retrieve fasta records
## for 3 Refseq mRNA sequences for the P53 gene.
## Use the following search term: 
## "TP53[Gene] AND Homo sapiens[Organism] AND mRNA[Filter] AND Refseq[Filter]"
##
## Remember to provide your email address
##


In [30]:
## Exercise 2.
## Parse the 3 fasta records and save each sequence in
## a separate fasta file.
##


## References

- Python for Bioinformatics, Sebastian Bassi, CRC Press (2010)
- [http://en.wikipedia.org/wiki/Entrez](http://en.wikipedia.org/wiki/Entrez)
- [http://www.ncbi.nlm.nih.gov/books/NBK1058/](http://www.ncbi.nlm.nih.gov/books/NBK1058/)
- [http://www.ncbi.nlm.nih.gov/books/NBK25499/](http://www.ncbi.nlm.nih.gov/books/NBK25499/)
- [http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
- [http://biopython.org/DIST/docs/api/](http://biopython.org/DIST/docs/api/)

#### Last Updated: 5-Sep-2018