# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 4: BioPython - Entrez E-utilities

1. NCBI - Entrez Databases
2. E-Utils
    - `Entrez.esearch()`
    - `Entrez.esummary()`
    - `Entrez.efetch()`
    - `Entrez.epost()`
    - `Entrez.einfo()`

#### Requirements

- Python 2.7
- `Bio` (BioPython) module
- Miscellaneous Files
    - `./images/ncbi_ids.jpg`

## NCBI - Entrez Databases

- Global Query Cross‐Database Search System
    - Allows metasearch of NCBI health science repository
    - National Center for Biotechnology Information (NBCI) started GenBank in 1992
    - [http://www.ncbi.nlm.nih.gov/gquery/](http://www.ncbi.nlm.nih.gov/gquery/)
- E-utilities
    - Supported by NCBI to provide a stable interface to Entrez query and database system
    - Queries are submitted via web URLs and XML formatted data is returned
    - The `Entrez` module from BioPython provides a programming interface to E-utils
        - Make no more than 3 queries per second (enforced by BioPython)
        - Queries should be accompanied by your email address
        - For large/regular queries consider downloading and accessing a local copy of the database

## E-Utils

[http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)

### `Entrez.esearch()`

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch](http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch)

The `Entrez.esearch()` function allows you to search specific NCBI databases for entries that match a specified search term. The function will return a list of unique identifiers (UIDs). The type of UID will depend on the database searched. By default, only the first 20 records are returned (use the `retmax` parameter to change this).

<img src="./images/ncbi_ids.jpg" align="left"/>

In [1]:
from Bio import Entrez

In [2]:
help(Entrez.esearch)

Help on function esearch in module Bio.Entrez:

esearch(db, term, **keywds)
    ESearch runs an Entrez search and returns a handle to the results.
    
    ESearch searches and retrieves primary IDs (for use in EFetch, ELink
    and ESummary) and term translations, and optionally retains results
    for future use in the user's environment.
    
    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
    
    Return a handle to the results which are always in XML format.
    
    Raises an IOError exception if there's a network error.
    
    Short example:
    
    >>> from Bio import Entrez
    >>> Entrez.email = "Your.Name.Here@example.org"
    >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> record["Count"] >= 2
    True
    >>> "156535671" in record["IdList"]
    True
    >>> "156535673" in record["Id

In [3]:
## Provide your email address
email = "mooneymi@ohsu.edu"
Entrez.email = email

## Submit a query
handle = Entrez.esearch(db="nuccore", term="sonic")

## Entrez.read() parses XML results
## A dictionary is returned
record = Entrez.read(handle)
record.keys()

[u'Count',
 u'RetMax',
 u'IdList',
 u'TranslationStack',
 u'TranslationSet',
 u'RetStart',
 u'QueryTranslation']

In [4]:
ids = record["IdList"]
ids

['1069367975', '1068953341', '1068410221', '1067363796', '537544658', '523425758', '523425713', '523425712', '523425711', '343403755', '343403754', '300797327', '300797306', '300795276', '237874193', '214010161', '190343019', '90403596', '214829617', '164565443']

In [5]:
record["Count"]

'1654'

In [6]:
handle.close()

### `Entrez.esummary()`

The `Entrez.esummary()` function provides a document summary for a specified UID. The provided summary is useful for initial filtering of the UID list returned by `Entrez.esearch()`.

#### UIDs Matter!

When searching multiple databases, make sure to use the appropriate UID for the given database. 

For example, <b>Gene ID != GI number</b> (although both are integers).

In [7]:
handle = Entrez.esummary(db="nuccore", id=ids[0])
summary = Entrez.read(handle)
summary

[{'Status': 'live', 'Comment': '  ', 'Caption': 'XM_018269025', 'Title': 'PREDICTED: Xenopus laevis sonic hedgehog protein-like (LOC108719826), mRNA', 'CreateDate': '2016/09/20', 'Extra': 'gi|1069367975|ref|XM_018269025.1|[1069367975]', 'TaxId': 8355, 'ReplacedBy': '', u'Item': [], 'Length': 2160, 'Flags': 512, 'UpdateDate': '2016/09/20', u'Id': '1069367975', 'Gi': 1069367975}]

In [8]:
for k,v in summary[0].items():
    print k+":", v

Status: live
Comment:   
Caption: XM_018269025
Title: PREDICTED: Xenopus laevis sonic hedgehog protein-like (LOC108719826), mRNA
CreateDate: 2016/09/20
Extra: gi|1069367975|ref|XM_018269025.1|[1069367975]
TaxId: 8355
ReplacedBy: 
Item: []
Length: 2160
Flags: 512
UpdateDate: 2016/09/20
Id: 1069367975
Gi: 1069367975


In [9]:
handle.close()

### `Entrez.efetch`

The `Entrez.efetch()` function retrieves entire records in a specified format. In addition to the `database` and `id` parameters, you can specify the retrieval type `rettype` and retrieval mode `retmode` parameters. 

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_](http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_)

In [10]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=ids[0], rettype="fasta", retmode="txt")

## Here we use the handle's read() method, not Entrez.read(),
## since the retmode parameter is text, not XML
fasta_record = handle.read()
print fasta_record

>gi|1069367975|ref|XM_018269025.1| PREDICTED: Xenopus laevis sonic hedgehog protein-like (LOC108719826), mRNA
CCCAGTCTCACAGGCGGCGGCAGCAGCTCCCTCCTATTGCTTCTCTGCTTCTTCTCATTTGTCCCTAATT
ACTGTCTCCCATGTGTTCTGTGAGTGGGGAGCAGCACCCTGGACTTTGTGCCCCTGTCTTGTCTGGGGTC
GGTGGATAGTGGGGTCGGCGGGACGGACAAGTGATCCAGCAGGACAGACACATCCTCTGAGCCTTTCATG
TCATTGGCTTCGCTCGGACGAGATGTCGGTTGCGACTGGAATTCTGCTGTTGGGCTTCACCTGTTCCCTG
CTGATCCCCCCCGGGTTGTCCTGTGGACCTGGCAGAGGAATTGGCAAGAGGAGACACCCCAAGAAACTCA
CCCCTCTCGTCTACAAACAGTTTATCCCCAACGTGGCGGAGAAGACCCTGGGGGCCAGCGGCAGATACGA
AGGAAAGATCGCGAGCAACTCGGATCGCTTTAAAGAATTGACCCCCAATTATAACCCAGATATTGTATTT
AAAGACGAGGAGAACACGGGGGCGGACCGGCTCATGACTCAGAGATGTAAAGACAAACTGAACGCACTCG
CAATCTCCGTGATGAACCAGTGGCCGGGGGTGAAGCTGCGGGTGACGGAGGGGTGGGATGAGGACGGGCA
CCACTTGGAGGAGTCGTTACATTATGAGGGGAGGGCAGTGGACATCACCACGTCGGACCGGGACCGCAGT
AAATACGGAATGTTGGCCCGACTGGCGGTGGAGGCCGGGTTCGACTGGGTCTATTACGAGTCCAAAGCTC
ATATTCACTGTTCGGTCAAAGCAGAGAACTCAGTGGCGGCCAAGTCTGGTGGTTGTTTCCCTGCTGGTGC
CGAGGTGATGGTGGAACTTGGTGGCACCAAAGCGGTGA

In [11]:
handle.close()

In [12]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=ids[0], rettype="fasta", retmode="xml")

## Use Entrez.read() to parse XML output
fasta_record = Entrez.read(handle)
fasta_record[0].keys()

[u'TSeq_accver',
 u'TSeq_sequence',
 u'TSeq_length',
 u'TSeq_taxid',
 u'TSeq_orgname',
 u'TSeq_gi',
 u'TSeq_seqtype',
 u'TSeq_defline']

In [13]:
fasta_record[0]['TSeq_taxid']

'8355'

In [14]:
handle.close()

#### Downloading Records in Bulk

Multiple IDs can be supplied to `Entrez.efetch()` as a comma separated list. 

In [15]:
print ','.join(ids[0:3])

1069367975,1068953341,1068410221


In [16]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id='1069367975,1068953341,1068410221', rettype="fasta", retmode="text")

## Here we use the handle's read() method, not Entrez.read(),
## since the retmode parameter is text, not XML
fasta_records = handle.read()
print fasta_records

>gi|1069367975|ref|XM_018269025.1| PREDICTED: Xenopus laevis sonic hedgehog protein-like (LOC108719826), mRNA
CCCAGTCTCACAGGCGGCGGCAGCAGCTCCCTCCTATTGCTTCTCTGCTTCTTCTCATTTGTCCCTAATT
ACTGTCTCCCATGTGTTCTGTGAGTGGGGAGCAGCACCCTGGACTTTGTGCCCCTGTCTTGTCTGGGGTC
GGTGGATAGTGGGGTCGGCGGGACGGACAAGTGATCCAGCAGGACAGACACATCCTCTGAGCCTTTCATG
TCATTGGCTTCGCTCGGACGAGATGTCGGTTGCGACTGGAATTCTGCTGTTGGGCTTCACCTGTTCCCTG
CTGATCCCCCCCGGGTTGTCCTGTGGACCTGGCAGAGGAATTGGCAAGAGGAGACACCCCAAGAAACTCA
CCCCTCTCGTCTACAAACAGTTTATCCCCAACGTGGCGGAGAAGACCCTGGGGGCCAGCGGCAGATACGA
AGGAAAGATCGCGAGCAACTCGGATCGCTTTAAAGAATTGACCCCCAATTATAACCCAGATATTGTATTT
AAAGACGAGGAGAACACGGGGGCGGACCGGCTCATGACTCAGAGATGTAAAGACAAACTGAACGCACTCG
CAATCTCCGTGATGAACCAGTGGCCGGGGGTGAAGCTGCGGGTGACGGAGGGGTGGGATGAGGACGGGCA
CCACTTGGAGGAGTCGTTACATTATGAGGGGAGGGCAGTGGACATCACCACGTCGGACCGGGACCGCAGT
AAATACGGAATGTTGGCCCGACTGGCGGTGGAGGCCGGGTTCGACTGGGTCTATTACGAGTCCAAAGCTC
ATATTCACTGTTCGGTCAAAGCAGAGAACTCAGTGGCGGCCAAGTCTGGTGGTTGTTTCCCTGCTGGTGC
CGAGGTGATGGTGGAACTTGGTGGCACCAAAGCGGTGA

In [17]:
handle.close()

### `Entrez.epost()`

Alternatively, use the `Entrez.epost()` function to cache a large number of IDs (too many IDs can make the URL-based requests fail). This function uploads the ID list to the NCBI servers and returns a `WebEnv` value and a `QueryKey` value that can be supplied to `Entrez.efetch()` to retrieve the query results.

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_](http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_)

In [18]:
## Use Entrez.epost() to cache multiple IDs
handle = Entrez.epost(db="nuccore", id=','.join(ids[0:3]))
epost_results = Entrez.read(handle)
web_env = epost_results['WebEnv']
query_key = epost_results['QueryKey']
handle.close()

## Use the WebEnv and QueryKey values to retrieve
## the query results with Entrez.efetch()
handle = Entrez.efetch(db="nuccore", rettype="fasta", retmode="text", webenv=web_env, query_key=query_key)
fasta_records = handle.read()
print fasta_records

>gi|1069367975|ref|XM_018269025.1| PREDICTED: Xenopus laevis sonic hedgehog protein-like (LOC108719826), mRNA
CCCAGTCTCACAGGCGGCGGCAGCAGCTCCCTCCTATTGCTTCTCTGCTTCTTCTCATTTGTCCCTAATT
ACTGTCTCCCATGTGTTCTGTGAGTGGGGAGCAGCACCCTGGACTTTGTGCCCCTGTCTTGTCTGGGGTC
GGTGGATAGTGGGGTCGGCGGGACGGACAAGTGATCCAGCAGGACAGACACATCCTCTGAGCCTTTCATG
TCATTGGCTTCGCTCGGACGAGATGTCGGTTGCGACTGGAATTCTGCTGTTGGGCTTCACCTGTTCCCTG
CTGATCCCCCCCGGGTTGTCCTGTGGACCTGGCAGAGGAATTGGCAAGAGGAGACACCCCAAGAAACTCA
CCCCTCTCGTCTACAAACAGTTTATCCCCAACGTGGCGGAGAAGACCCTGGGGGCCAGCGGCAGATACGA
AGGAAAGATCGCGAGCAACTCGGATCGCTTTAAAGAATTGACCCCCAATTATAACCCAGATATTGTATTT
AAAGACGAGGAGAACACGGGGGCGGACCGGCTCATGACTCAGAGATGTAAAGACAAACTGAACGCACTCG
CAATCTCCGTGATGAACCAGTGGCCGGGGGTGAAGCTGCGGGTGACGGAGGGGTGGGATGAGGACGGGCA
CCACTTGGAGGAGTCGTTACATTATGAGGGGAGGGCAGTGGACATCACCACGTCGGACCGGGACCGCAGT
AAATACGGAATGTTGGCCCGACTGGCGGTGGAGGCCGGGTTCGACTGGGTCTATTACGAGTCCAAAGCTC
ATATTCACTGTTCGGTCAAAGCAGAGAACTCAGTGGCGGCCAAGTCTGGTGGTTGTTTCCCTGCTGGTGC
CGAGGTGATGGTGGAACTTGGTGGCACCAAAGCGGTGA

### `Entrez.einfo()`

The `Entrez.einfo()` function can be used to retrieve information about the structure of Entrez databases.

In [19]:
## To list available databases
handle = Entrez.einfo()
result = handle.read()
print result

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
<DbList>

	<DbName>pubmed</DbName>
	<DbName>protein</DbName>
	<DbName>nuccore</DbName>
	<DbName>nucleotide</DbName>
	<DbName>nucgss</DbName>
	<DbName>nucest</DbName>
	<DbName>structure</DbName>
	<DbName>genome</DbName>
	<DbName>annotinfo</DbName>
	<DbName>assembly</DbName>
	<DbName>bioproject</DbName>
	<DbName>biosample</DbName>
	<DbName>blastdbinfo</DbName>
	<DbName>books</DbName>
	<DbName>cdd</DbName>
	<DbName>clinvar</DbName>
	<DbName>clone</DbName>
	<DbName>gap</DbName>
	<DbName>gapplus</DbName>
	<DbName>grasp</DbName>
	<DbName>dbvar</DbName>
	<DbName>gene</DbName>
	<DbName>gds</DbName>
	<DbName>geoprofiles</DbName>
	<DbName>homologene</DbName>
	<DbName>medgen</DbName>
	<DbName>mesh</DbName>
	<DbName>ncbisearch</DbName>
	<DbName>nlmcatalog</DbName>
	<DbName>omim</DbName>
	<DbName>orgtrack</DbName>
	<DbNa

In [20]:
handle.close()

In [21]:
## Or you can parse the XML
handle = Entrez.einfo()
result = Entrez.read(handle)
print result.keys()
result['DbList']

[u'DbList']


['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'unigene', 'gencoll', 'gtr']

In [22]:
handle.close()



By specifying the database name when calling `Entrez.einfo()` database field information can be retrieved.

In [23]:
## To get info about a specific database
handle = Entrez.einfo(db="nuccore")
result = Entrez.read(handle)
for field in result['DbInfo']['FieldList']:
    print "%(Name)s: %(Description)s" % field

ALL: All terms from all searchable fields
UID: Unique number assigned to each sequence
FILT: Limits the records
WORD: Free text associated with record
TITL: Words in definition line
KYWD: Nonstandardized terms provided by submitter
AUTH: Author(s) of publication
JOUR: Journal abbreviation of publication
VOL: Volume number of publication
ISS: Issue number of publication
PAGE: Page number(s) of publication
ORGN: Scientific and common names of organism, and all higher levels of taxonomy
ACCN: Accession number of sequence
PACC: Does not include retired secondary accessions
GENE: Name of gene associated with sequence
PROT: Name of protein associated with sequence
ECNO: EC number for enzyme or CAS registry number
PDAT: Date sequence added to GenBank
MDAT: Date of last update
SUBS: CAS chemical name or MEDLINE Substance Name
PROP: Classification by source qualifiers and molecule type
SQID: String identifier for sequence
GPRJ: BioProject
SLEN: Length of sequence
FKEY: Feature annotated on sequ

In [24]:
handle.close()

## In-Class Exercises

In [None]:
## Exercise 1.
## Use the Entrez BioPython module to retrieve fasta records
## for 3 Refseq mRNA sequences for the P53 gene.
## Use the following search term: 
## "TP53[Gene] AND Homo sapiens[Organism] AND mRNA[Filter] AND Refseq[Filter]"
##
## Remember to provide your email address
##


In [None]:
## Exercise 2.
## Parse the 3 fasta records and save each sequence in
## a separate fasta file.
##


## References

- Python for Bioinformatics, Sebastian Bassi, CRC Press (2010)
- [http://en.wikipedia.org/wiki/Entrez](http://en.wikipedia.org/wiki/Entrez)
- [http://www.ncbi.nlm.nih.gov/books/NBK1058/](http://www.ncbi.nlm.nih.gov/books/NBK1058/)
- [http://www.ncbi.nlm.nih.gov/books/NBK25499/](http://www.ncbi.nlm.nih.gov/books/NBK25499/)
- [http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
- [http://biopython.org/DIST/docs/api/](http://biopython.org/DIST/docs/api/)

#### Last Updated: 21-Sep-2016