# Accessing NCBI databases with Biopython

we will look at how to access such databases at the National Center for Biotechnology Information (NCBI). We will not only discuss GenBank, but also other databases at NCBI. Many people refer (wrongly) to the whole set of NCBI databases as GenBank, but NCBI includes the nucleotide database and many others, for example, PubMed.

## Check available databases at NCBI
Biopython provides an interface to Entrez, the data retrieval system made available by NCBI. Entrez can also be used through web browser: https://www.ncbi.nlm.nih.gov/search/


**TIPS:**
- specify an email address with your query
- avoid large number of requests (100 or more) during peak times (between 9.00 a.m. and 5.00 p.m. American Eastern Time on weekdays)
- do not post more than three queries per second (Biopython will take care of this for you)

It's not only good citizenship, but you risk getting blocked if you over use NCBI's servers (a good reason to give a real email address, because NCBI may try to contact you).

In [1]:
from Bio import Entrez, SeqIO

In [2]:
Entrez.email = "" 

**EInfo:** obtain a list of all database names accessible through Entrez

In [3]:
# This gives you the list of available databases
handle = Entrez.einfo()
rec = Entrez.read(handle)
handle.close()
print(rec.keys())

dict_keys(['DbList'])


In [4]:
rec['DbList']

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

We will now try to find the **chloroquine resistance transporter (CRT)** gene (KM288867) in **Plasmodium falciparum** (the parasite that causes the deadliest form of malaria) on the nucleotide database:

**ESearch:** Searching the Entrez databases

Note that the standard search will limit the number of record references to **20**, so if we have more, we can override **retmax** to desired amount of records.

In [5]:
handle = Entrez.esearch(db="nucleotide", term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax="40")
rec_list = Entrez.read(handle)
handle.close()
rec_list['Count']

'1374'

In [6]:
len(rec_list['IdList'])

40

In [7]:
rec_list['IdList']

['2049839867', '2049839865', '2049839863', '2049839861', '2049839859', '2049839857', '2049839855', '2049839853', '2049839851', '2049839849', '2049839847', '2049839845', '2049839843', '2049839841', '2049839839', '2049839837', '2049839835', '2049839833', '2049839831', '2049839829', '2049839827', '2049839825', '2049839823', '2049839821', '2049839819', '2049839817', '2049839815', '2049839813', '2049839811', '2049839809', '2049839807', '2049839805', '2049839803', '2049839801', '2049839799', '2049839797', '2049839795', '2049839793', '2049839791', '2049839789']

We now have the IDs of all of the records, but we still need to retrieve the records properly.

**EFetch:** Downloading full records from Entrez


Requesting a specific file format from Entrez using Bio.Entrez.efetch() requires specifying the **rettype** and/or **retmode** optional arguments. The different combinations are described for each database type on the pages linked to on NCBI efetch webpage - https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

rettype - return type, gb == GenBank
retmax - Total number of records from the input set to be retrieved, up to a maximum of 10,000

In [8]:
id_list = rec_list['IdList']
handle = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb') # genbank format, we need to parse it with SeqIO module

In [9]:
recs = list(SeqIO.parse(handle, 'gb'))
handle.close()

Note that we have converted an iterator (the result of SeqIO.parse) to a list. The advantage of doing this is that we can use the result as many times as we want (for example, iterate many times over), without repeating the query on the server.

In [10]:
recs

[SeqRecord(seq=Seq('ATGAAATTCGCAAGTAAAAAAAATAATCAAAAAAATTCAAGCAAAAATGACGAG...TAA'), id='MW275076.1', name='MW275076', description='Plasmodium falciparum isolate OM-17-323 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGAAATTCGCAAGTAAAAAAAATAATCAAAAAAATTCAAGCAAAAATGACGAG...TAA'), id='MW275075.1', name='MW275075', description='Plasmodium falciparum isolate OM-17-322 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGAAATTCGCAAGTAAAAAAAATAATCAAAAAAATTCAAGCAAAAATGACGAG...TAA'), id='MW275074.1', name='MW275074', description='Plasmodium falciparum isolate OM-17-321 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGAAATTCGCAAGTAAAAAAAATAATCAAAAAAATTCAAGCAAAAATGACGAG...TAA'), id='MW275073.1', name='MW275073', description='Plasmodium falciparum isolate OM-17-320 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('

 However, be careful with this technique, because you will retrieve a large amount of complete records, and some of them will have fairly large sequences inside. You risk downloading a lot of data (which would be a strain both on your side and on NCBI servers).

In [11]:
for rec in recs:
    if rec.name == 'KM288867': # try finding CRT gene in 40 records we fetched
        break
print(rec.name)
print(rec.description)

MW275037
Plasmodium falciparum isolate OM-17-284 chloroquine resistance transporter (crt) gene, partial cds


In [12]:
str(rec.seq)

'ATGAAATTCGCAAGTAAAAAAAATAATCAAAAAAATTCAAGCAAAAATGACGAGCGTTATAGAGAATTAGATAATTTAGTACAAGAAGGAAGTAAGTATCCAAAAATGGAAATATGGAATGATATAAATGAATAGATAAATCAACCTATTGGATATATATATATATATATATATATATATATATATATATATGTATACCCATATGTATTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCTTGTCGACCTTAACAGATGGCTCACGTTTAGGTGGAGGTTCTTGTCTTGGTAAATGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATTTATATTTTAAGTATTATTTATTTAAGTGTATGTGTAATGAATACAATTTTTGCTAAAAGAACTTTAAACAAAATTGGTAACTATAGTTTTGTAACATCCGAAACTTACAACTTTATTTGTATGATTATGTTCTTTATTGTTTATTCCTTATTTGGAAATAAAAAGGGAAATTCAAAAGTAAGATAAATCAATATATTAAAATGATGGATTTATAAGAGAATCTATTCCACCTACCAATATAAAACATTACACATATATATATATATATATATATATATATATATATATATATATATATGTATGTATGTTGATTAATTTGTTTATATATTTATATTTATTTCTTATGACCTTTTTAGGAACGACACCGAAGCTTTAATTTACAATTTTTTGCTATATCCATGTTAGATGCCTGTTCAGTCATTTTGGCCTTCATAGGTCTTACAAGAACTACTGGAAATATCCAATCATTTGTTCTTCAATTAAGTATTCCTATTAATATGTTCTTCTGCTTTTTAATATTAAGATATAGGTAAGTATACTATTTTAAATTACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATATAAAATATATATATATATTTATATATATTTATTTATATATTTAT