# Accessing NCBI Databases with Biopython

We will look at how to access databases at the National Center for Biotechnology Information (NCBI).

## Check Available Databases at NCBI

Biopython provides an interface to Entrez, the data retrieval system made available by NCBI. Entrez can also be used through a web browser: https://www.ncbi.nlm.nih.gov/search/

### Tips:

- Specify an email address with your query
- Avoid large number of requests (100 or more)
- Do not post more than three queries per second (Biopython will take care of this for you)

It's not only good citizenship, but you risk getting blocked if you overuse NCBI's servers (a good reason to give a real email address, because NCBI may try to contact you)

In [1]:
from Bio import Entrez, SeqIO

In [None]:
Entrez.email = ""

## `einfo`: Obtain a list of all database names accessible through `Entrez`

In [3]:
# This gives you the list of available databases
handle = Entrez.einfo()
rec = Entrez.read(handle)
handle.close()
print(rec.keys())

dict_keys(['DbList'])


In [5]:
for db in rec['DbList']:
    print(db)

pubmed
protein
nuccore
ipg
nucleotide
structure
genome
annotinfo
assembly
bioproject
biosample
blastdbinfo
books
cdd
clinvar
gap
gapplus
grasp
dbvar
gene
gds
geoprofiles
medgen
mesh
nlmcatalog
omim
orgtrack
pmc
proteinclusters
pcassay
protfam
pccompound
pcsubstance
seqannot
snp
sra
taxonomy
biocollections
gtr


We will try to find the **chloroquine resistant transporter (CRT)** gene (KM288867) ub **Plasmodium falciparum** (the parasite that causes the deadliest form of malaria) on the nucleotide database.

## `esearch`: Searching the Entrez database

Note tha the standard search will limit the number of record references to 20, so if we have more, we can override the `retmax` to desired amount of records

In [7]:
handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax=40)
rec_list = Entrez.read(handle)
handle.close()
rec_list['Count']

'3867'

In [8]:
len(rec_list['IdList'])

40

In [10]:
rec_list['IdList']

['3071712854', '3071712852', '3071712850', '3071712848', '3071712846', '3071712844', '3071712842', '3071712840', '3071712838', '3071712836', '3071712834', '3071712832', '3071712830', '3071712828', '3071712826', '3071712824', '3071712822', '3071712820', '3071712818', '3071712816', '3071712814', '3071712812', '3071712810', '3071712808', '3071712806', '3071712804', '3071712802', '3071712800', '3071712798', '3071712796', '3071712794', '3071712792', '3071712790', '3071712788', '3071712786', '3071712784', '3071712782', '3071712780', '3071712778', '3071712776']

We now have the IDs of all the records. Now we need to retrieve the records proper.

## `efetch`: Downloading full records from `Entrez`

Requesting a specific file format from Entrez using `Bio.Entrez.efetch()` requires specifying the `rettype` and/or `retmode` optional arguments. The different combinations are described for each database type on the pages linked to on NCBI efetch webpage - https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

In [None]:
id_list = rec_list['IdList']
handle = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb')   # genbank format - we need to parse it with SeqIO module
records_iter = SeqIO.parse(handle, 'gb')

In [None]:
for record in records:
    print(record)

ID: PX439247.1
Name: PX439247
Description: Plasmodium falciparum isolate KSM-60 chroquine resistance transporter (crt) gene, partial cds
Number of features: 4
/molecule_type=DNA
/topology=linear
/data_file_division=INV
/date=11-OCT-2025
/accessions=['PX439247']
/sequence_version=1
/keywords=['']
/source=Plasmodium falciparum (malaria parasite P. falciparum)
/organism=Plasmodium falciparum
/taxonomy=['Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', 'Aconoidasida', 'Haemosporida', 'Plasmodiidae', 'Plasmodium', 'Plasmodium (Laverania)']
/references=[Reference(title='Regional variation in sulfadoxine-pyrimethamine resistance genotypes and haplotypes of Plasmodium falciparum dihydrofolate reductase and dihydropteroate synthase genes in Western Kenya', ...), Reference(title='Direct Submission', ...)]
/structured_comment=defaultdict(<class 'dict'>, {'Assembly-Data': {'Sequencing Technology': 'Sanger dideoxy sequencing'}})
Seq('ATGAAATTCGCAAGTAAAAAAAATAATCAAAAAAATTCAAGCAAAAATGACGAG...AGA')
ID: 

In [15]:
handle.close()      # Always remember to cleanup

In [17]:
id_list = rec_list['IdList']
handle = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb')   # genbank format - we need to parse it with SeqIO module
records_iter = SeqIO.parse(handle, 'gb')
for rec in records_iter:
    if rec.name == 'KM288867':  # try finding the CRT gene in the 40 records we fetched
        break
print(rec.name)
print(rec.description)
handle.close()

PX439208
Plasmodium falciparum isolate KSM-21 chroquine resistance transporter (crt) gene, partial cds


In [18]:
str(rec.seq)

'ATGAAATTCGCAAGTAAAAAAAATAATCAAAAAAATTCAAGCAAAAATGACGAGCGTTATAGAGAATTAGATAATTTAGTACAAGAAGGAAGTAAGTATCCAAAAATGGAAATATGGAATGATATAAATGAATAGATAAATCAACCTATTGGATATATATATATATATATATATATATATATATATATATATGTATACCCATATGTATTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCTTGTCGACCTTAACAGATGGCTCACGTTTAGGTGGAGGTTCTTGTCTTGGTAAATGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATTTATATTTTAAGTATTATTTATTTAAGTGTATGTGTAATGAATAAAATTTTTGCTAAAAGAACTTTAAACAAAATTGGTAACTATAGTTTTGTAACATCCGAAACTCACAACTTTATTTGTATGATTATGTTCTTTATTGTTTATTCCTTATTTGGAAATAAAAAGGGAAATTCAAAAGTAAGA'