# Batch download Pubmed abstracts using the NCBI E-utilities and Python

Here, I will show you how NCBI E-utilities can be used to search for and download Pubmed abstracts. I use Python in conjunction with the NCBI E-utilities to download all the abstracts corresponding to a given search term and simultaneously parse the information contained in each abstract into a data science-friendly format.

## Introduction

Depending on what you search for in PubMed, you could be presented with thousands of abstracts that contain the keyword you used for your query (for example, try searching "cancer"). Finding the information you’re looking for can get a bit tedious when you have to manually click through each page of search results.

Using the NCBI E-utilities (Entrez Programming Utilities, https://www.ncbi.nlm.nih.gov/books/NBK25499/), you can retrieve and download abstracts associated with a PubMed search without having to sift through the user interface. Even better, this tool doesn’t require any software–its completely URL based. You craft "search" and "fetch" commands as URLs and open them in your browser window to access the abstracts.

We can automate the download process by programming a script in Python to construct the URLs, execute the "search" and "fetch" commands, and parse each part of the abstract (Authors, Journal, Date of publication, etc.) into a data file for downstream analysis. Text from each abstract can be analyzed to quickly extract numerical data or quantitative results.

Below, I will give a brief tutorial about how the tools work, and the code needed to automate the process using Python.


## How the NCBI E-utilities work

The two main E-util functions you will use are `esearch` and `efetch`.

First, `esearch` runs a keyword search command on the PubMed database and retrieves IDs for each of the abstracts corresponding to the search. The actual information associated with the abstracts does not show up, only the IDs. You’re also given a `query key` and `web environment ID`. 

Then, you input the `query key` and `web environment ID` into an `efetch` call, which will “fetch” all the abstracts for that specific `esearch` query.

Let’s say I want to search pubmed for "Intelligence".

### Step 1. Craft your esearch URL

Here is the URL required to execute a PubMed esearch for P2RY8:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=P2RY8&retmax=50&usehistory=y

This was crafted by putting the following parameters together:
* `http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?` is the backbone of the esearch function.
* `db=pubmed` specifies that we will be searching the pubmed database. 
* `term=P2RY8` specifies what we will be searching pubmed for. Change this field to whatever you want to search for.
* `retmax=50` specifies how many abstracts I want to return using the search.
* `usehistory=y` will provide you with a QueryKey and WebEnv id that will let you fetch abstracts from this search.
* The “&” signs are just used to separate the different conditions. Make sure to include it starting from after the `db=pubmed` argument.

Copying and pasting the full URL into my web browser results in a webpage that looks like this (XML output):

![esearch_web_result.png](images/esearch_web_result.png)

### Step 2. Craft the efetch URL

The next step is to execute an efetch command by constructing a new URL. Using the `WebEnv` and `QueryKey` information given in the above esearch result, I will type the following efetch command into my browser:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=NCID_1_5757184_130.14.18.48_9001_1579844644_2015135327_0MetA0_S_MegaStore&retstart=0&retmax=50&retmode=text&rettype=abstract

Note: If you’re trying this right now, your esearch will have given you a different webenv variable. Make sure to input YOUR webenv variable in the efetch URL for it to work!

Here is an explanation for each aspect of the link I constructed above:
* `http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?` is the backbone for a efetch command. Notice that the only difference from this and an esearch command is the part after “/eutils/”.
* `db=pubmed` specifies the database, again.
* `query_key=1` specifies the number that was given in the “querykey” field in the esearch result.
* `webenv=NCID_1_5757184_130.14.18.48_9001_1579844644_2015135327_0MetA0_S_MegaStore` specifies the ID that was given in the esearch result. 
* `retmode=text` specifies that I want the abstracts to be written out in print. 
* `rettype=abstract` specifies that I want abstracts shown, as opposed to other types of info that can be given from a PubMed search.

After inputting this link, you should observe the following output as a plaintext webpage:


![efetch_web_result.png](images/efetch_web_result.png)

Now, I can simply ctrl-F to sift through over three hundred abstracts. You can apply this simple two-step process whenever you’re tasked with searching through absurd amounts of Pubmed results.

Below, I show you how to perform this process in Python.

## Using the E-utilities in Python

The packages we need include `csv` to write to a csv file, `re` in order to use regular expressions to extract information from `esearch` results, `urllib` to open and read urls, and `time` in order to sleep for a couple seconds between requests so we don't get blocked.

In [1]:
import csv
import re
import urllib
from time import sleep

Now, we need to specify our parameters for the `esearch` and `efetch` calls. We can store each of the settings in their own individual variables in order to make it easy to customize the calls in the future.

In [2]:
query = 'intelligence'

# common settings between esearch and efetch
base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
db = 'db=pubmed'

# esearch specific settings
search_eutil = 'esearch.fcgi?'
search_term = '&term=' + query
search_usehistory = '&usehistory=y'
search_rettype = '&rettype=json'

We can construct the search url by simply combining together each of the variables into a long string:

In [3]:
search_url = base_url+search_eutil+db+search_term+search_usehistory+search_rettype
print(search_url)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=intelligence&usehistory=y&rettype=json


Now that we have the full `esearch` url constructed, we can open the url using `urllib.request.urlopen()`:

In [4]:
f = urllib.request.urlopen(search_url)
search_data = f.read().decode('utf-8')

In [5]:
search_data

'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">\n<eSearchResult><Count>323027</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>MCID_657c84d9fe81845c6315a812</WebEnv><IdList>\n<Id>38099452</Id>\n<Id>38099400</Id>\n<Id>38099383</Id>\n<Id>38099340</Id>\n<Id>38099205</Id>\n<Id>38099199</Id>\n<Id>38099176</Id>\n<Id>38099140</Id>\n<Id>38099136</Id>\n<Id>38099089</Id>\n<Id>38099088</Id>\n<Id>38099081</Id>\n<Id>38099049</Id>\n<Id>38099026</Id>\n<Id>38099022</Id>\n<Id>38098952</Id>\n<Id>38098921</Id>\n<Id>38098857</Id>\n<Id>38098822</Id>\n<Id>38098788</Id>\n</IdList><TranslationSet><Translation>     <From>intelligence</From>     <To>"intelligence"[MeSH Terms] OR "intelligence"[All Fields] OR "intelligences"[All Fields] OR "intelligent"[All Fields] OR "intelligently"[All Fields] OR "intelligibilities"[All Fields] OR "intelligibility"[All Fiel

Above, you can compare the raw text output of the esearch result with the image of what it looks like in the browser. You can see that the same syntax is used for the `WebEnv` and `QueryKey` sections. We will also need the total number of abstracts corresponding to the query, if we want to be able to retrieve all the abstracts. We can extract these items from the output using the regexes below:

In [6]:
# obtain total abstract count
total_abstract_count = int(re.findall("<Count>(\d+?)</Count>",search_data)[0])

# obtain webenv and querykey settings for efetch command
fetch_webenv = "&WebEnv=" + re.findall ("<WebEnv>(\S+)<\/WebEnv>", search_data)[0]
fetch_querykey = "&query_key=" + re.findall("<QueryKey>(\d+?)</QueryKey>",search_data)[0]

In [7]:
total_abstract_count

323027

We observe that there are 32027 total abstracts that match the keyword "Intelligence".

In [8]:
fetch_webenv

'&WebEnv=MCID_657c84d9fe81845c6315a812'

In [9]:
fetch_querykey

'&query_key=1'

Now that we have the `WebEnv` and `QueryKey`, we can run an `efetch` command to obtain the abstracts. In order to do so, we must assign values to each of the parameters in a similar manner as we did above for esearch.

In [10]:
# other efetch settings
fetch_eutil = 'efetch.fcgi?'
retmax = 20
retstart = 0
fetch_retstart = "&retstart=" + str(retstart)
fetch_retmax = "&retmax=" + str(retmax)
fetch_retmode = "&retmode=text"
fetch_rettype = "&rettype=abstract"

The fully constructed efetch command using the above parameters, which should fetch 20 of the 56 total abstracts, is below:

In [11]:
fetch_url = base_url+fetch_eutil+db+fetch_querykey+fetch_webenv+fetch_retstart+fetch_retmax+fetch_retmode+fetch_rettype
print(fetch_url)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_657c84d9fe81845c6315a812&retstart=0&retmax=20&retmode=text&rettype=abstract


Now, we can open the url the same way we did for esearch:

In [12]:
f = urllib.request.urlopen (fetch_url)
fetch_data = f.read().decode('utf-8')

In [13]:
fetch_data[1:3000]

'. Chem Soc Rev. 2023 Dec 15. doi: 10.1039/d3cs00714f. Online ahead of print.\n\nElectrocatalysis of nitrogen pollution: transforming nitrogen waste into \nhigh-value chemicals.\n\nWu Q(1), Zhu F(2), Wallace G(1), Yao X(2), Chen J(1).\n\nAuthor information:\n(1)Intelligent Polymer Research Institute, Australian Institute for Innovative \nMaterials, Innovation Campus, University of Wollongong, Squires Way, North \nWollongong, NSW 2500, Australia. junc@uow.edu.au.\n(2)School of Advanced Energy, Shenzhen Campus, Sun Yat-Sen University, Shenzhen, \nGuangdong 518107, P. R. China. yaoxd3@mail.sysu.edu.cn.\n\nOn 16 June 2023, the United Nations Environment Programme highlighted the \nseverity of nitrogen pollution faced by humans and called for joint action for \nsustainable nitrogen use. Excess nitrogenous waste (NW: NO, NO2, NO2-, NO3-, \netc.) mainly arises from the use of synthetic fertilisers, wastewater discharge, \nand fossil fuel combustion. Although the amount of NW produced can be m

Examining the text output, we see that individual abstracts are separated by 3 new lines (```\n\n\n```). We can therefore use split() to generate a list in which each item is a separate abstract.

In [14]:
# splits the data into individual abstracts
abstracts = fetch_data.split("\n\n\n")
len(abstracts)

20

Because we had set `retmax = 20`, we obtained 20 abstracts. Let's take a closer look at an individual abstract:

In [15]:
# print out the first abstract
abstracts[0]

'1. Chem Soc Rev. 2023 Dec 15. doi: 10.1039/d3cs00714f. Online ahead of print.\n\nElectrocatalysis of nitrogen pollution: transforming nitrogen waste into \nhigh-value chemicals.\n\nWu Q(1), Zhu F(2), Wallace G(1), Yao X(2), Chen J(1).\n\nAuthor information:\n(1)Intelligent Polymer Research Institute, Australian Institute for Innovative \nMaterials, Innovation Campus, University of Wollongong, Squires Way, North \nWollongong, NSW 2500, Australia. junc@uow.edu.au.\n(2)School of Advanced Energy, Shenzhen Campus, Sun Yat-Sen University, Shenzhen, \nGuangdong 518107, P. R. China. yaoxd3@mail.sysu.edu.cn.\n\nOn 16 June 2023, the United Nations Environment Programme highlighted the \nseverity of nitrogen pollution faced by humans and called for joint action for \nsustainable nitrogen use. Excess nitrogenous waste (NW: NO, NO2, NO2-, NO3-, \netc.) mainly arises from the use of synthetic fertilisers, wastewater discharge, \nand fossil fuel combustion. Although the amount of NW produced can be 

We observe that the sections of the abstract are separated by 2 new lines in a row, denoted by `\n\n`. We can again use `split()` to further categorize each section of the abstract.

In [16]:
split_abstract = abstracts[1].split("\n\n")
split_abstract

['2. Nanoscale. 2023 Dec 15. doi: 10.1039/d3nr04060g. Online ahead of print.',
 'A new triboelectric nanogenerator based on a multi-material stacking structure \nachieves efficient power conversion from discrete mechanical movement.',
 'Luo J(1), Su Y(1)(2), Liu A(1), Dai G(1), Zhang X(1), Su X(1), Shao Y(1), Li \nZ(1), Zhao X(1)(3), Zhao K(1)(4).',
 'Author information:\n(1)School of Marine Engineering Equipment, Zhejiang Ocean University, Zhoushan \n316022, China. syx@zjou.edu.cn.\n(2)School of Electrical Engineering, Southwest Jiaotong University, Chengdu \n611756, China.\n(3)Ocean College, Zhejiang University, Zhoushan 316021, China.\n(4)Laboratory of Polymers and Composites, Ningbo Institute of Materials \nTechnology and Engineering, Chinese Academy of Sciences, Ningbo 315201, China.',
 'Due to its invaluable potential in discrete mechanical energy collection, TENG \n(triboelectric nanogenerator) is considered to satisfy the power requirements of \nintelligent electronic devices a

In [17]:
len(split_abstract)

6

We see that the abstract has been split into 7 different items of information, corresponding to the journal name, date, authors, etc. Knowing this, we can construct a data frame in which each row represents an abstract, and each column represents a section of the abstract (journal, date, authors, etc.).

## Writing a loop to fetch all abstracts

There were 56 total abstracts corresponding to the keyword search P2RY8. Above, we have only processed the first 20. In order to obtain all of the abstracts, we can construct a loop that will call ```efetch``` while incrementing ```retstart``` by 20 each iteration, until all the abstracts have been downloaded and added to the table.

In [25]:
import csv
import re
import urllib
from time import sleep

query = "intelligence"

# common settings between esearch and efetch
base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
db = 'db=pubmed'

# esearch settings
search_eutil = 'esearch.fcgi?'
search_term = '&term=' + query
search_usehistory = '&usehistory=y'
search_rettype = '&rettype=json'

# call the esearch command for the query and read the web result
search_url = base_url+search_eutil+db+search_term+search_usehistory+search_rettype
print("this is the esearch command:\n" + search_url + "\n")
f = urllib.request.urlopen (search_url)
search_data = f.read().decode('utf-8')

# extract the total abstract count
total_abstract_count = int(re.findall("<Count>(\d+?)</Count>",search_data)[0])

# efetch settings
fetch_eutil = 'efetch.fcgi?'
retmax = 500
retstart = 0
fetch_retmode = "&retmode=text"
fetch_rettype = "&rettype=abstract"

# obtain webenv and querykey settings from the esearch results
fetch_webenv = "&WebEnv=" + re.findall ("<WebEnv>(\S+)<\/WebEnv>", search_data)[0]
fetch_querykey = "&query_key=" + re.findall("<QueryKey>(\d+?)</QueryKey>",search_data)[0]

# call efetch commands using a loop until all abstracts are obtained
run = True
all_abstracts = list()
loop_counter = 1

while run:
    print("this is efetch run number " + str(loop_counter))
    loop_counter += 1
    fetch_retstart = "&retstart=" + str(retstart)
    fetch_retmax = "&retmax=" + str(retmax)
    # create the efetch url
    fetch_url = base_url+fetch_eutil+db+fetch_querykey+fetch_webenv+fetch_retstart+fetch_retmax+fetch_retmode+fetch_rettype
    print(fetch_url)
    # open the efetch url
    f = urllib.request.urlopen (fetch_url)
    fetch_data = f.read().decode('utf-8')
    # split the data into individual abstracts
    abstracts = fetch_data.split("\n\n\n")
    # append to the list all_abstracts
    all_abstracts = all_abstracts+abstracts
    print("a total of " + str(len(all_abstracts)) + " abstracts have been downloaded.\n")
    # wait 2 seconds so we don't get blocked
    sleep(2)
    # update retstart to download the next chunk of abstracts
    retstart = retstart + retmax
    if retstart > total_abstract_count:
        run = False
    
    

this is the esearch command:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=intelligence&usehistory=y&rettype=json

this is efetch run number 1
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_657c8717129b4b17846fd004&retstart=0&retmax=500&retmode=text&rettype=abstract
a total of 492 abstracts have been downloaded.

this is efetch run number 2
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_657c8717129b4b17846fd004&retstart=500&retmax=500&retmode=text&rettype=abstract
a total of 984 abstracts have been downloaded.

this is efetch run number 3
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_657c8717129b4b17846fd004&retstart=1000&retmax=500&retmode=text&rettype=abstract
a total of 1476 abstracts have been downloaded.

this is efetch run number 4
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_6

HTTPError: HTTP Error 400: Bad Request

The script above should display the esearch command, as well as each individual efetch command used to download the data. We observe that 3 efetch commands were called, and that a total of 56 abstracts were downloaded and stored in the list `all_abstracts`.

You might be wondering, why did we need a loop in the first place? Couldn't we have just set `retmax` to 56 and just run `efetch` only once? For this particular example, yes. However, the maximum value for `retmax` is 500. For other keyword searches with over 500 abstracts, we must set retmax to 500 and loop until all the abstracts are downloaded.

In [26]:
len(all_abstracts)

9839

Next, we will split each abstract from `all_abstracts` into the categories: 'Journal', 'Title', 'Authors', 'Author_Information', 'Abstract', 'DOI', and 'Misc'. After splitting each abstract, we will write the information to a csv file for downstream analysis.

In [28]:
with open("abstracts.csv", "wt",encoding='utf-8', newline='') as abstracts_file:
    abstract_writer = csv.writer(abstracts_file)
    abstract_writer.writerow(['Journal', 'Title', 'Authors', 'Author_Information', 'Abstract', 'DOI', 'Misc'])
    #For each abstract, split into categories and write it to the csv file
    for abstract in all_abstracts:
        #To obtain categories, split every double newline.
        split_abstract = abstract.split("\n\n")
        abstract_writer.writerow(split_abstract)

Examining the resulting csv file, we notice that some abstracts are missing pieces of information. Some lack the author information, and some lack the abstract text entirely. We are only interested in abstracts with complete information, we should segregate the incomplete abstracts into a separate file. To do this, we can just include a simple if/else clause and write to two files. The code below should do so, in which the incomplete abstracts will be written to a file called `partial_abstracts.csv`.

In [32]:
with open("abstracts.csv", "wt",encoding='utf-8', newline='') as abstracts_file, open ("partial_abstracts.csv", "wt",encoding='utf-8', newline='') as partial_abstracts:
    # csv writer for full abstracts
    abstract_writer = csv.writer(abstracts_file)
    abstract_writer.writerow(['Journal', 'Title', 'Authors', 'Author_Information', 'Abstract', 'DOI', 'Misc'])
    # csv writer for partial abstracts
    partial_abstract_writer = csv.writer(partial_abstracts)
    #For each abstract, split into categories and write it to the csv file
    for abstract in all_abstracts:
        #To obtain categories, split every double newline.
        split_abstract = abstract.split("\n\n")
        if len(split_abstract) > 5:
            abstract_writer.writerow(split_abstract)
        else:
            partial_abstract_writer.writerow(split_abstract)

## Conclusion

Here, I have shown you how to use the NCBI E-utilities `esearch` and `efetch` to download abstracts from PubMed, as well as how to write a Python script to batch download all abstracts corresponding to a keyword search.

The file `pubmed_extractor.py` in this repository contains the Python script that we wrote above, but will also allow the user to input their desired keyword search. Running the script by typing `python pubmed_extractor.py` into your terminal should prompt you for a keyword to download PubMed abstracts for. The script will then dump your abstracts into `abstracts.csv`.

### References
Sayers E. The E-utilities In-Depth: Parameters, Syntax and More. 2009 May 29. In: Entrez Programming Utilities Help. Bethesda (MD): National Center for Biotechnology Information (US); 2010-.Available from: http://www.ncbi.nlm.nih.gov/books/NBK25499/