# PubMed API Tutorial

Write a program that creates a database of PubMed articles corresponding to a specific search string
- Define the structure of your database
- Explot the ENTREZ API to fetch the target information
- Process the ENTREZ output into your own data structure

In [1]:
import requests
import json

In [2]:
# what we would like to search on pubmed
search_term = 'ai'

# good practice when dealing with user input
# it removes any spaces before and after the string
search_term = search_term.strip()

# also pubmed api doesn't support spaces
# so we replace them with +
search_term = search_term.replace(' ', '+')

## ESEARCH

Portion of the API that provide a list of UIDs matching a text query <br>
Require some parameters to be passed with the url:
- **db**: where to look for, in our case is pubmed
- **term**: the actual text query (user input)
- **retmode**: the output format (xml, html, json ...)
- **retmax**: how many UIDs to return (max 100.000 for one API call)
- **retstart**: from what UID index start to retrieve (important if we want to retrieve more that 100000 UIDs)

Here the documentation : https://www.nlm.nih.gov/dataguide/eutilities/utilities.html#esearch

In [3]:
# create the url for the api with the parameters we want
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
db = '?db=pubmed'
term = '&term=' + search_term
retmode = '&retmode=json' # json easier than xlm
retstart = '&retstart=0' # not really necessary
retmax = '&retmax=20' # only retireve 20 papers for this example

In [4]:
# the order, beside base_url, does not matter
url = base_url + db + term + retmode + retstart + retmax
print(url)

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=ai&retmode=json&retstart=0&retmax=20


In [5]:
# we request the site's content (what we see if we click the link)
# result is in binary (we see it from the first character being b)
site = requests.get(url).content
site

b'{"header":{"type":"esearch","version":"0.3"},"esearchresult":{"count":"1031968","retmax":"20","retstart":"0","idlist":["36198377","36198246","36198062","36197985","36197836","36197712","36197618","36197550","36197309","36196897","36196766","36196596","36196366","36196333","36196212","36196087","36195992","36195890","36195867","36195859"],"translationset":[{"from":"ai","to":"\\"antagonists and inhibitors\\"[Subheading] OR (\\"antagonists\\"[All Fields] AND \\"inhibitors\\"[All Fields]) OR \\"antagonists and inhibitors\\"[All Fields] OR \\"ai\\"[All Fields]"}],"translationstack":[{"term":"\\"antagonists and inhibitors\\"[Subheading]","field":"Subheading","count":"617323","explode":"Y"},{"term":"\\"antagonists\\"[All Fields]","field":"All Fields","count":"894319","explode":"N"},{"term":"\\"inhibitors\\"[All Fields]","field":"All Fields","count":"1331633","explode":"N"},"AND","GROUP","OR",{"term":"\\"antagonists and inhibitors\\"[All Fields]","field":"All Fields","count":"617443","explod

In [6]:
# convert the binary in string
site = site.decode()
site

'{"header":{"type":"esearch","version":"0.3"},"esearchresult":{"count":"1031968","retmax":"20","retstart":"0","idlist":["36198377","36198246","36198062","36197985","36197836","36197712","36197618","36197550","36197309","36196897","36196766","36196596","36196366","36196333","36196212","36196087","36195992","36195890","36195867","36195859"],"translationset":[{"from":"ai","to":"\\"antagonists and inhibitors\\"[Subheading] OR (\\"antagonists\\"[All Fields] AND \\"inhibitors\\"[All Fields]) OR \\"antagonists and inhibitors\\"[All Fields] OR \\"ai\\"[All Fields]"}],"translationstack":[{"term":"\\"antagonists and inhibitors\\"[Subheading]","field":"Subheading","count":"617323","explode":"Y"},{"term":"\\"antagonists\\"[All Fields]","field":"All Fields","count":"894319","explode":"N"},{"term":"\\"inhibitors\\"[All Fields]","field":"All Fields","count":"1331633","explode":"N"},"AND","GROUP","OR",{"term":"\\"antagonists and inhibitors\\"[All Fields]","field":"All Fields","count":"617443","explode

In [7]:
# convert in json
# the string is already formatted as a json
json_site = json.loads(site)
json_site

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '1031968',
  'retmax': '20',
  'retstart': '0',
  'idlist': ['36198377',
   '36198246',
   '36198062',
   '36197985',
   '36197836',
   '36197712',
   '36197618',
   '36197550',
   '36197309',
   '36196897',
   '36196766',
   '36196596',
   '36196366',
   '36196333',
   '36196212',
   '36196087',
   '36195992',
   '36195890',
   '36195867',
   '36195859'],
  'translationset': [{'from': 'ai',
    'to': '"antagonists and inhibitors"[Subheading] OR ("antagonists"[All Fields] AND "inhibitors"[All Fields]) OR "antagonists and inhibitors"[All Fields] OR "ai"[All Fields]'}],
  'translationstack': [{'term': '"antagonists and inhibitors"[Subheading]',
    'field': 'Subheading',
    'count': '617323',
    'explode': 'Y'},
   {'term': '"antagonists"[All Fields]',
    'field': 'All Fields',
    'count': '894319',
    'explode': 'N'},
   {'term': '"inhibitors"[All Fields]',
    'field': 'All Fields',
    'count': '1331633

In [8]:
# we only want the idlist
UIDs = json_site['esearchresult']['idlist']
UIDs

['36198377',
 '36198246',
 '36198062',
 '36197985',
 '36197836',
 '36197712',
 '36197618',
 '36197550',
 '36197309',
 '36196897',
 '36196766',
 '36196596',
 '36196366',
 '36196333',
 '36196212',
 '36196087',
 '36195992',
 '36195890',
 '36195867',
 '36195859']

So now we have retireved the UIDs from a search <br>
For each UID we want to get some information: title, authors, doi ...

## EFETCH

Portion of the API that given a UID return formatted data records. <br>
It require some parameter:
- **db**: the database (our case pubmed)
- **uid**: the UID whose data we want to retrieve
- **retmode**/**rettype**: determine how your results will be displayed

Here the doucmentation: https://www.nlm.nih.gov/dataguide/eutilities/utilities.html#efetch

In [9]:
# create the url for the api with the parameters we want
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
db = '?db=pubmed'
uid = '&id=' + str(36135925) # just a random one, in reality we want to use the ones we retieved above
rettype = '&rettype=medline' # medline more structured than xml

In [10]:
url = base_url + db + uid + rettype
print(url)

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=36135925&rettype=medline


By clicking the link you can see that the data is structured in the following way: <br>
**TAG - CONTENT** <br>
Here are defined all the possible tags https://www.nlm.nih.gov/bsd/mms/medlineelements.html

In [11]:
# we retieve the data as we did above with ESEARCH
paper_data = requests.get(url).content
paper_data = paper_data.decode()

In [12]:
# the data is in plain text
# we need to process it into a more digestible structure
paper_data

"\nPMID- 36135925\nOWN - NLM\nSTAT- Publisher\nLR  - 20220922\nIS  - 2050-084X (Electronic)\nIS  - 2050-084X (Linking)\nVI  - 11\nDP  - 2022 Sep 22\nTI  - Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.\nLID - 10.7554/eLife.71809 [doi]\nLID - e71809 [pii]\nAB  - CRISPR/Cas9 provides a highly efficient and flexible genome editing technology\n      with numerous potential applications ranging from gene therapy to population\n      control. Some proposed applications involve the integration of CRISPR/Cas9\n      endonucleases into an organism's genome, which raises questions about potentially\n      harmful effects to the transgenic individuals. One example for which this is\n      particularly relevant are CRISPR-based gene drives conceived for the genetic\n      alteration of entire populations. The performance of such drives can strongly\n      depend on fitness costs experienced by drive carriers, yet relatively little is\n      known about the magnitud

In [13]:
# we see that each TAG - CONTENT is on a new line
# so we split the text in a list 
# "TAG_1 - CONTENT_1 \n TAG_2 - CONTENT_2"   ------>  [TAG_1 - CONTENT_1 , TAG_2 - CONTENT_2]

paper_data = paper_data.split('\n')
paper_data

['',
 'PMID- 36135925',
 'OWN - NLM',
 'STAT- Publisher',
 'LR  - 20220922',
 'IS  - 2050-084X (Electronic)',
 'IS  - 2050-084X (Linking)',
 'VI  - 11',
 'DP  - 2022 Sep 22',
 'TI  - Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.',
 'LID - 10.7554/eLife.71809 [doi]',
 'LID - e71809 [pii]',
 'AB  - CRISPR/Cas9 provides a highly efficient and flexible genome editing technology',
 '      with numerous potential applications ranging from gene therapy to population',
 '      control. Some proposed applications involve the integration of CRISPR/Cas9',
 "      endonucleases into an organism's genome, which raises questions about potentially",
 '      harmful effects to the transgenic individuals. One example for which this is',
 '      particularly relevant are CRISPR-based gene drives conceived for the genetic',
 '      alteration of entire populations. The performance of such drives can strongly',
 '      depend on fitness costs experienced by drive carriers

In [14]:
# we can see that some contents are quite long (like the abstract)
# and they do not follow the 'rule'  TAG - CONTENT
# how can we solve it?
# check first 4 characters
# if those are not empty ----> there is a TAG
# if those are empty ----> need to connect it to the previous TAG

ls = []
for elem in paper_data:
    if elem == '':
        continue
    
    elif elem[0:4] != '    ': # check if are not empty
        elem = elem.strip()
        elem = elem.replace('\n', '')
        ls.append(elem)
    
    else:                    # are empty
        elem = elem.strip()
        elem = elem.replace('\n', '')
        ls[-1] = ls[-1] + ' ' + elem   # connect to the previous tag
ls

['PMID- 36135925',
 'OWN - NLM',
 'STAT- Publisher',
 'LR  - 20220922',
 'IS  - 2050-084X (Electronic)',
 'IS  - 2050-084X (Linking)',
 'VI  - 11',
 'DP  - 2022 Sep 22',
 'TI  - Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.',
 'LID - 10.7554/eLife.71809 [doi]',
 'LID - e71809 [pii]',
 "AB  - CRISPR/Cas9 provides a highly efficient and flexible genome editing technology with numerous potential applications ranging from gene therapy to population control. Some proposed applications involve the integration of CRISPR/Cas9 endonucleases into an organism's genome, which raises questions about potentially harmful effects to the transgenic individuals. One example for which this is particularly relevant are CRISPR-based gene drives conceived for the genetic alteration of entire populations. The performance of such drives can strongly depend on fitness costs experienced by drive carriers, yet relatively little is known about the magnitude and causes of these co

In [15]:
# now we want to achieve this:
# [TAG - CONTENT] ----> [TAG , CONTENT]
# so we can access the tag and the content indipendently

ls_2 = [] 
for elem in ls:
    key = elem[0:4].strip()
    value = elem[5:].strip()
        
    ls_2.append([key, value])
ls_2

[['PMID', '36135925'],
 ['OWN', 'NLM'],
 ['STAT', 'Publisher'],
 ['LR', '20220922'],
 ['IS', '2050-084X (Electronic)'],
 ['IS', '2050-084X (Linking)'],
 ['VI', '11'],
 ['DP', '2022 Sep 22'],
 ['TI',
  'Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.'],
 ['LID', '10.7554/eLife.71809 [doi]'],
 ['LID', 'e71809 [pii]'],
 ['AB',
  "CRISPR/Cas9 provides a highly efficient and flexible genome editing technology with numerous potential applications ranging from gene therapy to population control. Some proposed applications involve the integration of CRISPR/Cas9 endonucleases into an organism's genome, which raises questions about potentially harmful effects to the transgenic individuals. One example for which this is particularly relevant are CRISPR-based gene drives conceived for the genetic alteration of entire populations. The performance of such drives can strongly depend on fitness costs experienced by drive carriers, yet relatively little is known about th

In [16]:
# so the data structure we have now is something like this
#    0    |     1
# ---------------------
#  tag1   |   content1
#  tag2   |   content2
#  tag3   |   content3
#  tag4   |   content4


# we find the content by iterating for each row
# and we check if the tag for the row is the one we are looking for
# if it is then we look at the content in the same row

for row in ls_2:
    if row[0] == 'TI':
        print('Title: ' + row[1] + '\n')
        
    elif row[0] == 'DP':
        print('Pubblication date: ' + row[1] + '\n')
        
    elif row[0] == 'JT':
        print('Journal Title: ' + row[1] + '\n')

Pubblication date: 2022 Sep 22

Title: Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.

Journal Title: eLife



# Database
We now need to repeate the procedure for all the UIDs we collected and save the info in a database

In [17]:
# to make everything easier we create some functions

In [18]:
def get_data(UID, session):
    url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=medline&id='
    url = url + UID
    data = session.get(url).content
    return data.decode()

In [19]:
def process_data(data):
    ls = []
    data = data.split('\n')
    for elem in data:
        if elem == '':
            continue
        elif elem[0:4] != '    ': 
            elem = elem.strip()
            elem = elem.replace('\n', '')
            ls.append(elem)
        else:                    
            elem = elem.strip()
            elem = elem.replace('\n', '')
            ls[-1] = ls[-1] + elem
    
    
    ls_2 = [] 
    for elem in ls:
        key = elem[0:4].strip()
        value = elem[5:].strip()
        
        ls_2.append([key, value])
    
    return ls_2


In [20]:
def get_title(processed_data):
    for row in processed_data:
        if row[0] == 'TI':
            return row[1]

def get_pubblication_date(processed_data):
    for row in processed_data:
        if row[0] == 'DP':
            return row[1]
        
def get_journal_title(processed_data):
    for row in processed_data:
        if row[0] == 'JT':
            return row[1]

In [21]:
UIDs

['36198377',
 '36198246',
 '36198062',
 '36197985',
 '36197836',
 '36197712',
 '36197618',
 '36197550',
 '36197309',
 '36196897',
 '36196766',
 '36196596',
 '36196366',
 '36196333',
 '36196212',
 '36196087',
 '36195992',
 '36195890',
 '36195867',
 '36195859']

In [22]:
import time

titles = []
dates = []
journals = []

session = requests.Session() # much faster for multiple requests to the same url

i = 1
for UID in UIDs:
    time.sleep(0.4) # API limits us to 3 calls per second
    data = get_data(UID, session)
    processed_data = process_data(data)
    
    titles.append(get_title(processed_data))
    dates.append(get_pubblication_date(processed_data))
    journals.append(get_journal_title(processed_data))
    
    print('Paper ' + str(i) + ': OK')
    i += 1
    

Paper 1: OK
Paper 2: OK
Paper 3: OK
Paper 4: OK
Paper 5: OK
Paper 6: OK
Paper 7: OK
Paper 8: OK
Paper 9: OK
Paper 10: OK
Paper 11: OK
Paper 12: OK
Paper 13: OK
Paper 14: OK
Paper 15: OK
Paper 16: OK
Paper 17: OK
Paper 18: OK
Paper 19: OK
Paper 20: OK


In [23]:
# there are different ways to create a database with pandas
# my favourite is using dictionaries
# this is the basic structure

# dict = {
#   column_name : list_of_values,
#   column_name_2 : list_of_values_2
#}

In [24]:
dic = {
    'UID' : UIDs,
    'title' : titles,
    'Pubblication Date' : dates,
    'Journal Title' : journals
}

In [25]:
import pandas as pd

df = pd.DataFrame(dic)

In [26]:
df

Unnamed: 0,UID,title,Pubblication Date,Journal Title
0,36198377,Hypouricemic and nephroprotective effects of p...,2022 Oct 2,Journal of ethnopharmacology
1,36198246,Persistent immune-related adverse events after...,2022 Oct 2,"European journal of cancer (Oxford, England : ..."
2,36198062,Reconsideration of the Benefits of Pharmacolog...,2022 Oct 3,The Journal of clinical psychiatry
3,36197985,Plasma iron controls neutrophil production and...,2022 Oct 7,Science advances
4,36197836,Risk Assessment of COVID-19 Cases in Emergency...,2022 Sep 12,JMIR formative research
5,36197712,Artificial Intelligence Applications in Health...,2022 Oct 5,Journal of medical Internet research
6,36197618,An overview of remote monitoring methods in bi...,2022 Oct 5,Environmental science and pollution research i...
7,36197550,Knockdown of RAD51AP1 suppressed cell prolifer...,2022 Oct 5,Discover. Oncology
8,36197309,"A multicenter, randomized, open-label, 2-arm p...",2022 Oct 5,Expert opinion on biological therapy
9,36196897,Associations of Childhood Maltreatment and Gen...,2022 Oct 5,Journal of the American Heart Association
