# PubMed API Tutorial

Write a program that creates a database of PubMed articles corresponding to a specific search string
- Define the structure of your database
- Explot the ENTREZ API to fetch the target information
- Process the ENTREZ output into your own data structure

In [1]:
import requests
import json

In [2]:
# what we would like to search on pubmed
inp = 'ai'

# good practice when dealing with user input
# it removes any spaces before and after the string
inp = inp.strip()

# also pubmed api doesn't suppor spaces
inp = inp.replace(' ', '+')

## ESEARCH

Portion of the API that provide a list of UIDs matching a text query
require some parameters:
- **db**: where to look for in our case is pubmed
- **term**: the actual text query (user input)
- **retmode**: the output format (xml, html, json ...)
- **retmax**: how many UIDs to return (max 100.000)
- **retstart**: from what UID index start to retrieve (important if we want to retrieve more that 100000 UIDs)

In [3]:
# create the url for the api with the parameters we want
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
db = '?db=pubmed'
term = '&term='+inp
retmode = '&retmode=json' # json easier than xlm
retstart = '&retstart=0' # not really necessary
retmax = '&retmax=20' # only retireve 20 for this example

In [4]:
url = base_url + db + term + retmode + retstart + retmax

In [5]:
print(url)

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=ai&retmode=json&retstart=0&retmax=20


In [6]:
# we request the site's content
# result is in binary
site = requests.get(url).content

# convert the binary
site = site.decode()
site

'{"header":{"type":"esearch","version":"0.3"},"esearchresult":{"count":"1031043","retmax":"20","retstart":"0","idlist":["36150897","36150786","36150780","36150617","36150548","36150430","36150415","36150254","36149899","36149896","36149852","36149773","36149730","36149727","36149583","36149512","36149398","36149279","36148991","36148916"],"translationset":[{"from":"ai","to":"\\"antagonists and inhibitors\\"[Subheading] OR (\\"antagonists\\"[All Fields] AND \\"inhibitors\\"[All Fields]) OR \\"antagonists and inhibitors\\"[All Fields] OR \\"ai\\"[All Fields]"}],"translationstack":[{"term":"\\"antagonists and inhibitors\\"[Subheading]","field":"Subheading","count":"617280","explode":"Y"},{"term":"\\"antagonists\\"[All Fields]","field":"All Fields","count":"894020","explode":"N"},{"term":"\\"inhibitors\\"[All Fields]","field":"All Fields","count":"1329963","explode":"N"},"AND","GROUP","OR",{"term":"\\"antagonists and inhibitors\\"[All Fields]","field":"All Fields","count":"617400","explode

In [7]:
# convert in json
json_site = json.loads(site)
json_site

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '1031043',
  'retmax': '20',
  'retstart': '0',
  'idlist': ['36150897',
   '36150786',
   '36150780',
   '36150617',
   '36150548',
   '36150430',
   '36150415',
   '36150254',
   '36149899',
   '36149896',
   '36149852',
   '36149773',
   '36149730',
   '36149727',
   '36149583',
   '36149512',
   '36149398',
   '36149279',
   '36148991',
   '36148916'],
  'translationset': [{'from': 'ai',
    'to': '"antagonists and inhibitors"[Subheading] OR ("antagonists"[All Fields] AND "inhibitors"[All Fields]) OR "antagonists and inhibitors"[All Fields] OR "ai"[All Fields]'}],
  'translationstack': [{'term': '"antagonists and inhibitors"[Subheading]',
    'field': 'Subheading',
    'count': '617280',
    'explode': 'Y'},
   {'term': '"antagonists"[All Fields]',
    'field': 'All Fields',
    'count': '894020',
    'explode': 'N'},
   {'term': '"inhibitors"[All Fields]',
    'field': 'All Fields',
    'count': '1329963

In [8]:
# we only want the idlist
json_site['esearchresult']['idlist']

['36150897',
 '36150786',
 '36150780',
 '36150617',
 '36150548',
 '36150430',
 '36150415',
 '36150254',
 '36149899',
 '36149896',
 '36149852',
 '36149773',
 '36149730',
 '36149727',
 '36149583',
 '36149512',
 '36149398',
 '36149279',
 '36148991',
 '36148916']

## EFETCH

Portion of the API that given a UID return formatted data records.
It require some parameter:
- **db**: the database (our case pubmed)
- **uid**: the UID whose data we want to retrieve
- **retmode**: the format of output data (only xml or html)

In [9]:
# create the url for the api with the parameters we want
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
db = '?db=pubmed'
uid = '&id=' + str(36135925) # just a random one, we want to use the one we retieved above
rettype = '&rettype=medline' # medline more structured than xml

In [10]:
url = base_url + db + uid + rettype
print(url)

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=36135925&rettype=medline


In [11]:
paper_data = requests.get(url).content
paper_data = paper_data.decode()

In [12]:
# plain text we need some processing to make it usefull
paper_data

"\nPMID- 36135925\nOWN - NLM\nSTAT- Publisher\nLR  - 20220922\nIS  - 2050-084X (Electronic)\nIS  - 2050-084X (Linking)\nVI  - 11\nDP  - 2022 Sep 22\nTI  - Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.\nLID - 10.7554/eLife.71809 [doi]\nLID - e71809 [pii]\nAB  - CRISPR/Cas9 provides a highly efficient and flexible genome editing technology\n      with numerous potential applications ranging from gene therapy to population\n      control. Some proposed applications involve the integration of CRISPR/Cas9\n      endonucleases into an organism's genome, which raises questions about potentially\n      harmful effects to the transgenic individuals. One example for which this is\n      particularly relevant are CRISPR-based gene drives conceived for the genetic\n      alteration of entire populations. The performance of such drives can strongly\n      depend on fitness costs experienced by drive carriers, yet relatively little is\n      known about the magnitud

In [13]:
paper_data = paper_data.split('\n')
paper_data

['',
 'PMID- 36135925',
 'OWN - NLM',
 'STAT- Publisher',
 'LR  - 20220922',
 'IS  - 2050-084X (Electronic)',
 'IS  - 2050-084X (Linking)',
 'VI  - 11',
 'DP  - 2022 Sep 22',
 'TI  - Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.',
 'LID - 10.7554/eLife.71809 [doi]',
 'LID - e71809 [pii]',
 'AB  - CRISPR/Cas9 provides a highly efficient and flexible genome editing technology',
 '      with numerous potential applications ranging from gene therapy to population',
 '      control. Some proposed applications involve the integration of CRISPR/Cas9',
 "      endonucleases into an organism's genome, which raises questions about potentially",
 '      harmful effects to the transgenic individuals. One example for which this is',
 '      particularly relevant are CRISPR-based gene drives conceived for the genetic',
 '      alteration of entire populations. The performance of such drives can strongly',
 '      depend on fitness costs experienced by drive carriers

In [14]:
ls = []
for elem in paper_data:
    if elem == '':
        continue
    elif elem[0:4] != '    ':
        elem = elem.strip()
        elem = elem.replace('\n', '')
        ls.append(elem)
    else:
        elem = elem.strip()
        elem = elem.replace('\n', '')
        ls[-1] = ls[-1] + elem
ls

['PMID- 36135925',
 'OWN - NLM',
 'STAT- Publisher',
 'LR  - 20220922',
 'IS  - 2050-084X (Electronic)',
 'IS  - 2050-084X (Linking)',
 'VI  - 11',
 'DP  - 2022 Sep 22',
 'TI  - Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.',
 'LID - 10.7554/eLife.71809 [doi]',
 'LID - e71809 [pii]',
 "AB  - CRISPR/Cas9 provides a highly efficient and flexible genome editing technologywith numerous potential applications ranging from gene therapy to populationcontrol. Some proposed applications involve the integration of CRISPR/Cas9endonucleases into an organism's genome, which raises questions about potentiallyharmful effects to the transgenic individuals. One example for which this isparticularly relevant are CRISPR-based gene drives conceived for the geneticalteration of entire populations. The performance of such drives can stronglydepend on fitness costs experienced by drive carriers, yet relatively little isknown about the magnitude and causes of these costs. Her

In [15]:
ls_2 = [] 
for elem in ls:
    key = elem[0:4].strip()
    value = elem[5:].strip()
        
    ls_2.append([key, value])
ls_2

[['PMID', '36135925'],
 ['OWN', 'NLM'],
 ['STAT', 'Publisher'],
 ['LR', '20220922'],
 ['IS', '2050-084X (Electronic)'],
 ['IS', '2050-084X (Linking)'],
 ['VI', '11'],
 ['DP', '2022 Sep 22'],
 ['TI',
  'Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.'],
 ['LID', '10.7554/eLife.71809 [doi]'],
 ['LID', 'e71809 [pii]'],
 ['AB',
  "CRISPR/Cas9 provides a highly efficient and flexible genome editing technologywith numerous potential applications ranging from gene therapy to populationcontrol. Some proposed applications involve the integration of CRISPR/Cas9endonucleases into an organism's genome, which raises questions about potentiallyharmful effects to the transgenic individuals. One example for which this isparticularly relevant are CRISPR-based gene drives conceived for the geneticalteration of entire populations. The performance of such drives can stronglydepend on fitness costs experienced by drive carriers, yet relatively little isknown about the magnit

In [16]:
# the first element is the tag
# the seconf element is the content
# here are the tags definitions https://www.nlm.nih.gov/bsd/mms/medlineelements.html

for elem in ls_2:
    if elem[0] == 'TI':
        print('Title: ' + elem[1])
    elif elem[0] == 'DP':
        print('Pubblication date: ' + elem[1])
        

Pubblication date: 2022 Sep 22
Title: Fitness effects of CRISPR endonucleases in Drosophila melanogaster populations.


In [17]:
# once you have the data you can store them in a csv file with pandas