<a href="https://colab.research.google.com/github/atjoelpark/eutilities/blob/main/demo/PubMedExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Obtaining Keys from the US National Library of Medicine
# Link: https://www.ncbi.nlm.nih.gov/home/develop/api/

In [None]:
# The two APIs of interest for PubMed Articles
# 1. Entrez Programming Utilities
# 2. PubMed Central (PMC) APIs

# Website provides documentation in regards to scope and uses.

In [None]:
# Use of APIs are free, but please do not abuse the services.
# 1. Do not run API requests concurrently.
# 2. Include parameters to help identify what services you require.
# 2a. 'tool' should be the name of your application, as a string value with no internal spaces.
# 2b. 'email' should be the email address of the maintainer of the tool

# Setting up Google Colab

In [None]:
# Setting up keys
API_KEY = ""
GOOGLE_DRIVE_URL = ""
GOOGLE_FOLDER_TO_SAVE = "Research"

In [None]:
# Mounting Google Drive to Google Colab
from google.colab import drive
drive.mount(f'/content/drive/{GOOGLE_DRIVE_URL}')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# Loading libraries (dependencies)
import numpy as np 
import pandas as pd 
import re 
import requests

# Main

In [None]:
# Using the requests module
# Using the function "get", we enter in the URL with the desired query parameters as an argument
# In this case, we will use Breast Cancer in Science Journal published in 2008.
response = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]')

In [None]:
# Exploring response
print(response)

<Response [200]>


Response 200 indicates that the API had successfully retrieved the data from the server. Let us explore the response.

TO see what option exist, this link will provide additional detail information: [W3 Schools Python Requests Module](https://www.w3schools.com/python/ref_requests_response.asp).

In [None]:
print(response.text)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>6</Count><RetMax>6</RetMax><RetStart>0</RetStart><IdList>
<Id>19008416</Id>
<Id>18927361</Id>
<Id>18787170</Id>
<Id>18487186</Id>
<Id>18239126</Id>
<Id>18239125</Id>
</IdList><TranslationSet><Translation>     <From>science[journal]</From>     <To>"Science"[Journal] OR "Science (1979)"[Journal]</To>    </Translation><Translation>     <From>breast cancer</From>     <To>"breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"Science"[Journal]</Term>    <Field>Journal</Field>    <Count>179506</Count>    <Explode>N</Explode>   </TermSet>   <TermSet>    <Term>"Science (1979)"[Jou

Note that `response.text` is the statement that allows us to visualize the XML response. Generally, if you are intending to finding a list of PMIDs that fit your Pubmed Query, it may be better to use the PubMed website. Return to the Google Slides for further context.

Note that because the response is in XML, we will need to import an additional python package called `xml.etree.ElementTree` to parse the XML file into something useable.

**Resource**: 

* [Python XML with ElementTree: Beginner's Guide in DataCamp](https://www.datacamp.com/community/tutorials/python-xml-elementtree)

* [Python Docs XML ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html)


In [None]:
import xml.etree.ElementTree as ET
root = ET.fromstring(response.text)
tree = ET.ElementTree(root)

In [None]:
# Finding the root XML tag
print(root.tag)

eSearchResult


In [None]:
# Iterate over the XML children tags
for child in root:
  print(child.tag)

Count
RetMax
RetStart
IdList
TranslationSet
TranslationStack
QueryTranslation


In [None]:
# We see that IdList is the XML tag of interest.
# Let's see what tags exist under IdList
for child in root.iter('IdList'):
  print(child.tag)

IdList


In [None]:
# Iterate over IdList to see what PMIDs exist.
for id in root.iter('Id'):
  print(id.text)

19008416
18927361
18787170
18487186
18239126
18239125


If we continue to look into the documentation, we can end up defining a function that could return PMIDS. That being said, PMIDS can be returned from the website.

We need to more importantly be able to extract PubMed metadata. Let's go ahead and do that. Let's access PubMed ID 19008416. We will be using ESummary version 2.0 Output.

Note that multiple IDs can be entered at once like this:

`id=19008416,18927361,18787170,18487186,18239126,18239125`

In [None]:
# https://www.ncbi.nlm.nih.gov/books/NBK25500/
response = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=19008416&version=2.0')

In [None]:
# Reponse ok?
print(response)

<Response [200]>


In [None]:
# The return was successful.
# Let's look at the content
print(response.text)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary pubmed 20160808//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20160808/esummary_pubmed.dtd">
<eSummaryResult>
<DocumentSummarySet status="OK">
<DbBuild>Build210822-1333.1</DbBuild>

<DocumentSummary uid="19008416">
	<PubDate>2008 Dec 12</PubDate>
	<EPubDate>2008 Nov 13</EPubDate>
	<Source>Science</Source>
	<Authors>
		<Author>
			<Name>Varambally S</Name>
			<AuthType>Author</AuthType>
			<ClusterID></ClusterID>
		</Author>
		<Author>
			<Name>Cao Q</Name>
			<AuthType>Author</AuthType>
			<ClusterID></ClusterID>
		</Author>
		<Author>
			<Name>Mani RS</Name>
			<AuthType>Author</AuthType>
			<ClusterID></ClusterID>
		</Author>
		<Author>
			<Name>Shankar S</Name>
			<AuthType>Author</AuthType>
			<ClusterID></ClusterID>
		</Author>
		<Author>
			<Name>Wang X</Name>
			<AuthType>Author</AuthType>
			<ClusterID></ClusterID>
		</Author>
		<Author>
			<Name>Ateeq B</Name>
			<AuthType>Author</

In [None]:
# We can extract metadata of interest.
# For instance, let us retrieve the published data
root = ET.fromstring(response.text)

In [None]:
for date in root.iter("PubDate"):
  print(date.text)

2008 Dec 12


So extraction of information can be performed this way, but there needs to be a more scalable way.

Currently, there are Python packages that exist that can expedite this process. (This is where the API key comes in handy.)

[BioEntrez](https://biopython.org/docs/1.75/api/Bio.Entrez.html)

We will be using `Bio.Entrez.esummary(**keywds)` to retrieve document summaries on PMIDs of interest.

In [None]:
# Installing Bio Entrez
!pip install Bio

Collecting Bio
  Downloading bio-0.7.4-py3-none-any.whl (80 kB)
[?25l[K     |████                            | 10 kB 36.7 MB/s eta 0:00:01[K     |████████▏                       | 20 kB 39.9 MB/s eta 0:00:01[K     |████████████▎                   | 30 kB 46.9 MB/s eta 0:00:01[K     |████████████████▍               | 40 kB 19.1 MB/s eta 0:00:01[K     |████████████████████▌           | 51 kB 14.5 MB/s eta 0:00:01[K     |████████████████████████▋       | 61 kB 14.8 MB/s eta 0:00:01[K     |████████████████████████████▋   | 71 kB 13.7 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 5.0 MB/s 
Collecting biopython>=1.79
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 41.3 MB/s 
Installing collected packages: biopython, Bio
Successfully installed Bio-0.7.4 biopython-1.79


In [None]:
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.esummary(db="pubmed", id="19008416")
record = Entrez.read(handle)
handle.close()

In [None]:
# record now contains the esummary for PMID 19008416
# Note that the id parameter will accept a list of PMIDs
# Entrez.read returns a list.
# Given that we have only supplied one PMID, the list is going to be length 1
# Note that Python uses zero-indexing, so the first record will start at 0
print(record[0])

{'Item': [], 'Id': '19008416', 'PubDate': '2008 Dec 12', 'EPubDate': '2008 Nov 13', 'Source': 'Science', 'AuthorList': ['Varambally S', 'Cao Q', 'Mani RS', 'Shankar S', 'Wang X', 'Ateeq B', 'Laxman B', 'Cao X', 'Jing X', 'Ramnarayanan K', 'Brenner JC', 'Yu J', 'Kim JH', 'Han B', 'Tan P', 'Kumar-Sinha C', 'Lonigro RJ', 'Palanisamy N', 'Maher CA', 'Chinnaiyan AM'], 'LastAuthor': 'Chinnaiyan AM', 'Title': 'Genomic loss of microRNA-101 leads to overexpression of histone methyltransferase EZH2 in cancer.', 'Volume': '322', 'Issue': '5908', 'Pages': '1695-9', 'LangList': ['English'], 'NlmUniqueID': '0404511', 'ISSN': '0036-8075', 'ESSN': '1095-9203', 'PubTypeList': ['Journal Article'], 'RecordStatus': 'PubMed - indexed for MEDLINE', 'PubStatus': 'ppublish+epublish', 'ArticleIds': {'pubmed': ['19008416'], 'medline': [], 'pii': '1165395', 'doi': '10.1126/science.1165395', 'pmc': 'PMC2684823', 'mid': 'NIHMS104414', 'rid': '19008416', 'eid': '19008416', 'pmcid': 'pmc-id: PMC2684823;manuscript-id

In [None]:
# This object is technically a JSON object, which is akin to a Python Dictionary
# Let's see if we can extract the PubDate
print(record[0]['PubDate'])

2008 Dec 12


Knowing this information, let us go ahead and create a function out of this. This function should accept a list of PMIDs and output a Pandas DataFrame.

In [None]:
def returnPMIDdata(list_of_pmids):
  # Convert to string data type first
  list_of_pmids = list(map(str,list_of_pmids))

  # Join the list of PMIDS into one string object
  str_pmids = ','.join(list_of_pmids)

  # Calling Entrez.esummary to send request to API
  handle = Entrez.esummary(db="pubmed", id=str_pmids)
  record = Entrez.read(handle)
  
  # Closing the connection
  handle.close()

  # Identifying what metadata we would like
  columns = ['PubDate', 'Source', 'Title', 'Volume', 'Issue', 'DOI', 'FullJournalName']
  
  # Converting the data into a dataframe
  df = pd.DataFrame(columns=columns)

  # Adding data into dataframe
  for count, value in enumerate(record):
    df.loc[count,'PubDate'] = value['PubDate']
    df.loc[count,'Source'] = value['Source']
    df.loc[count,'Title'] = value['Title']
    df.loc[count,'Volume'] = value['Volume']
    df.loc[count,'Issue'] = value['Issue']
    df.loc[count,'DOI'] = value['DOI']
    df.loc[count,'FullJournalName'] = value['FullJournalName']

  # Return dataframe
  return df
    

In [None]:
list_of_pmids = [19008416, 18927361, 18787170, 18487186, 18239126, 18239125]

**Explaining what is going on under the hood in the returnPMIDdata function**

In [None]:
list_pmids = list(map(str,list_of_pmids))
print(list_pmids)

['19008416', '18927361', '18787170', '18487186', '18239126', '18239125']


In [None]:
# Joining the pmids into one string object
str_pmids = ','.join(list_pmids)
print(str_pmids)

19008416,18927361,18787170,18487186,18239126,18239125


In [None]:
# Calling Entrez.esummary to send request to API
handle = Entrez.esummary(db="pubmed", id=str_pmids)
record = Entrez.read(handle)

# Closing the connection
handle.close()

print(record)

[{'Item': [], 'Id': '19008416', 'PubDate': '2008 Dec 12', 'EPubDate': '2008 Nov 13', 'Source': 'Science', 'AuthorList': ['Varambally S', 'Cao Q', 'Mani RS', 'Shankar S', 'Wang X', 'Ateeq B', 'Laxman B', 'Cao X', 'Jing X', 'Ramnarayanan K', 'Brenner JC', 'Yu J', 'Kim JH', 'Han B', 'Tan P', 'Kumar-Sinha C', 'Lonigro RJ', 'Palanisamy N', 'Maher CA', 'Chinnaiyan AM'], 'LastAuthor': 'Chinnaiyan AM', 'Title': 'Genomic loss of microRNA-101 leads to overexpression of histone methyltransferase EZH2 in cancer.', 'Volume': '322', 'Issue': '5908', 'Pages': '1695-9', 'LangList': ['English'], 'NlmUniqueID': '0404511', 'ISSN': '0036-8075', 'ESSN': '1095-9203', 'PubTypeList': ['Journal Article'], 'RecordStatus': 'PubMed - indexed for MEDLINE', 'PubStatus': 'ppublish+epublish', 'ArticleIds': {'pubmed': ['19008416'], 'medline': [], 'pii': '1165395', 'doi': '10.1126/science.1165395', 'pmc': 'PMC2684823', 'mid': 'NIHMS104414', 'rid': '19008416', 'eid': '19008416', 'pmcid': 'pmc-id: PMC2684823;manuscript-i

In [None]:
# Number of items
print(len(record))

6


In [None]:
# What data fields exist within this response?
record[0].keys()

dict_keys(['Item', 'Id', 'PubDate', 'EPubDate', 'Source', 'AuthorList', 'LastAuthor', 'Title', 'Volume', 'Issue', 'Pages', 'LangList', 'NlmUniqueID', 'ISSN', 'ESSN', 'PubTypeList', 'RecordStatus', 'PubStatus', 'ArticleIds', 'DOI', 'History', 'References', 'HasAbstract', 'PmcRefCount', 'FullJournalName', 'ELocationID', 'SO'])

In [None]:
# Let's say we are interested in 'PubDate', 'Source', 'Title', 'Volume', 'Issue', 'DOI', 'FullJournalName'
columns = ['PubDate', 'Source', 'Title', 'Volume', 'Issue', 'DOI', 'FullJournalName']

In [None]:
# Let's examine the first item in the list
for column in columns:
  print(record[0][column])

2008 Dec 12
Science
Genomic loss of microRNA-101 leads to overexpression of histone methyltransferase EZH2 in cancer.
322
5908
10.1126/science.1165395
Science (New York, N.Y.)


You can see that this is then converted into a dataframe via the function.

In [None]:
sample_df = returnPMIDdata(list_of_pmids)
sample_df

Unnamed: 0,PubDate,Source,Title,Volume,Issue,DOI,FullJournalName
0,2008 Dec 12,Science,Genomic loss of microRNA-101 leads to overexpr...,322,5908,10.1126/science.1165395,"Science (New York, N.Y.)"
1,2008 Oct 17,Science,Genetics. DNA test for breast cancer risk draw...,322,5900,10.1126/science.322.5900.357,"Science (New York, N.Y.)"
2,2008 Sep 12,Science,FBXW7 targets mTOR for degradation and coopera...,321,5895,10.1126/science.1162981,"Science (New York, N.Y.)"
3,2008 May 16,Science,Design logic of a cannabinoid receptor signali...,320,5878,10.1126/science.1152662,"Science (New York, N.Y.)"
4,2008 Feb 1,Science,Cancer proliferation gene discovery through fu...,319,5863,10.1126/science.1149200,"Science (New York, N.Y.)"
5,2008 Feb 1,Science,Profiling essential genes in human mammary cel...,319,5863,10.1126/science.1149185,"Science (New York, N.Y.)"


# Accessing E-utilities via Unix
**Advanced Topic**

E-utilities provides an API via the Unix command line. If you are adept at Unix and command line interfaces, then this can also be an alternative method to pulling in data.

This setup will take advantage of the underlying Google Compute Engine (Virtual Machine) that powers Google Colabs. Google Colabs is a Jupyter-like notebook that layers on top of a virtual machine. For Google Colab Pro users, you can access the terminal by clicking on the icon at the bottom left of the screen.

Regardless of whether you have Google Colab Pro or not, the function below `e_utilities_intall()` will enable this Colab to access the Unix-based E-utilities API.

In [None]:
# Installing E-utilities Entrez Direct
def e_utilities_install():
  """
  Installs e_utilities
  Reference: https://www.ncbi.nlm.nih.gov/books/NBK179288/
  """
  !sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
  !echo 'export PATH=\$PATH:\$HOME/edirect' >> $HOME/.bash_profile

In [None]:

# Defining Functions
def pull_pmid_metadata(pmid: list) -> pd.DataFrame:
  """
  This is dependent on E-utilities. The input parameter is a list of PMIDs.
  The output is is a Pandas DataFrame with the following columns:

  1. PMID
  2. PubMed Article Title
  3. DateCompleted (Year_Month_Day)
  4. DateRevised (Year_Month_Day)
  5. Journal Title
  6. Publication Date (Year_Month_Day)
  7. Abstract
  8. Author FirstName_LastName_Affiliation (Note that that three values are
  separated by "_". If an author has affilitations to multiple institions, the
  institutions are separated by the character "/".)

  @param pmid: Takes a list of PMIDs produced by function pull_pmid
  @return: Returns a Pandas DataFrame
  @raise keyError: raises an exception
  """
  columns = ["PMID", "PubMed_Article_Title", "Date_Completed_Year", 
             "Date_Completed_Month", "Date_Completed_Day", "Date_Revised_Year", 
             "Date_Revised_Month", "Date_Revised_Day", "Journal_Title",
             "Publication_Date_Year", "Publication_Date_Month", "Publication_Date_Day",
             "Abstract", "AuthorFirstName_AuthorLastName_Affiliation"]
  df = pd.DataFrame(columns=columns)

  for id, i in enumerate(pmid):
    try:
      _temp = f'''$HOME/edirect/efetch -db pubmed -id {i} -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab "/" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation'''

      _result = !{_temp}
      _temp = _result[0].split("|")

      for _count, _value in enumerate(_temp):
        df.loc[id,columns[_count]] = _value
        
    except Exception as e:
      print(f"Error Raised when Querying Unix EDirect for PMID: {i}")
      print(e)

  # Prior to returning df
  # If any cells have empty values, convert to NULL
  df = df.replace(r'', "NULL", regex=True)

  return df

The function above will input a list of PMIDs (`[]`) and the output will be a Pandas DataFrame.

In case you are interested in what is run under the hood, an example Unix Query is being executed and the returning data is then ported into a data frame.

## Main

In [None]:
# Install e-utilities UNIX
# When prompoted here, please type `y` to install
e_utilities_install()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   665  100   665    0     0   2131      0 --:--:-- --:--:-- --:--:--  2131

Entrez Direct has been successfully downloaded and installed.

In order to complete the configuration process, please execute the following:

  echo "export PATH=\${PATH}:/root/edirect" >> $HOME/.bashrc

or manually edit the PATH variable assignment in your .bashrc file.

Would you like to do that automatically now? [y/N]
y
OK, done.


In [None]:
# Example of a query of PMID 19008416
!$HOME/edirect/efetch -db pubmed -id 19008416 -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab "/" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation

19008416|Genomic loss of microRNA-101 leads to overexpression of histone methyltransferase EZH2 in cancer.|2009|01|05|2018|11|13|Science (New York, N.Y.)|2008|Dec|12|Enhancer of zeste homolog 2 (EZH2) is a mammalian histone methyltransferase that contributes to the epigenetic silencing of target genes and regulates the survival and metastasis of cancer cells. EZH2 is overexpressed in aggressive solid tumors by mechanisms that remain unclear. Here we show that the expression and function of EZH2 in cancer cell lines are inhibited by microRNA-101 (miR-101). Analysis of human prostate tumors revealed that miR-101 expression decreases during cancer progression, paralleling an increase in EZH2 expression. One or both of the two genomic loci encoding miR-101 were somatically lost in 37.5% of clinically localized prostate cancer cells (6 of 16) and 66.7% of metastatic disease cells (22 of 33). We propose that the genomic loss of miR-101 in cancer leads to overexpression of EZH2 and concomitan

In [None]:
# Recall the list of PMIDs I had from the previous example
# This will call the function pull_pmid_metadata 
# And then this will organize this into a dataframe
df = pull_pmid_metadata(list_of_pmids)

In [None]:
# Examining the dataframe
df

Unnamed: 0,PMID,PubMed_Article_Title,Date_Completed_Year,Date_Completed_Month,Date_Completed_Day,Date_Revised_Year,Date_Revised_Month,Date_Revised_Day,Journal_Title,Publication_Date_Year,Publication_Date_Month,Publication_Date_Day,Abstract,AuthorFirstName_AuthorLastName_Affiliation
0,19008416,Genomic loss of microRNA-101 leads to overexpr...,2009,1,5,2018,11,13,"Science (New York, N.Y.)",2008,Dec,12,Enhancer of zeste homolog 2 (EZH2) is a mammal...,Sooryanarayana_Varambally_Michigan Center for ...
1,18927361,Genetics. DNA test for breast cancer risk draw...,2008,11,3,2009,11,19,"Science (New York, N.Y.)",2008,Oct,17,,Jennifer_Couzin
2,18787170,FBXW7 targets mTOR for degradation and coopera...,2008,9,25,2018,11,13,"Science (New York, N.Y.)",2008,Sep,12,The enzyme mTOR (mammalian target of rapamycin...,"Jian-Hua_Mao_Cancer Research Institute, Univer..."
3,18487186,Design logic of a cannabinoid receptor signali...,2008,5,27,2018,11,13,"Science (New York, N.Y.)",2008,May,16,Cannabinoid receptor 1 (CB1R) regulates neuron...,Kenneth D_Bromberg_Department of Pharmacology ...
4,18239126,Cancer proliferation gene discovery through fu...,2008,2,12,2019,1,9,"Science (New York, N.Y.)",2008,Feb,1,Retroviral short hairpin RNA (shRNA)-mediated ...,Michael R_Schlabach_Howard Hughes Medical Inst...
5,18239125,Profiling essential genes in human mammary cel...,2008,2,12,2018,11,13,"Science (New York, N.Y.)",2008,Feb,1,By virtue of their accumulated genetic alterat...,Jose M_Silva_Watson School of Biological Scien...


In [None]:
# Saving the df to Google Colab
exec(f'df.to_csv("/content/drive/My Drive/{GOOGLE_FOLDER_TO_SAVE}/PMID_MetaData.csv", index=False)')