<a href="https://colab.research.google.com/github/atjoelpark/eutilities/blob/main/demo/PubMedExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Obtaining Keys from the US National Library of Medicine
# Link: https://www.ncbi.nlm.nih.gov/home/develop/api/

In [2]:
# The two APIs of interest for PubMed Articles
# 1. Entrez Programming Utilities
# 2. PubMed Central (PMC) APIs

# Website provides documentation in regards to scope and uses.

In [3]:
# Use of APIs are free, but please do not abuse the services.
# 1. Do not run API requests concurrently.
# 2. Include parameters to help identify what services you require.
# 2a. 'tool' should be the name of your application, as a string value with no internal spaces.
# 2b. 'email' should be the email address of the maintainer of the tool

# Setting up Google Colab

In [5]:
# Setting up keys
API_KEY = ""
GOOGLE_DRIVE_URL = ""

In [None]:
# Mounting Google Drive to Google Colab
from google.colab import drive
drive.mount(f'/content/drive/{GOOGLE_DRIVE_URL}')

In [6]:
# Loading libraries (dependencies)
import numpy as np 
import pandas as pd 
import re 
import requests

# Main

In [8]:
# Using the requests module
# Using the function "get", we enter in the URL with the desired query parameters as an argument
# In this case, we will use Breast Cancer in Science Journal published in 2008.
response = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]')

In [10]:
# Exploring response
print(response)

<Response [200]>


Response 200 indicates that the API had successfully retrieved the data from the server. Let us explore the response.

TO see what option exist, this link will provide additional detail information: [W3 Schools Python Requests Module](https://www.w3schools.com/python/ref_requests_response.asp).

In [16]:
print(response.text)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>6</Count><RetMax>6</RetMax><RetStart>0</RetStart><IdList>
<Id>19008416</Id>
<Id>18927361</Id>
<Id>18787170</Id>
<Id>18487186</Id>
<Id>18239126</Id>
<Id>18239125</Id>
</IdList><TranslationSet><Translation>     <From>science[journal]</From>     <To>"Science"[Journal] OR "Science (1979)"[Journal]</To>    </Translation><Translation>     <From>breast cancer</From>     <To>"breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"Science"[Journal]</Term>    <Field>Journal</Field>    <Count>179506</Count>    <Explode>N</Explode>   </TermSet>   <TermSet>    <Term>"Science (1979)"[Jou

Note that `response.text` is the statement that 

# Accessing E-utilities via Unix
**Advanced Topic**

E-utilities provides an API via the Unix command line. If you are adept at Unix and command line interfaces, then this can also be an alternative method to pulling in data.

This setup will take advantage of the underlying Google Compute Engine (Virtual Machine) that powers Google Colabs. Google Colabs is a Jupyter-like notebook that layers on top of a virtual machine. For Google Colab Pro users, you can access the terminal by clicking on the icon at the bottom left of the screen.

Regardless of whether you have Google Colab Pro or not, the function below `e_utilities_intall()` will enable this Colab to access the Unix-based E-utilities API.

In [11]:
# Installing E-utilities Entrez Direct
def e_utilities_install():
  """
  Installs e_utilities
  Reference: https://www.ncbi.nlm.nih.gov/books/NBK179288/
  """
  !curl -L https://www.ncbi.nlm.nih.gov/books/NBK179288/bin/install-edirect.sh > install-edirect.sh
  !bash install-edirect.sh -y
  !echo 'export PATH=\$PATH:\$HOME/edirect' >> $HOME/.bash_profile
  !rm install-edirect.sh

In [18]:

# Defining Functions
def pull_pmid_metadata(pmid: list) -> pd.DataFrame:
  """
  This is dependent on E-utilities. The input parameter is a list of PMIDs.
  The output is is a Pandas DataFrame with the following columns:

  1. PMID
  2. PubMed Article Title
  3. DateCompleted (Year_Month_Day)
  4. DateRevised (Year_Month_Day)
  5. Journal Title
  6. Publication Date (Year_Month_Day)
  7. Abstract
  8. Author FirstName_LastName_Affiliation (Note that that three values are
  separated by "_". If an author has affilitations to multiple institions, the
  institutions are separated by the character "/".)

  @param pmid: Takes a list of PMIDs produced by function pull_pmid
  @return: Returns a Pandas DataFrame
  @raise keyError: raises an exception
  """
  columns = ["PMID", "PubMed_Article_Title", "Date_Completed_Year", 
             "Date_Completed_Month", "Date_Completed_Day", "Date_Revised_Year", 
             "Date_Revised_Month", "Date_Revised_Day", "Journal_Title",
             "Publication_Date_Year", "Publication_Date_Month", "Publication_Date_Day",
             "Abstract", "AuthorFirstName_AuthorLastName_Affiliation"]
  df = pd.DataFrame(columns=columns)

  for id, i in enumerate(pmid):
    try:
      _temp = f'''$HOME/edirect/efetch -db pubmed -id {i} -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab "/" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation'''

      _result = !{_temp}
      _temp = _result[0].split("|")

      for _count, _value in enumerate(_temp):
        df.loc[id,columns[_count]] = _value
        
    except Exception as e:
      print(f"Error Raised when Querying Unix EDirect for PMID: {i}")
      print(e)

  # Prior to returning df
  # If any cells have empty values, convert to NULL
  df = df.replace(r'', "NULL", regex=True)

  return df

In [None]:
$HOME/edirect/efetch -db pubmed -id {i} -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab "/" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation'''