<a href="https://colab.research.google.com/github/atjoelpark/ml-disparities-mit/blob/master/pull_preprocessing/LCP_Pull_Extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pull and Extraction of LCP PMIDs

This notebook provides documentation and code for pulling metadata for a list of PMID IDs.

Reference: https://lcp.mit.edu/publications

In [None]:
# Customization
# Please enter in the pathway in your Google Drive (after /content/drive/) that you would like your files to be saved into
# Users will only need to modify this code then run all the cells in order
google_drive_url = ""
folder_to_save_in_google_drive = "/content/drive/My Drive/Research/Pubmed/MIT_publications"

# Libraries and Mounting Google Drive

In [1]:
# Importing libraries
import numpy as np 
import pandas as pd 
import re 
import requests

In [None]:
# Mounting Google Drive if using Google Drive
from google.colab import drive
drive.mount(f'/content/drive/{google_drive_url}')

Mounted at /content/drive/


## Defining Functions

In [4]:
# Installing E-utilities Entrez Direct
def e_utilities_install():
  """
  Installs e_utilities
  Reference: https://www.ncbi.nlm.nih.gov/books/NBK179288/
  """
  !sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
  !echo 'export PATH=\$PATH:\$HOME/edirect' >> $HOME/.bash_profile

In [None]:
# Setting development environment for Selenium
def setup_dev_environment():
  """
  Installs chromium, driver and selenium
  Sets options to be headless
  Opens a website and prepares Selenium for use
  Returns: webdriver
  """

  # install chromium, its driver, and selenium
  !apt update
  !apt install chromium-chromedriver
  !pip install selenium
  !pip install webdriver_manager
  # set options to be headless, ..
  from selenium import webdriver
  options = webdriver.ChromeOptions()
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  # open it, go to a website, and get results
  print("Chromium, Driver and Selenium successfully started....")
  return webdriver.Chrome(options=options)

In [None]:
# Defining Functions
def pull_pmid(year: int, wd) -> list:
  """ 
  This takes in a year as an argument and reads the PMID IDs for each year within https://lcp.mit.edu/publications
  Please note that scraping is a fragile process. Any changes to the underlying HTML tags, attributes can break this process.
  This method was constructed so that extracting PMIDs would be an automated, reproducible process.
  Note that manually inspecting the webpages and writing down the PMIDs is likely a faster and more reliable process!
  Regardless, please use this method at your caution as the results may not be reliable as time continues forward.

  Dependencies: Selenium, chromium-chromedriver, webdriver_manager, re
  @param year: This int contains the year to scrape from in https://lcp.mit.edu/publications
  @param wb: Passes in the webdriver for Selenium
  @return: a list of PMIDs
  @raise TypeError: raises an exception
  """
  try:
    # Uses Selenium to search by CSS
    URL = f'https://lcp.mit.edu/publications#P_{year}'
    wd.get(URL)
    # Click on the appropriate year tag
    l = wd.find_element_by_link_text(str(year))
    l.click()

    # Though this will decrease this method's generalizability, it appears that the CSS selector differs from year to year
    # For now, will hard code this in from year 2003 to 2020 (inclusive)
    year_switcher = {
        2003: '.bib2xhtml a:nth-child(2)~ a+ a , dd:nth-child(4) cite+ a',
        2004: '.bib2xhtml:nth-child(5) a:nth-child(4)',
        2005: 'dd:nth-child(6) cite~ a+ a , .bib2xhtml a~ a+ a',
        2006: '.bib2xhtml a:nth-child(2)~ a+ a , .bib2xhtml:nth-child(5) dd:nth-child(6) a+ a',
        2007: 'dd:nth-child(6) cite~ a+ a , dd:nth-child(2) cite~ a+ a',
        2008: '.bib2xhtml:nth-child(5) dd:nth-child(4) a+ a , .bib2xhtml:nth-child(5) dd:nth-child(2) a+ a , dd:nth-child(8) a+ a , dd:nth-child(14) a+ a , #content a:nth-child(5) , .bib2xhtml a:nth-child(2) , .bib2xhtml:nth-child(5) dd:nth-child(10) a+ a',
        2009: '.bib2xhtml a+ a:nth-child(3) , dd:nth-child(6) a+ a , #content a:nth-child(5) , dd:nth-child(2) a+ a',
        2010: 'dd:nth-child(16) a+ a , .bib2xhtml a:nth-child(2)~ a+ a , #content a:nth-child(5)',
        2011: '.bib2xhtml a:nth-child(2)~ a+ a , .bib2xhtml:nth-child(5) dd:nth-child(6) a , #content a:nth-child(5) , dd:nth-child(2) a~ cite+ a',
        2012: 'dd:nth-child(16) a+ a , #content a:nth-child(5) , dd:nth-child(2) a+ a',
        2013: 'dd:nth-child(24) a+ a , dd:nth-child(22) a+ a , dd:nth-child(20) a , dd:nth-child(16) a+ a , dd:nth-child(14) a+ a , dd:nth-child(10) a+ a , dd:nth-child(12) a+ a , dd:nth-child(6) cite:nth-child(2)~ a+ a , #content a:nth-child(5)',
        2014: 'dd:nth-child(28) a+ a , dd:nth-child(20) a+ a , .bib2xhtml a:nth-child(2)~ a+ a , .bib2xhtml:nth-child(5) dd:nth-child(6) a+ a , dd:nth-child(4) a+ a , #content a:nth-child(5)',
        2015: 'dd:nth-child(4) a+ a , dd:nth-child(40) a+ a , dd:nth-child(38) a+ a , dd:nth-child(36) a+ a , dd:nth-child(32) a+ a , .bib2xhtml a:nth-child(2)~ a+ a , dd:nth-child(26) cite+ a , dd:nth-child(22) a+ a , dd:nth-child(14) a+ a , dd:nth-child(12) a+ a , .bib2xhtml:nth-child(5) dd:nth-child(6) a+ a , #content a:nth-child(5) , dd:nth-child(2) a+ a',
        2016: 'dd:nth-child(6) a+ a , dd:nth-child(8) cite~ a+ a , dd:nth-child(54) a+ a , dd:nth-child(52) a+ a , dd:nth-child(50) a+ a , dd:nth-child(44) a+ a , dd:nth-child(46) a+ a , dd:nth-child(40) a~ a+ a , dd:nth-child(38) a+ a , dd:nth-child(34) a+ a , dd:nth-child(32) cite+ a , dd:nth-child(28) a+ a , dd:nth-child(20) cite~ a+ a , dd:nth-child(16) cite~ a+ a , dd:nth-child(4) cite~ a+ a , #content a:nth-child(5)',
        2017: 'dd:nth-child(48) a+ a , dd:nth-child(46) a+ a , dd:nth-child(44) a+ a , dd:nth-child(40) a+ a , dd:nth-child(38) a+ a , dd:nth-child(36) a+ a , dd:nth-child(34) a+ a , dd:nth-child(32) a+ a , dd:nth-child(30) a+ a , dd:nth-child(26) a+ a , dd:nth-child(24) cite+ a , dd:nth-child(22) a+ a , dd:nth-child(20) a+ a , dd:nth-child(18) a+ a , dd:nth-child(16) a+ a , dd:nth-child(12) a+ a , dd:nth-child(10) a+ a , dd:nth-child(4) a+ a , #content a:nth-child(5) , dd:nth-child(2) a+ a',
        2018: '.bib2xhtml:nth-child(5) a+ a',
        2019: '.bib2xhtml a~ a+ a , cite~ a+ a',
        2020: '.bib2xhtml a+ a'
    }

    # Python switch statement depending on year
    def switch(year):
      return year_switcher.get(year)

    # Gathers PMIDS
    _links = wd.find_elements_by_css_selector(switch(year))

    # Initiating an empty PMID list and appends to list with all PMID IDs
    _pmid_list = []
    for i in _links:
      tmp_search = re.findall(r'\(PMID:.*\)', i.text)
      if tmp_search:
        _pmid_list.append(tmp_search)

    # Flattends the list
    _pmid_list = [item for sublist in _pmid_list for item in sublist]

    # Extracts out only integers and removes text and special characters. 
    # Returns the list
    _pmid_list = [int(re.findall(r'\d+', i)[0]) for i in _pmid_list]
    return _pmid_list

  except TypeError as e:
    print("Error raised while pulling PMID...")
    print(e)

In [None]:
# Defining Functions
def pull_pmid_metadata(pmid: list) -> pd.DataFrame:
  """
  This is dependent on E-utilities. The input parameter is a list of PMIDs.
  The output is is a Pandas DataFrame with the following columns:

  1. PMID
  2. PubMed Article Title
  3. DateCompleted (Year_Month_Day)
  4. DateRevised (Year_Month_Day)
  5. Journal Title
  6. Publication Date (Year_Month_Day)
  7. Abstract
  8. Author FirstName_LastName_Affiliation (Note that that three values are
  separated by "_". If an author has affilitations to multiple institions, the
  institutions are separated by the character "/".)

  @param pmid: Takes a list of PMIDs produced by function pull_pmid
  @return: Returns a Pandas DataFrame
  @raise keyError: raises an exception
  """
  columns = ["PMID", "PubMed_Article_Title", "Date_Completed_Year", 
             "Date_Completed_Month", "Date_Completed_Day", "Date_Revised_Year", 
             "Date_Revised_Month", "Date_Revised_Day", "Journal_Title",
             "Publication_Date_Year", "Publication_Date_Month", "Publication_Date_Day",
             "Abstract", "AuthorFirstName_AuthorLastName_Affiliation"]
  df = pd.DataFrame(columns=columns)

  for id, i in enumerate(pmid):
    try:
      _temp = f'''$HOME/edirect/efetch -db pubmed -id {i} -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab "/" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation'''

      _result = !{_temp}
      _temp = _result[0].split("|")

      for _count, _value in enumerate(_temp):
        df.loc[id,columns[_count]] = _value
        
    except Exception as e:)
      print(f"Error Raised when Querying Unix EDirect for PMID: {i}")
      print(e)

  # Prior to returning df
  # If any cells have empty values, convert to NULL
  df = df.replace(r'', "NULL", regex=True)

  return df

# Main

In [5]:
%%time
# Install E-utilities, when prompted "Would you like to do that automatically now?" Please select 'y'.
e_utilities_install()


Entrez Direct has been successfully downloaded and installed.

-e In order to complete the configuration process, please execute the following:

  echo "source ~/.bash_profile" >> $HOME/.bashrc

or manually edit the PATH variable assignment in your .bash_profile file.

Would you like to do that automatically now? [y/N]
y
OK, done.

To activate EDirect for this terminal session, please execute the following:

export PATH=${PATH}:$HOME/edirect

CPU times: user 71.2 ms, sys: 23.6 ms, total: 94.7 ms
Wall time: 6.87 s


In [None]:
%%time 
# Sets up Web Driver
wd = setup_dev_environment()

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:12 https://cloud.r-project.org/bin/linux/ubuntu bi

In [None]:
%%time
# Creating a dictionary with keys = years, values = list of PMIDs corresponding to the year
MIT_publications = {}
for year in range(2003, 2021, 1):
  MIT_publications[year] = pull_pmid(year, wd)

CPU times: user 7.18 s, sys: 306 ms, total: 7.49 s
Wall time: 1min 12s


In [None]:
%%time
# Looping through the dictionary keys and returning pandas dataframes
# Saving to Google Colab all the CSV files
MIT_publications_df = {}
for year in range(2003, 2021, 1):
  MIT_publications_df[year] = pull_pmid_metadata(MIT_publications[year])
  exec(f'MIT_publications_{year} = MIT_publications_df[{year}]')
  exec(f'MIT_publications_{year}.to_csv("{folder_to_save_in_google_drive}/MIT_publications_{year}.csv", index=False)')

Error Raised when Querying Unix EDirect for PMID: 28114047
list index out of range
Error Raised when Querying Unix EDirect for PMID: 28328711
list index out of range
