<a href="https://colab.research.google.com/github/atjoelpark/ml-disparities-mit/blob/master/pull_preprocessing/LCP_Pull_Extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pull and Extraction of LCP PMIDs

This notebook provides documentation and code for pulling metadata for a list of PMID IDs.

Reference: https://lcp.mit.edu/publications

In [119]:
# Customization
# Please enter in the pathway in your Google Drive (after /content/drive/) that you would like your files to be saved into
# Users will only need to modify this code then run all the cells in order
google_drive_url = ""

# Libraries and Mounting Google Drive

In [120]:
# Importing libraries
import numpy as np 
import pandas as pd 
import re 
import requests
from bs4 import BeautifulSoup

In [3]:
# Mounting Google Drive if using Google Drive
from google.colab import drive
drive.mount(f'/content/drive/{google_drive_url}')

Mounted at /content/drive/


## Defining Functions

In [121]:
# Installing E-utilities Entrez Direct
def e_utilities_install():
  """
  Installs e_utilities
  Reference: https://www.ncbi.nlm.nih.gov/books/NBK179288/
  """
  !curl -L https://www.ncbi.nlm.nih.gov/books/NBK179288/bin/install-edirect.sh > install-edirect.sh
  !bash install-edirect.sh -y
  !echo 'export PATH=\$PATH:\$HOME/edirect' >> $HOME/.bash_profile
  !rm install-edirect.sh

In [122]:
# Setting development environment for Selenium
def setup_dev_environment():
  """
  Installs chromium, driver and selenium
  Sets options to be headless
  Opens a website and prepares Selenium for use
  Returns: webdriver
  """

  # install chromium, its driver, and selenium
  !apt update
  !apt install chromium-chromedriver
  !pip install selenium
  !pip install webdriver_manager
  # set options to be headless, ..
  from selenium import webdriver
  options = webdriver.ChromeOptions()
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  # open it, go to a website, and get results
  print("Chromium, Driver and Selenium successfully started....")
  return webdriver.Chrome(options=options)

In [123]:
# Defining Functions
def pull_pmid(year: int, wb) -> list:
  """ 
  This takes in a year as an argument and reads the PMID IDs for each year within https://lcp.mit.edu/publications

  Dependencies: Selenium, chromium-chromedriver, webdriver_manager, re
  @param year: This int contains the year to scrape from in https://lcp.mit.edu/publications
  @param wb: Passes in the webdriver for Selenium
  @return: a list of PMIDs
  @raise TypeError: raises an exception
  """
  try:
    # Uses Selenium to search by CSS
    URL = f'https://lcp.mit.edu/publications#P_{year}'
    wd.get(URL)
    _links = wd.find_elements_by_css_selector('.bib2xhtml a+ a')

    # Initiating an empty PMID list and appends to list with all PMID IDs
    _pmid_list = []
    for i in _links:
      tmp_search = re.findall(r'\(PMID:.*\)', i.text)
      if tmp_search:
        _pmid_list.append(tmp_search)

    # Flattends the list
    _pmid_list = [item for sublist in _pmid_list for item in sublist]

    # Extracts out only integers and removes text and special characters. 
    # Returns the list
    _pmid_list = [int(re.findall(r'\d+', i)[0]) for i in _pmid_list]
    return _pmid_list

  except TypeError as e:
    print("Error raised while pulling PMID...")
    print(e)

In [230]:
# Defining Functions
def pull_pmid_metadata(pmid: list) -> pd.DataFrame:
  """
  This is dependent on E-utilities. The input parameter is a list of PMIDs.
  The output is a Pandas DataFrame with the following columns:

  1. PMID
  2. PubMed Article Title
  3. DateCompleted (Year_Month_Day)
  4. DateRevised (Year_Month_Day)
  5. Journal Title
  6. Publication Date (Year_Month_Day)
  7. Abstract
  8. Author FirstName_LastName_Affiliation (Note that that three values are
  separated by "_". If an author has affilitations to multiple institions, the
  institutions are separated by the character "/".)

  @param pmid: Takes a list of PMIDs produced by function pull_pmid
  @return: Returns a Pandas DataFrame
  @raise keyError: raises an exception
  """
  columns = ["PMID", "PubMed_Article_Title", "Date_Completed_Year", 
             "Date_Completed_Month", "Date_Completed_Day", "Date_Revised_Year", 
             "Date_Revised_Month", "Date_Revised_Day", "Journal_Title",
             "Publication_Date_Year", "Publication_Date_Month", "Publication_Date_Day",
             "Abstract", "AuthorFirstName_AuthorLastName_Affiliation"]
  df = pd.DataFrame(columns=columns)
  for id, i in enumerate(pmid):
    try:
      _temp = f'''$HOME/edirect/efetch -db pubmed -id {i} -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab "/" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation'''

      _result = !{_temp}
      _temp = _result[0].split("|")

      for _count, _value in enumerate(_temp):
        df.loc[id,columns[_count]] = _value
        
    except Exception as e:
      print("Error Raised when Querying Unix EDirect...")
      print(e)

  # Prior to returning df
  # If any cells have empty values, convert to NULL
  df = df.replace(r'', "NULL", regex=True)
  return df

# Main

In [125]:
%%time
# Install E-utilities, when prompted "Would you like to do that automatically now?" Please select 'y'.
e_utilities_install()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   665  100   665    0     0   1722      0 --:--:-- --:--:-- --:--:--  1718

Entrez Direct has been successfully downloaded and installed.

In order to complete the configuration process, please execute the following:

  echo "source ~/.bash_profile" >> $HOME/.bashrc

or manually edit the PATH variable assignment in your .bash_profile file.

Would you like to do that automatically now? [y/N]
y
OK, done.
CPU times: user 118 ms, sys: 37.6 ms, total: 156 ms
Wall time: 18.1 s


In [126]:
%%time 
# Sets up Web Driver
wd = setup_dev_environment()

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Connecting to security.u[0m[33m0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)[0m                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
[33m0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)[0m                                                                               Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
[33m0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)[0m                                                                               Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
[3

In [111]:
%%time
pmid_list_2020 = pull_pmid(2020, wd)

CPU times: user 653 ms, sys: 21.1 ms, total: 674 ms
Wall time: 9.18 s


In [231]:
%%time 
pull_pmid_metadata(pmid_list_2020)

CPU times: user 202 ms, sys: 88.1 ms, total: 290 ms
Wall time: 9.53 s


Unnamed: 0,PMID,PubMed_Article_Title,Date_Completed_Year,Date_Completed_Month,Date_Completed_Day,Date_Revised_Year,Date_Revised_Month,Date_Revised_Day,Journal_Title,Publication_Date_Year,Publication_Date_Month,Publication_Date_Day,Abstract,AuthorFirstName_AuthorLastName_Affiliation
0,32612144,Real-world characterization of blood glucose c...,2020.0,12.0,4.0,2021,7,1,Scientific reports,2020,7.0,1.0,The heterogeneity of critical illness complica...,"Lawrence_Baker_RAND Corporation, Santa Monica,..."
1,31948262,Temporal Trends in Critical Care Outcomes in U...,2020.0,6.0,22.0,2021,3,15,American journal of respiratory and critical c...,2020,3.0,15.0,Rationale: Whether critical care improvements ...,"John_Danziger_Department of Medicine, Beth Isr..."
2,32126097,Predicting Intensive Care Unit admission among...,2020.0,6.0,11.0,2020,6,25,PloS one,2020,,,The risk stratification of patients in the eme...,"Marta_Fernandes_IDMEC, Instituto Superior Técn..."
3,32240233,Risk of mortality and cardiopulmonary arrest i...,2020.0,7.0,16.0,2020,7,16,PloS one,2020,,,Emergency department triage is the first point...,"Marta_Fernandes_IDMEC, Instituto Superior Técn..."
4,32248145,Urban Intelligence for Pandemic Response: View...,2020.0,4.0,16.0,2020,12,18,JMIR public health and surveillance,2020,4.0,14.0,Previous epidemic management research proves t...,Yuan_Lai_Department of Urban Studies and Plann...
5,32577533,What do medical students actually need to know...,,,,2021,7,4,NPJ digital medicine,2020,,,With emerging innovations in artificial intell...,"Liam G_McCoy_Faculty of Medicine, University o..."
6,32449686,COVID-19: Putting the General Data Protection ...,2020.0,6.0,3.0,2020,12,18,JMIR public health and surveillance,2020,5.0,29.0,The coronavirus disease (COVID-19) pandemic is...,Stuart_McLennan_Institute of History and Ethic...
7,32577534,"""Yes, but will it work for my patients?"" Drivi...",,,,2021,7,4,NPJ digital medicine,2020,,,Benchmark datasets have a powerful normative i...,Trishan_Panch_Division of Health Policy and Ma...
8,32432708,Assessment of Proficiency of N95 Mask Donning ...,2020.0,10.0,26.0,2020,12,18,JAMA network open,2020,5.0,1.0,,"Wesley_Yeung_University Medicine Cluster, Nati..."


In [218]:
!$HOME/edirect/efetch -db pubmed -id 32577533 -format xml

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE PubmedArticleSet>
<PubmedArticleSet>
  <PubmedArticle>
    <MedlineCitation Status="PubMed-not-MEDLINE" Owner="NLM">
      <PMID Version="1">32577533</PMID>
      <DateRevised>
        <Year>2021</Year>
        <Month>07</Month>
        <Day>04</Day>
      </DateRevised>
      <Article PubModel="Electronic-eCollection">
        <Journal>
          <ISSN IssnType="Electronic">2398-6352</ISSN>
          <JournalIssue CitedMedium="Internet">
            <Volume>3</Volume>
            <PubDate>
              <Year>2020</Year>
            </PubDate>
          </JournalIssue>
          <Title>NPJ digital medicine</Title>
          <ISOAbbreviation>NPJ Digit Med</ISOAbbreviation>
        </Journal>
        <ArticleTitle>What do medical students actually need to know about artificial intelligence?</ArticleTitle>
        <Pagination>
          <MedlinePgn>86</MedlinePgn>
        </Pagination>
        <ELocationID EIdTy