<a href="https://colab.research.google.com/github/atjoelpark/ml-disparities-mit/blob/master/pull_processing/LCP_Pull_Extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pull and Extraction of LCP PMIDs

This notebook provides documentation and code for pulling metadata for a list of PMID IDs.

Reference: https://lcp.mit.edu/publications

In [80]:
# Customization
# Please enter in the pathway in your Google Drive (after /content/drive/) that you would like your files to be saved into
# Users will only need to modify this code then run all the cells in order
google_drive_url = ""

# Libraries and Mounting Google Drive

In [2]:
# Importing libraries
import numpy as np 
import pandas as pd 
import re 
import requests
from bs4 import BeautifulSoup

In [79]:
# Mounting Google Drive
from google.colab import drive
drive.mount(f'/content/drive/{google_drive_url}')

Mounted at /content/drive/


## Defining Functions

In [106]:
# Installing E-utilities Entrez Direct
def e_utilities_install():
  """
  Installs e_utilities
  Reference: https://www.ncbi.nlm.nih.gov/books/NBK179288/
  """
  !curl -L https://www.ncbi.nlm.nih.gov/books/NBK179288/bin/install-edirect.sh > install-edirect.sh
  !bash install-edirect.sh -y
  !echo 'export PATH=\$PATH:\$HOME/edirect' >> $HOME/.bash_profile
  !rm install-edirect.sh

In [49]:
# Setting development environment for Selenium
def setup_dev_environment():
  """
  Installs chromium, driver and selenium
  Sets options to be headless
  Opens a website and prepares Selenium for use
  Returns: webdriver
  """

  # install chromium, its driver, and selenium
  !apt update
  !apt install chromium-chromedriver
  !pip install selenium
  !pip install webdriver_manager
  # set options to be headless, ..
  from selenium import webdriver
  options = webdriver.ChromeOptions()
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  # open it, go to a website, and get results
  print("Chromium, Driver and Selenium successfully started....")
  return webdriver.Chrome(options=options)

In [60]:
# Defining Functions
def pull_pmid(year: int, wb: webdriver) -> list:
  """
  
  This takes in a year as an argument and reads the PMID IDs for each year within https://lcp.mit.edu/publications

  Dependencies: Selenium, chromium-chromedriver, webdriver_manager, re
  @param year: This int contains the year to scrape from in https://lcp.mit.edu/publications
  @param wb: Passes in the webdriver for Selenium
  @return: a list of PMIDs
  @raise TypeError: raises an exception
  """
  try:
    # Uses Selenium to search by CSS
    URL = f'https://lcp.mit.edu/publications#P_{year}'
    wd.get(URL)
    _links = wd.find_elements_by_css_selector('.bib2xhtml a+ a')

    # Initiating an empty PMID list and appends to list with all PMID IDs
    _pmid_list = []
    for i in _links:
      tmp_search = re.findall(r'\(PMID:.*\)', i.text)
      if tmp_search:
        _pmid_list.append(tmp_search)

    # Flattends the list
    _pmid_list = [item for sublist in _pmid_list for item in sublist]

    # Extracts out only integers and removes text and special characters. 
    # Returns the list
    _pmid_list = [int(re.findall(r'\d+', i)[0]) for i in _pmid_list]
    return _pmid_list

  except TypeError as e:
    print("Error raised while pulling PMID...")
    print(e)

In [93]:
# Defining Functions
def pull_pmid_metadata(pmid: list) -> pd.DataFrame:
  """
  This is dependent on E-utilities

  @param pmid: Takes a list of PMIDs produced by function pull_pmid
  @return: Returns a Pandas DataFrame
  @raise keyError: raises an exception
  """
  for i in pmid:
    result_ = !efetch -db pubmed -id {i} -format xml | xtract -pattern PubmedArticle -block Author \
      -sep " " -tab "| " -element ForeName,LastName
    print(result_)

# Main

In [95]:
%%time
# Install E-utilities
e_utilities_install()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   665  100   665    0     0   3292      0 --:--:-- --:--:-- --:--:--  3292

Entrez Direct has been successfully downloaded and installed.

CPU times: user 71.9 ms, sys: 32.8 ms, total: 105 ms
Wall time: 6.01 s


In [58]:
%%time 
# Sets up Web Driver
wd = setup_dev_environment()

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
[33m0% [Connecting to archive.ubuntu.com (91.189.88.152)] [1 InRelease 14.2 kB/88.7[0m[33m0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Waiting for headers] [Wa[0m[33m0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Waiting for headers] [Wait[0m                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:8 https://developer.download.nvidia.com/co

In [82]:
%%time
pmid_list_2020 = pull_pmid(2020, wd)

CPU times: user 720 ms, sys: 50.1 ms, total: 770 ms
Wall time: 9.83 s


In [90]:
pmid_list_2020

[32612144,
 31948262,
 32126097,
 32240233,
 32248145,
 32577533,
 32449686,
 32577534,
 32432708]

In [111]:
!efetch -db pubmed -id 32612144 -format xml | xtract -pattern PubmedArticle -block Author -sep " " -tab "| " -element ForeName,LastName

/bin/bash: efetch: command not found
/bin/bash: xtract: command not found


In [96]:
%%time 
pull_pmid_metadata(pmid_list_2020)

['/bin/bash: efetch: command not found', '/bin/bash: xtract: command not found']
['/bin/bash: efetch: command not found', '/bin/bash: xtract: command not found']
['/bin/bash: efetch: command not found', '/bin/bash: xtract: command not found']
['/bin/bash: xtract: command not found', '/bin/bash: efetch: command not found']
['/bin/bash: efetch: command not found', '/bin/bash: xtract: command not found']
['/bin/bash: xtract: command not found', '/bin/bash: efetch: command not found']
['/bin/bash: efetch: command not found', '/bin/bash: xtract: command not found']
['/bin/bash: xtract: command not found', '/bin/bash: efetch: command not found']
['/bin/bash: efetch: command not found', '/bin/bash: xtract: command not found']
CPU times: user 158 ms, sys: 83.1 ms, total: 242 ms
Wall time: 1.29 s


# Ignore Below

In [None]:
# Instantiating a list of LCP publications
lcp_list_2020 = [32612144, 31948262, 32126097, 
            32240233, 32248145, 32577533,
            32449686, 32577534, 32432708]

lcp_list_2019 = [30270722, 31205091, 31428687,
                 31815192, 31539837, 30395555,
                 31831740, 30922394, 30077427,
                 29402151, 31595350, 31254239]

lcp_list_2018 = [30270722, 29602439, 29194147,
                 29806057, 29303796, 29750814,
                 29500006, 30291378, 30591356,
                 29500020, 30204154, 31984317,
                 30458029, 29742215, 30130997,
                 29252930, 30646358, 30411186,
                 29808824]

In [None]:
lcp_list_conf_pres_2020 = ["doi:10.1097/01.ccm.0000647968.73423.76"]

lcp_list_conf_pres_2019 = ["doi:https://doi.org/10.1053/j.ajkd.2019.03.199",
                           "doi:10.1097/01.ccm.0000550825.30295.dd",
                           "doi:10.1097/01.ccm.0000552309.57308.ff"]

lcp_list_conf_pres_2018 = ["doi:10.1109/EMBC.2018.8512859",
                           "doi:10.1109/URTC.2017.8284212",
                           "doi:10.1109/EMBC.2018.8513325",
                           "doi:10.22489/CinC.2018.049",
                           "doi:10.1109/ICHI.2018.00024"]                      

## Edge Cases for Discussion:

1. Dauvin A, Donado C, Bachtiger P, Huang KC, Sauer CM, Ramazzotti D, Bonvini M, Celi LA, Douglas MJ. Machine learning can accurately predict pre-admission baseline hemoglobin and creatinine in intensive care patients, bringing context to abnormal admission lab values. Presented at the European Society of Intensive Care Medicine 32nd Annual Congress, Berlin (https://www.esicm.org/events/32nd-annual-congress-berlin/), Sept. 2019.

2. Johnson AEW, Pollard TJ, Naumann T. Generalizability of predictive models for intensive care unit patients. In ML4H: Machine Learning for Health, Dec. 2018. Workshop at NeurIPS 2018.

3. Lehman EP, Krishnan RG, Zhao X, Mark RG, Lehman LwH. Representation learning approaches to detect false arrhythmia alarms from ECG dynamics. In Doshi-Velez F, Fackler J, Jung K, Kale D, Ranganath R, Wallace B, Wiens J, editors, Proceedings of the 3rd Machine Learning for Healthcare Conference, volume 85, 571–586, Palo Alto, California, 17–18 Aug 2018. PMLR.