<a href="https://colab.research.google.com/github/atjoelpark/ml-disparities-mit/blob/master/pull_preprocessing/LCP_Pull_Extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pull and Extraction of LCP PMIDs

This notebook provides documentation and code for pulling metadata for a list of PMID IDs.

Reference: https://lcp.mit.edu/publications

In [1]:
# Customization
# Please enter in the pathway in your Google Drive (after /content/drive/) that you would like your files to be saved into
# Users will only need to modify this code then run all the cells in order
google_drive_url = ""

# Libraries and Mounting Google Drive

In [2]:
# Importing libraries
import numpy as np 
import pandas as pd 
import re 
import requests
from bs4 import BeautifulSoup

In [3]:
# Mounting Google Drive if using Google Drive
from google.colab import drive
drive.mount(f'/content/drive/{google_drive_url}')

Mounted at /content/drive/


## Defining Functions

In [14]:
# Installing E-utilities Entrez Direct
def e_utilities_install():
  """
  Installs e_utilities
  Reference: https://www.ncbi.nlm.nih.gov/books/NBK179288/
  """
  !curl -L https://www.ncbi.nlm.nih.gov/books/NBK179288/bin/install-edirect.sh > install-edirect.sh
  !bash install-edirect.sh -y
  !echo 'export PATH=\$PATH:\$HOME/edirect' >> $HOME/.bash_profile
  !rm install-edirect.sh

In [5]:
# Setting development environment for Selenium
def setup_dev_environment():
  """
  Installs chromium, driver and selenium
  Sets options to be headless
  Opens a website and prepares Selenium for use
  Returns: webdriver
  """

  # install chromium, its driver, and selenium
  !apt update
  !apt install chromium-chromedriver
  !pip install selenium
  !pip install webdriver_manager
  # set options to be headless, ..
  from selenium import webdriver
  options = webdriver.ChromeOptions()
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  # open it, go to a website, and get results
  print("Chromium, Driver and Selenium successfully started....")
  return webdriver.Chrome(options=options)

In [6]:
# Defining Functions
def pull_pmid(year: int, wb) -> list:
  """
  
  This takes in a year as an argument and reads the PMID IDs for each year within https://lcp.mit.edu/publications

  Dependencies: Selenium, chromium-chromedriver, webdriver_manager, re
  @param year: This int contains the year to scrape from in https://lcp.mit.edu/publications
  @param wb: Passes in the webdriver for Selenium
  @return: a list of PMIDs
  @raise TypeError: raises an exception
  """
  try:
    # Uses Selenium to search by CSS
    URL = f'https://lcp.mit.edu/publications#P_{year}'
    wd.get(URL)
    _links = wd.find_elements_by_css_selector('.bib2xhtml a+ a')

    # Initiating an empty PMID list and appends to list with all PMID IDs
    _pmid_list = []
    for i in _links:
      tmp_search = re.findall(r'\(PMID:.*\)', i.text)
      if tmp_search:
        _pmid_list.append(tmp_search)

    # Flattends the list
    _pmid_list = [item for sublist in _pmid_list for item in sublist]

    # Extracts out only integers and removes text and special characters. 
    # Returns the list
    _pmid_list = [int(re.findall(r'\d+', i)[0]) for i in _pmid_list]
    return _pmid_list

  except TypeError as e:
    print("Error raised while pulling PMID...")
    print(e)

In [53]:
# Defining Functions
def pull_pmid_metadata(pmid: list) -> pd.DataFrame:
  """
  This is dependent on E-utilities

  @param pmid: Takes a list of PMIDs produced by function pull_pmid
  @return: Returns a Pandas DataFrame
  @raise keyError: raises an exception
  """
  for i in pmid:
    _temp = f'$HOME/edirect/efetch -db pubmed -id {i} -format xml | $HOME/edirect/xtract -pattern PubmedArticle -block Author \
      -sep " " -tab "| " -element ForeName,LastName'
    _result = !{_temp}
    print(_result)

# Main

In [8]:
%%time
# Install E-utilities
e_utilities_install()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   665  100   665    0     0   3064      0 --:--:-- --:--:-- --:--:--  3050

Entrez Direct has been successfully downloaded and installed.

In order to complete the configuration process, please execute the following:

  echo "export PATH=\${PATH}:/root/edirect" >> $HOME/.bashrc

or manually edit the PATH variable assignment in your .bashrc file.

Would you like to do that automatically now? [y/N]
N
Holding off, then.
CPU times: user 180 ms, sys: 53.3 ms, total: 233 ms
Wall time: 23.9 s


In [9]:
%%time 
# Sets up Web Driver
wd = setup_dev_environment()

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
[33m0% [Connecting to archive.ubuntu.com (91.189.88.142)] [1 InRelease 14.2 kB/88.7[0m[33m0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting f[0m                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9

In [10]:
%%time
pmid_list_2020 = pull_pmid(2020, wd)

CPU times: user 689 ms, sys: 24.7 ms, total: 713 ms
Wall time: 9.49 s


In [54]:
%%time 
pull_pmid_metadata(pmid_list_2020)

['Lawrence Baker| Jason H Maley| Aldo Arévalo| Francis DeMichele| Roselyn Mateo-Collado| Stan Finkelstein| Leo Anthony Celi']
['John Danziger| Miguel Ángel Armengol de la Hoz| Wenyuan Li| Matthieu Komorowski| Rodrigo Octávio Deliberato| Barret N M Rush| Kenneth J Mukamal| Leo Celi| Omar Badawi']
['Marta Fernandes| Rúben Mendes| Susana M Vieira| Francisca Leite| Carlos Palos| Alistair Johnson| Stan Finkelstein| Steven Horng| Leo Anthony Celi']
['Marta Fernandes| Rúben Mendes| Susana M Vieira| Francisca Leite| Carlos Palos| Alistair Johnson| Stan Finkelstein| Steven Horng| Leo Anthony Celi']
['Yuan Lai| Wesley Yeung| Leo Anthony Celi']
['Liam G McCoy| Sujay Nagaraj| Felipe Morgado| Vinyas Harish| Sunit Das| Leo Anthony Celi']
['Stuart McLennan| Leo Anthony Celi| Alena Buyx']
['Trishan Panch| Tom J Pollard| Heather Mattie| Emily Lindemer| Pearse A Keane| Leo Anthony Celi']
['Wesley Yeung| Kennedy Ng| J M Nigel Fong| Judy Sng| Bee Choo Tai| Sin Eng Chia']
CPU times: user 189 ms, sys: 88.7 