# Chargemaster Scraper
This notebook will go over running the various code snippets which allow the web scraper to scrape chargemaster (CDM) files from WSHA associated hospitals.

This was originally a Google colaboratory notebook and designed to run utilizing a Google Drive as mounted storage.

This has since been modified to be able to be run locally. Some examples may not be optimized as a result.

## Install Relevant Packages
Suggest utilizing a virtual environment. This code snippet assumes the working directory is the same as the one the notebook is located in. 

The "requirements.txt" is located one level above the notebook.

In [1]:
!pip3 install -r ../requirements.txt # Assumes this is being ru

Collecting bs4==0.0.1
  Using cached bs4-0.0.1-py3-none-any.whl
Collecting cffi==1.15.0
  Using cached cffi-1.15.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (446 kB)
Collecting charset-normalizer==2.0.8
  Using cached charset_normalizer-2.0.8-py3-none-any.whl (39 kB)
Collecting cryptography==36.0.0
  Using cached cryptography-36.0.0-cp36-abi3-manylinux_2_24_x86_64.whl (3.6 MB)
Collecting h11==0.12.0
  Using cached h11-0.12.0-py3-none-any.whl (54 kB)
Collecting idna==3.3
  Using cached idna-3.3-py3-none-any.whl (61 kB)
Collecting outcome==1.1.0
  Using cached outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting pycparser==2.21
  Using cached pycparser-2.21-py2.py3-none-any.whl (118 kB)
Collecting selenium==4.1.0
  Using cached selenium-4.1.0-py3-none-any.whl (958 kB)
Collecting sniffio==1.2.0
  Using cached sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting sortedcontainers==2.4.0
  Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting soupsieve==2.

  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.4
    Uninstalling pandas-1.3.4:
      Successfully uninstalled pandas-1.3.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.[0m
Successfully installed Flask-2.0.2 Jinja2-3.0.3 Werkzeug-2.0.2 brotli-1.0.9 bs4-0.0.1 cffi-1.15.0 charset-normalizer-2.0.8 cryptography-36.0.0 dash-2.0.0 dash-core-components-2.0.0 dash-html-components-2.0.0 dash-table-5.0.0 flask-compress-1.10.1 gunicorn-20.1.0 h11-0.12.0 idna-3.3 itsdangerous-2.0.1 numpy-1.21.2 outcome-1.1.0 pandas-1.2.1 plotly-5.3.1 pycparser-2.21 selenium-4.1.0 sniffio-1.2.0 sortedcontainers-2.4.0 soupsieve-2.3.1 tenacity-8.0.1 trio-0.19.0 trio-websocket-0.9.2 wsproto-1.0.0


### Set up Selenium for use

Requires we have a chromedriver executable. A "setup_selenium.sh" has been provided which downloads chromedriver executable for **linux** systems. If you're running this on another, please download the version 95 chromedriver for your corresponding operating system [here](https://chromedriver.chromium.org/downloads).

One has been included as part of the repository. It is located at chargemaster/web_scraper/chromedriver.

In [8]:
%cd ../chargemaster/web_scraper/
!./setup_selenium.sh
%cd ../../examples

/home/jihk/Desktop/ChargeMaster_Data_Aggregator/chargemaster/web_scraper
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9747k  100 9747k    0     0  2736k      0  0:00:03  0:00:03 --:--:-- 2736k
Archive:  chromedriver_linux64.zip
  inflating: chromedriver            
/home/jihk/Desktop/ChargeMaster_Data_Aggregator/examples


### Import Packages

In [9]:
import pandas as pd
import os, json, requests
from time import time
import numpy as np
from bs4 import BeautifulSoup
from requests.exceptions import InvalidURL, SSLError, ConnectionError
from urllib3.exceptions import NewConnectionError, LocationParseError
from selenium import webdriver
import urllib
from urllib.request import urlretrieve
from IPython.display import clear_output 
MAX_RUN_TIME=480 # 480seconds = 8 minutes max runtime per hospital
#MAX_RUN_TIME=120
visited_urls=[]

# Loading in the JSON file from the Google Drive

In [12]:
URLS_PATH = "../data/hospital_urls.json"
hospital_urls = json.load(open(URLS_PATH))
print(hospital_urls)

{'Arbor Health': {'wsha_url': 'https://www.wsha.org/members/morton-general-hospital', 'hospital_url': 'http://www.mortongeneral.org', 'county': 'Lewis', 'nbeds': 25, 'congressional_district': '3', 'legislative_district': '20', 'scraped_cdm': True}, 'Astria Sunnyside Hospital': {'wsha_url': 'https://www.wsha.org/members/sunnyside-community-hospital-clinics', 'hospital_url': 'https://www.astria.health/locations/astria-sunnyside-hospital', 'county': 'Yakima', 'nbeds': 25, 'congressional_district': '4', 'legislative_district': '15', 'scraped_cdm': True}, 'Astria Toppenish Hospital': {'wsha_url': 'https://www.wsha.org/members/toppenish-community-hospital', 'hospital_url': 'https://www.astria.health/locations/astria-toppenish-hospital', 'county': 'Yakima', 'nbeds': 48, 'congressional_district': '4', 'legislative_district': '15', 'scraped_cdm': True}, 'Cascade Behavioral Health': {'wsha_url': 'https://www.wsha.org/members/cascade-behavioral-health', 'hospital_url': 'https://www.cascadebh.com/

# Beginning Web Crawling

Code for the web crawler/scraper itself

## "Blacklist" websites
Many hospitals have external links to linkedin, news articles about them, facebook, etc. and obviously we don't want the scraper to go there. So we defined a blacklist of domains that the scraper should avoid. Has been changed to be a json file in the python equivalent.

In [15]:
blacklist = ['facebook',
             'yahoo',
             'gmail',
             'google', 
             'linkedin',
             'javascript',
             'javascript;',
             'javascript:;',
             'youtube',
             'twitter',
             'tiktok',
             'cdc',
             'mailto',
             'nih.gov',
             'coronavirus.gov',
             'usa.gov',
             'youtube',
             'instagram',
             'doh',
             'aboutus',
             'about-us',
             'news',
             'employment',
             'mailto',
             'covid19', 
             'covid-19',
             'covid_19',
             '.gov',
             'dhs',
             'tel:+',
             '#content',
             '#main',
             'about',
             'granthealth'
             'forgot', 
             'password',
             'goo.gl',
             'tel:',
             'apple.com',
             'microsoft',
             'mozilla',
             'contactus',
             'contactUs',
             'ContactUs',
             'tumblr',
             'whatsapp',
             'youtu',
             'vimeo',
             'search',
             'wa',
             'millcreek',
             'gift',
             'fchn',
             'cellnetix',
             'merchant',
             'cuisine',
             'office',
             'flickr'
             ]
"""
Adds to the list if any of the elements of "blacklist" are contained within url
so if the list is non-empty, it contains a blacklist site. Returns if it contains
a blacklist site or not. 
"""
def is_blacklist(url):
  blist = [site for site in blacklist if (site in url)]
  is_blacklisted = len(blist) > 0 
  return is_blacklisted

## Selenium

Need to utilize Selenium to interact with the webpages. BeautifulSoup to crawl while using Seleniuim to interact with JS elements.

In [14]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

## Requests & Iterating Through Hospitals

In [16]:
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
HEADERS = {'User-Agent': USER_AGENT}
"""
Didn't foresee that exception handling for just making the HTTP request would be
this difficult to track down. Broke apart the web crawler into two parts, one
for making the request and handling the HTTP status codes and one to parse
the actual HTML returned. 
"""
def get_request(hospital_name=None, url=None):
  if not hospital_name or not url: return
  req = None
  try:
    req = requests.get(url, headers=HEADERS)
  except (InvalidURL, ConnectionError):
    print(f"Invalid URL: {url}, Hospital: {hospital_name}")
    try:
      req = requests.get(url, verify=False, headers=HEADERS)
    except: return
  except (SSLError):
    req = requests.get(url, verify=False, headers=HEADERS)
    """
    Not doing 400s and 500s. Fiddling with headers still won't let me access some sites. 
    400s are usually invalid urls that are outdated. 
    Majority of the urls are still valid and return 200. 
    """
  if req.status_code >= 400: return None 
  return req
  #print(f"Response Code: {req.status_code}, hospital url: {url}")

# Code to Check if link is downloadable
To see if a given URL can be downloaded (E.g. in the case that it's a excel or csv file).
Taken from https://www.codementor.io/@aviaryan/downloading-files-from-urls-in-python-77q3bs0un

In [17]:
def is_downloadable(url):
    """
    Does the url contain a downloadable resource
    """
    try:
      h = requests.head(url, allow_redirects=True)
    except:
      return False
    header = h.headers
    content_type = header.get('content-type')
    if 'text' in content_type.lower():
        return False
    if 'html' in content_type.lower():
        return False
    return True

# Selenium Handler
Will allow us to click for any dynamically allocated content. E.g. a download onclick

Will grab all web elements with "onclick" by using XPATH. Iterates through these given elements and clicks them.

In [27]:
from urllib3.exceptions import MaxRetryError
from selenium.common.exceptions import ElementNotInteractableException, InvalidArgumentException

def selenium_handle(url, time_diff):
  print(f"Selenium handling: {url}")
  if time_diff > MAX_RUN_TIME: return
  chrome_options = webdriver.ChromeOptions()
  chrome_options.add_argument('--headless')
  chrome_options.add_argument('--no-sandbox')
  chrome_options.add_argument('--disable-dev-shm-usage')
  wd = webdriver.Chrome('../chargemaster/web_scraper/chromedriver', options=chrome_options)
  try:
    wd.get(url)
  except (MaxRetryError, InvalidArgumentException): 
    wd.quit()
    return

  try:
    elems = wd.find_elements(By.XPATH, "//a[@onclick]")
    for elem in elems:
      try:
        print(f"Found {str(elem)} onclick element")
        elem.click()
      except ElementNotInteractableException: continue
  except: 
    wd.quit()
    return
  wd.quit()
  

## Code to check if any hospital data has been downloaded
Will check the working directory for any new files and move them to the subdirectory given. Moving files across file systems must be an atomic operation meaning that it is the only one run for the given process at a time. 

The code for this was taken from https://alexwlchan.net/2019/03/atomic-cross-filesystem-moves-in-python/.

In [19]:
import errno
import os
import shutil
import uuid

# https://alexwlchan.net/2019/03/atomic-cross-filesystem-moves-in-python/
def safe_move(src, dst):
    """Rename a file from ``src`` to ``dst``.

    *   Moves must be atomic.  ``shutil.move()`` is not atomic.
        Note that multiple threads may try to write to the cache at once,
        so atomicity is required to ensure the serving on one thread doesn't
        pick up a partially saved image from another thread.

    *   Moves must work across filesystems.  Often temp directories and the
        cache directories live on different filesystems.  ``os.rename()`` can
        throw errors if run across filesystems.

    So we try ``os.rename()``, but if we detect a cross-filesystem copy, we
    switch to ``shutil.move()`` with some wrappers to make it atomic.
    """
    try:
        os.rename(src, dst)
    except OSError as err:

        if err.errno == errno.EXDEV:
            # Generate a unique ID, and copy `<src>` to the target directory
            # with a temporary name `<dst>.<ID>.tmp`.  Because we're copying
            # across a filesystem boundary, this initial copy may not be
            # atomic.  We intersperse a random UUID so if different processes
            # are copying into `<dst>`, they don't overlap in their tmp copies.
            copy_id = uuid.uuid4()
            tmp_dst = "%s.%s.tmp" % (dst, copy_id)
            shutil.copyfile(src, tmp_dst)

            # Then do an atomic rename onto the new name, and clean up the
            # source image.
            os.rename(tmp_dst, dst)
            os.unlink(src)
        else:
            raise

IGNORE_LIST = ['.config', 'drive', '.ipynb_checkpoints'] # Google Colab or Jupyter Notebook related files. Unrelated to scraped data
"""
ERROR: Cross file system transfers cannot be done with os.rename, os.replace
Try using OS methods but if it doesn't work, use shuttil.move
Needs to be atomic
"""
def check_and_move_files(SUBDIR_PATH=None):
  if not SUBDIR_PATH: return
  filenames = os.listdir(".")
  for filename in filenames:
    if filename in IGNORE_LIST: continue
    safe_move(f"./{filename}", f"{SUBDIR_PATH}/{filename}")

## Code to create a subdirectory for each hospital

Each hospital will have its own subdirectory created to hold all files scraped. This will help us in organizing files since the scraper will most likely pick up other files along the way.

In [20]:
SUBDIR_PATH = '../data/scraped_files/'

"""
Just replace all whitespace in the name with a underscore (_) for file naming conventions.
The names are taken from the WSHA page. We can assume they do not have any
mistakes such as having whitespace in front of the name or behind.

Will return the path of the subdirectory.
"""
def create_subdir(hospital_name):
  hospital_name = hospital_name.strip()
  subdir_name = hospital_name.replace(" ", "_")
  FULL_PATH = f"{SUBDIR_PATH}/{subdir_name}"
  if not os.path.isdir(FULL_PATH): 
    os.mkdir(FULL_PATH)
  return FULL_PATH


In [None]:
"""

WIP: IT DOES NOT WORK PROPERLY YET 


High Level Plan:
1. Find all 'a' tags
2. From the 'a' tags, get 'href's 
3. Comb through all hrefs for resources ending in 'csv' or 'xlsx' 
4. Else, the rest of the links are recursively called into web_crawl(same_hospital_name, new_link)

Cases to look out for:
1. Staying within the domain of the hospital
1.1 This could be being redirected to their Facebook page instead. So...
We want to check that the new URL we're visiting is a resource within the hospital's
webpages.
1.2 Can check for things like matching URLs. Evergreenhealth.org/patientportal is 
valid within Evergreenhealth but... Linkedin.com/Evergreenhealth isn't. 


Going to implement an x-levels deep search. From the main website, given 'x' levels,
It'll search x subdomains deep. E.g. If x = 10, and it goes to let's say
hopsitalwebsite.com/patients --> x = 9 now.  From here, it'll also search
every url and each of those will go down 8 levels. It'll be a branching effect. 
"""
visited_urls = []
def web_crawl(hospital_name=None, hospital_url=None, url=None, level=1, starttime=-1):
  if level == 0: return # Reached the extent of our search.
  #if hospital_url not in url: return # Stay within the hospital domain. See note above
  if url in visited_urls: return # Prevent circular crawling.
  print(f"Name: {hospital_name}, URL being parsed: {url}, {level} levels deep.")
  time_diff = time() - starttime
  if time_diff > MAX_RUN_TIME: 
    print("TIME RUN OUT")
    return

  try: # malformed url sometimes
    page = get_request(hospital_name, url)
  except requests.exceptions.InvalidSchema:
    return
  if not page: return
  visited_urls.append(url)
  soup = BeautifulSoup(page.content)
  soup_html = str(soup)
  downloadble_file = True
  content_type = str(page.headers['content-type']).lower()
  print(content_type)
  if 'text' in content_type or 'html' in content_type:
    downloadable_file = False
  #downloadable_file
  # SELENIUM HANDLER
  if time() - starttime < MAX_RUN_TIME:
    if 'onclick=' in soup_html or 'onclick =' in soup_html:
      selenium_handle(url, time_diff)
    
  for a_href in soup.findAll('a'):
    if time_diff > MAX_RUN_TIME: break
    url = a_href.get('href')
    if not url: continue
    full_url = url
    if 'http' not in url and 'https' not in url: 
      full_url = hospital_url + a_href.get('href')
    if downloadble_file:
      filename = full_url.split("/")[-1]
      try:
        urlretrieve(full_url, f"./{filename}") # download with unique 
      except: continue
    if not is_blacklist(full_url):
      if time() - starttime < MAX_RUN_TIME:
        web_crawl(hospital_name=hospital_name, hospital_url=hospital_url, url=full_url, level=level-1, starttime=starttime)
      else: return
    else:
      continue
    # TODO: If it ends in xlsx or csv, download, else recursively call


## Code for the Web Crawler + Scraper

In [22]:
import re
def get_domain(url):
  pattern = '(http[s]?:\/\/([w]{3}\.)?[a-zA-Z1-9]*\.(org|com|net)).*'
  match = re.match(pattern, url)
  if match: return match.group(1)
  return match

get_domain('https://cascademedical.org/patient-resources/billing') # Testing

'https://cascademedical.org'

In [23]:
from urllib.parse import urljoin

APX_KEYWORD = 'apps.para'

def is_within_time(starttime): return time() - starttime < MAX_RUN_TIME

def is_downloadable_link(page):
  content_type = str(page.headers['content-type']).lower()
  if 'text' in content_type or 'html' in content_type: 
    return False
  return True

def format_url(url, href):
  full_url = ""
  if 'http' not in href: # These are strict redirects. 
    if href.startswith("."): return None
    elif url.endswith("/"): full_url = url + href
    else: full_url = f"{url}/{href}"
    full_url = full_url.replace("//", "/")
    if 'https:/' in full_url: full_url = full_url.replace("https:/", "https://")
    else: full_url = full_url.replace("http:/", "http://")
  else: 
    full_url = href
  return full_url

def url_format(url, href):
  full_url = ""
  if 'http' not in href:
    if 'para' in href: print(href)
    if not full_url.endswith("/"): full_url = url + "/" 
    full_url = urljoin(full_url, href)
  else: full_url = href
  return full_url
  

def get_domain(url):
  pattern = '(http[s]?:\/\/([w]{3}\.)?[a-zA-Z1-9]*\.(org|com|net)).*'
  if type(url) != str: return None
  match = re.match(pattern, url)
  if match: return match.group(1)
  return match

def check_download(full_url, page):
  if is_downloadable_link(page):
    filename = full_url.split("/")[-1]
    print(f"{filename} is downloadable")
    try:
      urlretrieve(full_url, f"./{filename}") # download with unique 
    except: pass

def check_selenium(soup, url, starttime):
  soup_html = str(soup)
  if 'onclick' in soup_html:
    selenium_handle(url, time() - starttime)

def check_for_csv_xlsx(url, href):
  full_url = ""
  if 'http' in href: full_url = href
  else: full_url = url_format(url, href)
  if full_url.endswith("csv") or full_url.endswith("xlsx"):
    try:
      filename = href.split("/")[-1]
      urlretrieve(alternative_link, f"./{filename}")
    except:
      pass

def check_for_csv_xlsx_files(hospital_name, hrefs, url, starttime, levels):
  full_url = ""
  for href in hrefs:
    if not href: continue
    if href.endswith("csv") or href.endswith("xlsx"):
      if 'http' not in href: # These are strict redirects. 
        if href.startswith("."): return None
        elif url.endswith("/"): full_url = url + href
        else: full_url = f"{url}/{href}"
      else: 
        full_url = href
      print(f"Found file: {href}")
      domain_link = get_domain(f"{hospital_urls[hospital_name]['hospital_url']}")
      alternative_link=f"{domain_link}/{href}"
      try:
        filename = href.split("/")[-1]
        urlretrieve(alternative_link, f"./{filename}")
      except:
        pass
      try:
        filename = href.split("/")[-1]
        urlretrieve(href, f"./{filename}")
      except:
        pass

      crawl_and_scrape(hospital_name, alternative_link, starttime, 1) 
      crawl_and_scrape(hospital_name, full_url, starttime, 1) # terminate there.
      
def check_apx(hospital_name, url, starttime):
  if 'apps.para' in url: 
      crawl_and_scrape(hospital_name=hospital_name, url=url, starttime=starttime, levels=3) #refresh levels

def crawl_and_scrape(hospital_name, url, starttime, levels):
  if levels == 0: return
  if not is_within_time(starttime): return
  if url in visited_urls: return
  if is_blacklist(url): return

  print(f"Crawling & Scraping for {hospital_name}, on current URL: {url}, {levels} levels deep.")
  try: # malformed url sometimes
    page = get_request(hospital_name, url)
  except requests.exceptions.InvalidSchema: return
  if not page: return
  visited_urls.append(url) # Add to visited webpage list
  soup = BeautifulSoup(page.content)
  check_download(url, page)
  # Check Selenium
  check_selenium(soup, url, starttime)
  hrefs = [a.get('href') for a in soup.findAll('a')]
  for a_href in soup.findAll('a'):
    if not is_within_time(starttime): break
    href = a_href.get('href')
    if href in visited_hrefs: continue
    visited_hrefs.append(href)
    if not href: continue # No href links so 'url' will be NoneType
    if href.endswith(".pdf"): continue
    full_url = url_format(url, href)
    check_for_csv_xlsx(url, href)
    """
    domain = get_domain(full_url)
    if domain: 
      alternate_url = format_url(domain, href)
      crawl_and_scrape(hospital_name=hospital_name, url=alternate_url, starttime=starttime, levels=levels-1)
    """
    check_apx(hospital_name, full_url, starttime)
    
    if not full_url: continue
  # If website is on our blacklist, skip
    if is_blacklist(full_url): continue
    crawl_and_scrape(hospital_name=hospital_name, url=full_url, starttime=starttime, levels=levels-1)
    #check_for_csv_xlsx_files(domain, hospital_name, href, url, starttime, levels)
    
    
    # Check to see if we can download the file


# Scrape One Hospital
As an example, we'll go through just one hospital.

In [36]:
hospital_name = 'Arbor Health'
hospital_data = hospital_urls[hospital_name]
hospital_url = hospital_data['hospital_url']
# Create a subdirectory for the hospital if one doesn't already exist
# All scraped files for this hospital will go here.
subdir_path = create_subdir(hospital_name)

# Keep track of where the Scraper has already been
visited_urls, visited_hrefs = [], []

# Initiate Scraping
# time() = current time, 8 = how many levels deep for recursive calls should it go
crawl_and_scrape(hospital_name, hospital_url, time(), 8)

# Any generated files will go to this new subdirectory
check_and_move_files(subdir_path)

Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org, 8 levels deep.
Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org/provider-directory/, 7 levels deep.
Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org/locations/, 6 levels deep.
Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org/forms/contact-us/, 5 levels deep.
Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org/, 4 levels deep.
Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org/announcements/, 3 levels deep.
Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org/forms/request-for-public-records/, 2 levels deep.
Crawling & Scraping for Arbor Health, on current URL: http://www.mortongeneral.org/patients-visitors/billing-pricing/, 1 levels deep.
Crawling & Scraping for Arbor Health, on current URL: https://apps.para-hc

  wd = webdriver.Chrome('../chargemaster/web_scraper/chromedriver', options=chrome_options)


Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="a78e56ba-5551-41e8-8129-51fd10f2479b")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="34af23d2-f246-4aad-ba25-5efa7ecfa84b")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="1b01765e-d49f-48cf-8bff-e1cb4bd1b3e6")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="c2b02c83-1a43-4ac0-86d2-ea30d6cfd52b")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="fd968a76-bb55-4e77-8df4-a3e7df41a254")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="793fbc0a-17fa-4da8-8fc2-7777ebddb17b")> onclick element
Found <sel

Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="94f327a3-3664-4690-981b-b399ab3f198c")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="890f0dfc-cc24-43f1-9c7e-d3018467d05c")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="a1debf37-30ec-46dd-a07c-29dd87da680c")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="1e921a92-a30b-44d3-b9bd-cabcc6b83a5b")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="bba8d85a-a220-486f-85ac-f22998d30fc7")> onclick element
Found <selenium.webdriver.remote.webelement.WebElement (session="db5f5b539791b7f59a101fe02738b4bb", element="55868657-210e-4879-b07f-86f9912d4668")> onclick element
Crawling &

KeyboardInterrupt: 

Manually ended the notebook early because it found the CDM very quickly for Arbor Health so it never got to move the files but...

In [43]:
!ls ./

'91-1033860_Arbor Health Morton Hospital _standardcharges.csv.crdownload'
'91-1033860_Arbor Health Morton Hospital _standardchargesDRG.csv'
'91-1033860_Arbor Health Morton Hospital _standardchargesSS.csv.crdownload'
 Chargemaster_Scraper.ipynb


We can see that it was able to download the CDM files for Arbor into the current directory. Will manually be removing these for subsequent use.