# Web Scraping to Build a Database of Hospital Safety Grades

hospitalsafetygrade.org returns a list of hospitals by:
 - City
 - State
 - ZIP code
 - Hospital Name
For ZIP codes, we get all the entries within 5/10/50/100/200 miles, which we can specify. They are sorted by distance.

There are approximately 44,000 five-digit ZIP codes in use in the U.S. With the area of the U.S. at 3,618,780 sq miles, that makes the average land area per ZIP code about 82.25 sq miles. A radius of 50 miles would cover such an area most efficiently out of the options we have. 
It should be noted that ZIP codes are based on population, not geography. This estimate could produce highly inaccurate results for some ZIP codes, but we have to start somewhere, and this number can easily be modified later.

To begin, we're attempting to get the HTML of a single page:
 
Using `requests` library in python to retrieve HTML:

In [None]:
import requests

URL = "https://www.hospitalsafetygrade.org/search?findBy=zip&zip_code=90095&radius=50&city=&state_prov=&hospital="
page = requests.get(URL)

print(page.text)

The entire HTML output produces by the above is too large to find information by parsing/regex.

Attempting to isolate the portion we need:

In [None]:
from bs4 import BeautifulSoup

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="SearchResults")
print(results)

The above output indicates that the search results were empty and no results were found. We can see by going to the website that this is not the case, and the HTML from the request does not reflect the HTML on the website. This is characteristic of a web page that loads an initial incomplete page then refreshes to update its information. This can be handled by interpreting the logic used to update the page, (see: https://stackoverflow.com/questions/59727663/why-request-get-returning-wrong-page-content) but this process seems too complicated.  

The `Selenium` package is used to obtain HTML info from dynamic websites, using that instead:

In [16]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Retrieving the page we want to scrape:

In [18]:
driver.get('https://www.hospitalsafetygrade.org/search?findBy=zip&zip_code=90024&radius=50&city=&state_prov=&hospital=')

Attempting to get the HTML of the search results by matching their class names from their divs:

In [23]:
# fifty_closest = driver.find_elements_by_xpath('//td[@class="itemWrapper leapfrogSearchResult"]') <- DEPRECATED
fifty_closest = driver.find_elements(by = By.CLASS_NAME, value = 'itemWrapper leapfrogSearchResult')
print(fifty_closest)

[]


The above produced an empty list. Checking to see if the retrieved HTML is accurate:

Print full HTML of page:

In [None]:
print(driver.page_source)

The HTML matches what we expect. On further experimenting it seems the initial search failed because the class names can only be matched for outermost elements, i.e., not for nested elements.

A better way to isolate elements is using their XPath, which is analagous to the path used in file systems. It can be obtained for any element on a page with the following steps:
 - Right click on the element
 - Select "Inspect"
 - A console containing HTML appears, click on the 3 dots to the left of the line, hover over "Copy" and select "Copy full XPath"
 

In [28]:
test_list = driver.find_elements(by = By.XPATH, value = "/html/body/div[1]/div/section[3]/div[3]/div")
print(len(test_list))

100


In [29]:
print(test_list[1].text)

Cedars-Sinai Medical Center
8700 Beverly Boulevard
Los Angeles, CA 90048-1865
View the full Score
This Hospital's Grade
SPRING 2022


In [40]:
print(test_list[1].get_attribute('outerHTML'))

<div class="itemWrapper leapfrogSearchResult" data-lat="34.075153" data-lon="-118.3802766" data-distance="3.1484841626973" data-grade="c" data-slug="cedars-sinai-medical-center" data-name="Cedars-Sinai Medical Center" style="">
        <div class="detailWrapper">
            <div class="name">
                <a data-details-link="" href="/h/cedars-sinai-medical-center">Cedars-Sinai Medical Center</a>
            </div>
            <div class="address">
                8700 Beverly Boulevard<br>
                Los Angeles, CA 90048-1865
            </div>
            <div class="readmore">
                <a data-details-link="" href="/h/cedars-sinai-medical-center">View the full Score</a>
            </div>
        </div>
        <div class="gradeWrapper grade-c">
            <div class="title">
                This Hospital's Grade
            </div>
            <div class="grade">
                 <a data-details-link="" href="/h/cedars-sinai-medical-center"><img src="/media/image/

We now retrieved the information we needed and simply have to process it and add it to a dataframe

https://www.hospitalsafetygrade.org/search?findBy=state&zip_code=&city=&state_prov=CA&hospital=

In [None]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.hospitalsafetygrade.org/search?findBy=state&zip_code=&city=&state_prov=CA&hospital=')
state_list = driver.find_elements(by = By.XPATH, value = "/html/body/div[1]/div/section[3]/div[3]/div")
print(len(state_list))

In [54]:
print(state_list[1].get_attribute('outerHTML'))


<div class="itemWrapper leapfrogSearchResult" data-lat="34.1503596" data-lon="-118.2300566" data-distance="7810.2623225566" data-grade="a" data-slug="adventist-health-glendale" data-name="Adventist Health Glendale" style="">
        <div class="detailWrapper">
            <div class="name">
                <a data-details-link="" href="/h/adventist-health-glendale">Adventist Health Glendale</a>
            </div>
            <div class="address">
                1509 Wilson Terrace<br>
                Glendale, CA 91206-4098
            </div>
            <div class="readmore">
                <a data-details-link="" href="/h/adventist-health-glendale">View the full Score</a>
            </div>
        </div>
        <div class="gradeWrapper grade-a">
            <div class="title">
                This Hospital's Grade
            </div>
            <div class="grade">
                 <a data-details-link="" href="/h/adventist-health-glendale"><img src="/media/image/hss-grade-a-2016.

In [56]:
print(state_list[1].text)

Adventist Health Glendale
1509 Wilson Terrace
Glendale, CA 91206-4098
View the full Score
This Hospital's Grade
SPRING 2022


In [61]:
import re

In [79]:
for hospital in state_list:
    str1 = hospital.get_attribute('outerHTML')
    str2 = hospital.text
    str1 = str1.partition(">")[0]
    hospital_list = str2.split("\n")
    hospital_list[1] += ", " + hospital_list[2]
    hospital_list = hospital_list[0:2] 
    hospital_list.append(re.search('data-lat=".*?"', str1).group(0)[10:-1])
    hospital_list.append(re.search('data-lon=".*?"', str1).group(0)[10:-1])
    hospital_list.append(re.search('data-distance=".*?"', str1).group(0)[15:-1])
    hospital_list.append(re.search('data-grade=".*?"', str1).group(0)[-2])
    print(hospital_list)


['Adventist Health - Bakersfield', '2615 Chester Avenue, Bakersfield, CA 93301-2006', '35.3833475', '-119.0205429', '7827.7829684579', 'a']
['Adventist Health Glendale', '1509 Wilson Terrace, Glendale, CA 91206-4098', '34.1503596', '-118.2300566', '7810.2623225566', 'a']
['Adventist Health Hanford', '115 Mall Drive, Hanford, CA 93230-3513', '36.3237872', '-119.6664466', '7841.8158676335', 'a']
['Adventist Health Reedley', '372 W. Cypress Avenue, Reedley, CA 93654-2199', '36.6080787', '-119.4517641', '7824.2385162694', 'a']
['Adventist Health Selma', '1141 Rose Avenue, Selma, CA 93662', '36.5676997', '-119.5991582', '7832.87803033', 'a']
['Adventist Health St. Helena', '10 Woodland Road, St. Helena, CA 94574-9554', '38.5425501', '-122.4748268', '7933.7326975683', 'a']
['Adventist Health Ukiah Valley', '275 Hospital Drive, Ukiah, CA 95482-4531', '39.1531894', '-123.2030161', '7954.2195129498', 'a']
['Alhambra Hospital Medical Center', '100 S. Raymond Avenue, Alhambra, CA 91801-3199', '34

In [None]:
# Took ~15 minutes to run

states = [ 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']

final_list = []



for state in states:
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get('https://www.hospitalsafetygrade.org/search?findBy=state&zip_code=&city=&state_prov=' + state + '&hospital=')
    # driver.execute_script('''window.open("https://www.hospitalsafetygrade.org/search?findBy=state&zip_code=&city=&state_prov=''' + state + '''&hospital=","_blank");''')
    state_list = driver.find_elements(by = By.XPATH, value = "/html/body/div[1]/div/section[3]/div[3]/div")
    print(len(state_list))
    for hospital in state_list:
        str1 = hospital.get_attribute('outerHTML')
        str2 = hospital.text
        str1 = str1.partition(">")[0]
        hospital_list = str2.split("\n")
        hospital_list[1] += ", " + hospital_list[2]
        hospital_list = hospital_list[0:2] 
        hospital_list.append(re.search('data-lat=".*?"', str1).group(0)[10:-1])
        hospital_list.append(re.search('data-lon=".*?"', str1).group(0)[10:-1])
        hospital_list.append(re.search('data-distance=".*?"', str1).group(0)[15:-1])
        hospital_list.append(re.search('data-grade=".*?"', str1).group(0)[-2])
        final_list.append(hospital_list)
        print(hospital_list)
    print(state + " done!")
    driver.close()

In [120]:
final_list[0]

['Alaska Regional Hospital',
 '2801 DEBARR ROAD, Anchorage, AK 99508-2997',
 '61.210404',
 '-149.8286335',
 '7918.0198498343',
 'a']

In [None]:
# print(final_list)
df = pd.DataFrame(final_list, columns = ['Name', 'Address', 'Latitude', 'Longitude', 'Distance', 'SafetyGrade'])
print(df)

In [123]:
# Saving df as a CSV
df.to_csv('HospitalSafetyGrade_Raw.csv')

In [127]:
df.head(5)

Unnamed: 0,Name,Address,Latitude,Longitude,Distance,SafetyGrade
0,Alaska Regional Hospital,"2801 DEBARR ROAD, Anchorage, AK 99508-2997",61.210404,-149.8286335,7918.0198498343,a
1,Mat-Su Regional Medical Center,"2500 South Woodworth Loop, Palmer, AK 99645",61.562971,-149.2578179,7887.2712026392,b
2,Bartlett Regional Hospital,"3260 Hospital Drive, Juneau, AK 99801-7808",58.328916,-134.4650653,7708.9709345946,c
3,Central Peninsula General Hospital,"250 Hospital Place, Soldotna, AK 99669-6999",60.4934413,-151.0780122,7982.4571687651,c
4,Fairbanks Memorial Hospital,"1650 Cowles Street, Fairbanks, AK 99701-5998",64.8311569,-147.7399472,7674.3740755271,c


Making ZIP code a separate column:

In [136]:
def func(l):
    return l[-1][:5]

df["ZIP"] = df.Address.str.split(" ")
df["ZIP"] = df["ZIP"].apply(func)
df.head(5)

Unnamed: 0,Name,Address,Latitude,Longitude,Distance,SafetyGrade,ZIP
0,Alaska Regional Hospital,"2801 DEBARR ROAD, Anchorage, AK 99508-2997",61.210404,-149.8286335,7918.0198498343,a,99508
1,Mat-Su Regional Medical Center,"2500 South Woodworth Loop, Palmer, AK 99645",61.562971,-149.2578179,7887.2712026392,b,99645
2,Bartlett Regional Hospital,"3260 Hospital Drive, Juneau, AK 99801-7808",58.328916,-134.4650653,7708.9709345946,c,99801
3,Central Peninsula General Hospital,"250 Hospital Place, Soldotna, AK 99669-6999",60.4934413,-151.0780122,7982.4571687651,c,99669
4,Fairbanks Memorial Hospital,"1650 Cowles Street, Fairbanks, AK 99701-5998",64.8311569,-147.7399472,7674.3740755271,c,99701


In [137]:
df["SafetyGrade"] = df.SafetyGrade.str.upper()
df.head(5)

Unnamed: 0,Name,Address,Latitude,Longitude,Distance,SafetyGrade,ZIP
0,Alaska Regional Hospital,"2801 DEBARR ROAD, Anchorage, AK 99508-2997",61.210404,-149.8286335,7918.0198498343,A,99508
1,Mat-Su Regional Medical Center,"2500 South Woodworth Loop, Palmer, AK 99645",61.562971,-149.2578179,7887.2712026392,B,99645
2,Bartlett Regional Hospital,"3260 Hospital Drive, Juneau, AK 99801-7808",58.328916,-134.4650653,7708.9709345946,C,99801
3,Central Peninsula General Hospital,"250 Hospital Place, Soldotna, AK 99669-6999",60.4934413,-151.0780122,7982.4571687651,C,99669
4,Fairbanks Memorial Hospital,"1650 Cowles Street, Fairbanks, AK 99701-5998",64.8311569,-147.7399472,7674.3740755271,C,99701


In [139]:
cols = df.columns.to_list()
cols = [cols[0] , cols[5] , cols[6] , cols[2] , cols[3] , cols[4] , cols[1]]
df2 = df[cols]
df2.head(5)

Unnamed: 0,Name,SafetyGrade,ZIP,Latitude,Longitude,Distance,Address
0,Alaska Regional Hospital,A,99508,61.210404,-149.8286335,7918.0198498343,"2801 DEBARR ROAD, Anchorage, AK 99508-2997"
1,Mat-Su Regional Medical Center,B,99645,61.562971,-149.2578179,7887.2712026392,"2500 South Woodworth Loop, Palmer, AK 99645"
2,Bartlett Regional Hospital,C,99801,58.328916,-134.4650653,7708.9709345946,"3260 Hospital Drive, Juneau, AK 99801-7808"
3,Central Peninsula General Hospital,C,99669,60.4934413,-151.0780122,7982.4571687651,"250 Hospital Place, Soldotna, AK 99669-6999"
4,Fairbanks Memorial Hospital,C,99701,64.8311569,-147.7399472,7674.3740755271,"1650 Cowles Street, Fairbanks, AK 99701-5998"


In [144]:
print(df.shape)
print(df2.shape)

df2.to_csv('HospitalSafetyGrade.csv')

(2652, 7)
(2652, 7)


In [107]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))




[WDM] - Current google-chrome version is 102.0.5005
[WDM] - Get LATEST chromedriver version for 102.0.5005 google-chrome
[WDM] - Driver [C:\Users\email\.wdm\drivers\chromedriver\win32\102.0.5005.61\chromedriver.exe] found in cache


In [110]:

driver.execute_script('''window.open("http://bing''' + "." +  '''com","_blank");''')


In [98]:
driver.close()