# Process: Scrape EPA Web Page for Features

Organization: Esri

Author: Alberto Nieto (anieto@esri.com)

Date: 8/12/2018

Process Overview:

1. Parse regions HTML to get the list of individual site URLs

2. Iterate: For each site:

    - Parse the regions HTML for the site attributes (Name, City, State, Contact, Updated, Field Activity, Response Type, Response Authority, Incident Category 
    - Parse the site HTML for needed attributes (Site ID, Latitude, Longitude, Photo)

## Set needed modules

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import arcgis

## Set helper functions

In [2]:
def get_list_of_epasite_urls(region_html):
    f = open(region_html)
    soup = BeautifulSoup(f, 'html.parser')
    tablerecords = list(soup.find_all('a', class_="NormalColoredFont"))
    site_url_params_list = []
    for record in tablerecords:
        try:
            if record['href'] and record['href'][:47] == "https://response.epa.gov/site/site_profile.aspx":
                site_url_params_list.append(record['href'])
        except KeyError:
            pass

    return site_url_params_list, len(site_url_params_list)

In [3]:
def get_attributes_from_epasite_url(site_url):
    page = requests.get(site_url)
    site_soup = BeautifulSoup(page.content, 'html.parser')
    # Get Site ID
    id_div = site_soup.find("input", {"id": "ctl00_cp1_hid_Site_ID"})
    site_id = id_div.attrs['value']
    # Get Site Region
    region_div = site_soup.find("input", {"id": "ctl00_cp1_hid_Region"})
    region = region_div.attrs['value']
    # Get Site Name
    name_div = site_soup.find("input", {"id": "ctl00_FooterMenu1_hidSite_Name"})
    site_name = name_div.attrs['value']
    # Get Site Latitude
    lat_div = site_soup.find("span", {"id": "ctl00_cp1_lblLatitude"})
    lat = lat_div.text.split(": ")[-1]
    # Get Site Longitude
    lon_div = site_soup.find("span", {"id": "ctl00_cp1_lblLongitude"})
    lon = lon_div.text.split(": ")[-1]
    # Get Site Abstract
    abstract_div = site_soup.find("span", {"id": "ctl00_cp1_lblSite_Abstract"})
    abstract = abstract_div.text
    # Get Site Address
    address_div = site_soup.find("span", {"id": "ctl00_cp1_lblSiteAddress"})
    address = address_div.text
    # Get Site Contact
    sitecontact_div = site_soup.find("span", {"id": "ctl00_cp1_lblOSC_Name"})
    sitecontact = sitecontact_div.text
    # Get Site Contact E-Mail
    sitecontactemail_div = site_soup.find("a", {"id": "ctl00_cp1_lnkOSC_Email"})
    sitecontactemail = sitecontactemail_div.text
    # Get site photo link
    siteimg_path = "https://response.epa.gov" + site_soup.find("img", {"id": "ctl00_cp1_imgSitePhoto"}).attrs['src']
    
    # Set site dictionary
    site_dict = {}
    site_dict["Site ID"] = site_id
    site_dict["Site Name"] = site_name
    site_dict["Address"] = address
    site_dict["Region"] = region
    site_dict["POC"] = sitecontact
    site_dict["POC E-mail"] = sitecontactemail
    site_dict["Abstract"] = abstract
    site_dict["Photo URL"] = siteimg_path
    site_dict["Latitude"] = lat
    site_dict["Longitude"] = lon
    
    return site_dict

# 1. Parse regions HTML to get the list of individual site URLs

## Set target GIS

In [4]:
gis = arcgis.gis.GIS("https://science.maps.arcgis.com", "ANieto_science")

Enter password: ········


## Set data schema

In [5]:
site_columns = ["Site ID", "Site Name", "Address", "Region", "POC", "POC E-mail", "Abstract", "Photo URL", "Latitude", "Longitude"]

## Set reference to html and base url

In [25]:
region_html = r"Regions - EPA OSC Response.html"

In [26]:
site_urls_list, sites_count = get_list_of_epasite_urls(region_html)
site_urls_list

['https://response.epa.gov/site/site_profile.aspx?site_id=11657',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12842',
 'https://response.epa.gov/site/site_profile.aspx?site_id=11006',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12580',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12868',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12875',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12107',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12410',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12521',
 'https://response.epa.gov/site/site_profile.aspx?site_id=9627',
 'https://response.epa.gov/site/site_profile.aspx?site_id=3806',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12337',
 'https://response.epa.gov/site/site_profile.aspx?site_id=11094',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12208',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12635',
 'https://re

In [27]:
sites_count

1285

In [30]:
dicts_list = []
added_count = 0
not_added_count = 0
for ix, site_url in enumerate(site_urls_list):
    print("Processing site {0} of {1}".format(str(ix+1), sites_count))
    try:
        attrb_dict = get_attributes_from_epasite_url(site_url)
        dicts_list.append(attrb_dict)
        added_count += 1
#         print("Site processed.")
#         if ix == 9:
#             break
    except AttributeError:
        print("Could not process.")
        not_added_count += 1

Processing site 1 of 1285
Processing site 2 of 1285
Processing site 3 of 1285
Processing site 4 of 1285
Processing site 5 of 1285
Processing site 6 of 1285
Processing site 7 of 1285
Processing site 8 of 1285
Processing site 9 of 1285
Processing site 10 of 1285
Processing site 11 of 1285
Processing site 12 of 1285
Processing site 13 of 1285
Processing site 14 of 1285
Processing site 15 of 1285
Processing site 16 of 1285
Processing site 17 of 1285
Processing site 18 of 1285
Could not process.
Processing site 19 of 1285
Processing site 20 of 1285
Processing site 21 of 1285
Processing site 22 of 1285
Processing site 23 of 1285
Processing site 24 of 1285
Processing site 25 of 1285
Processing site 26 of 1285
Processing site 27 of 1285
Processing site 28 of 1285
Processing site 29 of 1285
Processing site 30 of 1285
Processing site 31 of 1285
Processing site 32 of 1285
Processing site 33 of 1285
Processing site 34 of 1285
Processing site 35 of 1285
Processing site 36 of 1285
Processing site 37

Processing site 286 of 1285
Processing site 287 of 1285
Processing site 288 of 1285
Processing site 289 of 1285
Processing site 290 of 1285
Processing site 291 of 1285
Processing site 292 of 1285
Processing site 293 of 1285
Processing site 294 of 1285
Processing site 295 of 1285
Processing site 296 of 1285
Processing site 297 of 1285
Processing site 298 of 1285
Processing site 299 of 1285
Processing site 300 of 1285
Processing site 301 of 1285
Processing site 302 of 1285
Processing site 303 of 1285
Processing site 304 of 1285
Processing site 305 of 1285
Processing site 306 of 1285
Processing site 307 of 1285
Processing site 308 of 1285
Processing site 309 of 1285
Processing site 310 of 1285
Processing site 311 of 1285
Processing site 312 of 1285
Processing site 313 of 1285
Processing site 314 of 1285
Processing site 315 of 1285
Processing site 316 of 1285
Processing site 317 of 1285
Processing site 318 of 1285
Processing site 319 of 1285
Processing site 320 of 1285
Processing site 321 

Processing site 572 of 1285
Processing site 573 of 1285
Processing site 574 of 1285
Processing site 575 of 1285
Processing site 576 of 1285
Processing site 577 of 1285
Processing site 578 of 1285
Processing site 579 of 1285
Processing site 580 of 1285
Processing site 581 of 1285
Processing site 582 of 1285
Processing site 583 of 1285
Processing site 584 of 1285
Processing site 585 of 1285
Processing site 586 of 1285
Processing site 587 of 1285
Processing site 588 of 1285
Processing site 589 of 1285
Processing site 590 of 1285
Processing site 591 of 1285
Processing site 592 of 1285
Processing site 593 of 1285
Processing site 594 of 1285
Processing site 595 of 1285
Processing site 596 of 1285
Processing site 597 of 1285
Processing site 598 of 1285
Processing site 599 of 1285
Processing site 600 of 1285
Processing site 601 of 1285
Processing site 602 of 1285
Processing site 603 of 1285
Processing site 604 of 1285
Processing site 605 of 1285
Processing site 606 of 1285
Processing site 607 

Processing site 855 of 1285
Processing site 856 of 1285
Processing site 857 of 1285
Processing site 858 of 1285
Processing site 859 of 1285
Processing site 860 of 1285
Processing site 861 of 1285
Processing site 862 of 1285
Processing site 863 of 1285
Processing site 864 of 1285
Processing site 865 of 1285
Processing site 866 of 1285
Processing site 867 of 1285
Processing site 868 of 1285
Processing site 869 of 1285
Processing site 870 of 1285
Processing site 871 of 1285
Processing site 872 of 1285
Processing site 873 of 1285
Processing site 874 of 1285
Processing site 875 of 1285
Processing site 876 of 1285
Processing site 877 of 1285
Processing site 878 of 1285
Could not process.
Processing site 879 of 1285
Processing site 880 of 1285
Processing site 881 of 1285
Could not process.
Processing site 882 of 1285
Processing site 883 of 1285
Processing site 884 of 1285
Processing site 885 of 1285
Processing site 886 of 1285
Processing site 887 of 1285
Processing site 888 of 1285
Could not 

Could not process.
Processing site 1113 of 1285
Processing site 1114 of 1285
Processing site 1115 of 1285
Could not process.
Processing site 1116 of 1285
Processing site 1117 of 1285
Processing site 1118 of 1285
Processing site 1119 of 1285
Processing site 1120 of 1285
Processing site 1121 of 1285
Processing site 1122 of 1285
Processing site 1123 of 1285
Processing site 1124 of 1285
Processing site 1125 of 1285
Processing site 1126 of 1285
Processing site 1127 of 1285
Could not process.
Processing site 1128 of 1285
Could not process.
Processing site 1129 of 1285
Processing site 1130 of 1285
Could not process.
Processing site 1131 of 1285
Processing site 1132 of 1285
Could not process.
Processing site 1133 of 1285
Processing site 1134 of 1285
Processing site 1135 of 1285
Processing site 1136 of 1285
Processing site 1137 of 1285
Processing site 1138 of 1285
Processing site 1139 of 1285
Processing site 1140 of 1285
Processing site 1141 of 1285
Processing site 1142 of 1285
Could not proces

In [32]:
print("Summary: ")
print("Processed Sites: {0}".format(added_count))
print("Not Processed Sites: {0}".format(not_added_count))

Summary: 
Processed Sites: 1159
Not Processed Sites: 126


In [33]:
# Convert to dataframe
sites_df = pd.DataFrame(dicts_list, columns=site_columns)
sites_df

Unnamed: 0,Site ID,Site Name,Address,Region,POC,POC E-mail,Abstract,Photo URL,Latitude,Longitude
0,11657,Trex Property Grand Rapids,"312 Ellsworth Ave SWGrand Rapids, MI 49503",5,Jeff Lippert,lippert.jeffrey@epa.gov,"EPA became involved at the site on June 9, 201...",https://response.epa.gov/sites/11657/files/312...,42.9573850,-85.6728980
1,12842,Crest Rubber Ravenna,"645 S. Chestnut StRavenna, OH 44266",5,Eric Pohl,pohl.eric@epa.gov,"Bridgestone Americas Tire Operations, is condu...",https://response.epa.gov/images/epaosc/default...,41.1506872,-81.2434801
2,11006,Otsego Township Dam Area,"557 Lincoln Road (M-89)Otsego Township, MI 49078",5,Paul Ruesch,ruesch.paul@epa.gov,"This website contains documents, photos and re...",https://response.epa.gov/sites/11006/files/fac...,42.4601694,-85.7199333
3,12580,Fluorescent Recycling,"7260 Neville AveCleveland, OH 44122",5,Eric Pohl,pohl.eric@epa.gov,U.S. Environmental Protection Agency has begun...,https://response.epa.gov/sites/12580/files/IMG...,41.4645458,-81.7368574
4,12868,Continuing Challenge TRIPR,DoubleTree by Hilton2001 Point West WaySacrame...,5,Paul Ruesch,ruesch.paul@epa.gov,This site is to help facilitate the planning o...,https://response.epa.gov/sites/12868/files/pan...,38.6007461,-121.4329739
5,12875,Intrastate Distributors ER,"20021 Exeter StHighland Park, MI 48203",5,Brian Kelly,kelly.brian@epa.gov,Intrastate Distributors reported that they suf...,https://response.epa.gov/images/epaosc/default...,42.4407260,-83.1062270
6,12107,Lunkenheimer Foundry Site,"1519 Tremont StreetCincinnati, OH 45268",5,Steven Renninger,renninger.steven@epa.gov,The Lunkenheimer Foundry Site is a vacant foun...,https://response.epa.gov/sites/12107/files/IMG...,39.1270300,-84.5454300
7,12410,Fremont Vapor Intrusion ER,"Justice StreetFremont, OH 43420",5,Elizabeth Nightingale / Andrew Kocher,nightingale.elizabeth@epa.gov,On 9/22/17 at approximately 12:40 central time...,https://response.epa.gov/sites/12410/files/01_...,41.3493130,-83.1147170
8,12521,Ensign Products Co.,"3528 East 76th StreetCleveland, OH 44105",5,Jason Cashmere,cashmere.jason@epa.gov,Ensign Products Co. is a former rustproofing a...,https://response.epa.gov/sites/12521/files/Ens...,41.4624860,-81.6349740
9,9627,AA Oil,"2340 S Tibbs AvenueIndianapolis, IN 46241",5,Shelly Lam,Lam.Shelly@epa.gov,The AA Oil Site (Site) is located at 2340 S Ti...,https://response.epa.gov/sites/9627/files/IMG_...,39.7321720,-86.2174560


In [34]:
geoms = []
for i in range(0, len(sites_df)):
    x = float(sites_df.iloc[i]['Longitude'])
    y = float(sites_df.iloc[i]['Latitude'])
    geoms.append(arcgis.geometry.Point({"x":x, "y":y, "spatialReference":{"wkid":4326}}))
geoms

[{'spatialReference': {'wkid': 4326}, 'x': -85.672898, 'y': 42.957385},
 {'spatialReference': {'wkid': 4326}, 'x': -81.2434801, 'y': 41.1506872},
 {'spatialReference': {'wkid': 4326}, 'x': -85.7199333, 'y': 42.4601694},
 {'spatialReference': {'wkid': 4326}, 'x': -81.7368574, 'y': 41.4645458},
 {'spatialReference': {'wkid': 4326}, 'x': -121.4329739, 'y': 38.6007461},
 {'spatialReference': {'wkid': 4326}, 'x': -83.106227, 'y': 42.440726},
 {'spatialReference': {'wkid': 4326}, 'x': -84.54543, 'y': 39.12703},
 {'spatialReference': {'wkid': 4326}, 'x': -83.114717, 'y': 41.349313},
 {'spatialReference': {'wkid': 4326}, 'x': -81.634974, 'y': 41.462486},
 {'spatialReference': {'wkid': 4326}, 'x': -86.217456, 'y': 39.732172},
 {'spatialReference': {'wkid': 4326}, 'x': -84.4569, 'y': 39.1616},
 {'spatialReference': {'wkid': 4326}, 'x': -87.9781412, 'y': 43.0029058},
 {'spatialReference': {'wkid': 4326}, 'x': -86.204475, 'y': 39.74848},
 {'spatialReference': {'wkid': 4326}, 'x': -84.527698, 'y': 

In [35]:
# Convert to spatial dataframe
sites_sdf = arcgis.features.SpatialDataFrame(data=sites_df, geometry=geoms)
sites_sdf

Unnamed: 0,Site ID,Site Name,Address,Region,POC,POC E-mail,Abstract,Photo URL,Latitude,Longitude,SHAPE
0,11657,Trex Property Grand Rapids,"312 Ellsworth Ave SWGrand Rapids, MI 49503",5,Jeff Lippert,lippert.jeffrey@epa.gov,"EPA became involved at the site on June 9, 201...",https://response.epa.gov/sites/11657/files/312...,42.9573850,-85.6728980,"{'x': -85.672898, 'y': 42.957385, 'spatialRefe..."
1,12842,Crest Rubber Ravenna,"645 S. Chestnut StRavenna, OH 44266",5,Eric Pohl,pohl.eric@epa.gov,"Bridgestone Americas Tire Operations, is condu...",https://response.epa.gov/images/epaosc/default...,41.1506872,-81.2434801,"{'x': -81.2434801, 'y': 41.1506872, 'spatialRe..."
2,11006,Otsego Township Dam Area,"557 Lincoln Road (M-89)Otsego Township, MI 49078",5,Paul Ruesch,ruesch.paul@epa.gov,"This website contains documents, photos and re...",https://response.epa.gov/sites/11006/files/fac...,42.4601694,-85.7199333,"{'x': -85.7199333, 'y': 42.4601694, 'spatialRe..."
3,12580,Fluorescent Recycling,"7260 Neville AveCleveland, OH 44122",5,Eric Pohl,pohl.eric@epa.gov,U.S. Environmental Protection Agency has begun...,https://response.epa.gov/sites/12580/files/IMG...,41.4645458,-81.7368574,"{'x': -81.7368574, 'y': 41.4645458, 'spatialRe..."
4,12868,Continuing Challenge TRIPR,DoubleTree by Hilton2001 Point West WaySacrame...,5,Paul Ruesch,ruesch.paul@epa.gov,This site is to help facilitate the planning o...,https://response.epa.gov/sites/12868/files/pan...,38.6007461,-121.4329739,"{'x': -121.4329739, 'y': 38.6007461, 'spatialR..."
5,12875,Intrastate Distributors ER,"20021 Exeter StHighland Park, MI 48203",5,Brian Kelly,kelly.brian@epa.gov,Intrastate Distributors reported that they suf...,https://response.epa.gov/images/epaosc/default...,42.4407260,-83.1062270,"{'x': -83.106227, 'y': 42.440726, 'spatialRefe..."
6,12107,Lunkenheimer Foundry Site,"1519 Tremont StreetCincinnati, OH 45268",5,Steven Renninger,renninger.steven@epa.gov,The Lunkenheimer Foundry Site is a vacant foun...,https://response.epa.gov/sites/12107/files/IMG...,39.1270300,-84.5454300,"{'x': -84.54543, 'y': 39.12703, 'spatialRefere..."
7,12410,Fremont Vapor Intrusion ER,"Justice StreetFremont, OH 43420",5,Elizabeth Nightingale / Andrew Kocher,nightingale.elizabeth@epa.gov,On 9/22/17 at approximately 12:40 central time...,https://response.epa.gov/sites/12410/files/01_...,41.3493130,-83.1147170,"{'x': -83.114717, 'y': 41.349313, 'spatialRefe..."
8,12521,Ensign Products Co.,"3528 East 76th StreetCleveland, OH 44105",5,Jason Cashmere,cashmere.jason@epa.gov,Ensign Products Co. is a former rustproofing a...,https://response.epa.gov/sites/12521/files/Ens...,41.4624860,-81.6349740,"{'x': -81.634974, 'y': 41.462486, 'spatialRefe..."
9,9627,AA Oil,"2340 S Tibbs AvenueIndianapolis, IN 46241",5,Shelly Lam,Lam.Shelly@epa.gov,The AA Oil Site (Site) is located at 2340 S Ti...,https://response.epa.gov/sites/9627/files/IMG_...,39.7321720,-86.2174560,"{'x': -86.217456, 'y': 39.732172, 'spatialRefe..."


## Publish the Dataframe to the GIS

In [36]:
fc = gis.content.import_data(sites_sdf, title="EPA Region 5 Response Sites", tags="EPA, Region 5, Response Sites")
fc