# Process: Scrape EPA Web Page for Features

Organization: Esri

Author: Alberto Nieto (anieto@esri.com)

Date: 8/12/2018

Process Overview:

1. Parse regions HTML to get the list of individual site URLs

2. Iterate: For each site:

    - Parse the regions HTML for the site attributes (Name, City, State, Contact, Updated, Field Activity, Response Type, Response Authority, Incident Category 
    - Parse the site HTML for needed attributes (Site ID, Latitude, Longitude, Photo)

## Set needed modules

In [104]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import arcgis

## Set helper functions

In [54]:
def get_list_of_epasite_urls(region_html):
    f = open(region_html)
    soup = BeautifulSoup(f, 'html.parser')
    tablerecords = list(soup.find_all('a', class_="NormalColoredFont"))
    site_url_params_list = []
    for record in tablerecords:
        try:
            if record['href'] and record['href'][:47] == "https://response.epa.gov/site/site_profile.aspx":
                site_url_params_list.append(record['href'])
        except KeyError:
            pass

    return site_url_params_list, len(site_url_params_list)

In [112]:
def get_attributes_from_epasite_url(site_url):
    page = requests.get(site_url)
    site_soup = BeautifulSoup(page.content, 'html.parser')
    # Get Site ID
    id_div = site_soup.find("input", {"id": "ctl00_cp1_hid_Site_ID"})
    site_id = id_div.attrs['value']
    # Get Site Region
    region_div = site_soup.find("input", {"id": "ctl00_cp1_hid_Region"})
    region = region_div.attrs['value']
    # Get Site Name
    name_div = site_soup.find("input", {"id": "ctl00_FooterMenu1_hidSite_Name"})
    site_name = name_div.attrs['value']
    # Get Site Latitude
    lat_div = site_soup.find("span", {"id": "ctl00_cp1_lblLatitude"})
    lat = lat_div.text.split(": ")[-1]
    # Get Site Longitude
    lon_div = site_soup.find("span", {"id": "ctl00_cp1_lblLongitude"})
    lon = lon_div.text.split(": ")[-1]
    # Get Site Abstract
    abstract_div = site_soup.find("span", {"id": "ctl00_cp1_lblSite_Abstract"})
    abstract = abstract_div.text
    # Get Site Address
    address_div = site_soup.find("span", {"id": "ctl00_cp1_lblSiteAddress"})
    address = address_div.text
    # Get Site Contact
    sitecontact_div = site_soup.find("span", {"id": "ctl00_cp1_lblOSC_Name"})
    sitecontact = sitecontact_div.text
    # Get Site Contact E-Mail
    sitecontactemail_div = site_soup.find("a", {"id": "ctl00_cp1_lnkOSC_Email"})
    sitecontactemail = sitecontactemail_div.text
    # Get site photo link
    siteimg_path = "https://response.epa.gov" + site_soup.find("img", {"id": "ctl00_cp1_imgSitePhoto"}).attrs['src']
    
    # Set site dictionary
    site_dict = {}
    site_dict["Site ID"] = site_id
    site_dict["site Name"] = site_name
    site_dict["Address"] = address
    site_dict["Region"] = region
    site_dict["POC"] = sitecontact
    site_dict["POC E-mail"] = sitecontactemail
    site_dict["Abstract"] = abstract
    site_dict["Photo URL"] = siteimg_path
    site_dict["Latitude"] = lat
    site_dict["Longitude"] = lon
    
    return site_dict

In [113]:
get_attributes_from_epasite_url(site_url_list[0])

{'Site ID': '11657',
 'site Name': 'Trex Property Grand Rapids',
 'Address': '312 Ellsworth Ave SWGrand Rapids, MI 49503',
 'Region': '5',
 'POC': 'Jeff Lippert',
 'POC E-mail': 'lippert.jeffrey@epa.gov',
 'Abstract': 'EPA became involved at the site on June 9, 2016, after the Michigan Department of Health and Human Services requested assistance to assess a potential trichloroethylene (TCE) vapor intrusion issue.  The site is the location of the former Detrex Company who owned and operated a solvent recycler at the site.  EPA and MDEQ collected indoor air samples and found levels of TCE in the indoor air approximately 200 times above the health screening level. Kent County Health Department evacuated the commercial building 312 Ellsworth Ave SW due to unsafe levels of TCE found in the indoor air.\r EPA ordered the PRP to conduct the cleanup at the site and immediately install a vapor mitigation system to allow re-occupancy of the building. Installation of a vapor mitigation system bega

In [114]:
get_attributes_from_epasite_url(site_url_list[3])

{'Site ID': '12580',
 'site Name': 'Fluorescent Recycling',
 'Address': '7260 Neville AveCleveland, OH 44122',
 'Region': '5',
 'POC': 'Eric Pohl',
 'POC E-mail': 'pohl.eric@epa.gov',
 'Abstract': 'U.S. Environmental Protection Agency has begun cleanup of 2 to 3 million spent fluorescent lamps, 250 drums of PCB-containing lighting ballasts and other electronic equipment stored in a warehouse at Fluorescent Recycling Inc., 7260 Neville Ave., Cleveland, Ohio.\r At the request of the Ohio Environmental Protection Agency, U.S. EPA will address the hazardous contamination accumulated throughout the entire warehouse. On Feb. 13, EPA was notified that a fire at the warehouse was affecting the waste. U.S. EPA provided technical assistance to the Cleveland Fire Department to determine if mercury vapor was present. U.S. EPA has confirmed that contamination on site includes mercury vapors and PCBs.\r The owner has now provided U.S. EPA with access to the warehouse for a Superfund time-critical re

# 1. Parse regions HTML to get the list of individual site URLs

## Set target GIS

In [106]:
gis = arcgis.gis.GIS("https://science.maps.arcgis.com", "ANieto_science")

Enter password: ········


## Set data schema

In [108]:
site_columns = ["Site ID", "Site Name", "Address", "Region", "POC", "POC E-mail", "Abstract", "Photo URL", "Latitude", "Longitude"]

## Set reference to html and base url

In [96]:
# html = r"https://response.epa.gov/site/region_list.aspx?region=5"
region_html = r"Regions - EPA OSC Response.html"

In [97]:
# site_soup

In [102]:
site_urls_list, sites_count = get_list_of_epa_site_urls(region_html)
site_urls_list

['https://response.epa.gov/site/site_profile.aspx?site_id=11657',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12842',
 'https://response.epa.gov/site/site_profile.aspx?site_id=11006',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12580',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12868',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12875',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12107',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12410',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12521',
 'https://response.epa.gov/site/site_profile.aspx?site_id=9627',
 'https://response.epa.gov/site/site_profile.aspx?site_id=3806',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12337',
 'https://response.epa.gov/site/site_profile.aspx?site_id=11094',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12208',
 'https://response.epa.gov/site/site_profile.aspx?site_id=12635',
 'https://re

In [103]:
sites_count

1285

In [139]:
dicts_list = []
for ix, site_url in enumerate(site_urls_list):
    print("Processing site {0} of {1}".format(str(ix+1), sites_count))
    try:
        attrb_dict = get_attributes_from_epasite_url(site_url)
        dicts_list.append(attrb_dict)
        print("Site processed.")
        if ix == 9:
            break
    except AttributeError:
        print("Could not process.")

Processing site 1 of 1285
Site processed.
Processing site 2 of 1285
Site processed.
Processing site 3 of 1285
Site processed.
Processing site 4 of 1285
Site processed.
Processing site 5 of 1285
Site processed.
Processing site 6 of 1285
Site processed.
Processing site 7 of 1285
Site processed.
Processing site 8 of 1285
Site processed.
Processing site 9 of 1285
Site processed.
Processing site 10 of 1285
Site processed.


In [146]:
# Convert to spatial dataframe
sites_spdf = arcgis.features.SpatialDataFrame(dicts_list, columns=site_columns)
sites_spdf

Unnamed: 0,Site ID,Site Name,Address,Region,POC,POC E-mail,Abstract,Photo URL,Latitude,Longitude
0,11657,,"312 Ellsworth Ave SWGrand Rapids, MI 49503",5,Jeff Lippert,lippert.jeffrey@epa.gov,"EPA became involved at the site on June 9, 201...",https://response.epa.gov/sites/11657/files/312...,42.957385,-85.672898
1,12842,,"645 S. Chestnut StRavenna, OH 44266",5,Eric Pohl,pohl.eric@epa.gov,"Bridgestone Americas Tire Operations, is condu...",https://response.epa.gov/images/epaosc/default...,41.1506872,-81.2434801
2,11006,,"557 Lincoln Road (M-89)Otsego Township, MI 49078",5,Paul Ruesch,ruesch.paul@epa.gov,"This website contains documents, photos and re...",https://response.epa.gov/sites/11006/files/fac...,42.4601694,-85.7199333
3,12580,,"7260 Neville AveCleveland, OH 44122",5,Eric Pohl,pohl.eric@epa.gov,U.S. Environmental Protection Agency has begun...,https://response.epa.gov/sites/12580/files/IMG...,41.4645458,-81.7368574
4,12868,,DoubleTree by Hilton2001 Point West WaySacrame...,5,Paul Ruesch,ruesch.paul@epa.gov,This site is to help facilitate the planning o...,https://response.epa.gov/sites/12868/files/pan...,38.6007461,-121.4329739
5,12875,,"20021 Exeter StHighland Park, MI 48203",5,Brian Kelly,kelly.brian@epa.gov,Intrastate Distributors reported that they suf...,https://response.epa.gov/images/epaosc/default...,42.440726,-83.106227
6,12107,,"1519 Tremont StreetCincinnati, OH 45268",5,Steven Renninger,renninger.steven@epa.gov,The Lunkenheimer Foundry Site is a vacant foun...,https://response.epa.gov/sites/12107/files/IMG...,39.12703,-84.54543
7,12410,,"Justice StreetFremont, OH 43420",5,Elizabeth Nightingale / Andrew Kocher,nightingale.elizabeth@epa.gov,On 9/22/17 at approximately 12:40 central time...,https://response.epa.gov/sites/12410/files/01_...,41.349313,-83.114717
8,12521,,"3528 East 76th StreetCleveland, OH 44105",5,Jason Cashmere,cashmere.jason@epa.gov,Ensign Products Co. is a former rustproofing a...,https://response.epa.gov/sites/12521/files/Ens...,41.462486,-81.634974
9,9627,,"2340 S Tibbs AvenueIndianapolis, IN 46241",5,Shelly Lam,Lam.Shelly@epa.gov,The AA Oil Site (Site) is located at 2340 S Ti...,https://response.epa.gov/sites/9627/files/IMG_...,39.732172,-86.217456


In [151]:
sites_spdf['SHAPE'] = lambda x sites_spdf['Latitude'] 

KeyError: 'SHAPE'

In [147]:
sites_spdf.loc[0]

Site ID                                                   11657
Site Name                                                   NaN
Address              312 Ellsworth Ave SWGrand Rapids, MI 49503
Region                                                        5
POC                                                Jeff Lippert
POC E-mail                              lippert.jeffrey@epa.gov
Abstract      EPA became involved at the site on June 9, 201...
Photo URL     https://response.epa.gov/sites/11657/files/312...
Latitude                                             42.9573850
Longitude                                           -85.6728980
Name: 0, dtype: object

## Publish the Dataframe to the GIS

In [148]:
fc = gis.content.import_data(sites_spdf, address_fields="Address")

AttributeError: Geometry Column Not Present: SHAPE