# Download Oil and Gas Well Data to CSV Files

This notebook is used to download the oil and gas databases from all ~34 oil and gas producing states. This notebook grabs the data and saves it to a CSV. A separate notebook is used to process through the downloaded CSVs and move into a database.

Most states make this data available via an ArcGIS REST API. For these states, we just use the ArcGIS Python API to scrape and grab all the records. 

## ArcGIS States
* Alabama
* Arizona
* California
* Florida
* Illinois
* Kansas
* Louisiana
* Michigan
* Nevada
* New Mexico
* New York
* Pennsylvania
* Texas
* Virginia
* Washington
* West Virginia
* Wyoming - URL Changes Periodically

## Manual Download and Processing States
For some states, it's easiest right now just to download the data and manually convert it into a CSV.
* Idaho
* Maryland: Source comes from Maryland directly (not available online) or FrackTracker, then transformed (see code below)
* Missouri: DL from https://dnr.mo.gov/geology/geosrv/ogc/ogc-permits/
* Oregon: D/L XLSX frmo http://www.oregongeology.org/mlrr/oilgas-report.htm
* South Dakota: DL Shapefile from http://denr.sd.gov/des/og/ogmaps.aspx
* Tennessee: D/L from (http://environment-online.state.tn.us:8080/pls/enf_reports/f?p=9034:34300:0::NO:::)
* Utah: D/L shapefile from https://gis.utah.gov/data/energy/oil-gas/
* Montana (RDBMS):
  1. Open http://www.bogc.dnrc.mt.gov/WebApps/DataMiner/Wells/WellSurfaceLongLat.aspx
  2. Filter by County - Begins With - %
  3. Click the "Excel" button
  4. Open in Excel and save as CSV. Viola, 44k records!

Additionally:
* North Dakota: straightforward shapefile download and conversion, which I've scripted below
* Colorado: straightforward shapefile download and conversion, which I've scripted below

## Complicated Scraping
Finally, some states just make it plain difficult to get their data. 
* Indiana - arcgis to get APIs plus manual scraping to get well details
* Oklahoma (Osage Tribe Reservation): http://oag.osagetribe.org/osageonline/ (possibly included in NOAG data)
* Ohio -- No dates from ArcGIS. Get well list from ArcGIS and scrape dates

## National Oil and Gas Gateway
The National Oil and Gas Gateway (http://www.noggateway.org/reports) has:
* Alabama - 18,881
* **Arkansas** - 52,490
* Colorado - 115,976
* **Kentucky** - 145,745
* **Mississippi**  - 34,586
* **Nebraska** - 22,253
* New York - 41,787
* **Oklahoma** - 533,003
* Utah - 32,415
* West Virgina - 114,874

    
## States With and Without Oil and Gas Production

Official list of states with currently producing wells: (https://www.eia.gov/petroleum/wells/)[https://www.eia.gov/petroleum/wells/]

_States without Oil or Gas Wells_
* Connecticut
* Delaware
* District of Columbia
* Georgia
* Iowa
* Maine
* Massachusetts
* Minnesota
* New Hampshire
* New Jersey
* North Carolina
* Rhode Island
* South Carolina
* Vermont
* Wisconsin
* Hawaii

## To add someday:
* Offshore oil and gas wells
* Any additional indian reservation permitting agencies

For our scraping, we use *Selenium WebDriver*. To make this portable across platforms, we'll run a headless WebDriver server within a handy little Docker container:

```bash
docker run -d -p 4444:4444 --shm-size 2g -v "$PWD/downloads":/var/tmp selenium/standalone-firefox:3.9.1-actinium
```
For development purposes, it's easiest to run WebDriver on a local browser, then point the code to the Docker container.

In [None]:
# module to capture source data and write to a CSV
import os, errno, array, csv, json, math, random, urllib, json, re

import pandas as pd
import numpy as np

from datetime import datetime
import zipfile
import psycopg2

state_abbrev = { 'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 
'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 
'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD', 
'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO', 'Montana': 'MT', 
'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM', 'New York': 'NY', 
'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 
'Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 
'Vermont': 'VT', 'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY', }

def test_directory(filename):
    path = os.path.dirname(filename)
    try:
        os.makedirs(path)
    except OSError as exception:
        if exception.errno != errno.EEXIST:
            raise

def write_to_csv(state, records, overwrite=False):
    today = datetime.today()
    #filename = 'csvs/' + state.name.lower() + '-' + datetime.strftime(datetime.today(), "%Y-%m-%d") + '.csv'
    filename = 'csvs/' + state.lower() + '-' + 'data' + '.csv'

    try:
        os.makedirs(os.path.dirname(filename))
    except OSError as exception:
        if exception.errno != errno.EEXIST:
            raise

    if os.path.isfile(filename) and not overwrite:
        raise IOError('File already exists. Specify overwrite = True in function parameters to overwrite.')
    if len(records) == 0:
        raise IndexError('State object has not data!')

    with open(filename, 'wb') as f:
        writer = csv.DictWriter(f, fieldnames=records[0].keys())
        writer.writeheader()
        for row in records:
            writer.writerow(row)
        print ('Wrote', str(len(records)), 'rows to', filename)
    return True

    
def strip_and_encode_dict(d):
    new_dict = dict()
    for key, value in d.iteritems():
        if isinstance(value, dict):
            value = strip_and_encode(value)
        elif isinstance(value, list):
            value = strip_and_encode_list(value)
        else:
            key = key.encode('utf-8').strip()
            if value:
                value = value.encode('utf-8') if isinstance(value, unicode) else value
                value = value.strip() if isinstance(value, str) else value
        new_dict[key] = value
    return new_dict
def strip_and_encode_list(l):
    new_list = list()
    for item in l:
        if isinstance(item, dict):
            new_item = strip_and_encode_dict(item)
        elif isinstance(item, list):
            new_item = strip_and_encode_list(item)
        else:
            new_item = item.encode('utf-8') if item and isinstance(item, unicode) else item
            new_item = new_item.strip() if isinstance(new_item, str) else new_item
        new_list.append(new_item)
    return new_list

def st_time(func):
    """
        st decorator to calculate the total time of a func
    """

    def st_func(*args, **keyArgs):
        t1 = time.time()
        r = func(*args, **keyArgs)
        t2 = time.time()
        print ("Function=%s, Time=%s" % (func.__name__, t2 - t1))
        return r

    return st_func

In [None]:
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.features import FeatureSet

def download_layer(layer_url, geom = False):
    feature_layer = FeatureLayer(layer_url)

    batch_size = feature_layer.properties.maxRecordCount
    feature_ids = feature_layer.query(where="1=1", return_ids_only=True)['objectIds']

    records = list()
    for i in range(0, len(feature_ids), batch_size):
        query_ids = [str(j) for j in feature_ids[i:i+batch_size]]
        if geom:
            result = feature_layer.query(where='1=1', object_ids=','.join(query_ids),
                                         returnGeometry= True, outFields = '*', outSR='4326')
            for record in result.features:
                records.append({**record.attributes, **record.geometry})    
        else:
            result = feature_layer.query(where='1=1', object_ids=','.join(query_ids))        
            for record in result.features:
                records.append(record.attributes)
            
    return records

In [123]:
# grab and scrape ArcGIS layers from states that offer ArcGIS Rest API Access
gis_states = {
    'Alabama': 'https://map.ogb.state.al.us/arcgis/rest/services/OGB/map/MapServer/15/',
    'Arizona': 'http://services.azgs.az.gov/arcgis/rest/services/aasggeothermal/AZWellHeaders/MapServer/0',
    'California': 'http://spatialservices.conservation.ca.gov/arcgis/rest/services/DOMS/Wells/MapServer/0',
    'Florida': 'https://ca.dep.state.fl.us/arcgis/rest/services/OpenData/OIL_WELLS/MapServer/0/',
    'Illinois': 'http://maps.isgs.illinois.edu/arcgis/rest/services/ILOIL/Wells/MapServer/2',
    'Kansas': 'http://services.kgs.ku.edu/arcgis8/rest/services/wwc5/wwc5_general/MapServer/6',
    'Louisiana': 'http://sonris-www.dnr.state.la.us/arcgis/rest/services/MapSvc/OC/MapServer/0',
    'Michigan': {
        'Oil Wells': 'http://gisp.mcgi.state.mi.us/arcgis/rest/services/DEQ/OilandGas/MapServer/7',
        'Natural Gas Wells': 'http://gisp.mcgi.state.mi.us/arcgis/rest/services/DEQ/OilandGas/MapServer/8',
        'Gas Condensate Wells': 'http://gisp.mcgi.state.mi.us/arcgis/rest/services/DEQ/OilandGas/MapServer/9',
        # note: other layers avaiable
    },
    'Nevada': {
        'Wells through 2006': 'https://gisweb.unr.edu/nbmg/rest/services/MineralsAndEnergy/OilAndGas/MapServer/0',
        'Wells through 2013': 'https://gisweb.unr.edu/nbmg/rest/services/MineralsAndEnergy/OilAndGas/MapServer/1'
    },
    'New York': 'http://www.dec.ny.gov/arcgis/rest/services/mines_and_wells/MapServer/1/',
    'Pennsylvania': 'http://www.depgis.state.pa.us/arcgis/rest/services/OilGas/Utica_Wells/MapServer/0',
    'Texas': 'http://wwwgisp.rrc.texas.gov/arcgis/rest/services/rrc_public/RRC_Public_Viewer_Srvs/MapServer/1/',
    'Virginia': {
        'Active Wells': 'https://dmme.virginia.gov/gis/rest/services/DGO/DGO_wells/MapServer/0',
        'Plugged Wells': 'https://dmme.virginia.gov/gis/rest/services/DGO/DGO_wells/MapServer/5'
    },
    'Washington': 'https://gis.dnr.wa.gov/site1/rest/services/Public_Geology/WADNR_PUBLIC_WGS_ERPL/MapServer/1/',
    'West Virginia': 'https://tagis.dep.wv.gov/arcgis/rest/services/app_services/oog2/MapServer/7', # note- multiple layers avail
    'Wyoming': 'http://ims.wsgs.wyo.gov/arcgis/rest/services/OilGas/OilGas_Map_WOGCCDownloads/MapServer/0/'
}
#issue with KY right now    '  
# doesn't work: Kentucky
#    'Kentucky': 'http://kgs.uky.edu/arcgis/rest/services/KYOilGas/KYOilGasWells_SZ/MapServer/4',
#   

#     'Wyoming': {
#         'Orphan': 'http://wogccms.state.wy.us/arcgis/rest/services/WOGCC/UnitMap/MapServer/0',
#         'Conventional': 'http://wogccms.state.wy.us/arcgis/rest/services/WOGCC/UnitMap/MapServer/2',
#         'CoalBed': 'http://wogccms.state.wy.us/arcgis/rest/services/WOGCC/UnitMap/MapServer/3',
#         'Horizontal': 'http://wogccms.state.wy.us/arcgis/rest/services/WOGCC/UnitMap/MapServer/9'
#     },


In [None]:
gis_states = {
    'Wyoming': 'http://ims.wsgs.wyo.gov/arcgis/rest/services/OilGas/OilGas_Map_WOGCCDownloads/MapServer/0/'
}

In [None]:
for state, gis in gis_states.items():
    print("Downloading %s." % state,)
    file_name = state_abbrev[state].lower() + "-data.csv"
    # string or dict?
    if type(gis) is str:
        wells = download_layer(gis)
    
    if type(gis) is dict:
        wells = list()
        for layer_url in gis.values():
            wells.append(download_layer(layer_url))
        wells = [well for layer in wells for well in layer]
    df = pd.DataFrame(wells)
    df.to_csv('csvs/' + file_name)
    print("Done. %s wells recorded." % len(wells))

In [None]:
# A SEPARATE FN FOR
# THE STATES WHERE WE NEED TO EXPLICITYLY REQUEST
# # THE GEOMETRY
# geom_gis_states = {
#     'New Mexico': 'https://gis.emnrd.state.nm.us/public/rest/services/OCDPUB/NM_Well_Locations/MapServer/0/',

# }

# for state, gis in geom_gis_states.items():
#     print("Downloading %s." % state,)
#     file_name = state_abbrev[state].lower() + "-data.csv"
#     # string or dict?
#     if type(gis) is str:
#         wells = download_layer(gis)
    
#     if type(gis) is dict:
#         wells = list()
#         for layer_url in gis.values():
#             wells.append(download_layer(layer_url, geom = True))
#         wells = [well for layer in wells for well in layer]
#     df = pd.DataFrame(wells)
#     df.to_csv('csvs/' + file_name)
#     print("Done. %s wells recorded." % len(wells))

# it's complicated
Indiana - arcgis plus manual scraping

Mississippi: RBDMS

Montana: OpenGIS http://bogc.dnrc.mt.gov/WebApps/DataMiner/MontanaMap.aspx

Nebraska: OpenGIS

Oklahoma: RBDMS https://apps.occeweb.com/RBDMSWeb_OK/OCCOGOnline.aspx

# Alaska

AK used to have an arcgis server that was easy to scrape. Now ... not so much
Steps:
1. Go to [http://aogweb.state.ak.us/DataMiner3/Forms/WellList.aspx](http://aogweb.state.ak.us/DataMiner3/Forms/WellList.aspx), click on wells, then "Export All" as CSV

In [None]:
# Arkansas
# probably don't actually do this. Use the NOAG data.
state = 'AR'

profile = FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/var/tmp/')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv,application/vnd.ms-excel')
profile.set_preference('browser.helperApps.alwaysAsk.force', 'false')

driver = webdriver.Remote(command_executor='http://127.0.0.1:4444/wd/hub',
                          desired_capabilities=capabilities,
                          browser_profile=profile)

driver.implicitly_wait(600) # 10 minutes

driver.get('http://www.aogc2.state.ar.us/welldata/Wells/Default.aspx')
assert "Production & Well Data" in driver.title
criteria_select = Select(driver.find_element_by_id("cpMainContent_ddlCriteria"))
criteria_select.select_by_visible_text('Well Type')
time.sleep(3)
well_type_select = Select(driver.find_element_by_id('cpMainContent_ddlListItem'))
options = [str(opt.get_attribute("value")) for opt in well_type_select.options]

for opt_idx in range(1, len(options)):
    driver.get('http://www.aogc2.state.ar.us/welldata/Wells/Default.aspx')
    assert "Production & Well Data" in driver.title   
    criteria_select = Select(driver.find_element_by_id("cpMainContent_ddlCriteria"))
    criteria_select.select_by_visible_text('Well Type')
    time.sleep(3)
    well_type_select = Select(driver.find_element_by_id('cpMainContent_ddlListItem'))
    options = [str(opt.get_attribute("value")) for opt in well_type_select.options]
    well_type_select.select_by_value(options[opt_idx])
    driver.find_element_by_id('cpMainContent_btnSubmit').click()
    time.sleep(3)
    driver.find_element_by_id('cpMainContent_btnExcel').click()
    while (len(os.listdir('downloads/')) < opt_idx
        and len([f for f in os.listdir('downloads/') if f[-4:]=='part']) > 0):
        time.sleep(15) # if current download in process, wait...
driver.quit()

if driver.count_downloads() == len(options) - 1:
    print ("successfully downloaded", num_options, "files")
else:
    print ('something went wrong!')

# next, modify to clear all crap from the downloads directory
# then, grab all of the .xls files (ack!) and merge into one csv
# for now, manually combine into one CSV (complicating matters: what gets downloaded is actually an html table)

In [None]:
# Colorado
state = 'CO'
# download shapefile from: http://cogcc.state.co.us/data2.html#/downloads
# Well Spots (APIs)(10 Mb) - metadata
# Active and plugged wells - including active and expired well permits
src = 'http://cogcc.state.co.us/documents/data/downloads/gis/WELLS_SHP.ZIP'
res = 'downloads/co/shapefile.zip'
urllib.request.urlretrieve(src, res)

zip = zipfile.ZipFile(res)
zip.extractall('downloads/co')

# Convert shapefile to geojson
command = "ogr2ogr -f CSV -mapFieldType Date=String csvs/co-data.csv downloads/co/Wells.shp"
!$command

In [None]:
# Idaho - automated = no!
state = 'ID'
import requests
from lxml import html

# historic pre-1988 data d/l from historic shapefile
# there are only 28 or so active wells. Can be manually retrieved from
# http://welldata.ogcc.idaho.gov/DataMining.html?EntityType=Well&EntityKeyName=PKey&EntityKeyValue=1367116&DETAILSONLY=True
# as of 2017, idaho has at least one producing Well. Some 200 have been drilled in the state's history

# for modern wells, use http://welldata.ogcc.idaho.gov/ to d/l list of modern wells
# and grab IDs frmo source code
# then scrape the lat/lon info and join to the manually d/l list of wells using below

well_ids = [1367116,1367111,1367112,1367103,1367072,1367070,1367078,1367102,1367104,1367073,1367071,1367100,1367106,1367075,1367076,1367074,1367077,1367110,1367098,1367099,1367101,1367097,1367107,1367109,1367108,1367113,1367114,1367115]

wells = []
for id in well_ids:
    url = 'http://welldata.ogcc.idaho.gov/ED.aspx?KeyName=PKey&KeyValue=%s&KeyType=Integer&DetailXML=WellDetails.xml' % id
    page = requests.get(url)
    tree = html.fromstring(page.content)
    api = tree.xpath('//*[@id="ED"]/table/tr[1]/td[3]/text()')
    lat = tree.xpath('//*[@id="EDI0"]/table/tr[2]/td[6]/text()')
    lon = tree.xpath('//*[@id="EDI0"]/table/tr[2]/td[8]/text()')
    well = {'API': api[0], 'lat': lat[0], 'lon': lon[0]}
    wells.append(well)
    
df = pd.read_csv('downloads/reportdata.csv')
df1 = df.merge(pd.DataFrame(wells), on='API')
df1.to_csv('csvs/id-data-current.csv')

# then, manually and painfully cobble together the historic data with the current

## Indiana
Indiana is a bit complicated and unqiue. The records are downloaded in two steps. 
1. First, two ArcGIS layers are scraped to get the IGS unique identifiers. 
2. Then, the details for each well are looked up individually, for example [well data](https://igws.indiana.edu/pdms/wellEvents.cfm?igsID=100020)

In [None]:
# Indiana - New


import queue
from threading import Thread
import requests
from lxml import html

# I really should just modify the original function to accekpt kwargs
def download_in_layer(layer_url):
    feature_layer = FeatureLayer(layer_url)

    batch_size = feature_layer.properties.maxRecordCount
    feature_ids = feature_layer.query(where="1=1", return_ids_only=True)['objectIds']

    records = list()
    for i in range(0, len(feature_ids), batch_size):
        query_ids = [str(j) for j in feature_ids[i:i+batch_size]]
        result = feature_layer.query(where='1=1', object_ids=','.join(query_ids),
                                     returnGeometry= True, outFields = '*', outSR='4326' )
        for record in result.features:
            records.append({**record.attributes, **record.geometry})    
    return records

layers = {
    'Oil': 'http://gis.indiana.edu/arcgis/rest/services/PDMS/Basic_PDMS/MapServer/1',
    'Gas': 'http://gis.indiana.edu/arcgis/rest/services/PDMS/Basic_PDMS/MapServer/2'
}

wells = list()
for layer_url in layers.values():
    wells.append(download_in_layer(layer_url))

records = [well for layer in wells for well in layer]

In [None]:
# Indiana - New

import queue
from threading import Thread
import requests
from lxml import html

# get records...


# Scrape PDMS to get dates
def get_details(q, rows):
    for row in rows:
        igs_id = row['IGS_ID']
        url = 'http://igs.indiana.edu/pdms/wellEvents.cfm?igsID=%s' % str(igs_id)
        for attempt in range(12):
            try:
                page = requests.get(url)
            except requests.exceptions.ConnectionError as e:
                print (str(2**attempt)),
                time.sleep(2**attempt)
            else:
                break
        else:
            print ('Failed after 12 retries at', url)
            raise requests.exceptions.ConnectionError                
        tree = html.fromstring(page.content)
        permits = tree.xpath('//div[@pdmshelp="permit_number"]/table/tr/td/text()')
        statuses = tree.xpath('//div[@pdmshelp="Status"]/table/tr/td/text()')
        dates = tree.xpath('//div[@pdmshelp="completion_date"]/table/tr/td/text()')

        if len(permits) > 0:
            permit = permits[0].strip()
            date = dates[0].strip()
            first_status = statuses[0].strip()
            last_status = statuses[len(permits) - 1].strip()
        #    operator = operators[idx].strip() if operators[idx] else None
            row = {
                "IGS_ID": igs_id,
                "PermitNo": permit,
                "SpudDate": date,
                "Type": first_status,
                "Status": last_status if first_status != last_status else "Active",
            }
            q.put(row)

threads = []
q = queue.Queue()
workers = 20
increment = len(records) // workers
for idx in range(0, len(records), increment):
    t = Thread(target=get_details, args=(q, records[idx:idx+increment]))
    threads.append(t)
    t.start()
for t in threads:
    t.join()

details = list()
while not q.empty():
    details.append(q.get())

print ('finished scraping', str(len(details)), 'records')


In [None]:
# merge and write Indiana records to csv
wells_df = pd.DataFrame(records)
details_df = pd.DataFrame(details)

df = wells_df.merge(details_df, how = "left", on = "IGS_ID")
df.to_csv('csvs/in-data.csv')

## Maryland
MD has ~40 gas wells. This data isn't available online—you have to call Maryland and get it that way (or get it from FrackTracker). Then, it requires a bit of transformation from Maryland's projection system to WGS84. Finally, look up the dates from and join manually to CSV

__Format Transformation__

```bash
gdalsrsinfo -o proj4 MD_active_gas_wells.prj


ogr2ogr -t_srs EPSG:4326 -s_srs '+proj=lcc +lat_1=38.3 +lat_2=39.45 +lat_0=37.66666666666666 +lon_0=-77 +x_0=400000 +y_0=0 +datum=NAD83 +units=m +no_defs' -f "CSV" MD_active_gas_wells.csv MD_active_gas_wells.shp -lco GEOMETRY=AS_XY 


ogr2ogr -t_srs EPSG:4326 -s_srs '+proj=lcc +lat_1=38.3 +lat_2=39.45 +lat_0=37.66666666666666 +lon_0=-77 +x_0=400000 +y_0=0 +datum=NAD83 +units=m +no_defs' -f "CSV" MD_historical_gas_wells.csv Md_historical_gas_wells.shp -lco GEOMETRY=AS_XY 

```

## New Mexico

Download shapefiled from ftp://164.64.106.6/Public/OCD/OCD%20GIS%20Data/Shape%20Files/

and convert to CSV. I did this manually, but could easily be scripted.

In [None]:
# North Dakota

src = 'https://www.dmr.nd.gov/output/ShapeFiles/Wells.zip'
res = 'downloads/nd/shapefile.zip'
test_directory(res)
urllib.request.urlretrieve(src, res)

zip = zipfile.ZipFile(res)
zip.extractall('downloads/nd')

# Convert shapefile to geojson
command = "ogr2ogr -f CSV -mapFieldType Date=String csvs/nd-data.csv downloads/nd/Wells.shp"
!$command

## Ohio

In [None]:
import queue
from threading import Thread
import requests
from lxml import html

# get records...
# download ohio's 232k records, then scrape dates, merge, and save.
wells = download_layer('https://gis2.ohiodnr.gov/arcgis/rest/services/DOG_Services/Oilgas_Wells_10_JS_TEST/MapServer/0')

# Scrape to get dates
def get_details(q, rows):
    for row in rows:
        api = row['API_WELLNO_LINK']
        url = 'https://gis.ohiodnr.gov/MapViewer/WellSummaryCard.asp?api={}'.format(api)
        for attempt in range(12):
            try:
                page = requests.get(url)
            except requests.exceptions.ConnectionError as e:
                print (str(2**attempt)),
                time.sleep(2**attempt)
            else:
                break
        else:
            print ('Failed after 12 retries at', url)
            raise requests.exceptions.ConnectionError                
        tree = html.fromstring(page.content)

        issued = tree.xpath('/html/body/div/div[3]/table/tr[1]/td[4]/text()')
        commenced = tree.xpath('/html/body/div/div[3]/table/tr[2]/td[8]/text()')
        completed = tree.xpath('/html/body/div/div[3]/table/tr[3]/td[6]/text()')
        issued = issued[0].strip() if len(issued) >= 1 else None
        commenced = commenced[0].strip() if len(commenced) >= 1 else None        
        completed = completed[0].strip() if len(completed) >= 1 else None
        res = {
            "api": api,
            "issued": issued,
            "commenced": commenced,
            "completed": completed
        }
        q.put(res)
        
records = wells
threads = []
q = queue.Queue()
workers = 20
increment = len(records) // workers
for idx in range(0, len(records), increment):
    t = Thread(target=get_details, args=(q, records[idx:idx+increment]))
    threads.append(t)
    t.start()
for t in threads:
    t.join()

details = list()
while not q.empty():
    details.append(q.get())

print ('finished scraping', str(len(details)), 'records')

# merge and write Ohio records to csv
wells_df = pd.DataFrame(wells)
details_df = pd.DataFrame(details)

df = wells_df.merge(details_df, how = "left", left_on = "API_WELLNO_LINK", right_on = "api")
df.to_csv('csvs/oh-data.csv')

In [None]:
# Tennessee
state = 'TN'



import requests
# for whatever reason, urllib2 gets stuck in an endless redirect. Since it's a CSV file, we just
# use requests instead

src = 'http://environment-online.state.tn.us:8080/pls/enf_reports/f?p=9034:34300:22741039565748:CSV::::'
r = requests.get(src)
with open('csvs/tn-data.csv', 'wb') as f:
    f.write(r.content)
    print f, 'downloaded'


## Texas

Download well list using ArcGIS server, then scrape to download the permit details.

In [95]:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import os, errno, time
from contextlib import contextmanager
from selenium.webdriver.support.expected_conditions import staleness_of

In [124]:
test_directory('downloads/tx')
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/var/tmp/')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv,application/vnd.ms-excel')
profile.set_preference('browser.helperApps.alwaysAsk.force', 'false')

import datetime
start_date = datetime.datetime(2013,8,1) # no records prior to 1975....
end_date = datetime.datetime.today()
step = datetime.timedelta(days = 45)
day = datetime.timedelta(days = 1)
# generate tuples of 30-day date ranges from 1950 to present
dates = []
while start_date <= end_date:
    dates.append((start_date.strftime("%m/%d/%Y"), (start_date + step).strftime("%m/%d/%Y")))
    start_date += step + day
    
well_types = ['Oil or Gas Well', 'Gas Well', 'Oil Well']

def download(start, end, well_type):
    driver.get('http://webapps2.rrc.state.tx.us/EWA/drillingPermitsQueryAction.do')
    assert "Drilling Permit" in driver.title

    well_type_select = Select(driver.find_element_by_id("wellTypeCodeHndlr:1017"))
    well_type_select.select_by_visible_text(well_type)
    submitted_from_text = driver.find_element_by_id("submittedDtFromHndlr:1026")
    submitted_from_text.send_keys(start)
    submitted_to_text = driver.find_element_by_id("submittedDtToHndlr:1027")
    submitted_to_text.send_keys(end)

    # submitted_to_text.submit() # any form element will do...
    
    with wait_for_page_load(driver):
        driver.find_element_by_xpath('//input[@value="Submit"]').click()
    
    if "exceeds the maximum records allowed" in driver.page_source:
        print("date range too large")
        return False
    elif "No results found" in driver.page_source:
        return True
    
    wait = WebDriverWait(driver, 10)    
    element = wait.until(EC.presence_of_element_located((By.XPATH, '//input[@value="Download"]')))
    element.click()
    return True

@contextmanager
def wait_for_page_load(driver, timeout=30):
    old_page = driver.find_element_by_tag_name('html')
    yield
    WebDriverWait(driver, timeout).until(
        staleness_of(old_page)
    )

In [125]:
# !docker run -d -p 4444:4444 --shm-size 2g -v "$PWD/downloads/tx":/var/tmp selenium/standalone-firefox:3.9.1-actinium
driver = webdriver.Remote(command_executor='http://127.0.0.1:4444/wd/hub',
                          desired_capabilities=webdriver.DesiredCapabilities.FIREFOX,
                          browser_profile=profile)

# for testing purposes, use local driver
# driver = webdriver.Firefox(profile)

try:
    for date in dates:
        for well_type in well_types:
            start, end = date # unpack tuple
            result = download(start, end, well_type)
            if not result:
                break
finally:
    driver.quit()
    
# I could make this a lot faster by spinning up a couple dozen webdriver instances, or, more easily
# by trying to download more records at a go

In [126]:
# combine all downloaded CSVs into a single CSV and extract approved date
import os, csv, re

download_dir = ('downloads/tx')
files = os.listdir('downloads/tx')

combined_results = []

for file in files:
    file = os.path.join(download_dir, files[0])
    with open(file, 'r') as f:
        next(f); next(f); next(f); next(f) # skip first 4 lines
        reader = csv.DictReader(f)
        results = [row for row in reader]
        
    for row in results:
        row['Approved Date'] = re.split('Approved', row['Status Date'])[-1].strip()
        combined_results.append(row)

with open('downloads/tx-permits.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames = combined_results[0].keys())
    writer.writeheader()
    writer.writerows(combined_results)

In [None]:
import pandas as pd
permits = pd.read_csv('downloads/tx-permits.csv', low_memory = False)
permits.drop_duplicates(inplace = True)
wells = pd.read_csv("csvs/tx-data.csv", low_memory = False)

wells.set_index('API', inplace = True)
permits.set_index('API NO.', inplace = True)

df = wells.merge(permits, how = "left", left_on = "API", right_on = "API NO.")
df.to_csv('csvs/tx-data-complete.csv')