API docs for the WikiSRATMicroService
===

**SRAT code is all in SQL, run from this notebook.**  
This code is within the GitHub repository. The result of this will be a table of NHDplus catchments. 

**Instances where load reduction is greater than the load created by an NHDplus catchment:**
- After weekly check in call, decided that it can be negative, so leave as raw value
- Otherwise can set threshold to where can't reduce loads beyond a proportion of the current load

In [63]:
import requests
import pandas as pd
import geopandas as gpd 
from requests.auth import HTTPBasicAuth
import json
import os
import psycopg2
from sqlalchemy import create_engine 
from pathlib import Path
import pyarrow
from StringParser import StringParser
from DatabaseAdapter import DatabaseAdapter
from DatabaseFormatter import DatabaseFormatter
from DatabaseMakeTable import DatabaseMakeTable

# Run WikiSRATMicroService API

## Open Connection to WikiSRATMicroService Database

To fetch information to assist in formating requests to the WikiSRATMicroService API.

In [64]:
# GET THE DATABASE CONFIG INFORMATION USING A CONFIG FILE. THE FILE IS IN THE GITIGNORE SO WILL REQUIRE BEING SENT

config_file = json.load(open('db_config.json'))
PG_CONFIG = config_file['PG_CONFIG']

_host = PG_CONFIG['host'],
_database = PG_CONFIG['database'],
_user = PG_CONFIG['user'],
_password = PG_CONFIG['password'],
_port = PG_CONFIG['port']

In [65]:
# Create connection to database

_PG_Connection = psycopg2.connect(
        host=PG_CONFIG['host'],
        database=PG_CONFIG['database'],
        user=PG_CONFIG['user'],
        password=PG_CONFIG['password'],
        port=PG_CONFIG['port'])

## Fetch data for formating requests to the WikiSRATMicroService API


In [66]:
%%time

# GET THE MODELED LOADS FROM THE DRWI DATABASE, DERIVED FROM MMW MODEL RUNS

_PG_Connection.set_isolation_level(0)
_cur = _PG_Connection.cursor()
_cur.execute("select * from databmpapi.drb_loads_raw order by huc12;")  
# _cur.execute("select * from databmpapi.drb_loads_raw where huc12 in ('020402030902', '020402030901');")  

_dbdata = _cur.fetchall()
print(len(_dbdata))

484
Wall time: 324 ms


In [67]:
_dbdata?

## Run WikiSRATMicroService API


In [68]:
# CREATE A COUPLE HELPER FUNCTIONS TO RUN THE MICROSERVICE
def respond(err, res=None):
    return {
        'statusCode': '400' if err else '200',
        'body': err.args[0] if err else json.dumps(res),
        'headers': {
            'Content-Type': 'application/json',
            'Access-Control-Allow-Origin': '*'
        },
    }

def lambda_handler(event, context):
    try:
        data = StringParser.parse(event['body'])
        db = DatabaseAdapter(_database[0], _user[0], _host[0], _port, _password[0], _flag)
        input_array = DatabaseAdapter.python_to_array(data)
        return respond(None, db.run_model(input_array))
    except AttributeError as e:
        return respond(e)

In [69]:
%%time

# FOR ALL DRWI HUC12s, FEED THROUGH THE MICROSERVICE TO GET SUB-BASIN ATTENUATION
# The database adapter routine flag can either be 'base' or 'restoration', depending on if you want these
# projects to be removed from the attenuation routine. Restoration projects come from what was enetered in
# through FieldDoc.

_flag = 'base'


# RUN THE HUC12s THROUGH THE MICROSERVICE
_body = DatabaseFormatter.parse(_dbdata)
# _body = '[{"huc12": "020402010101", "tpload_hp": 10, "tpload_crop": 10, "tpload_wooded": 10, "tpload_open": 10, "tpload_barren": 10, "tpload_ldm": 10, "tpload_mdm": 10, "tpload_hdm": 10, "tpload_wetland": 10, "tpload_farman": 10, "tpload_tiledrain": 10, "tpload_streambank": 10, "tpload_subsurface": 10, "tpload_pointsource": 10, "tpload_septics": 10, "tnload_hp": 10, "tnload_crop": 10, "tnload_wooded": 10, "tnload_open": 10, "tnload_barren": 10, "tnload_ldm": 10, "tnload_mdm": 10, "tnload_hdm": 10, "tnload_wetland": 10, "tnload_farman": 10, "tnload_tiledrain": 10, "tnload_streambank": 10, "tnload_subsurface": 10, "tnload_pointsource": 10, "tnload_septics": 10, "tssload_hp": 10, "tssload_crop": 10, "tssload_wooded": 10, "tssload_open": 10, "tssload_barren": 10, "tssload_ldm": 10, "tssload_mdm": 10, "tssload_hdm": 10, "tssload_wetland": 10, "tssload_tiledrain": 10, "tssload_streambank": 10}, {"huc12": "020402010102", "tpload_hp": 10, "tpload_crop": 10, "tpload_wooded": 10, "tpload_open": 10, "tpload_barren": 10, "tpload_ldm": 10, "tpload_mdm": 10, "tpload_hdm": 10, "tpload_wetland": 10, "tpload_farman": 10, "tpload_tiledrain": 10, "tpload_streambank": 10, "tpload_subsurface": 10, "tpload_pointsource": 10, "tpload_septics": 10, "tnload_hp": 10, "tnload_crop": 10, "tnload_wooded": 10, "tnload_open": 10, "tnload_barren": 10, "tnload_ldm": 10, "tnload_mdm": 10, "tnload_hdm": 10, "tnload_wetland": 10, "tnload_farman": 10, "tnload_tiledrain": 10, "tnload_streambank": 10, "tnload_subsurface": 10, "tnload_pointsource": 10, "tnload_septics": 10, "tssload_hp": 10, "tssload_crop": 10, "tssload_wooded": 10, "tssload_open": 10, "tssload_barren": 10, "tssload_ldm": 10, "tssload_mdm": 10, "tssload_hdm": 10, "tssload_wetland": 10, "tssload_tiledrain": 10, "tssload_streambank": 10}, {"huc12": "020402010103", "tpload_hp": 10, "tpload_crop": 10, "tpload_wooded": 10, "tpload_open": 10, "tpload_barren": 10, "tpload_ldm": 10, "tpload_mdm": 10, "tpload_hdm": 10, "tpload_wetland": 10, "tpload_farman": 10, "tpload_tiledrain": 10, "tpload_streambank": 10, "tpload_subsurface": 10, "tpload_pointsource": 10, "tpload_septics": 10, "tnload_hp": 10, "tnload_crop": 10, "tnload_wooded": 10, "tnload_open": 10, "tnload_barren": 10, "tnload_ldm": 10, "tnload_mdm": 10, "tnload_hdm": 10, "tnload_wetland": 10, "tnload_farman": 10, "tnload_tiledrain": 10, "tnload_streambank": 10, "tnload_subsurface": 10, "tnload_pointsource": 10, "tnload_septics": 10, "tssload_hp": 10, "tssload_crop": 10, "tssload_wooded": 10, "tssload_open": 10, "tssload_barren": 10, "tssload_ldm": 10, "tssload_mdm": 10, "tssload_hdm": 10, "tssload_wetland": 10, "tssload_tiledrain": 10, "tssload_streambank": 10}]'

_r = dict(lambda_handler({"body": _body},None))

Wall time: 5.06 s


In [70]:
# Explore the API response
_r?

In [71]:
# Extract the NHD Loads from the response
_nhdloads = dict(json.loads(_r['body']))['huc12s']
_nhdloads?

In [72]:
# Explore selection of data for a HUC12
print(dict(json.loads(_r['body']))['huc12s']['020402010101']['catchments'])

{'4481881': {'comid': 4481881, 'tploadrate_total': 8.67678906666698, 'tploadate_conc': 0.00480867068316327, 'tnloadrate_total': 205.478173246534, 'tnloadrate_conc': 0.188308808608071, 'tssloadrate_total': 12430.9823731758, 'tssloadrate_conc': 10.6329585165398}, '4481681': {'comid': 4481681, 'tploadrate_total': 45.9269178229512, 'tploadate_conc': 0.0139356826495012, 'tnloadrate_total': 1098.12798966065, 'tnloadrate_conc': 0.333206840298741, 'tssloadrate_total': 128512.512354698, 'tssloadrate_conc': 38.9947698116636}, '4481279': {'comid': 4481279, 'tploadrate_total': 9.3369485692889, 'tploadate_conc': 0.0870941956538323, 'tnloadrate_total': 260.499630512306, 'tnloadrate_conc': 2.34191601922763, 'tssloadrate_total': 4135.72100063273, 'tssloadrate_conc': 49.6417992751781}, '4481935': {'comid': 4481935, 'tploadrate_total': 47.8207522003034, 'tploadate_conc': 0.107891339771179, 'tnloadrate_total': 1122.82652578667, 'tnloadrate_conc': 2.53327797292518, 'tssloadrate_total': 29233.9090668937, '

# Load Results into Database (SKIP if DONE)

### If you have done this for the latest data, skip this block.

Loading results into the database is a time consuming process (~20-30 minutes), so it only needs to be run once for every data update. 

In [73]:
# GET THE TOTAL NUMBER OF ROWS TO PRINT THE % COMPLETED LATER ON

t = 0
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        t += 1
        
t

19496

## Save "Base" Model Results (i.e. no BMPs) to Database

In [74]:
# LOAD THE RESULTS INTO A DATABASE FOR REVIEW, CONSULT MSC94@DREXEL.EDU FOR MORE INFORMATION (MAY REQUIRE PERMISSION)
# CREATE THE TABLE TO CACHE THE API OUTPUT
# This uses an imported function to create the table. This is necessary to get the COMID geometries

# SET THE TABLE NAME AND CREATE TABLE
tablename_base = 'base_run'
new = DatabaseMakeTable(_database[0], _user[0], _host[0], _port, _password[0], tablename_base)
new.make_table()


Table Created


In [75]:
%%time

# LOADING RESULTS INTO THE DB CAN TAKE ~10 MINUTES

c = 0
prog_update = 0.1
print('0%', end='--->')
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        update_arr = [int(_nhdloads[huc12s]['catchments'][comid]['comid']),
                      _nhdloads[huc12s]['catchments'][comid]['tploadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tploadate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_conc']]
        update_arr = [x if x != None else -9999 for x in update_arr]
        _PG_Connection.set_isolation_level(0)
        _cur = _PG_Connection.cursor()
        _cur.execute("insert into wikiwtershedoutputs.{} values ({},{},{},{},{},{},{})"
                     ";".format(tablename_base, update_arr[0],update_arr[1],update_arr[2],update_arr[3],update_arr[4],update_arr[5],update_arr[6]))
        c += 1
        if c == int(t * prog_update - 1):
            print('{}%'.format(int(prog_update*100)), end='--->')
            prog_update = round(prog_update+0.1,1)
print('done')

0%--->10%--->20%--->30%--->40%--->50%--->60%--->70%--->80%--->90%--->100%--->done
Wall time: 7min 25s


## Repeat for Restoration Results

Reruns the WikiSRATMicroService API with restoration BMPs and saves response to the database.

In [76]:
# NOW RUN THE ATTENUATION WITH THE BMPs ADDED IN
_flag = 'restoration'
_body = DatabaseFormatter.parse(_dbdata)
_r = dict(lambda_handler({"body": _body},None))
print('done')

done


In [77]:
# GET THE TOTAL NUMBER OF ROWS TO PRINT THE % COMPLETED LATER ON
_nhdloads = dict(json.loads(_r['body']))['huc12s']
t = 0
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        t += 1

In [78]:
# LOAD THE RESULTS INTO A DATABASE FOR REVIEW, CONSULT MSC94@DREXEL.EDU FOR MORE INFORMATION (MAY REQUIRE PERMISSION)
# CREATE THE TABLE TO CACHE THE API OUTPUT
tablename_rest = 'restoration_run'
new = DatabaseMakeTable(_database[0], _user[0], _host[0], _port, _password[0], tablename_rest)
new.make_table()

Table Created


In [79]:
%%time

# LOADING RESULTS INTO THE DB CAN TAKE ~10 MINUTES

c = 0
prog_update = 0.1
print('0%', end='--->')
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        update_arr = [int(_nhdloads[huc12s]['catchments'][comid]['comid']),
                      _nhdloads[huc12s]['catchments'][comid]['tploadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tploadate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_conc']]
        update_arr = [x if x != None else -9999 for x in update_arr]
        _PG_Connection.set_isolation_level(0)
        _cur = _PG_Connection.cursor()
        _cur.execute("insert into wikiwtershedoutputs.{} values ({},{},{},{},{},{},{})"
                     ";".format(tablename_rest, update_arr[0],update_arr[1],update_arr[2],update_arr[3],update_arr[4],update_arr[5],update_arr[6]))
        c += 1
        if c == int(t * prog_update - 1):
            print('{}%'.format(int(prog_update*100)), end='--->')
            prog_update = round(prog_update+0.1,1)

print('done')

0%--->10%--->20%--->30%--->40%--->50%--->60%--->70%--->80%--->90%--->100%--->done
Wall time: 7min 20s


# ABOVE CODE (Load Results) ONLY NEEDS TO BE RUN ONCE

Use labs to get outline to collapse code to be run once. Further edits to the data frames for API output can be done here.

# Read Results in Pandas and Save to Local Storage

In [80]:
# PUT HERE ANY TABLE NAMES CREATED ABOVE, IN CASE THE KERNEL WAS RESTARTED AND YOU DON'T WANT TO RUN THE ABOVE AGAIN
tablename_rest = 'restoration_run'
tablename_base = 'base_run'

## Read from Database into Pandas with `psycopg2`
Custom function written by Mike, using the pyscopg2 library.

In [81]:
%%time

# LOAD THE DATABASE TABLES INTO PANDAS
# geom = NHDplus streamline
# catchment = NHDplus catchment

def postgresql_to_dataframe(conn, select_query, column_names):
    """
    Tranform a SELECT query into a pandas dataframe
    """
    _cur = conn.cursor()
    try:
        _cur.execute(select_query)
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        _cur.close()
        return 1
    
    # We get a list of tupples
    tupples = _cur.fetchall()
    _cur.close()
    
    # Turn it into a pandas dataframe
    df = pd.DataFrame(tupples, columns=column_names)
    return df

colnames = ('comid','tploadrate_total','tploadate_conc','tnloadrate_total','tnloadate_conc','tssloadrate_total',
            'tssloadate_conc','catchment_hectares','watershed_hectares','tploadrate_total_ws','tnloadrate_total_ws',
            'tssloadrate_total_ws','maflowv','geom','geom_catchment', 'cluster', 'fa_name', 'sub_focusarea',
            'nord','nordstop', 'huc12')

cluster_names = ('Brandywine and Christina','Kirkwood - Cohansey Aquifer','Middle Schuylkill','New Jersey Highlands',
                 'Poconos and Kittatinny','Schuylkill Highlands','Upper Lehigh','Upstream Suburban Philadelphia')


base_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_base)
rest_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_rest)

base_df = postgresql_to_dataframe(_PG_Connection, base_model_select, colnames)
rest_df = postgresql_to_dataframe(_PG_Connection, rest_model_select, colnames)

print('done')

done
Wall time: 55 s


In [82]:
# Note that most columns are read as string objects, and ...
# geometry datatypes are not compatbile with GeoPandas and common Python plotting functions.
base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 21 columns):
comid                   19496 non-null int64
tploadrate_total        19496 non-null object
tploadate_conc          19496 non-null object
tnloadrate_total        19496 non-null object
tnloadate_conc          19496 non-null object
tssloadrate_total       19496 non-null object
tssloadate_conc         19496 non-null object
catchment_hectares      19496 non-null object
watershed_hectares      19496 non-null object
tploadrate_total_ws     19496 non-null object
tnloadrate_total_ws     19496 non-null object
tssloadrate_total_ws    19496 non-null object
maflowv                 19496 non-null object
geom                    19266 non-null object
geom_catchment          19496 non-null object
cluster                 17358 non-null object
fa_name                 186 non-null object
sub_focusarea           186 non-null float64
nord                    18870 non-null float64
nordstop            

In [83]:
base_df.head()

Unnamed: 0,comid,tploadrate_total,tploadate_conc,tnloadrate_total,tnloadate_conc,tssloadrate_total,tssloadate_conc,catchment_hectares,watershed_hectares,tploadrate_total_ws,...,tssloadrate_total_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
0,2612952,0.6351,0.02,11.5479,0.2536,1173.0077,8.8315,6.3851,981.0,0.1216,...,53.7069,6.676,01050000E06A7F00000100000001020000C007000000B3...,01060000206A7F0000010000000103000000010000001B...,drb,,,74295.0,74299.0,20401010101
1,2612782,51.2284,0.0188,758.0229,0.2414,22417.1069,7.4794,524.1138,906.39,0.1147,...,45.6521,6.191,01050000E06A7F00000100000001020000C015000000D8...,01060000206A7F000001000000010300000001000000DF...,drb,,,74297.0,74299.0,20401010101
2,2612920,797.8818,0.0373,8339.1431,0.4056,366215.4767,17.1041,3191.3462,4174.83,0.217,...,99.5046,27.179,01050000E06A7F00000100000001020000C063000000E1...,01060000206A7F000001000000010300000001000000C1...,drb,,,74294.0,74299.0,20401010101
3,2613460,0.7829,0.0304,33.2667,0.4398,61.4898,11.8981,1.9784,65727.99,0.1878,...,73.5153,454.467,01050000E06A7F00000100000001020000C00500000072...,01060000206A7F00000200000001030000000100000005...,drb,,,74264.0,74339.0,20401010202
4,2612780,33.5463,0.0195,361.5146,0.2099,14488.9726,8.4141,302.6154,302.85,0.1109,...,47.8421,1.927,01050000E06A7F00000100000001020000C00C00000087...,01060000206A7F000001000000010300000001000000D5...,drb,,,74299.0,74299.0,20401010101


## Read from Database using SQLAlchemy and GeoPandas
An alternate way to connect, using sqlalchemy, written by Sarah.  
Benefits include:
- Doesn't require knowledge of column names
- Correctly auto-parses datatypes
- Auto-converts column named 'geom' from a string into a geometry datatype compatible with  GeoPandas and most plotting libraries.

In [84]:
%%time
# Connect to database with sqlalchemy
from sqlalchemy import create_engine  

db_connection_url = "postgresql://{}:{}@{}:{}/{}".format(_user[0], _password[0], _host[0], _port, _database[0])
con = create_engine(db_connection_url)  

Wall time: 1.03 ms


In [85]:
%%time
# import data with geopandas compatible geometry
import geopandas as gpd 

base_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_base)
rest_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_rest)

base_df = gpd.read_postgis(base_model_select, con)
rest_df = gpd.read_postgis(rest_model_select, con)

Wall time: 37.5 s


In [86]:
base_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 21 columns):
comid                   19496 non-null int64
tploadrate_total        19496 non-null float64
tploadate_conc          19496 non-null float64
tnloadrate_total        19496 non-null float64
tnloadate_conc          19496 non-null float64
tssloadrate_total       19496 non-null float64
tssloadate_conc         19496 non-null float64
catchment_hectares      19496 non-null float64
watershed_hectares      19496 non-null float64
tploadrate_total_ws     19496 non-null float64
tnloadrate_total_ws     19496 non-null float64
tssloadrate_total_ws    19496 non-null float64
maflowv                 19496 non-null float64
geom                    19266 non-null geometry
geom_catchment          19496 non-null object
cluster                 17358 non-null object
fa_name                 186 non-null object
sub_focusarea           186 non-null float64
nord                    18870 non-null float6

In [87]:
rest_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 21 columns):
comid                   19496 non-null int64
tploadrate_total        19496 non-null float64
tploadate_conc          19496 non-null float64
tnloadrate_total        19496 non-null float64
tnloadate_conc          19496 non-null float64
tssloadrate_total       19496 non-null float64
tssloadate_conc         19496 non-null float64
catchment_hectares      19496 non-null float64
watershed_hectares      19496 non-null float64
tploadrate_total_ws     19496 non-null float64
tnloadrate_total_ws     19496 non-null float64
tssloadrate_total_ws    19496 non-null float64
maflowv                 19496 non-null float64
geom                    19266 non-null geometry
geom_catchment          19496 non-null object
cluster                 17358 non-null object
fa_name                 186 non-null object
sub_focusarea           186 non-null float64
nord                    18870 non-null float6

In [88]:
base_df.head()

Unnamed: 0,comid,tploadrate_total,tploadate_conc,tnloadrate_total,tnloadate_conc,tssloadrate_total,tssloadate_conc,catchment_hectares,watershed_hectares,tploadrate_total_ws,...,tssloadrate_total_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
0,2612952,0.6351,0.02,11.5479,0.2536,1173.0077,8.8315,6.3851,981.0,0.1216,...,53.7069,6.676,MULTILINESTRING Z ((531509.616 4697096.509 0.0...,01060000206A7F0000010000000103000000010000001B...,drb,,,74295.0,74299.0,20401010101
1,2612782,51.2284,0.0188,758.0229,0.2414,22417.1069,7.4794,524.1138,906.39,0.1147,...,45.6521,6.191,MULTILINESTRING Z ((531592.400 4699047.630 0.0...,01060000206A7F000001000000010300000001000000DF...,drb,,,74297.0,74299.0,20401010101
2,2612920,797.8818,0.0373,8339.1431,0.4056,366215.4767,17.1041,3191.3462,4174.83,0.217,...,99.5046,27.179,MULTILINESTRING Z ((531474.494 4696977.042 0.0...,01060000206A7F000001000000010300000001000000C1...,drb,,,74294.0,74299.0,20401010101
3,2613460,0.7829,0.0304,33.2667,0.4398,61.4898,11.8981,1.9784,65727.99,0.1878,...,73.5153,454.467,MULTILINESTRING Z ((499180.397 4669543.565 0.0...,01060000206A7F00000200000001030000000100000005...,drb,,,74264.0,74339.0,20401010202
4,2612780,33.5463,0.0195,361.5146,0.2099,14488.9726,8.4141,302.6154,302.85,0.1109,...,47.8421,1.927,MULTILINESTRING Z ((532569.849 4700306.676 0.0...,01060000206A7F000001000000010300000001000000D5...,drb,,,74299.0,74299.0,20401010101


In [89]:
# Set index to comid
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
base_df.set_index('comid', inplace=True)
rest_df.set_index('comid', inplace=True)

In [90]:
base_df.index

Int64Index([2612952, 2612782, 2612920, 2613460, 2612780, 2612792, 2612956,
            2612794, 2612948, 2612950,
            ...
            9437003, 9436867, 9437007, 9437021, 9436995, 9437027, 9437011,
            9436999, 8409259, 8409235],
           dtype='int64', name='comid', length=19496)

In [91]:
rest_df.index

Int64Index([ 2612952,  2612782,  2612920,  2613460,  2612780,  2612792,
             2612956,  2612794,  2612948,  2612950,
            ...
             9437009, 26814149,  9437007,  9437021,  9436995,  9437027,
             9437011,  9436999,  8409259,  8409235],
           dtype='int64', name='comid', length=19496)

In [92]:
base_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 2612952 to 8409235
Data columns (total 20 columns):
tploadrate_total        19496 non-null float64
tploadate_conc          19496 non-null float64
tnloadrate_total        19496 non-null float64
tnloadate_conc          19496 non-null float64
tssloadrate_total       19496 non-null float64
tssloadate_conc         19496 non-null float64
catchment_hectares      19496 non-null float64
watershed_hectares      19496 non-null float64
tploadrate_total_ws     19496 non-null float64
tnloadrate_total_ws     19496 non-null float64
tssloadrate_total_ws    19496 non-null float64
maflowv                 19496 non-null float64
geom                    19266 non-null geometry
geom_catchment          19496 non-null object
cluster                 17358 non-null object
fa_name                 186 non-null object
sub_focusarea           186 non-null float64
nord                    18870 non-null float64
nordstop                18844 non-n

### Read again To Set Catchment Geometry
This reruns the `gpd.read_postgis()` function with the optional argument `geom_col="geom_catchment"`, as a workaround for converting both the flowlines and the catchment boundaries to geometry datatypes.

In [93]:
%%time
# get catchment geometry
base_catch = 'SELECT comid, geom_catchment FROM wikiwtershedoutputs.{}'.format(tablename_base)
rest_catch = 'SELECT comid, geom_catchment FROM wikiwtershedoutputs.{}'.format(tablename_rest)

base_df_catch = gpd.read_postgis(base_catch, con, geom_col="geom_catchment")
rest_df_catch = gpd.read_postgis(rest_catch, con, geom_col="geom_catchment")

Wall time: 49 s


In [94]:
base_df_catch.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 2 columns):
comid             19496 non-null int64
geom_catchment    19496 non-null geometry
dtypes: geometry(1), int64(1)
memory usage: 304.8 KB


In [95]:
# Set index to comid
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
base_df_catch.set_index('comid', inplace=True)
rest_df_catch.set_index('comid', inplace=True)

### Replace `geom_catchment` Object with Geometry DType

In [96]:
# merge
base_df['geom_catchment'] = base_df_catch['geom_catchment']
rest_df['geom_catchment'] = rest_df_catch['geom_catchment']

In [97]:
base_df.head(3)

Unnamed: 0_level_0,tploadrate_total,tploadate_conc,tnloadrate_total,tnloadate_conc,tssloadrate_total,tssloadate_conc,catchment_hectares,watershed_hectares,tploadrate_total_ws,tnloadrate_total_ws,tssloadrate_total_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2612952,0.6351,0.02,11.5479,0.2536,1173.0077,8.8315,6.3851,981.0,0.1216,1.5422,53.7069,6.676,MULTILINESTRING Z ((531509.616 4697096.509 0.0...,"MULTIPOLYGON (((531066.887 4697023.252, 531073...",drb,,,74295.0,74299.0,20401010101
2612782,51.2284,0.0188,758.0229,0.2414,22417.1069,7.4794,524.1138,906.39,0.1147,1.4734,45.6521,6.191,MULTILINESTRING Z ((531592.400 4699047.630 0.0...,"MULTIPOLYGON (((531741.244 4696996.221, 531747...",drb,,,74297.0,74299.0,20401010101
2612920,797.8818,0.0373,8339.1431,0.4056,366215.4767,17.1041,3191.3462,4174.83,0.217,2.3596,99.5046,27.179,MULTILINESTRING Z ((531474.494 4696977.042 0.0...,"MULTIPOLYGON (((526550.250 4691030.315, 526556...",drb,,,74294.0,74299.0,20401010101


In [98]:
rest_df.head(3)

Unnamed: 0_level_0,tploadrate_total,tploadate_conc,tnloadrate_total,tnloadate_conc,tssloadrate_total,tssloadate_conc,catchment_hectares,watershed_hectares,tploadrate_total_ws,tnloadrate_total_ws,tssloadrate_total_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2612952,0.6351,0.02,11.5479,0.2536,1173.0077,8.8315,6.3851,981.0,0.1216,1.5422,53.7069,6.676,MULTILINESTRING Z ((531509.616 4697096.509 0.0...,"MULTIPOLYGON (((531066.887 4697023.252, 531073...",drb,,,74295.0,74299.0,20401010101
2612782,51.2284,0.0188,758.0229,0.2414,22417.1069,7.4794,524.1138,906.39,0.1147,1.4734,45.6521,6.191,MULTILINESTRING Z ((531592.400 4699047.630 0.0...,"MULTIPOLYGON (((531741.244 4696996.221, 531747...",drb,,,74297.0,74299.0,20401010101
2612920,797.8818,0.0373,8339.1431,0.4056,366215.4767,17.1041,3191.3462,4174.83,0.217,2.3596,99.5046,27.179,MULTILINESTRING Z ((531474.494 4696977.042 0.0...,"MULTIPOLYGON (((526550.250 4691030.315, 526556...",drb,,,74294.0,74299.0,20401010101


### Sort Indices

Docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

In [99]:
base_df.sort_index(inplace=True)
rest_df.sort_index(inplace=True)

In [100]:
base_df.index

Int64Index([  1748535,   1748537,   1748539,   1748541,   1748543,   1748545,
              1748547,   1748549,   1748551,   1748553,
            ...
            932040361, 932040362, 932040363, 932040364, 932040365, 932040366,
            932040367, 932040368, 932040369, 932040370],
           dtype='int64', name='comid', length=19496)

In [101]:
rest_df.index

Int64Index([  1748535,   1748537,   1748539,   1748541,   1748543,   1748545,
              1748547,   1748549,   1748551,   1748553,
            ...
            932040361, 932040362, 932040363, 932040364, 932040365, 932040366,
            932040367, 932040368, 932040369, 932040370],
           dtype='int64', name='comid', length=19496)

### Correct Names

**Note on `ploadrate_total`, etc.**  
The microservice output says "tploadrate_total", however Mike suspected this is a mistake, and it actually is the local NHDplus catchment annual load (kg/ha). We confirmed this by comparing to Model My Watershed subbasin modeling output for specific COMDIDs.
- The stream concentrations take into account the upstream watershed load, but this value is not returned. Mike calculated this when exporting the microservice results to PG for convenience, and saved as `tploadrate_total_ws` etc.

This code block assigns more accurate, shorter names to selected columns.

Docs for `DataFrame.rename()`:
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html

In [102]:
base_df.rename(columns={'tploadrate_total':'tp_load',
                        'tploadate_conc':'tp_conc',
                        'tnloadrate_total':'tn_load',
                        'tnloadate_conc':'tn_conc',
                        'tssloadrate_total':'tss_load',
                        'tssloadate_conc':'tss_conc',
                        'tploadrate_total_ws':'tp_loadrate_ws',
                        'tnloadrate_total_ws':'tn_loadrate_ws',
                        'tssloadrate_total_ws':'tss_loadrate_ws',
                       },
               inplace=True,
              )
base_df.head(3)

Unnamed: 0_level_0,tp_load,tp_conc,tn_load,tn_conc,tss_load,tss_conc,catchment_hectares,watershed_hectares,tp_loadrate_ws,tn_loadrate_ws,tss_loadrate_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1748535,881.4618,0.0226,10322.4299,0.2643,390433.1419,9.9983,6496.7052,6501.69,0.1357,1.5874,60.0509,43.699,MULTILINESTRING Z ((539681.467 4689416.166 0.0...,"MULTIPOLYGON (((535287.346 4678015.900, 535140...",drb,,,74914.0,74914.0,20401020302
1748537,296.6355,0.0297,3165.6081,0.3166,88090.7401,8.8103,1663.1712,1664.46,0.1784,1.9019,52.9245,11.189,MULTILINESTRING Z ((532825.185 4684387.302 0.0...,"MULTIPOLYGON (((532639.560 4678753.867, 532610...",drb,,,74913.0,74913.0,20401020302
1748539,350.9217,0.035,2816.4257,0.2808,117212.516,11.6874,1639.4128,1640.7,0.2139,1.7164,71.4407,11.223,MULTILINESTRING Z ((524096.535 4677200.269 0.0...,"MULTIPOLYGON (((525043.074 4672557.985, 525013...",drb,,,74921.0,74921.0,20401020305


In [103]:
rest_df.rename(columns={'tploadrate_total':'tp_load',
                        'tploadate_conc':'tp_conc',
                        'tnloadrate_total':'tn_load',
                        'tnloadate_conc':'tn_conc',
                        'tssloadrate_total':'tss_load',
                        'tssloadate_conc':'tss_conc',
                        'tploadrate_total_ws':'tp_loadrate_ws',
                        'tnloadrate_total_ws':'tn_loadrate_ws',
                        'tssloadrate_total_ws':'tss_loadrate_ws',
                       },
               inplace=True,
              )
rest_df.head(3)

Unnamed: 0_level_0,tp_load,tp_conc,tn_load,tn_conc,tss_load,tss_conc,catchment_hectares,watershed_hectares,tp_loadrate_ws,tn_loadrate_ws,tss_loadrate_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1748535,881.4618,0.0226,10322.4299,0.2643,390433.1419,9.9983,6496.7052,6501.69,0.1357,1.5874,60.0509,43.699,MULTILINESTRING Z ((539681.467 4689416.166 0.0...,"MULTIPOLYGON (((535287.346 4678015.900, 535140...",drb,,,74914.0,74914.0,20401020302
1748537,296.6355,0.0297,3165.6081,0.3166,88090.7401,8.8103,1663.1712,1664.46,0.1784,1.9019,52.9245,11.189,MULTILINESTRING Z ((532825.185 4684387.302 0.0...,"MULTIPOLYGON (((532639.560 4678753.867, 532610...",drb,,,74913.0,74913.0,20401020302
1748539,350.9217,0.035,2816.4257,0.2808,117212.516,11.6874,1639.4128,1640.7,0.2139,1.7164,71.4407,11.223,MULTILINESTRING Z ((524096.535 4677200.269 0.0...,"MULTIPOLYGON (((525043.074 4672557.985, 525013...",drb,,,74921.0,74921.0,20401020305


### Add Units

Loads are in kg/y.  
Concentrations are in mg/L.

Options for the future:
- https://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe
- https://pandas.pydata.org/docs/reference/api/pandas.Series.attrs.html
- https://stackoverflow.com/questions/39419178/how-can-i-manage-units-in-pandas-data
- https://github.com/hgrecco/pint

In [105]:
#base_df.attrs

In [106]:
#base_df.tp_load.attrs

## Get the Protection Projects "Avoided Loads"

Returns the results for the protection projects uploaded into FieldDoc. Protection projects were exploaded from multi-polygon to polygon then intersected with NHDplus COMID.

The avoided load is taken as the Developed load - the Forested load. Assumes 100% developement to Medium Intensity (NLCD class 23). Only the average forested loading rate is provided in this table (averagess NLCD codes 41,42,43) in order to make the table simpler, however each land cover class' loading rate is explicitly used in the "avoided" load calculation.

Primary Key
- practice_i = FieldDoc (FD) primary key ID
- rn = Row number for number of polygons in the multipolygon submitted to FD as a project
- comid = NHDplus COMID intesecting with the protection polygon

In [107]:
%%time

db_connection_url = "postgresql://{}:{}@{}:{}/{}".format(_user[0], _password[0], _host[0], _port, _database[0])
con = create_engine(db_connection_url)  

tablename_prot = 'protection_lbsavoided_fd'

prot_select = 'SELECT * FROM datapolassess.{}'.format(tablename_prot)

prot_df = gpd.read_postgis(prot_select, con)

Wall time: 1.06 s


In [108]:
prot_df.head()

Unnamed: 0,practice_i,practice_n,rn,comid,practice_t,practice_d,project_na,project_st,creator_na,program_id,...,parcel_devtssload_lbyr,parcel_foretssload_lbyr,parcel_tssload_lbyr_avoided,parcel_devtnload_lbyr,parcel_foretnload_lbyr,parcel_tnload_lbyr_avoided,parcel_devtpload_lbyr,parcel_foretpload_lbyr,parcel_tpload_lbyr_avoided,geom
0,5289.0,Bear Creek,1,4185445,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,175779.645,9187.118,166592.527,205.6,17.768,187.832,76.994,4.23,72.764,"MULTIPOLYGON (((-75.79484 41.17775, -75.79468 ..."
1,5289.0,Bear Creek,1,4185461,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,159240.637,8322.707,150917.93,186.255,16.096,170.159,69.75,3.832,65.918,"MULTIPOLYGON (((-75.78786 41.18276, -75.78756 ..."
2,5289.0,Bear Creek,1,4185483,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,235688.406,12318.247,223370.159,275.672,23.823,251.849,103.235,5.672,97.563,"MULTIPOLYGON (((-75.81161 41.17841, -75.81176 ..."
3,5289.0,Bear Creek,1,4185485,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,48109.916,2514.463,45595.453,56.272,4.863,51.409,21.073,1.158,19.915,"MULTIPOLYGON (((-75.80381 41.17354, -75.80405 ..."
4,5289.0,Bear Creek,1,4185505,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,40026.507,2083.105,37943.402,34.93,3.776,31.154,16.804,0.944,15.86,"MULTIPOLYGON (((-75.79805 41.17363, -75.80381 ..."


In [109]:
prot_df.practice_i = prot_df.practice_i.astype('int64')

In [110]:
prot_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 269 entries, 0 to 268
Data columns (total 33 columns):
practice_i                      269 non-null int64
practice_n                      269 non-null object
rn                              269 non-null int64
comid                           269 non-null int64
practice_t                      269 non-null object
practice_d                      44 non-null object
project_na                      269 non-null object
project_st                      269 non-null object
creator_na                      269 non-null object
program_id                      269 non-null float64
program_na                      269 non-null object
created                         269 non-null object
modified                        269 non-null object
practice_url                    269 non-null object
project_url                     269 non-null object
huc12                           269 non-null object
area_acres                      269 non-null float64
huc12

## Get the Restoration Projects intersected by COMID

In [111]:
%%time
tablename_rest_projects = 'restoration_lbsreduced'
rest_proj_select = 'SELECT * FROM datapolassess.{}'.format(tablename_rest_projects)
rest_proj_df = gpd.read_postgis(rest_proj_select, con)

Wall time: 2.67 s


In [112]:
rest_proj_df.head()

Unnamed: 0,comid,practice_id,site_id,practice_type,tn_reduced_lbs,tp_reduced_lbs,tss_reduced_lbs,practice_name,practice_description,project_name,project_status,creator_name,program_id,program_name,created_on,modified_on,practice_url,project_url,geom
0,2583213,5396,2864,Forest Buffer,2.8553,0.6454,331.0068,Forest buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Kristine Rogers,1,Delaware River Restoration Fund,2019-01-24,2021-07-23,https://www.fielddoc.org/practices/5396,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((522215.487 4552852.203, 522202..."
1,2583213,5399,2865,Forest Buffer,14.13,3.44,1438.77,Forest buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Kristine Rogers,1,Delaware River Restoration Fund,2019-01-24,2021-08-10,https://www.fielddoc.org/practices/5399,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((522234.173 4553442.738, 522208..."
2,2583231,5396,2864,Forest Buffer,33.6847,7.6146,3904.9832,Forest buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Kristine Rogers,1,Delaware River Restoration Fund,2019-01-24,2021-07-23,https://www.fielddoc.org/practices/5396,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((522215.487 4552852.203, 522202..."
3,2583231,5435,2896,Forest Buffer,1.91,0.4,222.04,Forest Buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Michelle DiBlasio,1,Delaware River Restoration Fund,2019-01-25,2021-08-10,https://www.fielddoc.org/practices/5435,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((522983.826 4552844.364, 522966..."
4,2583231,52787,12863,Forest Buffer,16.45,3.46,1917.7,sussex county fairgrounds buffer,buffer planting with NJ Youth Corps,EZG #44530 Synergistic Conservation Strategies...,complete,john parke,1,Delaware River Restoration Fund,2021-08-30,2021-08-30,https://www.fielddoc.org/practices/52787,https://www.fielddoc.org/projects/6368,"MULTIPOLYGON (((523736.468 4553845.556, 523801..."


In [113]:
rest_proj_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 755 entries, 0 to 754
Data columns (total 19 columns):
comid                   755 non-null int64
practice_id             755 non-null int64
site_id                 755 non-null int64
practice_type           755 non-null object
tn_reduced_lbs          755 non-null float64
tp_reduced_lbs          755 non-null float64
tss_reduced_lbs         755 non-null float64
practice_name           755 non-null object
practice_description    485 non-null object
project_name            755 non-null object
project_status          755 non-null object
creator_name            755 non-null object
program_id              755 non-null int64
program_name            755 non-null object
created_on              755 non-null object
modified_on             755 non-null object
practice_url            755 non-null object
project_url             755 non-null object
geom                    755 non-null geometry
dtypes: float64(3), geometry(1), int64(4), object(

## Get the Point Sources by NPDES ID and COMID

In [114]:
%%time
tablename_pointsource = 'ms_pointsource_drb_12_13_18'
pointsource_select = 'SELECT * FROM wikiwtershed.{}'.format(tablename_pointsource)
point_source_df = gpd.read_postgis(pointsource_select, con)

Wall time: 258 ms


In [115]:
point_source_df.head()

Unnamed: 0,ogc_fid,geom,npdes_id,city,state,latitude,longitude,huc12,avg_n_conc,lbsn_yr,mgd,avgpconc,lbsp_yr,kgn_yr,kgp_yr,facilityname,comid
0,1,POINT (414070.808 4469871.626),PA0033995,BERN TWP,PA,40.375,-76.012222,20402030409,0.191,84.591269,0.16305,0.191,1325.042759,38.369923,601.028795,COUNTY OF BERKS WWTP,4783187
1,1,POINT (450085.743 4495200.414),PA0051811,SOUTH WHITEHALL1TWP,PA,40.606111,-75.59,20401060703,35.0,32.011082,0.0003,35.0,2.733729,14.519971,1.239998,LEHIGH COUNTY AUTH,4187751
2,1,POINT (443086.168 4439655.703),1594403,WEST VINCENT TWP,PA,40.105277,-75.667777,20402031003,,,0.0,,,,,MATTHEWS MEADOWS STP,4781791
3,1,POINT (444600.842 4439582.877),1592417,WEST VINCENT TWP,PA,40.104722,-75.65,20402031003,0.0,0.0,0.0,0.0,,0.0,,SAINT STEPHEN'S GREENE STP,4782621
4,1,POINT (499574.949 4511288.150),NJ0065196,Township of Washington,NJ,40.752548,-75.005035,20401050401,7.175,155.557987,0.000677,7.175,5.599735,70.559859,2.539995,390 RT 57,2588253


In [116]:
point_source_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 812 entries, 0 to 811
Data columns (total 17 columns):
ogc_fid         812 non-null int64
geom            812 non-null geometry
npdes_id        812 non-null object
city            789 non-null object
state           812 non-null object
latitude        812 non-null float64
longitude       812 non-null float64
huc12           812 non-null object
avg_n_conc      811 non-null float64
lbsn_yr         811 non-null float64
mgd             812 non-null float64
avgpconc        811 non-null float64
lbsp_yr         809 non-null float64
kgn_yr          811 non-null float64
kgp_yr          809 non-null float64
facilityname    812 non-null object
comid           812 non-null int64
dtypes: float64(9), geometry(1), int64(2), object(5)
memory usage: 108.0+ KB


## Save to Parquet Files
The code below converts the data into a locally saved parquet file to avoid having to access the database every time we run the visualization script.  

Apache Parquet has become the high-performance binary cloud format of choice for storing dataframes.
- https://pandas.pydata.org/docs/user_guide/io.html#io-parquet
- https://anaconda.org/TomAugspurger/pandas-performance/notebook
- https://geopandas.readthedocs.io/en/latest/docs/reference/api/geopandas.GeoDataFrame.to_parquet.html

NOTE: The current version of GeoPandas only supports the `pyarrow` engine. In future implementations, look for integration with `fastparquet` engine/library, which is implemented in Pandas and more tightly integrated with numba and dask. See https://fastparquet.readthedocs.io

In [117]:
# Find current working directory
from pathlib import Path
Path.cwd()
project_folder = Path.cwd()

In [118]:
# Set relative path - will work for anybody in this directory / cloning the github
data_folder    = Path('data/')

In [119]:
%%time

# requires pyarrow
import pyarrow

base_df.to_parquet(project_folder / data_folder /'base_df.parquet')
rest_df.to_parquet(project_folder / data_folder /'rest_df.parquet')


This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  after removing the cwd from sys.path.

This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  """


Wall time: 5.23 s


In [120]:
%%time
project_folder = Path.cwd()
data_folder    = Path('data/')
prot_df.to_parquet(project_folder / data_folder /'prot_df.parquet')

Wall time: 36 ms



This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  This is separate from the ipykernel package so we can avoid doing imports until


In [121]:
%%time
project_folder = Path.cwd()
data_folder    = Path('data/')
rest_proj_df.to_parquet(project_folder / data_folder /'rest_proj_df.parquet')

Wall time: 61 ms



This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  This is separate from the ipykernel package so we can avoid doing imports until


In [122]:
%%time
project_folder = Path.cwd()
data_folder    = Path('data/')
point_source_df.to_parquet(project_folder / data_folder /'point_source_df.parquet')

Wall time: 31 ms



This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  This is separate from the ipykernel package so we can avoid doing imports until


In [123]:
# %%time
# # Alternate write, using an alternate compression engine, for bettter but slower compression

# base_df.to_parquet(project_folder / data_folder /'base_df2.parquet',
#                     compression='gzip', # default is compression with `snappy`
#                    )