API docs for the WikiSRATMicroService
===

**SRAT code is all in SQL, run from this notebook.**  
This code is within the GitHub repository. The result of this will be a table of NHDplus catchments. 

**Instances where load reduction is greater than the load created by an NHDplus catchment:**
- After weekly check in call, decided that it can be negative, so leave as raw value
- Otherwise can set threshold to where can't reduce loads beyond a proportion of the current load

In [1]:
import requests
import pandas as pd
import geopandas as gpd 
from requests.auth import HTTPBasicAuth
import json
import os
import psycopg2
from sqlalchemy import create_engine 
from pathlib import Path
import pyarrow
from StringParser import StringParser
from DatabaseAdapter import DatabaseAdapter
from DatabaseFormatter import DatabaseFormatter
from DatabaseMakeTable import DatabaseMakeTable

# Run WikiSRATMicroService API

## Open Connection to WikiSRATMicroService Database

To fetch information to assist in formating requests to the WikiSRATMicroService API.

In [3]:
# GET THE DATABASE CONFIG INFORMATION USING A CONFIG FILE. 
# THE FILE IS IN THE GITIGNORE SO WILL REQUIRE BEING SENT VIA EMAIL.

config_file = json.load(open('db_config.json'))
PG_CONFIG = config_file['PG_CONFIG']

_host = PG_CONFIG['host'],
_database = PG_CONFIG['database'],
_user = PG_CONFIG['user'],
_password = PG_CONFIG['password'],
_port = PG_CONFIG['port']

In [4]:
# Create connection to database using `psycopg2`

_PG_Connection = psycopg2.connect(
        host=PG_CONFIG['host'],
        database=PG_CONFIG['database'],
        user=PG_CONFIG['user'],
        password=PG_CONFIG['password'],
        port=PG_CONFIG['port'])

## Fetch data for formating requests to the WikiSRATMicroService API


In [5]:
%%time

# GET THE MODELED LOADS FROM THE DRWI DATABASE, DERIVED FROM MMW MODEL RUNS

_PG_Connection.set_isolation_level(0)
_cur = _PG_Connection.cursor()
_cur.execute("select * from databmpapi.drb_loads_raw order by huc12;")  
# _cur.execute("select * from databmpapi.drb_loads_raw where huc12 in ('020402030902', '020402030901');")  

_dbdata = _cur.fetchall()
print(len(_dbdata))

484
CPU times: user 17.3 ms, sys: 10.7 ms, total: 28 ms
Wall time: 547 ms


In [6]:
_dbdata?

[0;31mType:[0m        list
[0;31mString form:[0m [(1, '020401010101', Decimal('83.25'), Decimal('131587.70'), Decimal('144086.00'), Decimal('11916 <...> 00421087362510914, 0.0325741241610815, 0.124997522620959, 0.149069147756057, 0.0349152439841768)]
[0;31mLength:[0m      484
[0;31mDocstring:[0m  
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.


## Run WikiSRATMicroService API


In [7]:
# CREATE A COUPLE HELPER FUNCTIONS TO RUN THE MICROSERVICE
def respond(err, res=None):
    return {
        'statusCode': '400' if err else '200',
        'body': err.args[0] if err else json.dumps(res),
        'headers': {
            'Content-Type': 'application/json',
            'Access-Control-Allow-Origin': '*'
        },
    }

def lambda_handler(event, context):
    try:
        data = StringParser.parse(event['body'])
        db = DatabaseAdapter(_database[0], _user[0], _host[0], _port, _password[0], _flag)
        input_array = DatabaseAdapter.python_to_array(data)
        return respond(None, db.run_model(input_array))
    except AttributeError as e:
        return respond(e)

In [8]:
%%time

# FOR ALL DRWI HUC12s, FEED THROUGH THE MICROSERVICE TO GET SUB-BASIN ATTENUATION
# The database adapter routine flag can either be 'base' or 'restoration', depending on if you want these
# projects to be removed from the attenuation routine. Restoration projects come from what was enetered in
# through FieldDoc.

_flag = 'base'

# RUN THE HUC12s THROUGH THE MICROSERVICE
_body = DatabaseFormatter.parse(_dbdata)
# _body = '[{"huc12": "020402010101", "tpload_hp": 10, "tpload_crop": 10, "tpload_wooded": 10, "tpload_open": 10, "tpload_barren": 10, "tpload_ldm": 10, "tpload_mdm": 10, "tpload_hdm": 10, "tpload_wetland": 10, "tpload_farman": 10, "tpload_tiledrain": 10, "tpload_streambank": 10, "tpload_subsurface": 10, "tpload_pointsource": 10, "tpload_septics": 10, "tnload_hp": 10, "tnload_crop": 10, "tnload_wooded": 10, "tnload_open": 10, "tnload_barren": 10, "tnload_ldm": 10, "tnload_mdm": 10, "tnload_hdm": 10, "tnload_wetland": 10, "tnload_farman": 10, "tnload_tiledrain": 10, "tnload_streambank": 10, "tnload_subsurface": 10, "tnload_pointsource": 10, "tnload_septics": 10, "tssload_hp": 10, "tssload_crop": 10, "tssload_wooded": 10, "tssload_open": 10, "tssload_barren": 10, "tssload_ldm": 10, "tssload_mdm": 10, "tssload_hdm": 10, "tssload_wetland": 10, "tssload_tiledrain": 10, "tssload_streambank": 10}, {"huc12": "020402010102", "tpload_hp": 10, "tpload_crop": 10, "tpload_wooded": 10, "tpload_open": 10, "tpload_barren": 10, "tpload_ldm": 10, "tpload_mdm": 10, "tpload_hdm": 10, "tpload_wetland": 10, "tpload_farman": 10, "tpload_tiledrain": 10, "tpload_streambank": 10, "tpload_subsurface": 10, "tpload_pointsource": 10, "tpload_septics": 10, "tnload_hp": 10, "tnload_crop": 10, "tnload_wooded": 10, "tnload_open": 10, "tnload_barren": 10, "tnload_ldm": 10, "tnload_mdm": 10, "tnload_hdm": 10, "tnload_wetland": 10, "tnload_farman": 10, "tnload_tiledrain": 10, "tnload_streambank": 10, "tnload_subsurface": 10, "tnload_pointsource": 10, "tnload_septics": 10, "tssload_hp": 10, "tssload_crop": 10, "tssload_wooded": 10, "tssload_open": 10, "tssload_barren": 10, "tssload_ldm": 10, "tssload_mdm": 10, "tssload_hdm": 10, "tssload_wetland": 10, "tssload_tiledrain": 10, "tssload_streambank": 10}, {"huc12": "020402010103", "tpload_hp": 10, "tpload_crop": 10, "tpload_wooded": 10, "tpload_open": 10, "tpload_barren": 10, "tpload_ldm": 10, "tpload_mdm": 10, "tpload_hdm": 10, "tpload_wetland": 10, "tpload_farman": 10, "tpload_tiledrain": 10, "tpload_streambank": 10, "tpload_subsurface": 10, "tpload_pointsource": 10, "tpload_septics": 10, "tnload_hp": 10, "tnload_crop": 10, "tnload_wooded": 10, "tnload_open": 10, "tnload_barren": 10, "tnload_ldm": 10, "tnload_mdm": 10, "tnload_hdm": 10, "tnload_wetland": 10, "tnload_farman": 10, "tnload_tiledrain": 10, "tnload_streambank": 10, "tnload_subsurface": 10, "tnload_pointsource": 10, "tnload_septics": 10, "tssload_hp": 10, "tssload_crop": 10, "tssload_wooded": 10, "tssload_open": 10, "tssload_barren": 10, "tssload_ldm": 10, "tssload_mdm": 10, "tssload_hdm": 10, "tssload_wetland": 10, "tssload_tiledrain": 10, "tssload_streambank": 10}]'

_r = dict(lambda_handler({"body": _body},None))

CPU times: user 384 ms, sys: 52.7 ms, total: 436 ms
Wall time: 6.01 s


In [9]:
# Explore the API response
_r?

[0;31mType:[0m        dict
[0;31mString form:[0m {'statusCode': '200', 'body': '{"huc12s": {"020401010101": {"huc12": "020401010101", "tpload_hp": <...> 15247}}}}}', 'headers': {'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*'}}
[0;31mLength:[0m      3
[0;31mDocstring:[0m  
dict() -> new empty dictionary
dict(mapping) -> new dictionary initialized from a mapping object's
    (key, value) pairs
dict(iterable) -> new dictionary initialized as if via:
    d = {}
    for k, v in iterable:
        d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
    in the keyword argument list.  For example:  dict(one=1, two=2)


In [10]:
# Extract the NHD Loads from the response
_nhdloads = dict(json.loads(_r['body']))['huc12s']
_nhdloads?

[0;31mType:[0m        dict
[0;31mString form:[0m {'020401010101': {'huc12': '020401010101', 'tpload_hp': 602.07665754054, 'tpload_crop': 360.72733 <...> 0436867554431356, 'tssloadrate_total': 25055.0654476434, 'tssloadrate_conc': 41.5994419315247}}}}
[0;31mLength:[0m      484
[0;31mDocstring:[0m  
dict() -> new empty dictionary
dict(mapping) -> new dictionary initialized from a mapping object's
    (key, value) pairs
dict(iterable) -> new dictionary initialized as if via:
    d = {}
    for k, v in iterable:
        d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
    in the keyword argument list.  For example:  dict(one=1, two=2)


In [11]:
# Explore selection of data for a HUC12
print(dict(json.loads(_r['body']))['huc12s']['020402010101']['catchments'])

{'4481881': {'comid': 4481881, 'tploadrate_total': 8.67678906666698, 'tploadate_conc': 0.00480867068316327, 'tnloadrate_total': 205.478173246534, 'tnloadrate_conc': 0.188308808608071, 'tssloadrate_total': 12430.9823731758, 'tssloadrate_conc': 10.6329585165398}, '4481681': {'comid': 4481681, 'tploadrate_total': 45.9269178229512, 'tploadate_conc': 0.0139356826495012, 'tnloadrate_total': 1098.12798966065, 'tnloadrate_conc': 0.333206840298741, 'tssloadrate_total': 128512.512354698, 'tssloadrate_conc': 38.9947698116636}, '4481279': {'comid': 4481279, 'tploadrate_total': 9.3369485692889, 'tploadate_conc': 0.0870941956538323, 'tnloadrate_total': 260.499630512306, 'tnloadrate_conc': 2.34191601922763, 'tssloadrate_total': 4135.72100063273, 'tssloadrate_conc': 49.6417992751781}, '4481935': {'comid': 4481935, 'tploadrate_total': 47.8207522003034, 'tploadate_conc': 0.107891339771179, 'tnloadrate_total': 1122.82652578667, 'tnloadrate_conc': 2.53327797292518, 'tssloadrate_total': 29233.9090668937, '

# Load Results into Database (SKIP if DONE)

### If you have done this for the latest data, skip this block.

Loading results into the database is a time consuming process (~20-30 minutes), so it only needs to be run once for every data update. 

In [9]:
# GET THE TOTAL NUMBER OF ROWS TO PRINT THE % COMPLETED LATER ON

t = 0
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        t += 1
        
t

19496

## Save "Base" Model Results (i.e. no BMPs) to Database

In [5]:
# LOAD THE RESULTS INTO A DATABASE FOR REVIEW, CONSULT MSC94@DREXEL.EDU FOR MORE INFORMATION (MAY REQUIRE PERMISSION)
# CREATE THE TABLE TO CACHE THE API OUTPUT
# This uses an imported function to create the table. This is necessary to get the COMID geometries

# SET THE TABLE NAME AND CREATE TABLE
tablename_base = 'base_run'
new = DatabaseMakeTable(_database[0], _user[0], _host[0], _port, _password[0], tablename_base)
new.make_table()


Table Created


In [None]:
%%time

# LOADING RESULTS INTO THE DB CAN TAKE ~10 MINUTES

c = 0
prog_update = 0.1
print('0%', end='--->')
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        update_arr = [int(_nhdloads[huc12s]['catchments'][comid]['comid']),
                      _nhdloads[huc12s]['catchments'][comid]['tploadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tploadate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_conc']]
        update_arr = [x if x != None else -9999 for x in update_arr]
        _PG_Connection.set_isolation_level(0)
        _cur = _PG_Connection.cursor()
        _cur.execute("insert into wikiwtershedoutputs.{} values ({},{},{},{},{},{},{})"
                     ";".format(tablename_base, update_arr[0],update_arr[1],update_arr[2],update_arr[3],update_arr[4],update_arr[5],update_arr[6]))
        c += 1
        if c == int(t * prog_update - 1):
            print('{}%'.format(int(prog_update*100)), end='--->')
            prog_update = round(prog_update+0.1,1)
print('done')

0%--->

## Repeat for Restoration Results

Reruns the WikiSRATMicroService API with restoration BMPs and saves response to the database.

In [7]:
# NOW RUN THE ATTENUATION WITH THE BMPs ADDED IN
_flag = 'restoration'

_r = dict(lambda_handler({"body": _body},None))
print('done')

done


In [8]:
# GET THE TOTAL NUMBER OF ROWS TO PRINT THE % COMPLETED LATER ON
_nhdloads = dict(json.loads(_r['body']))['huc12s']
t = 0
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        t += 1

In [9]:
# LOAD THE RESULTS INTO A DATABASE FOR REVIEW, CONSULT MSC94@DREXEL.EDU FOR MORE INFORMATION (MAY REQUIRE PERMISSION)
# CREATE THE TABLE TO CACHE THE API OUTPUT
tablename_rest = 'restoration_run'
new = DatabaseMakeTable(_database[0], _user[0], _host[0], _port, _password[0], tablename_rest)
new.make_table()

Table Created


In [10]:
%%time

# LOADING RESULTS INTO THE DB CAN TAKE ~10 MINUTES

c = 0
prog_update = 0.1
print('0%', end='--->')
for huc12s, huc12 in _nhdloads.items():
    for comid in _nhdloads[huc12s]['catchments']:
        update_arr = [int(_nhdloads[huc12s]['catchments'][comid]['comid']),
                      _nhdloads[huc12s]['catchments'][comid]['tploadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tploadate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tnloadrate_conc'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_total'],
                      _nhdloads[huc12s]['catchments'][comid]['tssloadrate_conc']]
        update_arr = [x if x != None else -9999 for x in update_arr]
        _PG_Connection.set_isolation_level(0)
        _cur = _PG_Connection.cursor()
        _cur.execute("insert into wikiwtershedoutputs.{} values ({},{},{},{},{},{},{})"
                     ";".format(tablename_rest, update_arr[0],update_arr[1],update_arr[2],update_arr[3],update_arr[4],update_arr[5],update_arr[6]))
        c += 1
        if c == int(t * prog_update - 1):
            print('{}%'.format(int(prog_update*100)), end='--->')
            prog_update = round(prog_update+0.1,1)

print('done')

0%--->10%--->20%--->30%--->40%--->50%--->60%--->70%--->80%--->90%--->100%--->done


# ABOVE CODE (Load Results) ONLY NEEDS TO BE RUN ONCE

Use labs to get outline to collapse code to be run once. Further edits to the data frames for API output can be done here.

# Read Results in Pandas and Save to Local Storage

In [12]:
# PUT HERE ANY TABLE NAMES CREATED ABOVE, IN CASE THE KERNEL WAS RESTARTED AND YOU DON'T WANT TO RUN THE ABOVE AGAIN
tablename_rest = 'restoration_run'
tablename_base = 'base_run'

## Read from Database into Pandas with `psycopg2`
Using a custom function written by Mike, using the [pyscopg2](https://www.psycopg.org) library.

In [14]:
%%time

# LOAD THE DATABASE TABLES INTO PANDAS
# geom = NHDplus streamline
# catchment = NHDplus catchment

def postgresql_to_dataframe(conn, select_query, column_names):
    """
    Tranform a SELECT query into a pandas dataframe
    """
    _cur = conn.cursor()
    try:
        _cur.execute(select_query)
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        _cur.close()
        return 1
    
    # We get a list of tupples
    tupples = _cur.fetchall()
    _cur.close()
    
    # Turn it into a pandas dataframe
    df = pd.DataFrame(tupples, columns=column_names)
    return df

colnames = ('comid','tploadrate_total','tploadate_conc','tnloadrate_total','tnloadate_conc','tssloadrate_total',
            'tssloadate_conc','catchment_hectares','watershed_hectares','tploadrate_total_ws','tnloadrate_total_ws',
            'tssloadrate_total_ws','maflowv','geom','geom_catchment', 'cluster', 'fa_name', 'sub_focusarea',
            'nord','nordstop', 'huc12')

cluster_names = ('Brandywine and Christina','Kirkwood - Cohansey Aquifer','Middle Schuylkill','New Jersey Highlands',
                 'Poconos and Kittatinny','Schuylkill Highlands','Upper Lehigh','Upstream Suburban Philadelphia')


base_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_base)
rest_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_rest)

base_df = postgresql_to_dataframe(_PG_Connection, base_model_select, colnames)
rest_df = postgresql_to_dataframe(_PG_Connection, rest_model_select, colnames)

print('done')

done
CPU times: user 664 ms, sys: 1.06 s, total: 1.72 s
Wall time: 28.2 s


In [15]:
# Note that most columns are read as string objects, and ...
# geometry datatypes are not compatbile with GeoPandas and common Python plotting functions.
base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   comid                 19496 non-null  int64  
 1   tploadrate_total      19496 non-null  object 
 2   tploadate_conc        19496 non-null  object 
 3   tnloadrate_total      19496 non-null  object 
 4   tnloadate_conc        19496 non-null  object 
 5   tssloadrate_total     19496 non-null  object 
 6   tssloadate_conc       19496 non-null  object 
 7   catchment_hectares    19496 non-null  object 
 8   watershed_hectares    19496 non-null  object 
 9   tploadrate_total_ws   19496 non-null  object 
 10  tnloadrate_total_ws   19496 non-null  object 
 11  tssloadrate_total_ws  19496 non-null  object 
 12  maflowv               19496 non-null  object 
 13  geom                  19266 non-null  object 
 14  geom_catchment        19496 non-null  object 
 15  cluster            

In [16]:
base_df.head(3)

Unnamed: 0,comid,tploadrate_total,tploadate_conc,tnloadrate_total,tnloadate_conc,tssloadrate_total,tssloadate_conc,catchment_hectares,watershed_hectares,tploadrate_total_ws,...,tssloadrate_total_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
0,9483444,59.6058,0.0172,2106.3224,1.1446,182057.2388,23.0532,949.6402,8043.03,0.0739,...,99.094,38.689,01050000E06A7F00000100000001020000C01F0000005F...,01060000206A7F000001000000010300000001000000A1...,Kirkwood - Cohansey Aquifer,,,112251.0,112277.0,20402060506
1,9483428,54.4192,0.0159,2620.0121,0.7643,64651.6999,18.8605,797.1225,797.76,0.0683,...,81.0415,3.836,01050000E06A7F00000100000001020000C0270000003F...,01060000206A7F0000010000000103000000010000003E...,Kirkwood - Cohansey Aquifer,,,112277.0,112277.0,20402060506
2,9486456,0.2163,-9999.0,3.0331,-9999.0,1638.9457,-9999.0,17.536,98077.23,-10.4769,...,-10.4769,0.115,01050000E06A7F00000100000001020000C008000000A7...,01060000206A7F0000010000000103000000010000002F...,Kirkwood - Cohansey Aquifer,,,112180.0,112184.0,20402060507


## Read from Database using SQLAlchemy and GeoPandas
An alternate way to connect, using sqlalchemy, written by Sarah.  
Benefits include:
- Doesn't require knowledge of column names
- Correctly auto-parses datatypes
- Auto-converts column named 'geom' from a string into a geometry datatype compatible with  GeoPandas and most plotting libraries.

In [17]:
%%time
# Connect to database with sqlalchemy
from sqlalchemy import create_engine  

db_connection_url = "postgresql://{}:{}@{}:{}/{}".format(_user[0], _password[0], _host[0], _port, _database[0])
con = create_engine(db_connection_url)  

CPU times: user 13.9 ms, sys: 3.35 ms, total: 17.3 ms
Wall time: 22.4 ms


In [18]:
%%time
# import data with geopandas compatible geometry
import geopandas as gpd 

base_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_base)
rest_model_select = 'SELECT * FROM wikiwtershedoutputs.{}'.format(tablename_rest)

base_df = gpd.read_postgis(base_model_select, con)
rest_df = gpd.read_postgis(rest_model_select, con)

CPU times: user 2.62 s, sys: 1.52 s, total: 4.14 s
Wall time: 45 s


In [19]:
base_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   comid                 19496 non-null  int64   
 1   tploadrate_total      19496 non-null  float64 
 2   tploadate_conc        19496 non-null  float64 
 3   tnloadrate_total      19496 non-null  float64 
 4   tnloadate_conc        19496 non-null  float64 
 5   tssloadrate_total     19496 non-null  float64 
 6   tssloadate_conc       19496 non-null  float64 
 7   catchment_hectares    19496 non-null  float64 
 8   watershed_hectares    19496 non-null  float64 
 9   tploadrate_total_ws   19496 non-null  float64 
 10  tnloadrate_total_ws   19496 non-null  float64 
 11  tssloadrate_total_ws  19496 non-null  float64 
 12  maflowv               19496 non-null  float64 
 13  geom                  19266 non-null  geometry
 14  geom_catchment        19496 non-null  object  

In [20]:
rest_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   comid                 19496 non-null  int64   
 1   tploadrate_total      19496 non-null  float64 
 2   tploadate_conc        19496 non-null  float64 
 3   tnloadrate_total      19496 non-null  float64 
 4   tnloadate_conc        19496 non-null  float64 
 5   tssloadrate_total     19496 non-null  float64 
 6   tssloadate_conc       19496 non-null  float64 
 7   catchment_hectares    19496 non-null  float64 
 8   watershed_hectares    19496 non-null  float64 
 9   tploadrate_total_ws   19496 non-null  float64 
 10  tnloadrate_total_ws   19496 non-null  float64 
 11  tssloadrate_total_ws  19496 non-null  float64 
 12  maflowv               19496 non-null  float64 
 13  geom                  19266 non-null  geometry
 14  geom_catchment        19496 non-null  object  

In [21]:
base_df.head(3)

Unnamed: 0,comid,tploadrate_total,tploadate_conc,tnloadrate_total,tnloadate_conc,tssloadrate_total,tssloadate_conc,catchment_hectares,watershed_hectares,tploadrate_total_ws,...,tssloadrate_total_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
0,9483444,59.6058,0.0172,2106.3224,1.1446,182057.2388,23.0532,949.6402,8043.03,0.0739,...,99.094,38.689,MULTILINESTRING Z ((504833.558 4358195.759 0.0...,01060000206A7F000001000000010300000001000000A1...,Kirkwood - Cohansey Aquifer,,,112251.0,112277.0,20402060506
1,9483428,54.4192,0.0159,2620.0121,0.7643,64651.6999,18.8605,797.1225,797.76,0.0683,...,81.0415,3.836,MULTILINESTRING Z ((505062.199 4363690.254 0.0...,01060000206A7F0000010000000103000000010000003E...,Kirkwood - Cohansey Aquifer,,,112277.0,112277.0,20402060506
2,9486456,0.2163,-9999.0,3.0331,-9999.0,1638.9457,-9999.0,17.536,98077.23,-10.4769,...,-10.4769,0.115,MULTILINESTRING Z ((498393.111 4341809.157 0.0...,01060000206A7F0000010000000103000000010000002F...,Kirkwood - Cohansey Aquifer,,,112180.0,112184.0,20402060507


### Set Index to COMID
For easier search and error-free merging.  
Doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html

In [22]:
# Set index to COMID
base_df.set_index('comid', inplace=True)
rest_df.set_index('comid', inplace=True)

### Sort Index
For performance.  
Docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

In [44]:
# Sort Index
base_df.sort_index(inplace=True)
rest_df.sort_index(inplace=True)

In [37]:
base_df.index

Int64Index([  1748535,   1748537,   1748539,   1748541,   1748543,   1748545,
              1748547,   1748549,   1748551,   1748553,
            ...
            932040361, 932040362, 932040363, 932040364, 932040365, 932040366,
            932040367, 932040368, 932040369, 932040370],
           dtype='int64', name='comid', length=19496)

In [38]:
rest_df.index

Int64Index([  1748535,   1748537,   1748539,   1748541,   1748543,   1748545,
              1748547,   1748549,   1748551,   1748553,
            ...
            932040361, 932040362, 932040363, 932040364, 932040365, 932040366,
            932040367, 932040368, 932040369, 932040370],
           dtype='int64', name='comid', length=19496)

### Convert Nord & NordStop to Integers
For improved performance, memory use, and compression.  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html

NOTE that this does not work because of NaN values

### Convert String Objects to Categoricals
For improved performance, memory use, and compression.  
See https://anaconda.org/TomAugspurger/pandas-performance/notebook

Doc: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

In [None]:
%%time
base_df.huc12.count()

In [58]:
base_df.cluster = base_df.cluster.astype('category')
base_df.fa_name = base_df.fa_name.astype('category')
base_df.huc12   = base_df.huc12.astype('category')

rest_df.cluster = rest_df.cluster.astype('category')
rest_df.fa_name = rest_df.fa_name.astype('category')
rest_df.huc12   = rest_df.huc12.astype('category')

In [59]:
base_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   tploadrate_total      19496 non-null  float64 
 1   tploadate_conc        19496 non-null  float64 
 2   tnloadrate_total      19496 non-null  float64 
 3   tnloadate_conc        19496 non-null  float64 
 4   tssloadrate_total     19496 non-null  float64 
 5   tssloadate_conc       19496 non-null  float64 
 6   catchment_hectares    19496 non-null  float64 
 7   watershed_hectares    19496 non-null  float64 
 8   tploadrate_total_ws   19496 non-null  float64 
 9   tnloadrate_total_ws   19496 non-null  float64 
 10  tssloadrate_total_ws  19496 non-null  float64 
 11  maflowv               19496 non-null  float64 
 12  geom                  19266 non-null  geometry
 13  geom_catchment        19496 non-null  object  
 14  cluster               17358 non-null

In [61]:
%%time
base_df.huc12.count()

CPU times: user 186 µs, sys: 31 µs, total: 217 µs
Wall time: 214 µs


19496

## Convert to Reach-Specific GeoDataFrame

A GeoPandas DataFrame can only have one geometry column, and the auto-import defaulted to the reach polyline. So create a new DataFrame and drop catchment-specific columns.

In [48]:
base_df_reach = base_df.drop(['tploadrate_total',
                              'tnloadrate_total',
                              'tssloadrate_total',
                              'tploadrate_total_ws',
                              'tnloadrate_total_ws',
                              'tssloadrate_total_ws',
                              'geom_catchment',
                             ], axis=1)
rest_df_reach = rest_df.drop(['tploadrate_total',
                              'tnloadrate_total',
                              'tssloadrate_total',
                              'tploadrate_total_ws',
                              'tnloadrate_total_ws',
                              'tssloadrate_total_ws',
                              'geom_catchment',
                             ], axis=1)

In [49]:
base_df_reach.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tploadate_conc      19496 non-null  float64 
 1   tnloadate_conc      19496 non-null  float64 
 2   tssloadate_conc     19496 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   maflowv             19496 non-null  float64 
 6   geom                19266 non-null  geometry
 7   cluster             17358 non-null  object  
 8   fa_name             186 non-null    object  
 9   sub_focusarea       186 non-null    float64 
 10  nord                18870 non-null  float64 
 11  nordstop            18844 non-null  float64 
 12  huc12               19496 non-null  object  
dtypes: float64(9), geometry(1), object(3)
memory usage: 2.1+ MB


In [50]:
rest_df_reach

Unnamed: 0_level_0,tploadate_conc,tnloadate_conc,tssloadate_conc,catchment_hectares,watershed_hectares,maflowv,geom,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1748535,0.0226,0.2643,9.9983,6496.7052,6501.69,43.699,MULTILINESTRING Z ((539681.467 4689416.166 0.0...,drb,,,74914.0,74914.0,020401020302
1748537,0.0297,0.3166,8.8103,1663.1712,1664.46,11.189,MULTILINESTRING Z ((532825.185 4684387.302 0.0...,drb,,,74913.0,74913.0,020401020302
1748539,0.0350,0.2808,11.6874,1639.4128,1640.70,11.223,MULTILINESTRING Z ((524096.535 4677200.269 0.0...,drb,,,74921.0,74921.0,020401020305
1748541,0.0245,0.2703,10.2217,3013.8348,12912.30,86.528,MULTILINESTRING Z ((533110.692 4677277.923 0.0...,drb,,,74911.0,74915.0,020401020302
1748543,0.0307,0.2661,10.8301,1151.0990,5232.87,35.389,MULTILINESTRING Z ((526672.244 4673109.451 0.0...,drb,,,74920.0,74922.0,020401020305
...,...,...,...,...,...,...,...,...,...,...,...,...,...
932040366,-9999.0000,-9999.0000,-9999.0000,2124.7248,2720941.47,17802.923,,drb,,,65070.0,76964.0,020402060103
932040367,-9999.0000,-9999.0000,-9999.0000,788.7859,2717821.26,17788.281,,drb,,,65079.0,76964.0,020402060103
932040368,-9999.0000,-9999.0000,-9999.0000,265.0275,2716120.08,17780.448,,drb,,,65080.0,76960.0,020402060103
932040369,-9999.0000,-9999.0000,-9999.0000,1106.5294,2889095.67,18624.999,,drb,,,64232.0,76965.0,020402040000


### Read again To Set Catchment Geometry
This reruns the `gpd.read_postgis()` function with the optional argument `geom_col="geom_catchment"`, as a workaround for converting both the flowlines and the catchment boundaries to geometry datatypes.

In [32]:
%%time
# get catchment geometry
base_catch = 'SELECT comid, geom_catchment FROM wikiwtershedoutputs.{}'.format(tablename_base)
rest_catch = 'SELECT comid, geom_catchment FROM wikiwtershedoutputs.{}'.format(tablename_rest)

base_df_catch = gpd.read_postgis(base_catch, con, geom_col="geom_catchment")
rest_df_catch = gpd.read_postgis(rest_catch, con, geom_col="geom_catchment")

CPU times: user 4.19 s, sys: 1.02 s, total: 5.2 s
Wall time: 31 s


In [33]:
base_df_catch.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 19496 entries, 0 to 19495
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   comid           19496 non-null  int64   
 1   geom_catchment  19496 non-null  geometry
dtypes: geometry(1), int64(1)
memory usage: 304.8 KB


In [34]:
# Set index to comid
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
base_df_catch.set_index('comid', inplace=True)
rest_df_catch.set_index('comid', inplace=True)

In [35]:
base_df_catch.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 9483444 to 9455335
Data columns (total 1 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   geom_catchment  19496 non-null  geometry
dtypes: geometry(1)
memory usage: 304.6 KB


### Replace `geom_catchment` Object with Geometry DType

In [28]:
# merge
base_df['geom_catchment'] = base_df_catch['geom_catchment']
rest_df['geom_catchment'] = rest_df_catch['geom_catchment']

In [32]:
base_df.head(3)

Unnamed: 0_level_0,tploadrate_total,tploadate_conc,tnloadrate_total,tnloadate_conc,tssloadrate_total,tssloadate_conc,catchment_hectares,watershed_hectares,tploadrate_total_ws,tnloadrate_total_ws,tssloadrate_total_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
9480578,137.9295,0.1814,529.1511,0.6958,137638.1583,180.9927,194.6102,194.76,0.7083,2.7168,706.7063,0.851,MULTILINESTRING Z ((464640.526 4392120.603 0.0...,"MULTIPOLYGON (((464174.332 4390560.985, 464180...",Kirkwood - Cohansey Aquifer,,,64056.0,64056.0,20402060102
9480748,1.1351,0.1795,35.3758,4.4193,1567.6093,120.4098,2.8777,840.51,0.7815,19.2403,524.2289,4.095,MULTILINESTRING Z ((475214.244 4387090.507 0.0...,"MULTIPOLYGON (((475252.372 4387049.060, 475163...",Kirkwood - Cohansey Aquifer,,,64097.0,64099.0,20402060102
9481982,145.5457,0.1069,571.2896,3.5045,117610.9142,103.1548,199.286,8919.54,0.4986,16.3441,481.0876,46.551,MULTILINESTRING Z ((466629.746 4389974.935 0.0...,"MULTIPOLYGON (((465685.660 4389709.497, 465712...",Kirkwood - Cohansey Aquifer,,,64065.0,64131.0,20402060102


In [43]:
rest_df.head(3)

Unnamed: 0_level_0,tp_load,tp_conc,tn_load,tn_conc,tss_load,tss_conc,catchment_hectares,watershed_hectares,tp_loadrate_ws,tn_loadrate_ws,tss_loadrate_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2612952,0.6351,0.02,11.5479,0.2536,1173.0077,8.8315,6.3851,981.0,0.1216,1.5422,53.7069,6.676,MULTILINESTRING Z ((531509.616 4697096.509 0.0...,"MULTIPOLYGON (((531066.887 4697023.252, 531073...",drb,,,74295.0,74299.0,20401010101
2612782,51.2284,0.0188,758.0229,0.2414,22417.1069,7.4794,524.1138,906.39,0.1147,1.4734,45.6521,6.191,MULTILINESTRING Z ((531592.400 4699047.630 0.0...,"MULTIPOLYGON (((531741.244 4696996.221, 531747...",drb,,,74297.0,74299.0,20401010101
2612920,797.8818,0.0373,8339.1431,0.4056,366215.4767,17.1041,3191.3462,4174.83,0.217,2.3596,99.5046,27.179,MULTILINESTRING Z ((531474.494 4696977.042 0.0...,"MULTIPOLYGON (((526550.250 4691030.315, 526556...",drb,,,74294.0,74299.0,20401010101


### Correct Names

**Note on `ploadrate_total`, etc.**  
The microservice output says "tploadrate_total", however Mike suspected this is a mistake, and it actually is the local NHDplus catchment annual load (kg/ha). We confirmed this by comparing to Model My Watershed subbasin modeling output for specific COMDIDs.
- The stream concentrations take into account the upstream watershed load, but this value is not returned. Mike calculated this when exporting the microservice results to PG for convenience, and saved as `tploadrate_total_ws` etc.

This code block assigns more accurate, shorter names to selected columns.

Docs for `DataFrame.rename()`:
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html

In [47]:
base_df.rename(columns={'tploadrate_total':'tp_load',
                        'tploadate_conc':'tp_conc',
                        'tnloadrate_total':'tn_load',
                        'tnloadate_conc':'tn_conc',
                        'tssloadrate_total':'tss_load',
                        'tssloadate_conc':'tss_conc',
                        'tploadrate_total_ws':'tp_loadrate_ws',
                        'tnloadrate_total_ws':'tn_loadrate_ws',
                        'tssloadrate_total_ws':'tss_loadrate_ws',
                       },
               inplace=True,
              )
base_df.head(3)

Unnamed: 0_level_0,tp_load,tp_conc,tn_load,tn_conc,tss_load,tss_conc,catchment_hectares,watershed_hectares,tp_loadrate_ws,tn_loadrate_ws,tss_loadrate_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1748535,881.4618,0.0226,10322.4299,0.2643,390433.1419,9.9983,6496.7052,6501.69,0.1357,1.5874,60.0509,43.699,MULTILINESTRING Z ((539681.467 4689416.166 0.0...,"MULTIPOLYGON (((535287.346 4678015.900, 535140...",drb,,,74914.0,74914.0,20401020302
1748537,296.6355,0.0297,3165.6081,0.3166,88090.7401,8.8103,1663.1712,1664.46,0.1784,1.9019,52.9245,11.189,MULTILINESTRING Z ((532825.185 4684387.302 0.0...,"MULTIPOLYGON (((532639.560 4678753.867, 532610...",drb,,,74913.0,74913.0,20401020302
1748539,350.9217,0.035,2816.4257,0.2808,117212.516,11.6874,1639.4128,1640.7,0.2139,1.7164,71.4407,11.223,MULTILINESTRING Z ((524096.535 4677200.269 0.0...,"MULTIPOLYGON (((525043.074 4672557.985, 525013...",drb,,,74921.0,74921.0,20401020305


In [48]:
rest_df.rename(columns={'tploadrate_total':'tp_load',
                        'tploadate_conc':'tp_conc',
                        'tnloadrate_total':'tn_load',
                        'tnloadate_conc':'tn_conc',
                        'tssloadrate_total':'tss_load',
                        'tssloadate_conc':'tss_conc',
                        'tploadrate_total_ws':'tp_loadrate_ws',
                        'tnloadrate_total_ws':'tn_loadrate_ws',
                        'tssloadrate_total_ws':'tss_loadrate_ws',
                       },
               inplace=True,
              )
rest_df.head(3)

Unnamed: 0_level_0,tp_load,tp_conc,tn_load,tn_conc,tss_load,tss_conc,catchment_hectares,watershed_hectares,tp_loadrate_ws,tn_loadrate_ws,tss_loadrate_ws,maflowv,geom,geom_catchment,cluster,fa_name,sub_focusarea,nord,nordstop,huc12
comid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1748535,881.4618,0.0226,10322.4299,0.2643,390433.1419,9.9983,6496.7052,6501.69,0.1357,1.5874,60.0509,43.699,MULTILINESTRING Z ((539681.467 4689416.166 0.0...,"MULTIPOLYGON (((535287.346 4678015.900, 535140...",drb,,,74914.0,74914.0,20401020302
1748537,296.6355,0.0297,3165.6081,0.3166,88090.7401,8.8103,1663.1712,1664.46,0.1784,1.9019,52.9245,11.189,MULTILINESTRING Z ((532825.185 4684387.302 0.0...,"MULTIPOLYGON (((532639.560 4678753.867, 532610...",drb,,,74913.0,74913.0,20401020302
1748539,350.9217,0.035,2816.4257,0.2808,117212.516,11.6874,1639.4128,1640.7,0.2139,1.7164,71.4407,11.223,MULTILINESTRING Z ((524096.535 4677200.269 0.0...,"MULTIPOLYGON (((525043.074 4672557.985, 525013...",drb,,,74921.0,74921.0,20401020305


### Add Units

Loads are in kg/y.  
Concentrations are in mg/L.

Options for the future:
- https://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe
- https://pandas.pydata.org/docs/reference/api/pandas.Series.attrs.html
- https://stackoverflow.com/questions/39419178/how-can-i-manage-units-in-pandas-data
- https://github.com/hgrecco/pint

In [88]:
base_df.attrs

{}

In [89]:
base_df.tp_load.attrs

{}

## Get the Protection Projects "Avoided Loads"

Returns the results for the protection projects uploaded into FieldDoc. Protection projects were exploaded from multi-polygon to polygon then intersected with NHDplus COMID.

The avoided load is taken as the Developed load - the Forested load. Assumes 100% developement to Medium Intensity (NLCD class 23). Only the average forested loading rate is provided in this table (averagess NLCD codes 41,42,43) in order to make the table simpler, however each land cover class' loading rate is explicitly used in the "avoided" load calculation.

Primary Key
- practice_i = FieldDoc (FD) primary key ID
- rn = Row number for number of polygons in the multipolygon submitted to FD as a project
- comid = NHDplus COMID intesecting with the protection polygon

In [5]:
%%time

db_connection_url = "postgresql://{}:{}@{}:{}/{}".format(_user[0], _password[0], _host[0], _port, _database[0])
con = create_engine(db_connection_url)  

tablename_prot = 'protection_lbsavoided_fd'

prot_select = 'SELECT * FROM datapolassess.{}'.format(tablename_prot)

prot_df = gpd.read_postgis(prot_select, con)

Wall time: 1.02 s


In [6]:
prot_df.head()

Unnamed: 0,practice_i,practice_n,rn,comid,practice_t,practice_d,project_na,project_st,creator_na,program_id,...,parcel_devtssload_lbyr,parcel_foretssload_lbyr,parcel_tssload_lbyr_avoided,parcel_devtnload_lbyr,parcel_foretnload_lbyr,parcel_tnload_lbyr_avoided,parcel_devtpload_lbyr,parcel_foretpload_lbyr,parcel_tpload_lbyr_avoided,geom
0,5289.0,Bear Creek,1,4185445,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,175779.645,9187.118,166592.527,205.6,17.768,187.832,76.994,4.23,72.764,"POLYGON ((-75.79484 41.17775, -75.79468 41.177..."
1,5289.0,Bear Creek,1,4185461,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,159240.637,8322.707,150917.93,186.255,16.096,170.159,69.75,3.832,65.918,"POLYGON ((-75.78786 41.18276, -75.78756 41.182..."
2,5289.0,Bear Creek,1,4185483,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,235688.406,12318.247,223370.159,275.672,23.823,251.849,103.235,5.672,97.563,"POLYGON ((-75.81161 41.17841, -75.81176 41.181..."
3,5289.0,Bear Creek,1,4185485,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,48109.916,2514.463,45595.453,56.272,4.863,51.409,21.073,1.158,19.915,"POLYGON ((-75.80381 41.17354, -75.80405 41.173..."
4,5289.0,Bear Creek,1,4185505,Fee acquisition,Acquisition,Bear Creek - Crystal Lake,complete,Dawn Gorham,5.0,...,40026.507,2083.105,37943.402,34.93,3.776,31.154,16.804,0.944,15.86,"POLYGON ((-75.79805 41.17363, -75.80381 41.173..."


In [7]:
prot_df.practice_i = prot_df.practice_i.astype('int64')

In [8]:
prot_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 269 entries, 0 to 268
Data columns (total 33 columns):
practice_i                      269 non-null int64
practice_n                      269 non-null object
rn                              269 non-null int64
comid                           269 non-null int64
practice_t                      269 non-null object
practice_d                      44 non-null object
project_na                      269 non-null object
project_st                      269 non-null object
creator_na                      269 non-null object
program_id                      269 non-null float64
program_na                      269 non-null object
created                         269 non-null object
modified                        269 non-null object
practice_url                    269 non-null object
project_url                     269 non-null object
huc12                           269 non-null object
area_acres                      269 non-null float64
huc12

## Get the Restoration Projects intersected by COMID

In [40]:
%%time
tablename_rest_projects = 'restoration_lbsreduced'
rest_proj_select = 'SELECT * FROM datapolassess.{}'.format(tablename_rest_projects)
rest_proj_df = gpd.read_postgis(rest_proj_select, con)

Wall time: 1.79 s


In [41]:
rest_proj_df.head()

Unnamed: 0,comid,practice_i,site_id,practice_t,tn_reduced_lbs,tp_reduced_lbs,tss_reduced_lbs,practice_n,practice_d,project_na,project_st,creator_na,program_id,program_na,created,modified,practice_url,project_url,geom
0,2583213,5396.0,2864.0,Forest Buffer,2.8559,0.6456,331.0807,Forest buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Kristine Rogers,1.0,Delaware River Restoration Fund,2019-01-24T20:01:44.943344,2021-07-23T14:43:03.668156,https://www.fielddoc.org/practices/5396,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((521926.600 4553120.204, 521932..."
1,2583213,5399.0,2865.0,Forest Buffer,14.13,3.44,1438.77,Forest buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Kristine Rogers,1.0,Delaware River Restoration Fund,2019-01-24T20:09:22.592579,2021-08-10T16:07:21.248677,https://www.fielddoc.org/practices/5399,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((522305.921 4553660.720, 522278..."
2,2583231,5396.0,2864.0,Forest Buffer,33.6841,7.6144,3904.9093,Forest buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Kristine Rogers,1.0,Delaware River Restoration Fund,2019-01-24T20:01:44.943344,2021-07-23T14:43:03.668156,https://www.fielddoc.org/practices/5396,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((522395.577 4552892.342, 522393..."
3,2583231,5435.0,2896.0,Forest Buffer,1.91,0.4,222.04,Forest Buffer,,EZG #53363 Restoring Paulins Kill Floodplain F...,active,Michelle DiBlasio,1.0,Delaware River Restoration Fund,2019-01-25T18:21:34.588826,2021-08-10T16:06:30.564327,https://www.fielddoc.org/practices/5435,https://www.fielddoc.org/projects/2737,"MULTIPOLYGON (((522983.863 4552844.403, 522966..."
4,2583337,5416.0,2853.0,Forest Buffer,162.3818,28.1236,12496.2614,Forest buffer,,EZG #61049 Restoring Paulins Kill Floodplain F...,active,Kristine Rogers,1.0,Delaware River Restoration Fund,2019-01-24T21:19:21.966215,2021-03-18T13:38:27.015039,https://www.fielddoc.org/practices/5416,https://www.fielddoc.org/projects/2591,"MULTIPOLYGON (((521893.017 4545756.661, 522185..."


In [42]:
rest_proj_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 417 entries, 0 to 416
Data columns (total 19 columns):
comid              417 non-null int64
practice_i         417 non-null float64
site_id            406 non-null float64
practice_t         417 non-null object
tn_reduced_lbs     417 non-null float64
tp_reduced_lbs     417 non-null float64
tss_reduced_lbs    417 non-null float64
practice_n         417 non-null object
practice_d         214 non-null object
project_na         417 non-null object
project_st         417 non-null object
creator_na         417 non-null object
program_id         417 non-null float64
program_na         417 non-null object
created            417 non-null object
modified           417 non-null object
practice_url       417 non-null object
project_url        417 non-null object
geom               417 non-null geometry
dtypes: float64(6), geometry(1), int64(1), object(11)
memory usage: 62.0+ KB


## Get the Point Sources by NPDES ID and COMID

In [37]:
%%time
tablename_pointsource = 'ms_pointsource_drb_12_13_18'
pointsource_select = 'SELECT * FROM wikiwtershed.{}'.format(tablename_pointsource)
point_source_df = gpd.read_postgis(pointsource_select, con)

Wall time: 359 ms


In [38]:
point_source_df.head()

Unnamed: 0,ogc_fid,geom,npdes_id,city,state,latitude,longitude,huc12,avg_n_conc,lbsn_yr,mgd,avgpconc,lbsp_yr,kgn_yr,kgp_yr,facilityname,comid
0,1,POINT (414070.808 4469871.626),PA0033995,BERN TWP,PA,40.375,-76.012222,20402030409,0.191,84.591269,0.16305,0.191,1325.042759,38.369923,601.028795,COUNTY OF BERKS WWTP,4783187
1,1,POINT (450085.743 4495200.414),PA0051811,SOUTH WHITEHALL1TWP,PA,40.606111,-75.59,20401060703,35.0,32.011082,0.0003,35.0,2.733729,14.519971,1.239998,LEHIGH COUNTY AUTH,4187751
2,1,POINT (443086.168 4439655.703),1594403,WEST VINCENT TWP,PA,40.105277,-75.667777,20402031003,,,0.0,,,,,MATTHEWS MEADOWS STP,4781791
3,1,POINT (444600.842 4439582.877),1592417,WEST VINCENT TWP,PA,40.104722,-75.65,20402031003,0.0,0.0,0.0,0.0,,0.0,,SAINT STEPHEN'S GREENE STP,4782621
4,1,POINT (499574.949 4511288.150),NJ0065196,Township of Washington,NJ,40.752548,-75.005035,20401050401,7.175,155.557987,0.000677,7.175,5.599735,70.559859,2.539995,390 RT 57,2588253


In [39]:
point_source_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 812 entries, 0 to 811
Data columns (total 17 columns):
ogc_fid         812 non-null int64
geom            812 non-null geometry
npdes_id        812 non-null object
city            789 non-null object
state           812 non-null object
latitude        812 non-null float64
longitude       812 non-null float64
huc12           812 non-null object
avg_n_conc      811 non-null float64
lbsn_yr         811 non-null float64
mgd             812 non-null float64
avgpconc        811 non-null float64
lbsp_yr         809 non-null float64
kgn_yr          811 non-null float64
kgp_yr          809 non-null float64
facilityname    812 non-null object
comid           812 non-null int64
dtypes: float64(9), geometry(1), int64(2), object(5)
memory usage: 108.0+ KB


## Save to Parquet Files
The code below converts the data into a locally saved parquet file to avoid having to access the database every time we run the visualization script.  

Apache Parquet has become the high-performance binary cloud format of choice for storing dataframes.
- https://pandas.pydata.org/docs/user_guide/io.html#io-parquet
- https://anaconda.org/TomAugspurger/pandas-performance/notebook
- https://geopandas.readthedocs.io/en/latest/docs/reference/api/geopandas.GeoDataFrame.to_parquet.html

NOTE: The current version of GeoPandas only supports the `pyarrow` engine. In future implementations, look for integration with `fastparquet` engine/library, which is implemented in Pandas and more tightly integrated with numba and dask. See https://fastparquet.readthedocs.io

In [80]:
# Find current working directory
from pathlib import Path
Path.cwd()

PosixPath('/Users/aaufdenkampe/Documents/Python/WikiSRATMicroService')

In [82]:
# Set relative path - will work for anybody in this directory / cloning the github
data_folder    = Path('data/')

In [83]:
%%time

# requires pyarrow
import pyarrow

base_df.to_parquet(project_folder / data_folder /'base_df.parquet')
rest_df.to_parquet(project_folder / data_folder /'rest_df.parquet')
prot_df.to_parquet(project_folder / data_folder /'prot_df.parquet')


This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.


This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.



CPU times: user 3.38 s, sys: 108 ms, total: 3.48 s
Wall time: 4.08 s



This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.



In [13]:
%%time
project_folder = Path.cwd()
data_folder    = Path('data/')
prot_df.to_parquet(project_folder / data_folder /'prot_df.parquet')

Wall time: 135 ms



This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  after removing the cwd from sys.path.


In [43]:
%%time
project_folder = Path.cwd()
data_folder    = Path('data/')
rest_proj_df.to_parquet(project_folder / data_folder /'rest_proj_df.parquet')

Wall time: 95 ms



This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
%%time
project_folder = Path.cwd()
data_folder    = Path('data/')
point_source_df.to_parquet(project_folder / data_folder /'point_source_df.parquet')

In [98]:
%%time
# Alternate write, using an alternate compression engine, for bettter but slower compression

base_df.to_parquet(project_folder / data_folder /'base_df2.parquet',
                    compression='gzip', # default is compression with `snappy`
                   )


This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.



CPU times: user 3.43 s, sys: 34.5 ms, total: 3.47 s
Wall time: 3.48 s
