# Intro

The purpose of this notebook is to query the SAWGraph knowledge graph for landfills and Department of Defenses (DoD) sites that are near NHD surface water flowlines and find all downstream surface water flowlines. There are two maps at the end comparing the inclusion/exclusion of coastal flowlines.

This is a demonstration of competency question 2b from Use Case 1: Testing.

- What surface water bodies are downstream from landfills or DoD sites?

Note: This could also be done using NHD water bodies or a mixture of flowlines and NHD water bodies.

# Setup

Here we set up SPARQLWrapper to work with our endpoint and create our query.

## Install & Import Statements

Install: The SPARQLWrapper libary provides tools for querying SPARQL endpoints. The sparql_dataframe library can be used with SPARQLWrapper to convert JSON results from a SPARQL query directly to a Pandas dataframe.

Import: See the inline comments for a brief rational of each library.

In [None]:
%%capture
!pip install mapclassify --upgrade --quiet
!pip install SPARQLWrapper --upgrade --quiet
!pip install sparql_dataframe --upgrade --quiet

In [None]:
from branca.element import Figure                                  # For controlling the size of the final map
import folium                                                      # For maps
import geopandas as gpd                                            # For geospatial dataframes
import pandas as pd                                                # For dataframes
from shapely import wkt                                            # For working with WKT coordinates in a GeoDataFrame
from SPARQLWrapper import SPARQLWrapper, JSON, GET, POST, DIGEST   # For querying SPARQL endpoints
import sparql_dataframe                                            # For converting SPARQL query results to Pandas dataframes

## Variable Initialization

A SPARQLWrapper is created to access the Hydrology repository for the SAWGraph project.

In [None]:
%%capture
pd.options.display.width = 240

endpointHydrology = 'https://gdb.acg.maine.edu:7200/repositories/Hydrology'
sparqlHydrology = SPARQLWrapper(endpointHydrology)
sparqlHydrology.setHTTPAuth(DIGEST)
sparqlHydrology.setCredentials('sawgraph-endpoint', 'skailab')
sparqlHydrology.setMethod(GET)
sparqlHydrology.setReturnFormat(JSON)

endpointFIO = 'https://gdb.acg.maine.edu:7200/repositories/FIO'
sparqlFIO = SPARQLWrapper(endpointFIO)
sparqlFIO.setHTTPAuth(DIGEST)
sparqlFIO.setCredentials('sawgraph-endpoint', 'skailab')
sparqlFIO.setMethod(GET)
sparqlFIO.setReturnFormat(JSON)

endpointS2L13AdminRegions = 'https://gdb.acg.maine.edu:7200/repositories/S2L13_AdminRegions'
sparqlS2L13AdminRegions = SPARQLWrapper(endpointS2L13AdminRegions)
sparqlS2L13AdminRegions.setHTTPAuth(DIGEST)
sparqlS2L13AdminRegions.setCredentials('sawgraph-endpoint', 'skailab')
sparqlS2L13AdminRegions.setMethod(GET)
sparqlS2L13AdminRegions.setReturnFormat(JSON)

## Primary Query

This query directly access data in the Hydrology repository of the SAWGraph Knowledge Graph. It uses federation to access additional data in the FIO and S2L13_AdminRegions repositories.

The first block looks for any surface water flowlines within S2 cells that contain landfills or DoD sites. It also looks for all flowlines downstream of those initial flowlines. If available, the name associated with the flowline is retrieved. Finally, all flowlines labeled "Coastline" are excluded from the results. NOTE: These will be removed from the flowline data in the future.

The second block finds facilities that are either landfills or DoD sites (using NAICS codes) and the S2 cell they are within. It also pulls label information if it is available.

The query is executed and returned as a dataframe.

In [None]:
%%time
query = """
PREFIX fio: <http://sawgraph.spatialai.org/v1/fio#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX hyf: <https://www.opengis.net/def/schema/hy_features/hyf/>
PREFIX kwg-ont: <http://stko-kwg.geog.ucsb.edu/lod/ontology/>
PREFIX naics: <http://sawgraph.spatialai.org/v1/fio/naics#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX saw_water: <http://sawgraph.spatialai.org/v1/saw_water#>
PREFIX schema: <https://schema.org/>
PREFIX us_frs: <http://sawgraph.spatialai.org/v1/us-frs#>

SELECT * WHERE {
    ?fl rdf:type hyf:HY_FlowPath ;
    	kwg-ont:sfCrosses ?fac_s2 ;
    	hyf:downstreamWaterbodyTC ?fl_ds ;
        saw_water:hasFTYPE ?fl_type .
    OPTIONAL { ?fl schema:name ?fl_name . }
    OPTIONAL { ?fl_ds schema:name ?fl_ds_name . }
    ?fl_ds saw_water:hasFTYPE ?fl_ds_type .
    FILTER ( ?fl_type != "Coastline" )
    FILTER ( ?fl_ds_type != "Coastline" )

    SERVICE <repository:FIO> {
        SELECT * WHERE {
            ?fac rdf:type fio:Facility ;
                 fio:ofIndustry ?code ;
                 kwg-ont:sfWithin ?fac_s2  .
            OPTIONAL { ?fac rdfs:label ?faclabel . }
            OPTIONAL { ?code rdfs:label ?ind . }
            FILTER (?code IN (naics:NAICS-Industry-Code-562212, naics:NAICS-Industry-Code-92811, naics:NAICS-Industry-Code-928110))
        }
    }
}
"""
df = sparql_dataframe.get(endpointHydrology, query)

CPU times: user 28.5 ms, sys: 1.86 ms, total: 30.4 ms
Wall time: 395 ms


# Geometry Queries

This is a function to create a query for geometries for a list of instances in a KG.

In [None]:
def geo_query(data):
    instances = ''
    for d in data:
        instances += '<' + d + '>, '
    instances = instances[:-2]
    query = """
    PREFIX geo: <http://www.opengis.net/ont/geosparql#>

    SELECT * WHERE {
        ?x geo:hasGeometry/geo:asWKT ?wkt .
        FILTER (?x IN (""" + instances + """))
    }
    """
    return query

Create lists of flowlines, facilities, and S2 cells.

In [None]:
flowlines = []
ds_flowlines = []
facilities = []
s2cells = []
for row in df.itertuples():
    if row.fl not in flowlines:
        flowlines.append(row.fl)
    if row.fl_ds not in ds_flowlines:
        ds_flowlines.append(row.fl_ds)
    if row.fac not in facilities:
        facilities.append(row.fac)
    if row.fac_s2 not in s2cells:
        s2cells.append(row.fac_s2)

Query for the geometries and add the results to dataframes. Because there are a large number of downstream flowline results, that process is somewhat more complicated with queries carried out for 100 flowlines at a time. The following function determines the number of instances that require geometries and retrieves them appropriately.

In [None]:
def retrieve_geometries(instances, repo):
    if len(instances) < 101:
        query = geo_query(instances)
        return sparql_dataframe.get(repo, query)
    else:
        data_dict = {}
        for i in range(len(instances) // 100 + 1):
            data_dict[i] = instances[i * 100:(i + 1) * 100]
        df_dict = {}
        for k, v in data_dict.items():
            query = geo_query(v)
            df_temp = sparql_dataframe.get(repo, query)
            if k == 0:
                df_dict[k] = [df_temp.columns.values.tolist()] + df_temp.values.tolist()
            else:
                df_dict[k] = df_temp.values.tolist()
        data_list = []
        for k, v in df_dict.items():
            for item in v:
                data_list.append(item)
        df_geo = pd.DataFrame(data_list[1:], columns=data_list[0])
        df_geo.drop_duplicates(inplace=True)
        return df_geo

In [None]:
%%time
df_fl_geo = retrieve_geometries(flowlines, endpointHydrology)
df_dsfl_geo = retrieve_geometries(ds_flowlines, endpointHydrology)
df_fac_geo = retrieve_geometries(facilities, endpointFIO)
df_s2_geo = retrieve_geometries(s2cells, endpointS2L13AdminRegions)

CPU times: user 163 ms, sys: 15.3 ms, total: 178 ms
Wall time: 1.63 s


We have to add the geometries to the primary query's dataframe of results. This is done using the following function, which is called four times in this case for four different geometry types (flowlines, downstream flowlines, facilities, and S2 cells).

In [None]:
def new_column(df_in, df_wkt, col_item, col_wkt):
    df_out = df_in.copy()
    df_out[col_wkt] = df_out[col_item]
    for row in df_wkt.itertuples():
        df_out.loc[df_out[col_wkt] == row.x, col_wkt] = row.wkt
    return df_out

In [None]:
df = new_column(df, df_fl_geo, 'fl', 'fl_wkt')
df = new_column(df, df_dsfl_geo, 'fl_ds', 'fl_ds_wkt')
df = new_column(df, df_fac_geo, 'fac', 'fac_wkt')
df = new_column(df, df_s2_geo, 'fac_s2', 'fac_s2_wkt')

# Visualizing on a map

## Partioning the Query Results

The data is split into four separate themed dataframes, one for flowlines, one for downstream flowlines, one for facilities, and one for the S2 cells.

The columns with WKT coordinates in each dataframe are converted to Shapely geometry objects prior to converting the dataframes to GeoDataFrames.

In [None]:
df_fl = df[['fl', 'fl_name', 'fl_wkt']].copy()
df_fl.drop_duplicates(inplace=True)
df_fl['fl_wkt'] = df_fl['fl_wkt'].apply(wkt.loads)

df_dsfl = df[['fl_ds', 'fl_ds_name', 'fl_ds_wkt']].copy()
df_dsfl.drop_duplicates(inplace=True)
df_dsfl['fl_ds_wkt'] = df_dsfl['fl_ds_wkt'].apply(wkt.loads)

df_fac = df[['fac', 'faclabel', 'fac_wkt']].copy()
df_fac.drop_duplicates(inplace=True)
df_fac['fac_wkt'] = df_fac['fac_wkt'].apply(wkt.loads)

df_s2 = df[['fac', 'fac_s2', 'faclabel', 'fac_s2_wkt']].copy()
df_s2.drop_duplicates(inplace=True)
df_s2['fac_s2_wkt'] = df_s2['fac_s2_wkt'].apply(wkt.loads)

## Create GeoPandas dataframes for mapping

Convert the above dataframes to GeoDataFrames, setting the WKT columns as the geometry columns and setting the CRS to WGS 84.

In [None]:
%%capture
gdf_fl = gpd.GeoDataFrame(df_fl, geometry='fl_wkt')
gdf_fl.set_crs(epsg=4326, inplace=True, allow_override=True)

gdf_dsfl = gpd.GeoDataFrame(df_dsfl, geometry='fl_ds_wkt')
gdf_dsfl.set_crs(epsg=4326, inplace=True, allow_override=True)

gdf_fac = gpd.GeoDataFrame(df_fac, geometry='fac_wkt')
gdf_fac.set_crs(epsg=4326, inplace=True, allow_override=True)

gdf_s2 = gpd.GeoDataFrame(df_s2, geometry='fac_s2_wkt')
gdf_s2.set_crs(epsg=4326, inplace=True, allow_override=True)

## Create Map

Each GeoDataFrame is a layer in the final map.

In [None]:
%%capture
fl_color = 'blue'
fac_color = 'red'
s2_color = 'black'
boundweight = 5

map = gdf_s2.explore(color=s2_color,
                     style_kwds=dict(weight=boundweight),
                     tooltip=True,
                     name='Facility S2 Cells',
                     show=True)
gdf_fl.explore(m=map,
               color=fl_color,
               style_kwds=dict(weight=boundweight),
               tooltip=True,
               highlight=False,
               name='Near Flowlines',
               show=True)
gdf_dsfl.explore(m=map,
                 color=fl_color,
                 style_kwds=dict(weight=boundweight),
                 tooltip=True,
                 highlight=False,
                 name='DS Flowlines',
                 show=True)
gdf_fac.explore(m=map,
                color=fac_color,
                style_kwds=dict(weight=boundweight),
                tooltip=True,
                name='Facilities',
                show=True)


# folium.TileLayer("stamenterrain", show=False).add_to(map)
# folium.TileLayer("MapQuest Open Aerial", show=False).add_to(map)
folium.LayerControl(collapsed=False).add_to(map)

## Show Map

The map is created inside a Figure box to control its size.

In [None]:
map.save('SAWGraph_UC1_CQ2b_map.html')

fig = Figure(width=800, height=600)
fig.add_child(map)