### Build topology
This script reads in the individual least cost routes linking each biogas source to the nearest pipeline and merges them into a topologically correct network. This is done by splitting each route where two routes meet. 

In [1]:
#Import packages
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import Point, LineString
from shapely.ops import split

In [2]:
#Read in routes feature class as shapefile
gdfRoutes = gpd.read_file('..\\data\\processed\\Routes.shp')

In [3]:
#Columns to drop to keep things tidy
drop_cols = ['Facility N', 'Address', 'City', 'County Nam',
             'Zip', 'Latitude', 'Longitude', 'Regulated', 
             'Allowable']

#### Derive geodataframes of the route start points and end points
Start points are used to link each route's biogas potential to the final output and end points are used to split existing route features.

In [4]:
#Copy routes geodataframe and update geometry to start points
gdfStart = gdfRoutes.copy(deep=True)
gdfStart['geometry'] = gdfRoutes['geometry'].apply(lambda x: Point(x.coords[0]))
gdfStart.drop(columns=drop_cols,axis=1,inplace=True)

In [5]:
#Copy routes geodataframe and update geometry to start points
gdfEnd = gdfRoutes.copy(deep=True)
gdfEnd['geometry'] = gdfRoutes['geometry'].apply(lambda x: Point(x.coords[-1]))
gdfEnd.drop(columns=drop_cols,axis=1,inplace=True)

#### Split route features where new routes enter them
1. Combine all endpoint point features into a single multipoint feature
2. Split the LineString geometries with this multipoint feature, resulting in Geometry Collection features stored in the geodataframe's geometry series
3. Iterate through each fearture in the above result, splitting its geometry collection back into individual LineString features, and adding each to a growing list.
4. Reconstruct a new geodataframe of all the route segments from the split list, adding a new edge ID attribute

In [6]:
#Combine endpoints into a single multipoint object
ends = gdfEnd.geometry.unary_union

In [7]:
#Create a geoseries of split routes (geometry collections) - takes a bit of time
theSplits = gdfRoutes.geometry.apply(lambda x: split(x,ends))

In [8]:
#Create lists to fill
links = [] #List of each route's original route ID
geom = []  #List of the LineString objects extracted from each feature's geometry collection

In [9]:
#Iterate and add items to the list
for index, row in gdfRoutes.iterrows():
    #Iterate through split segments in the geometry collection
    for line in theSplits[index].geoms:
        #Add items to the list
        links.append(str(row['index']))
        geom.append(line)

In [10]:
#Construct an output geodataframe from the route and geom lists created above
gdfSegments = gpd.GeoDataFrame(pd.DataFrame({'route_id':links}),
                               geometry = geom, crs = gdfRoutes.crs)

#Add the index as a unique segment ID  
gdfSegments['edge_ID'] = gdfSegments.index.astype(str)

With the segments created, we now need assign attributes to each edge. These consist of the its upstream node ID, its downstream node ID, and the amount of biogas introduced at its upstream node. 

This process is a bit tricky and is done by:
* Creating a feature class of the segment's starting vertex, linked to the segment's `edge_id`. This geodataframe is labeled `gdfDownstreamNodes` as its points are labeled with the edge that falls **downstream** of it. 
* Creating a second feature class of the segment's ending vertex, tagged with the segment's `edge_id `. This geodataframe is labeled `gdfUpstreamNodes` as its points are labeled with the edge that falls **upstream** of it. 
* The to points are spatially joined, resulting in a dataset of vertices (`gdNodes`) where each includes the label of the upstream and downstream edge_id. 

##### Create geodataframes from "upstream" and "downstream" nodes
Here the "downstream" nodes are actually the first point in each segment, and the "upstream" nodes are the last. This seems backwards, but these are both intermediate datasets used to determine "from-to" pairs, done by spatially joining them. When joined, the "from" node carries the attribute of the upstream segment and the "to" node carries the downstream one...

In [11]:
#Construct a gdf of segement start points; 
#  the 'edge_id' included is the id of the segement into which it flows, i.e. its downstream segment ID
gdfFirstPoints = gdfSegments.copy(deep=True)
gdfFirstPoints['geometry'] = gdfFirstPoints['geometry'].apply(lambda x: Point(x.coords[0]))
gdfFirstPoints['downstream_id'] = gdfFirstPoints.index.astype(str)

In [12]:
#Construct a gdf of segement start points; 
#  the 'edge_id' included is the id of the segement flowing into it, i.e. its upstream segment ID
gdfLastPoints = gdfSegments.copy(deep=True)
gdfLastPoints['geometry'] = gdfLastPoints['geometry'].apply(lambda x: Point(x.coords[-1]))
gdfLastPoints['upstream_id'] = gdfLastPoints.index.astype(str)

##### Spatially join the downstream and upstream points
Spatially joining the two datasets results in a single point feature class with each point including the edge ids of the upstream and downstream segements, respectively.

In [13]:
#Spatially join the above geodataframes and remove indices
gdfNodes = gpd.sjoin(left_df=gdfLastPoints, right_df=gdfFirstPoints, how='left')
gdfNodes.drop(columns=['route_id_left','edge_ID_left','index_right','route_id_right','edge_ID_right'],
              axis=1,inplace=True)
gdfNodes.head(1)

Unnamed: 0,geometry,upstream_id,downstream_id
0,POINT (1582309.906 -310281.512),0,1


In [14]:
#Show info on the resulting dataset
gdfNodes.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2570 entries, 0 to 2569
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   geometry       2570 non-null   geometry
 1   upstream_id    2570 non-null   object  
 2   downstream_id  2161 non-null   object  
dtypes: geometry(1), object(2)
memory usage: 80.3+ KB


The above reveals a number of records missing `downstream_id` values. These are the terminal segments, i.e. the ones connecting routes to existing NG pipeline infrastructure. 
> This is because we joined the "first" points of each segement to the "last" ones (left join above), leaving a number of downstream labeled as NaN (those segment end points that didn't intersect another segment's start point); these are the terminal nodes in each route grouping (i.e. where the pipes connect to the existing pipeline network). 

We relabel these terminal nodes with the upstream segment ID followed by a "T". 

In [15]:
#Update null values with upstream IDs, appended with a "T"
gdfNodes['downstream_id'].fillna(gdfNodes['upstream_id'] + "T",inplace=True)
#Drop extraneous fields
gdfNodes.sample(10)

Unnamed: 0,geometry,upstream_id,downstream_id
2046,POINT (1589875.930 -272955.793),2046,676
2323,POINT (1470837.151 -180145.898),2323,2323T
2390,POINT (1447634.677 -333483.986),2390,1934
2234,POINT (1644351.304 -283043.825),2234,1512
964,POINT (1593406.741 -277999.809),964,625
1340,POINT (1619131.223 -318856.339),1340,1341
1593,POINT (1619131.223 -272451.392),1593,203
1755,POINT (1636785.279 -177119.488),1755,1755T
1547,POINT (1665031.769 -297671.472),1547,1286
225,POINT (1630732.460 -318856.339),225,51


In [16]:
#Write the nodes to a file
gdfNodes.to_file('../scratch/nodes.shp')

#### Transfer node information to route segment features.
The nodes geodataframe above includes points occuring at the end vertices of each pipeline segment (again, because we joined the `gdfFirstPoints` features *to* the `gdfEndPoint` features, keeping all the `gdfEndPoint` features). And each of these points is aware of the edge_IDs of the segment upstream of it and downstream of it (or if it's a terminal node). 

What we want in the next step is to link this information to each route segment feature such that each feature knows its node ID (taken from it's upstream_id) and the node ID immediately downstream of it. This will allow us to construct a graph from all the segments. 

We also need to link each segment with the amount of biogas potential it introduces into the system. 

##### Joining node attribute data to each segment feature
The first step is done via attribute join, i.e., by merging the `downstream_id` attribute in the gdfNodes dataframe to the segment geodataframe using the `upstream_id` as the joining feature to link with the gdfSegments's `edge_ID` attribute.

In [17]:
#Join the upstream and downstream IDs to the segments features
gdfSegments_ids = gdfSegments.merge(gdfNodes[['upstream_id','downstream_id']],left_on='edge_ID', right_on='upstream_id',how='left')
#Drop the 'upstream_ID' as it's redundant with the edge_ID
gdfSegments_ids. drop('upstream_id',axis=1,inplace=True)
#Show the table
gdfSegments_ids.sample(10)

Unnamed: 0,route_id,geometry,edge_ID,downstream_id
2224,1910,"LINESTRING (1598450.757 -281026.219, 1598450.7...",2224,1987
622,1036,"LINESTRING (1544984.187 -254797.335, 1545488.5...",622,622T
1561,1139,"LINESTRING (1675119.802 -272451.392, 1674615.4...",1561,720
2496,86,"LINESTRING (1585336.316 -341050.010, 1584831.9...",2496,1466
353,1891,"LINESTRING (1611565.199 -330961.978, 1612069.6...",353,69
1243,60,"LINESTRING (1592397.938 -345085.223, 1592397.9...",1243,1168
478,1491,"LINESTRING (1681677.022 -213436.404, 1682181.4...",478,479
779,479,"LINESTRING (1638298.484 -298680.275, 1637794.0...",779,735
1699,2247,"LINESTRING (1618626.822 -257823.745, 1618122.4...",1699,1700
2392,4,"LINESTRING (1435529.039 -301202.283, 1435529.0...",2392,222


Now each segment feature knows its node/edge ID and the node ID of the segment immediately downstream!

##### Attaching biogas potential data back to each segment
Next, we need to join the Biogas Potential linked with each segment. The biogas potential is stored in the `gdfStart` geodataframe constructed by taking the first point in each original biogas route. Here, the `Biogas P_1` attribute is what we want. 

In [18]:
gdfStart.head()

Unnamed: 0,index,Total Wast,Biogas Pot,Biogas P_1,geometry
0,179,248472.480342,6957229.0,245692300.0,POINT (1582814.308 -311794.717)
1,2106,226652.386212,6346267.0,224116300.0,POINT (1737161.199 -158961.030)
2,2112,192948.55104,5402559.0,190789600.0,POINT (1623166.436 -225037.641)
3,345,186392.775915,5218998.0,184307200.0,POINT (1618626.822 -297167.070)
4,1300,158494.371443,4437842.0,156720900.0,POINT (1627706.051 -108520.870)


As this dataframe has no attribute feature that would let us join the data to our segement features, we'll need to create some sort of spatial join to link the biogas potential to our segments. 

To do this, we'll spatially join the `gdfStart` features (which contain biogas potential information ) to the `gdfFirstPoints` features (which contain the node/edge ID information):

In [19]:
gdfBiogasLookup = gpd.sjoin(left_df=gdfFirstPoints[['geometry','edge_ID']],#Join only the geom and edge_ID cols
                            right_df=gdfStart[['geometry','Total Wast','Biogas P_1']],  #Join only the geom and biogas cols
                            how='inner')
gdfBiogasLookup.head()

Unnamed: 0,geometry,edge_ID,index_right,Total Wast,Biogas P_1
0,POINT (1582814.308 -311794.717),0,0,248472.480342,245692300.0
12,POINT (1737161.199 -158961.030),12,1,226652.386212,224116300.0
13,POINT (1623166.436 -225037.641),13,2,192948.55104,190789600.0
16,POINT (1618626.822 -297167.070),16,3,186392.775915,184307200.0
21,POINT (1627706.051 -108520.870),21,4,158494.371443,156720900.0


This gives us a table that we can now merge to our segments dataframe. Not all segments will have biogas data, so we need to set null values to zero (which requires fixing the datatype).

In [20]:
gdfSegments.head()

Unnamed: 0,route_id,geometry,edge_ID
0,179,"LINESTRING (1582814.308 -311794.717, 1582814.3...",0
1,179,"LINESTRING (1582309.906 -310281.512, 1582309.9...",1
2,179,"LINESTRING (1582309.906 -308768.307, 1582309.9...",2
3,179,"LINESTRING (1582309.906 -308263.906, 1582309.9...",3
4,179,"LINESTRING (1582309.906 -307255.102, 1582814.3...",4


In [23]:
#Merge the biogas potential to the segment featuresm using edge_ID as the common field
gdfSegments_biogas = gdfSegments_ids.merge(gdfBiogasLookup[['edge_ID','Total Wast','Biogas P_1']],
                                           on='edge_ID',how='left')

#Fix waste and biogas columns (set NaN to zero)
gdfSegments_biogas['BG_potential'] = gdfSegments_biogas['Biogas P_1'].values.astype(np.int64)
gdfSegments_biogas.loc[pd.isna(gdfSegments_biogas['Biogas P_1']),'BG_potential'] = 0
gdfSegments_biogas.loc[pd.isna(gdfSegments_biogas['Total Wast']),'Total Wast'] = 0

gdfSegments_biogas.head()

Unnamed: 0,route_id,geometry,edge_ID,downstream_id,Total Wast,Biogas P_1,BG_potential
0,179,"LINESTRING (1582814.308 -311794.717, 1582814.3...",0,1,248472.480342,245692300.0,245692262
1,179,"LINESTRING (1582309.906 -310281.512, 1582309.9...",1,2,0.0,,0
2,179,"LINESTRING (1582309.906 -308768.307, 1582309.9...",2,3,0.0,,0
3,179,"LINESTRING (1582309.906 -308263.906, 1582309.9...",3,4,0.0,,0
4,179,"LINESTRING (1582309.906 -307255.102, 1582814.3...",4,5,0.0,,0


In [None]:
#Add segment attributes: Biogas site | Junction | Terminal
gdfSegments['Node']

In [27]:
#Write out shapefile
gdfSegments_biogas[['edge_ID','downstream_id','route_id','Total Wast',
                    'BG_potential','geometry']].to_file('../data/processed/BasePipeline.shp')

In [25]:
#Write out edge list
gdfSegments_biogas[['edge_ID','downstream_id','Total Wast','Biogas P_1']].to_csv('../data/processed/BaseEdgeList.csv',index=False)