## Assigining attribute values to Oregon tax lot parcels

### Summary
In this notebook we add descriptive attributes to Oregon tax lot polygons for use in Landmapper. These attributes are displayed on the first page of the Landmapper map package. 
* **ID** - fieldname: *id*, source: create, type: double
* **Acres** - fieldname: *acres*, source: create, type: double
* **Elevation range** - fieldnames: *min_ft*, *max_ft*, source: , type: double
* **Legal Description** - fieldname: *legalDesc*, source: , type:text
* **County** - fieldname: *county*, source:parcels, type: text
* **Forest Fire District** - fieldname: *odf_fpd*, source:
* **Structure Fire District** - fieldname: *agency*, source:
* **Land use** - fieldname: *landuse*, source: parcel, type: text
* **Watershed Name** - fieldname: *name*, source: USGS WBD
* **Watershed (HUC)** - fieldname: *huc12*, source USGS WBD
* **Coordinates** - fieldnames: lat, lon
* **Elevation Range** - fieldnames: min, max

**Sources**
* Parcels - https://geo.wa.gov/datasets/wa-geoservices::current-parcels/about
* Landuse Codes - https://depts.washington.edu/wagis/projects/parcels/producers/qaqc/summary.php?org=416&nid=63
* Legal Description - https://gis.blm.gov/arcgis/rest/services/Cadastral/BLM_Natl_PLSS_CadNSDI/MapServer
* Watersheds - https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer

In [1]:
%load_ext autotime
import os

import pandas as pd
import geopandas as gpd
import numpy as np
import dask_geopandas
import dask.dataframe
from dask.distributed import Client, LocalCluster
from rasterstats import zonal_stats
import rasterio 

In [48]:
# PROJECT PATHS
# also stored on knowsys at Landmapper_2020/Data
TAXLOTS = "../data/Parcels_2023_small_ele.shp"
LANDUSE_CODES = "../data/WA_source/Landuse_Code_Lookup.csv"
WATERSHED = "../data/WA_source/NHD_H_Washington_State_Shape/Shape/WBDHU12.shp"
PLSS = "../data/WA_source/WA_PLSS/WA_Public_Land_Survey_Sections.shp"
TAXLOTS_LARGE = "../data/WA_taxlot_attributes_091123.shp"

time: 788 µs


### Load and preprocess tax lots

Still nee to implement zonal statistics of DEM data to assign MIN & MAX values to each taxlot in this notebook. Currently being done outside of this process - MIN/MAX values are already in TAXLOTS file. 

In [3]:
# read in parcels 
WA = gpd.read_file(TAXLOTS)
# grab crs
crs = WA.crs

time: 8min 4s


In [4]:
WA.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2639885 entries, 0 to 2639884
Data columns (total 14 columns):
 #   Column      Dtype   
---  ------      -----   
 0   OBJECTID    int64   
 1   COUNTY_NM   object  
 2   LANDUSE_CD  int64   
 3   VALUE_LAND  int64   
 4   VALUE_BLDG  int64   
 5   acres       float64 
 6   id          int64   
 7   OID_1       int64   
 8   OBJECTID_1  int64   
 9   COUNT       float64 
 10  AREA        float64 
 11  MIN         float64 
 12  MAX         float64 
 13  geometry    geometry
dtypes: float64(5), geometry(1), int64(7), object(1)
memory usage: 282.0+ MB
time: 84 ms


In [5]:
#drop unneeded fields
WA.drop(['VALUE_LAND', 'VALUE_BLDG', 'acres', 'OID_1', 'OBJECTID_1', 'COUNT', 'AREA'], axis=1, inplace=True)

time: 247 ms


Landuse codes must be converted to text description from matrix. 

In [6]:
#read in matrix - path at top
codes = pd.read_csv(LANDUSE_CODES)
# join based on LANDUSE_CD
WA = pd.merge(WA, codes, on="LANDUSE_CD", how='left')

time: 1.09 s


In [10]:
WA.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2663441 entries, 0 to 2663440
Data columns (total 8 columns):
 #   Column      Dtype   
---  ------      -----   
 0   OBJECTID    int64   
 1   COUNTY_NM   object  
 2   LANDUSE_CD  int64   
 3   id          int64   
 4   MIN         float64 
 5   MAX         float64 
 6   geometry    geometry
 7   landuse     object  
dtypes: float64(2), geometry(1), int64(3), object(2)
memory usage: 182.9+ MB
time: 6.12 ms


### Join with attributes

In [24]:
def special_join(df, join_df):
    """
    Returns spatial join of two input features
    
    Parameters
    ----------
    df : geodataframe
        left join features
    join_df : geodataframe
        right join features
        
    Returns
    -------
    out_df : geodataframe
        spatial join of two input features
    """
    out_df = df.to_crs(2927)
    out_df = gpd.overlay(join_df, out_df, how='intersection')
    #there might be multiple per taxlot, so choose the largest
    out_df['area'] = out_df.geometry.area
    #sort by area
    out_df.sort_values(by='area', inplace=True)
    #drop duplicates, keep largest/last
    out_df.drop_duplicates(subset='id', keep='last', inplace=True)
    out_df.drop(columns=['area'], inplace=True)
    return out_df

time: 1.17 ms


In [12]:
join = WA[['id', 'geometry']]

time: 197 ms


Watersheds are specified at the subwatershed level, including name and huc12 

In [26]:
# read in Watershed (HUC) polygons
gdf = gpd.read_file(WATERSHED)
water = gdf[['name', 'huc12', 'geometry']]

time: 5.4 s


In [28]:
# spatial join 
WA_huc = special_join(water, join)
huc_out = pd.DataFrame(WA_huc[['id', 'name', 'huc12']])
huc_out.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2639885 entries, 1632289 to 2093596
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   id      int64 
 1   name    object
 2   huc12   object
dtypes: int64(1), object(2)
memory usage: 80.6+ MB
time: 18min 5s


Legal description pulled from PLSS data - Township, Section, Range

In [42]:
# read in PLSS dataset
plss = gpd.read_file(PLSS)
plss = plss[['LEGAL_DE_4', 'geometry']]

time: 36.3 s


In [36]:
plss.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 70719 entries, 0 to 70718
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   LEGAL_DE_4  70719 non-null  object  
 1   geometry    70719 non-null  geometry
dtypes: geometry(1), object(1)
memory usage: 1.1+ MB
time: 21.1 ms


In [38]:
# spatial join 
WA_plss = special_join(plss, join)
plss_out = pd.DataFrame(WA_plss[['id', 'LEGAL_DE_4']])
plss_out.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2616519 entries, 1463412 to 2674936
Data columns (total 2 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   id          int64 
 1   LEGAL_DE_4  object
dtypes: int64(1), object(1)
memory usage: 59.9+ MB
time: 1min 54s


Combine and export 

In [41]:
# merge dataframes
export = WA.merge(huc_out, on='id', how='left')
export = export.merge(plss_out, on='id', how='left')
export.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2663441 entries, 0 to 2663440
Data columns (total 11 columns):
 #   Column      Dtype   
---  ------      -----   
 0   OBJECTID    int64   
 1   COUNTY_NM   object  
 2   LANDUSE_CD  int64   
 3   id          int64   
 4   MIN         float64 
 5   MAX         float64 
 6   geometry    geometry
 7   landuse     object  
 8   name        object  
 9   huc12       object  
 10  LEGAL_DE_4  object  
dtypes: float64(2), geometry(1), int64(3), object(5)
memory usage: 243.8+ MB
time: 3.65 s


In [54]:
export_sub = export[['id', 'landuse', 'huc12', 'name', 'LEGAL_DE_4', 'MIN', 'MAX', 'OBJECTID', 'COUNTY_NM', 'geometry']]
# insert missing fields
# forest fire district - N/A in WA
export_sub.insert(1,'odf_fpd',"NA")
# structure fire district - N/A in WA
export_sub.insert(2,'agency',"NA")
# source of taxlots
export_sub.insert(9,'source',"DNR")
# not sure what this column is
export_sub.insert(11,'map_id',"NA")
export_sub.rename(columns={'LEGAL_DE_4': 'legal_label', 'OBJECTID': 'map_taxlot', 'COUNTY_NM':'county'}, inplace=True)
# convert to string
export_sub['map_taxlot'] = export_sub['map_taxlot'].apply(str)
export_sub.info()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  export_sub.rename(columns={'LEGAL_DE_4': 'legal_label', 'OBJECTID': 'map_taxlot', 'COUNTY_NM':'county'}, inplace=True)


<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2663441 entries, 0 to 2663440
Data columns (total 14 columns):
 #   Column       Dtype   
---  ------       -----   
 0   id           int64   
 1   odf_fpd      object  
 2   agency       object  
 3   landuse      object  
 4   huc12        object  
 5   name         object  
 6   legal_label  object  
 7   MIN          float64 
 8   MAX          float64 
 9   source       object  
 10  map_taxlot   object  
 11  map_id       object  
 12  county       object  
 13  geometry     geometry
dtypes: float64(2), geometry(1), int64(1), object(10)
memory usage: 304.8+ MB
time: 835 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


In [55]:
export_sub.head(3)

Unnamed: 0,id,odf_fpd,agency,landuse,huc12,name,legal_label,MIN,MAX,source,map_taxlot,map_id,county,geometry
0,700000,,,,170601080706,Twelvemile Lake,T16-0N R37-0E S12,453.778168,460.582153,DNR,1,,Adams,"POLYGON ((2236595.797 579455.407, 2236387.087 ..."
1,700001,,,,170200151005,Saddle Gap,T15-0N R28-0E S09,217.998596,219.096863,DNR,4,,Adams,"POLYGON ((1934757.561 540565.256, 1934756.807 ..."
2,700002,,,,170200150806,Lower Paha Coulee,T18-0N R34-0E S25,475.931732,478.895721,DNR,9,,Adams,"POLYGON ((2141334.650 622587.000, 2141582.875 ..."


time: 13.5 ms


In [142]:
#EXPORT = '../data/OR_Attributes.csv'
#export_sub.to_csv(EXPORT, encoding='utf-8', index=False)

time: 1min 42s


In [146]:
EXPORT = '../data/WA_Attributes.shp'
export_sub.to_file(EXPORT)

time: 8min 10s


In [None]:
# #set up client with 32 cores 
# client = Client(
#     LocalCluster(
#         n_workers = 32,
#         processes=True,
#         threads_per_worker=5
#     )
# )

# #create dask dataframe
# OR_dask = dask_geopandas.from_geopandas(OR_county, npartitions=160)
# OR_dask.info()
# test_join = dask_geopandas.sjoin(OR_dask, water, predicate='within')
# r = test_join.compute()