## This notebook shows an example of how to transform data from 2010 census tracts to 2020 tracts using sjoin_nearest 
* This is a proposed method approach for handling raw input data originally on census tract scale, but at an older (e.g., 2010) boundary set
* E.g., CalEnviroscreen data

**Note**: Make sure to clear the output from this notebook before merging any updates, as the display of data creates a massive output (on the order of 200MB) which breaks github storage limits per file.

In [None]:
import pandas as pd
import os
import sys
import math
import numpy as np
import geopandas as gpd

# suppress pandas purely educational warnings
from warnings import simplefilter
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

sys.path.append(os.path.expanduser('../../'))
from scripts.utils.file_helpers import pull_csv_from_directory, upload_csv_aws, filter_counties
from scripts.utils.write_metadata import append_metadata

In [None]:
# pull .xlsx from aws
enviroscreen_excel = 's3://ca-climate-index/1_pull_data/society_economy/vulnerable_populations/ca_enviro_screen/calenviroscreen.xlsx'
enviroscreen_data = pd.read_excel(enviroscreen_excel)
enviroscreen_data = enviroscreen_data[['Census Tract', 
                                              'California County', 
                                              'Total Population', 
                                                'Asthma',
                                                'Low Birth Weight', 
                                                'Cardiovascular Disease', 
                                                'Education', 
                                                'Linguistic Isolation',
                                                'Poverty',
                                                'Unemployment', 
                                                'Housing Burden', 
                                                'Imp. Water Bodies'
                                                ]]
enviroscreen_data = enviroscreen_data.rename(columns={'Census Tract':"GEOID"})

# read in CA census tiger file
old_census_path = "s3://ca-climate-index/0_map_data/tl_2017_06_tract/"
ca_old = gpd.read_file(old_census_path)
ca_old['GEOID'] = pd.to_numeric(ca_old.GEOID)
ca_old = ca_old[["GEOID","geometry"]]

enviroscreen_data = pd.merge(ca_old,enviroscreen_data,on="GEOID")
enviroscreen_data = gpd.GeoDataFrame(enviroscreen_data, geometry="geometry")

In [None]:
# read in CA census tiger file
census_shp_dir = "s3://ca-climate-index/0_map_data/2021_tiger_census_tract/2021_ca_tract/"

ca_boundaries = gpd.read_file(census_shp_dir)
# need to rename columns so we don't have any duplicates in the final geodatabase
column_names = ca_boundaries.columns
new_column_names = ["USCB_"+column for column in column_names if column != "geometry"]
ca_boundaries = ca_boundaries.rename(columns=dict(zip(column_names, new_column_names)))
# drop unnecessary columns
ca_boundaries = ca_boundaries[["geometry","USCB_GEOID"]]
ca_boundaries

### A key step: we need to reproject from lat-lon data to x-y, otherwise the distance calculations from old > new census tracts will be extremely inaccurate.

In [None]:
enviroscreen_data = enviroscreen_data.to_crs(crs=3857) 
ca_boundaries = ca_boundaries.to_crs(crs=3857) 

### Now the code calculates the distances between old and new tracts. A distance of 0 means that a new tract's boundaries exist within the older one's. The presence of many more geometries than tracts occurs when multiple old tracts are equidistant to a new tract. We only search within a max distance of 5 km. This results in dropping one new tract (Farallon Islands; GEOID 06075980401), since the closest of the 2010 tracts is 25 km away from it. Otherwise all distances are 0, since old and new boundaries are nested within each other in some way: 
1. Two or more old ones are combined to make a new one.
2. An old one is split into two or more.
3. A new one is made by merging portions of two or more old ones.

In [None]:
joined_df = gpd.sjoin_nearest(
    ca_boundaries, enviroscreen_data, 
    how="inner", distance_col="distances", 
    max_distance=5000
)
joined_df

In [None]:
data_vars = ['Asthma',
            'Low Birth Weight', 
            'Cardiovascular Disease', 
            'Education', 
            'Linguistic Isolation',
            'Poverty',
            'Unemployment', 
            'Housing Burden', 
            'Imp. Water Bodies' # probably not a use case here though
            ]
# now take the average of the tracts which now exist in the new tract
joined_avg_df = joined_df.groupby(['USCB_GEOID','geometry'])[data_vars].mean().reset_index()
calenviroscreen_new_tracts = gpd.GeoDataFrame(joined_avg_df, geometry='geometry')
calenviroscreen_new_tracts

### Explore the data

In [None]:
gdf_new_tracts.explore(column="Asthma")

### Explore the original data for reference

In [None]:
enviroscreen_data.explore(column="Asthma")

### One problem: The joining ends up averaging even when a census tract has not changed. 
* Proposed solution is to infill with the original data in these cases
* Perform a check between new and old data, and infill with original data where appropriate