# Cal-CRAI Index: Built Environment Domain

**Order of operations**: 
1) Metric handling \
   1a - Retrieve data \
   1b - Min-max standardization \
   1c - Set vulnerability orientation (positive for when a larger value represents greater vulnerability, negative for when a larger value corresponds to decreased vulnerability)

2) Calculate indicators \
   2a - Min-max standardization \
   2b - Finalize domain score
   
3) Visualize, save, and export domain score dataframe

In [3]:
import pandas as pd
import os
import sys
import warnings

# suppress pandas purely educational warnings
from warnings import simplefilter
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

sys.path.append(os.path.expanduser('../../'))
from scripts.utils.file_helpers import pull_csv_from_directory, upload_csv_aws, delete_items
from scripts.utils.write_metadata import append_metadata
from scripts.utils.cal_crai_plotting import plot_domain_score, plot_region_domain
from scripts.utils.cal_crai_calculations import (handle_outliers, min_max_standardize, process_domain_csv_files, 
                                        compute_averaged_indicators, compute_summed_indicators, indicator_dicts, 
                                        add_census_tracts, domain_summary_stats)

## Step 1: Metric level
### 1a) Retrieve metric files and process

In [None]:
# set-up
bucket_name = 'ca-climate-index'
aws_dir = '3_fair_data/index_data/'

pull_csv_from_directory(bucket_name, aws_dir, output_folder='aws_csvs', search_zipped=False, print_name=False)

Process and merge Built Environment metric files together

In [None]:
# domain-specific
domain_prefix = 'built_'
input_folder = r'aws_csvs'
output_folder = domain_prefix + "folder"
meta_csv = r'../utils/calcrai_metrics.csv'

merged_output_file = f'concatenate_{domain_prefix}metrics.csv'
metric_vulnerable_resilient_dict = process_domain_csv_files(domain_prefix, input_folder, output_folder, meta_csv, merged_output_file)

Take a look at the resulting dictionary: we will use this later to refactor certain metrics!

In [None]:
metric_vulnerable_resilient_dict

Now, take a look at the merged singluar csv file

In [None]:
# read-in and view processed data
processed_built_df = pd.read_csv(merged_output_file)
processed_built_df

### 1b) Min-max standardization
Metrics are min-max standardized on 0.01 to 0.99 scale.

In [None]:
# standardizing our df
columns_to_process = [col for col in processed_built_df.columns if col != 'GEOID']
min_max_metrics = min_max_standardize(processed_built_df, columns_to_process)

Isolate for GEOID and standardized columns exclusively

In [None]:
words = ['GEOID','standardized']
selected_columns = []
for word in words:
    selected_columns.extend(min_max_metrics.columns[min_max_metrics.columns.str.contains(word)].tolist())
min_max_standardized_built_metrics_df = min_max_metrics[selected_columns]
min_max_standardized_built_metrics_df.head()

### 1c) Set resilience orientation
* High values indicate resiliency
* Low values indicate vulnerablility

Some metrics indicate a communities vulnerablity rather than resilience. For example, 'Number of Violent Crimes per 10,000 Population' represents a communities vulnerability to violent crime. For this metric, the higher the number, the more vulnerable. So we identify these 'vulnerable' metrics with our `metric_vulnerable_resilient_dict` dictionary and subtract their values from 1 so all high values indicate resiliency

In [None]:
metric_vulnerable_resilient_dict

In [None]:
# Access the vulnerable column names from the dictionary
vulnerable_columns = metric_vulnerable_resilient_dict['vulnerable']

# Identify columns in the DataFrame that contain any of the vulnerable column names as substrings
vulnerable_columns_in_df = [col for col in min_max_standardized_built_metrics_df.columns 
                           if any(resilient_col in col for resilient_col in vulnerable_columns)]

# Create a new DataFrame with the adjusted vulnerable columns
adjusted_vulnerable_df = min_max_standardized_built_metrics_df.copy()

# Subtract the standardized vulnerable columns from one and store the result in the new DataFrame
adjusted_vulnerable_df.loc[:, vulnerable_columns_in_df] = (
    1 - adjusted_vulnerable_df.loc[:, vulnerable_columns_in_df]
)

# View results
adjusted_vulnerable_df.head()

## Step 2: Calculate Indicators
Loop to go through df columns and average metrics that belong within an indicator based off of the metric to indicator dictionary

In [None]:
domain_prefix[:-1]

In [None]:
averaged_indicators_built_environment = compute_averaged_indicators(
    adjusted_vulnerable_df, 
    indicator_dicts(domain_prefix[:-1]), print_summary=True
)

# show resulting dataframe to highlight the indicator values
averaged_indicators_built_environment

Save Indicator dataframe as a csv

In [14]:
# set-up file for export
indicator_filename = '{}domain_averaged_indicators.csv'.format(domain_prefix)
averaged_indicators_built_environment.to_csv(indicator_filename, index=False)

Sum the indicator columns together to calculate the domain score
* essentially summing all columns except for 'GEOID'

In [None]:
columns_to_sum = [col for col in averaged_indicators_built_environment.columns if col != 'GEOID']
summed_indicators_built_environment = compute_summed_indicators(
    df=averaged_indicators_built_environment, 
    columns_to_sum=columns_to_sum,
    domain_prefix=domain_prefix
)

### 2a) Min-max standardize the summed columns
Indicators are also min-max standardized. We'll follow the same procedure again. 

In [None]:
columns_to_process = [col for col in summed_indicators_built_environment.columns if col != 'GEOID']
min_max_domain = min_max_standardize(summed_indicators_built_environment, columns_to_process)
min_max_domain

### 2b) Finalize domain score
* Isolate to census tract and summed standardized columns
* Rename tract to GEOID for merging
* Rename domain score column
* Add a zero at the beginning of the GEOID to match census tract that will be merged

In [17]:
built_environment_domain_score = min_max_domain[['GEOID', 'summed_indicators_built_domain_min_max_standardized']].copy()

# GEOID handling
built_environment_domain_score['GEOID'] = built_environment_domain_score['GEOID'].apply(lambda x: '0' + str(x))
built_environment_domain_score['GEOID'] = built_environment_domain_score['GEOID'].astype(str).apply(lambda x: x.rstrip('0').rstrip('.') if '.' in x else x)

# rename for clarity
built_environment_domain_score = built_environment_domain_score.rename(columns={'summed_indicators_built_domain_min_max_standardized':'built_environment_domain_score'})

## Step 3: Visualize, save, and export domain score

Let's look at some summary statistics for this domain:

In [None]:
domain_summary_stats(built_environment_domain_score, 'built_environment_domain_score')

Now let's visualize the entire domain!

In [None]:
plot_domain_score(built_environment_domain_score,
                  column_to_plot='built_environment_domain_score',
                  domain=domain_prefix)

### We can also visualize specific areas!
Here we'll use the specialized plotting function `plot_region_domain` function, which can handle county identifiers (their corresponding FIPS code) and pre-defined CA regions (listed below). For more information on the function, type `help(plot_region_domain)` to display additional context for all arguments. 

Pre-defined regions include: `bay_area`, `central_region`, `inland_deserts`, `north_central`, `northern`, `south_coast`, `slr_coast`. 

CA County FIPS Code Look-Up Table
|County: Code|County: Code|County: Code|County: Code|
|-----|----|-----|-----|
|Alameda: 001|Lassen: 035|San Benito: 069|Tehama: 103|
|Alpine: 003|Los Angeles: 037|San Bernardino: 071|Trinity: 105|
|Amador: 005|Madera: 039|San Diego: 073|Tulare: 107|
|Butte: 007|Marin: 041|San Francisco: 075|Tuolumne: 109|
|Calaveras: 009|Mariposa: 043|San Joaquin: 077|Ventura: 111|
|Colusa: 013|Mendocino: 045|San Luis Obispo: 079|Yolo: 113|
|Contra Costa: 015|Merced: 047|San Mateo: 081|Yuba: 115|
|Del Norte: 017|Modoc: 049| Santa Barbara: 083|
|El Dorado: 019|Mono: 051|Santa Clara: 085|
|Fresno: 019|Monterey: 053|Santa Cruz: 087|
|Glenn: 021|Napa: 055|Shasta: 089|
|Humboldt: 023|Nevada: 057|Sierra: 091|
|Imperial: 025|Orange: 059|Siskiyou: 095|
|Inyo: 027|Placer: 061|Solano: 095|
|Kern: 029|Plumas: 063|Sonoma: 097|
|Kings: 031|Riverside: 065|Stanislaus: 099|
|Lake: 033|Sacramento: 067|Sutter: 101|

You can plot a domains vulnerability index by region, specific county/counties, or the entirety of CA with labels. Below are a few example of each of these plotting scenarios. 

`help(plot_region_domain)`

In [None]:
plot_region_domain(built_environment_domain_score,
                  column_to_plot='built_environment_domain_score',
                  domain=domain_prefix,
                  domain_label_map={domain_prefix: 'Built Environment'},
                  region='slr_coast',
                  savefig=False, 
                  font_color='black')

In [None]:
plot_region_domain(built_environment_domain_score,
                  column_to_plot='built_environment_domain_score',
                  domain=domain_prefix,
                  domain_label_map={domain_prefix: 'Built Environment'}, 
                  region='central_region', 
                  savefig=False, 
                  font_color='black')

In [None]:
list_of_counties = ['003']
plot_region_domain(built_environment_domain_score,
                  column_to_plot='built_environment_domain_score',
                  domain=domain_prefix,
                  domain_label_map={domain_prefix: 'Built Environment'}, 
                  counties_to_plot=list_of_counties,
                  savefig=False,
                  font_color='black')

In [None]:
plot_region_domain(built_environment_domain_score,
                  column_to_plot='built_environment_domain_score',
                  domain=domain_prefix,
                  domain_label_map={domain_prefix: 'Built Environment'}, 
                  plot_all=True, 
                  savefig=False, 
                  font_color='black')

## Export the final domain csv file

In [24]:
# set-up file for export
domain_filename = '{}environment_domain_score.csv'.format(domain_prefix)
built_environment_domain_score.to_csv(domain_filename, index=False)

## Upload the indicator and domain score csv files to AWS

In [None]:
# upload to aws bucket
bucket_name = 'ca-climate-index'
directory = '3_fair_data/index_data'

files_upload = indicator_filename, domain_filename

for file in files_upload:
    upload_csv_aws([file], bucket_name, directory)

## Delete desired csv files
* all that were generated from this notebook by default

In [None]:
folders_to_delete = ["aws_csvs", "built_folder"]
csv_files_to_delete = ["concatenate_built_metrics.csv", "built_environment_domain_score.csv",
                       "built_domain_averaged_indicators.csv"]

delete_items(folders_to_delete, csv_files_to_delete)