## This notebook creates the following metrics within the Society & Economy domain sourced from CalEnviroScreen:
* Age-adjusted emergency department visits for asthma per 10,000 people
* Age-adjusted emergency department visits for myocardial infarction per 10,000 people
* % of live, singleton births < 5.5 pounds (non-twin, including premature)
* % of population 25 and older with less than a high school education
* % of households where all members 14 and older have some difficult speaking English
* % of population living below 2x federal poverty level
* % of population > 16 years old unemployed and eligible for the workforce
* % of households which are low-income and housing-burdened
* % of population which have at-risk drinking water 

In [2]:
import pandas as pd
import os
import sys
import math

# suppress pandas purely educational warnings
from warnings import simplefilter
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

sys.path.append(os.path.expanduser('../../'))
from scripts.utils.file_helpers import pull_csv_from_directory, upload_csv_aws, filter_counties
from scripts.utils.write_metadata import append_metadata

In [2]:
# pull .xlsx from aws
enviroscreen_excel = 's3://ca-climate-index/1_pull_data/society_economy/vulnerable_populations/ca_enviro_screen/calenviroscreen.xlsx'
enviroscreen_data = pd.read_excel(enviroscreen_excel)

In [None]:
enviroscreen_data

In [None]:
enviroscreen_data.columns

In [None]:
metric_enviroscreen_data = enviroscreen_data[['Census Tract', 
                                              'California County', 
                                              'Total Population', 
                                              'Drinking Water',
                                              'Drinking Water Pctl',
                                              'Groundwater Threats', 
                                              'Groundwater Threats Pctl', 
                                              'Asthma',
                                              'Asthma Pctl',
                                              'Low Birth Weight', 
                                              'Cardiovascular Disease', 
                                              'Education', 
                                              'Linguistic Isolation',
                                              'Poverty',
                                              'Unemployment', 
                                              'Housing Burden'
                                              ]]

In [None]:
metric_enviroscreen_data 

## Pulling in 2021 census population data, can only use for one of our metrics (if desired) as the others are already in percentages from 2019 data

In [None]:
est_pop = "s3://ca-climate-index/0_map_data/cri_acs_demographic_estimated_population.csv"
ca_est_pop = pd.read_csv(est_pop)
ca_est_pop = ca_est_pop[['Census_Tract', 'est_total_pop']]
ca_est_pop = ca_est_pop.rename(columns={'est_total_pop': 'Total Population 2021'})
ca_est_pop = ca_est_pop.rename(columns={'Census_Tract': 'Census Tract'})

In [None]:
# Adding 2021 population column to our enviroscreen data merged based on census tract
merged_df = pd.merge(metric_enviroscreen_data, ca_est_pop[['Census Tract', 'Total Population 2021']], on='Census Tract', how='left')

# Move our merged 2021 pop column towards the front
column_to_move = 'Total Population 2021'
col = merged_df.pop(column_to_move)
merged_df.insert(3, column_to_move, col)


In [None]:
merged_df = merged_df.rename(columns={'Total Population': 'Total Population 2019'})

merged_df

In [None]:
merged_df.columns

### Function Call
The function below creates new df's for each metric listed below. Some metrics are already in percent from the 2019 data, so those columns are renamed and retained for Cal-CRAI metric. df's are saved as csv's named off of their metric column:

ones that are already in percent from 2019 data
* % of live, singleton births < 5.5 pounds (non-twin, including premature)
* % of population 25 and older with less than a high school education
* % of households where all members 14 and older have some difficult speaking English
* % of population living below 2x federal poverty level
* % of population > 16 years old unemployed and eligible for the workforce
* % of households which are low-income and housing-burdened

The function can also calculate metric per 10,000 people for metrics that have a 'sum of' column rather than pre-baked in percentages:

metrics that have been calculated for metrics per 10,000 have columns for 2019 and 2021 populations
* Age-adjusted emergency department visits for asthma per 10,000 people
* Age-adjusted emergency department visits for myocardial infarction per 10,000 people

Asthma and cardiovascular percentage can be calculated with 2019 and 2021 as the CalEnviroscreen values are 'Age-adjusted rate of emergency department visits for asthma/cardiovascular disease'

In [5]:
# @append_metadata
def calenviroscreen_metric_calc(columns_to_process, calculate_per_10000=False, varname=""):

    '''
    Calculates the following metrics sourced from CalEnviroScreen:
    * % of live, singleton births < 5.5 pounds (non-twin, including premature)
    * % of population 25 and older with less than a high school education
    * % of households where all members 14 and older have some difficult speaking English
    * % of population living below 2x federal poverty level
    * % of population > 16 years old unemployed and eligible for the workforce
    * % of households which are low-income and housing-burdened
    * Age-adjusted emergency department visits for asthma per 10,000 people
    * Age-adjusted emergency department visits for myocardial infarction per 10,000 people
  
    Methods
    --------
    Relevant data columns were isolated and renamed to align with Cal-CRAI metrics.
    2021 American Community Survey population data was added and merged into the
    data so metrics could be calculated with updated population (where applicable).
    Metrics with % calculations were largely untouched as CalEnviroScreen data had
    those metrics calculated for 2019.
    Metrics with emergency department visits had their values adjusted to reflect
    number of visits per 10,000 people per tract with 2019 and 2021 population data.

    Parameters
    ------------
    columns_to_process: list
        list of columns that contain desired metric data
    calculate_per_10000: boolean
        if true, adds columns with calculations for # of visits per 10,000 people
        if false, retains the column but renames to 2019
    varname: string
        Final metric name.

    Script
    ------
    cal_enviroscreen_metrics.ipynb

    Note
    ------
    This function assumes users have configured the AWS CLI such that their access key / 
    secret key pair are stored in ~/.aws/credentials. 
    See https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html for guidance.
    '''

    # pull .xlsx from aws
    enviroscreen_excel = 's3://ca-climate-index/1_pull_data/society_economy/vulnerable_populations/ca_enviro_screen/calenviroscreen.xlsx'
    enviroscreen_data = pd.read_excel(enviroscreen_excel)
    
    print('Data transformation: isolating columns relevant to Cal-CRAI metrics.')
    metric_enviroscreen_data = enviroscreen_data[['Census Tract', 
                                              'California County', 
                                              'Total Population', 
                                              'Drinking Water',
                                              'Drinking Water Pctl',
                                              'Groundwater Threats', 
                                              'Groundwater Threats Pctl', 
                                              'Asthma',
                                              'Asthma Pctl',
                                              'Low Birth Weight', 
                                              'Cardiovascular Disease', 
                                              'Education', 
                                              'Linguistic Isolation',
                                              'Poverty',
                                              'Unemployment', 
                                              'Housing Burden'
                                              ]]
    
    est_pop = "s3://ca-climate-index/0_map_data/cri_acs_demographic_estimated_population.csv"
    ca_est_pop = pd.read_csv(est_pop)
    ca_est_pop = ca_est_pop[['Census_Tract', 'est_total_pop']]
    ca_est_pop = ca_est_pop.rename(columns={'est_total_pop': 'Total Population 2021'})
    ca_est_pop = ca_est_pop.rename(columns={'Census_Tract': 'Census Tract'})

    print('Data transformation: adding 2021 ACS population data, merging based on census tract.')
    # Adding 2021 population column to our enviroscreen data merged based on census tract
    merged_df = pd.merge(metric_enviroscreen_data, ca_est_pop[['Census Tract', 'Total Population 2021']], on='Census Tract', how='left')

    # Move our merged 2021 pop column towards the front
    column_to_move = 'Total Population 2021'
    col = merged_df.pop(column_to_move)
    merged_df.insert(3, column_to_move, col)
    merged_df = merged_df.rename(columns={'Total Population': 'Total Population 2019'})

    # List to store generated CSV file names
    csv_file_names = []
    
    print('Data transformation: renaming columns to reflect calculation year.')
    print('Data transformation: adding calculation columns for metrics with emergency department visits.')

    # Loop through columns to process
    for column in columns_to_process:
        # Create new DataFrame
        new_df = merged_df[['Census Tract', 'California County', 'Total Population 2019']].copy()
        
        # Calculate new column name
        new_column_name = column.replace(' ', '_') + '_related_ED_visits_2019' if calculate_per_10000 else column.replace(' ', '_') + '_percent_2019'
        new_column_name_per_10000_people_2019 = new_column_name.replace('_2019', '_per_10000_people_2019')
        new_column_name_per_10000__people_2021 = new_column_name.replace('_2019', '_per_10000_people_2021')
        
        # Add new column with the calculated name
        if not calculate_per_10000:
            new_df[new_column_name] = merged_df[column]
        else:
            new_df['Total Population 2021'] = merged_df['Total Population 2021']  # Only add this column if calculating percentage
            new_df[column] = merged_df[column]
            new_df[new_column_name_per_10000_people_2019] = (merged_df[column] / merged_df['Total Population 2019']) * 10000
            new_df[new_column_name_per_10000__people_2021] = (merged_df[column] / merged_df['Total Population 2021']) * 10000
        
        if not calculate_per_10000:
            # Define CSV file name based on the new column name
            csv_filename = new_column_name + '_metric.csv'
        else:
            # Define CSV file name based on the new column name
            csv_filename = new_column_name + '_percentage_metric.csv'

        # Save the DataFrame to CSV
        new_df.to_csv(csv_filename, index=False)
        
        print(f"Saved DataFrame to: {csv_filename}")

        # Append CSV filename to the list
        csv_file_names.append(csv_filename)
        # Output or further process new DataFrame
        display(new_df)

        bucket_name = 'ca-climate-index'
        directory = '3_fair_data/index_data'
        upload_csv_aws([csv_filename], bucket_name, directory)
        print('')

# Calling function for both metric calc types
Having a hard time with the append metadata wrapper currently

In [6]:
# Columns to loop through that dont need percentages calculated
columns_to_process_no_10000 = [
    'Low Birth Weight',
    'Education',
    'Linguistic Isolation',
    'Poverty',
    'Unemployment',
    'Housing Burden'
]
varnames = ['society_calenviroscreen_birth_rate',
            'society_calenviroscreen_low_education',
            'society_calenviroscreen_nonenglish_speakers',
            'society_calenviroscreen_below_poverty_level',
            'society_calenviroscreen_unemployment',
            'society_calenviroscreen_housing_burdened']

# Calculate metric without percentages
calenviroscreen_metric_calc(columns_to_process_no_10000, calculate_per_10000=False, varname='')


varnames = ['society_calenviroscreen_emergency_dept_visits',
            'society_calenviroscreen_emergency_dept_myocardial_visits']

# Columns to loop through that include calculating percentages
columns_to_process_per_10000 = [
    'Asthma',
    'Cardiovascular Disease'
]
# Calculate percentages
calenviroscreen_metric_calc(columns_to_process_per_10000, calculate_per_10000=True, varname='')


Data transformation: isolating columns relevant to Cal-CRAI metrics.
Data transformation: adding 2021 ACS population data, merging based on census tract.
Data transformation: renaming columns to reflect calculation year.
Data transformation: adding calculation columns for metrics with emergency department visits.
Saved DataFrame to: Low_Birth_Weight_percent_2019_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Low_Birth_Weight_percent_2019
0,6019001100,Fresno,2780,7.80
1,6077000700,San Joaquin,4680,6.88
2,6037204920,Los Angeles,2751,7.11
3,6019000700,Fresno,3664,10.65
4,6019000200,Fresno,2689,10.25
...,...,...,...,...
8030,6107004000,Tulare,582,
8031,6109985202,Tuolumne,2509,
8032,6111001206,Ventura,778,
8033,6111003012,Ventura,675,


Low_Birth_Weight_percent_2019_metric.csv uploaded to AWS

Saved DataFrame to: Education_percent_2019_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Education_percent_2019
0,6019001100,Fresno,2780,44.5
1,6077000700,San Joaquin,4680,46.4
2,6037204920,Los Angeles,2751,52.2
3,6019000700,Fresno,3664,41.4
4,6019000200,Fresno,2689,43.6
...,...,...,...,...
8030,6107004000,Tulare,582,43.6
8031,6109985202,Tuolumne,2509,34.1
8032,6111001206,Ventura,778,
8033,6111003012,Ventura,675,


Education_percent_2019_metric.csv uploaded to AWS

Saved DataFrame to: Linguistic_Isolation_percent_2019_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Linguistic_Isolation_percent_2019
0,6019001100,Fresno,2780,16.0
1,6077000700,San Joaquin,4680,29.7
2,6037204920,Los Angeles,2751,17.1
3,6019000700,Fresno,3664,15.7
4,6019000200,Fresno,2689,20.0
...,...,...,...,...
8030,6107004000,Tulare,582,
8031,6109985202,Tuolumne,2509,
8032,6111001206,Ventura,778,
8033,6111003012,Ventura,675,51.9


Linguistic_Isolation_percent_2019_metric.csv uploaded to AWS

Saved DataFrame to: Poverty_percent_2019_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Poverty_percent_2019
0,6019001100,Fresno,2780,76.0
1,6077000700,San Joaquin,4680,73.2
2,6037204920,Los Angeles,2751,62.6
3,6019000700,Fresno,3664,65.7
4,6019000200,Fresno,2689,72.7
...,...,...,...,...
8030,6107004000,Tulare,582,79.6
8031,6109985202,Tuolumne,2509,
8032,6111001206,Ventura,778,17.1
8033,6111003012,Ventura,675,96.7


Poverty_percent_2019_metric.csv uploaded to AWS

Saved DataFrame to: Unemployment_percent_2019_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Unemployment_percent_2019
0,6019001100,Fresno,2780,12.8
1,6077000700,San Joaquin,4680,19.8
2,6037204920,Los Angeles,2751,6.4
3,6019000700,Fresno,3664,15.7
4,6019000200,Fresno,2689,13.7
...,...,...,...,...
8030,6107004000,Tulare,582,
8031,6109985202,Tuolumne,2509,
8032,6111001206,Ventura,778,
8033,6111003012,Ventura,675,


Unemployment_percent_2019_metric.csv uploaded to AWS

Saved DataFrame to: Housing_Burden_percent_2019_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Housing_Burden_percent_2019
0,6019001100,Fresno,2780,30.3
1,6077000700,San Joaquin,4680,31.2
2,6037204920,Los Angeles,2751,20.3
3,6019000700,Fresno,3664,35.4
4,6019000200,Fresno,2689,32.7
...,...,...,...,...
8030,6107004000,Tulare,582,
8031,6109985202,Tuolumne,2509,
8032,6111001206,Ventura,778,24.4
8033,6111003012,Ventura,675,


Housing_Burden_percent_2019_metric.csv uploaded to AWS

Data transformation: isolating columns relevant to Cal-CRAI metrics.
Data transformation: adding 2021 ACS population data, merging based on census tract.
Data transformation: renaming columns to reflect calculation year.
Data transformation: adding calculation columns for metrics with emergency department visits.
Saved DataFrame to: Asthma_related_ED_visits_2019_percentage_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Total Population 2021,Asthma,Asthma_related_ED_visits_per_10000_people_2019,Asthma_related_ED_visits_per_10000_people_2021
0,6019001100,Fresno,2780,3166.0,129.54,465.971223,409.159823
1,6077000700,San Joaquin,4680,5284.0,105.88,226.239316,200.378501
2,6037204920,Los Angeles,2751,2623.0,76.10,276.626681,290.125810
3,6019000700,Fresno,3664,,139.45,380.594978,
4,6019000200,Fresno,2689,2861.0,139.08,517.218297,486.123733
...,...,...,...,...,...,...,...
8030,6107004000,Tulare,582,561.0,61.64,1059.106529,1098.752228
8031,6109985202,Tuolumne,2509,2038.0,68.79,274.172977,337.536801
8032,6111001206,Ventura,778,,48.36,621.593830,
8033,6111003012,Ventura,675,,45.35,671.851852,


Asthma_related_ED_visits_2019_percentage_metric.csv uploaded to AWS

Saved DataFrame to: Cardiovascular_Disease_related_ED_visits_2019_percentage_metric.csv


Unnamed: 0,Census Tract,California County,Total Population 2019,Total Population 2021,Cardiovascular Disease,Cardiovascular_Disease_related_ED_visits_per_10000_people_2019,Cardiovascular_Disease_related_ED_visits_per_10000_people_2021
0,6019001100,Fresno,2780,3166.0,21.47,77.230216,67.814277
1,6077000700,San Joaquin,4680,5284.0,20.26,43.290598,38.342165
2,6037204920,Los Angeles,2751,2623.0,20.87,75.863322,79.565383
3,6019000700,Fresno,3664,,22.68,61.899563,
4,6019000200,Fresno,2689,2861.0,22.64,84.194868,79.133170
...,...,...,...,...,...,...,...
8030,6107004000,Tulare,582,561.0,21.22,364.604811,378.253119
8031,6109985202,Tuolumne,2509,2038.0,22.89,91.231566,112.315996
8032,6111001206,Ventura,778,,8.77,112.724936,
8033,6111003012,Ventura,675,,12.25,181.481481,


Cardiovascular_Disease_related_ED_visits_2019_percentage_metric.csv uploaded to AWS



Drinking water is a bit more challenging. The CalEnviroscreen report states the range for the index is 0-1161 and is calculated with: 

1. Drinking water system boundaries and townships were
downloaded and cleaned.
2. Average concentrations for each contaminant were calculated
and associated with each water system and township.
3. The systems’ and townships’ average water contaminant
concentrations were re-allocated from the associated
boundaries to census tracts. The census tracts were then
ranked to obtain a percentile score for each contaminant and
tract.
4. A census tract contaminant index was calculated as the sum
of the percentiles for all contaminants

California water systems have a high rate of compliance with drinking water standards. In
2017, systems serving an estimated 1.6 percent of the state’s population were in
violation of one or more federal drinking water standards (SWRCB, 2018). The drinking
water contaminant index in CalEnviroScreen 4.0 is not a measure of compliance with
these or California’s state standards. The drinking water contaminant index is a
combination of contaminant data that takes into account the relative concentrations of
different contaminants and whether multiple contaminants are present. The indicator
does not indicate whether water is safe to drink.

There are also a few other potentially useful metrics (all have percentile columns/data):

IMPAIRED WATERBODIES
Contamination of California streams, rivers, lakes, and coastal waters by pollutants can
compromise the use of the water body for drinking, swimming, fishing, aquatic life
protection, and other beneficial uses. When this occurs, such water bodies are considered
“impaired.” Information on impairments to these water bodies can help determine the
extent of environmental degradation within an area.

GROUNDWATER THREATS
Many activities can pose threats to groundwater quality. These include the storage and
disposal of hazardous materials on land and in underground storage tanks at various
types of commercial, industrial, and military sites. Thousands of storage tanks in California
have leaked petroleum or other hazardous substances, degrading soil and groundwater.
Storage tanks are of particular concern when they can affect drinking water supplies. In
addition, the land surrounding these sites may be taken out of service due to perceived
cleanup costs or concerns about liability. Dairy farms and concentrated animal-feeding
operations, which produce large quantities of animal manure pose a threat to
groundwater. Other activities that pose threats to groundwater quality include produced
water ponds, which are generated as a result of oil and gas development. The most
complete sets of information related to sites that may impact groundwater and require
cleanup are maintained by the State Water Resources Control Board. 


So, do we want to use the drinking water and devise our own threshold for at risk water, or use percentiles and those below a number are at risk? Or use one of the other two water related metrics included in the dataset?