## California Climate Investment Projects Crosswalk - Indicator & Climate Risk Mitigation Columns
This notebook analyses CCI funded programs and projects by connecting each CCI project with an indicator and climate risk mitigation outlined by ERA and CARB using a keyword search function. 

At present, the CCI data comprises 133,696 funded projects between 2015 and 2023. 

## Step One: Indicator Columns:
The detected  indicators are:
* Vulnerable populations
* Social Services
* Economic Health
* Emergency Response
* Personal preparedness
* Community preparedness
* Natural resources conservation
* Ecosystem type, condition, conservation
* Agricultural productivity conservation
* Transportation infrastructure
* Communication infrastructure
* Utilities infrastructure
* Housing vacancy and quality
* Wildfire exposure
* Wildfire loss
* Inland flooding exposure
* Inland flooding loss
* Extreme heat exposure
* Extreme heat loss
* Drought exposure
* Drought loss
* Sea level rise exposure
* Sea level rise loss

Analysis Steps: \
CCI data is scanned for common metric keywords associated with the defined indicators via a dictionary to automatically assign an indicator based on any keyword found in the following columns from the CCI funded programs dataset: 
* category
* sector
* project descriptions
* project type
* program description
* sub program name
* other project benefits description
* voucher description \

counters are added to reveal the number of times each indicator was detected, as well as the number of times a keyword was found from a specific column 

In [1]:
# Import useful libraries
import os
import boto3
import pandas as pd
import itertools
import re

### Pull the CCI data from Feb 14th. 2024

In [2]:
# Initialize the S3 client
s3_client = boto3.client('s3')

# Bucket name and file paths
bucket_name = 'ca-climate-index'
directory = '0_map_data/crosswalk_data/CCI_Projects_Project_Category_Update_02142024.xlsm'

print('Pulling file')
s3_client.download_file(bucket_name, directory, 'CCI_Projects_Project_Category_Update_02142024.xlsm')
print('File pulled')

Pulling file
File pulled


In [3]:
crosswalk_data = pd.read_excel('CCI_Projects_Project_Category_Update_02142024.xlsm')

#### Display all columns

In [4]:
print('Number of columns:', len(crosswalk_data.columns.tolist()))
display(crosswalk_data.columns.tolist())

Number of columns: 130


['Project IDNumber',
 'Reporting Cycle Name',
 'Agency Name',
 'Program Name',
 'Program Description',
 'Sub Program Name',
 'Record Type',
 'Project Name',
 'Project Type',
 'Project Description',
 'SECTOR',
 'CATEGORY',
 'ACTION',
 'Census Tract',
 'Address',
 'Lat Long',
 'Senate\nDistrict',
 'Assembly\nDistrict',
 'County',
 'Total Project Cost',
 'Total Program GGRFFunding',
 'Project Life Years',
 'Total Project GHGReductions',
 'Annual Project GHGReductions',
 'Project Count',
 'Fiscal Year Funding Project',
 'Is Benefit Disadvantaged Communities',
 'Disadvantaged Community Criteria',
 'Disadvantaged Community Need',
 'Disadvantaged Community Census Tracts',
 'Total GGRFDisadvantaged Community Funding',
 'Disadvantaged Community Benefits Description',
 'Funding Benefiting Disadvantaged Communities',
 'Estimated Num Vehicles In Service',
 'Funding Within Disadvantage Communities',
 'Other Project Benefits Description',
 'VMTReductions',
 'Number Of Housing Units',
 'Number Of Aff

#### Selecting columns relevant to be scanned through in the function below

In [5]:
relevant_columns = [
    'CATEGORY',
    'SECTOR',
    'Project Description',
    'Project Type',
    'Program Description',
    'Sub Program Name',
    'Other Project Benefits Description',
    'Voucher Description'  
]

#### Create a metric-indicator dictionary to scan through data based on dictionary values
* first draft

In [6]:
metric_to_indicator_dict = {
    'Vulnerable populations': ['asthma', 'heart disease', 'myocardial infarction', 'low birth weight', 
                              'less than a high school education', 'linguistic isolation', 'poverty', 
                              'unemployment', 'housing burden', 'at-risk drinking water', 'homelessness', 
                              'without health insurance', 'no health insurance', 'ambulatory disability', 
                              'cognitive disability', 'financial assistance', 'over 65', 'under 5', 
                              'violent crime', 'no ac', 'no air conditioning', 'lack air conditioning', 
                              'outdoor employment', 'low food accessibility', 'no food accessibility'],
    
    'Social Services': ['healthcare', 'mental healthcare', 'substance abuse', 'blood bank', 'organ bank', 
                        'hospitals', 'personal care', 'construction', 'rebuild', 'rebuilding', 'maintenance'],
    
    'Economic Health': ['income', 'gini index', 'economic diversity', 'economy', 'economic health'],
    
    'Emergency Response': ['emergency response', 'firefighters', 'fireman', 'nurse', 'nurses', 
                           'law enforcement', 'police', 'fire stations', 'emergency medical care', 
                           'emergency services'],
    
    'Personal preparedness': ['emergency preparation', 'flood insurance', 'homeowners insurance'],
    
    'Community preparedness': ['disaster funding', 'disaster mitigation', 'mitigation funding', 'mitigation', 
                               'wildfire risk', 'flood risk'],
    
    'Natural resources conservation': ['land management', 'watershed', 'water quality'],
    
    'Ecosystem type condition conservation': ['ecosystem type', 'biodiversity', 'soil quality', 
                                                'soil cover', 'air quality', 'impervious', 
                                                'habitat conservation', 'habitat preservation', 
                                                'conservation'],
    
    'Agricultural productivity conservation': ['crop conservation', 'crop condition', 'agricultural productivity', 
                                               'agricultural conservation', 'crop soil', 'crop soil moisture'],
    
    'Transportation infrastructure': ['highway', 'road', 'roads', 'highways', 'freeways', 'freeway', 
                                      'freight rail network', 'train', 'trains', 'bridge', 'bridges', 
                                      'traffic', 'airport', 'airports', 'transportation'],
    
    'Communication infrastructure': ['communication', 'broadband internet', 'radio', 'cell service', 
                                     'cell phone service', 'microwave towers', 'paging', 'television', 
                                     'tv', 'land mobile'],
    
    'Utilities infrastructure': ['utilities', 'energy transmission', 'power lines', 'power line', 
                                  'energy production', 'power plant', 'power plants', 
                                  'public safety power shutoff', 'wastewater treatment'],
    
    'Housing vacancy and quality': ['housing', 'housing vacancy', 'housing quality', 'housing age', 
                                    'housing structures', 'housing structure', 'home', 'house', 'shelter'],
    
    'Wildfire exposure': ['red flag', 'wildfire exposure', 'vulnerable to wildfire', 'exposure to wildfire'],

    'Wildfire loss' : ['wildfire fatalities', 'wildfire loss', 'wildfire damage', 'loss to wildfire', 'acres burned'],

    'Inland flooding exposure' : ['flood warning', 'floodplain area', 'inland flooding', 'extreme precipitation', 'surface runoff'],

    'Inland flooding loss' : ['flood claim', 'flood cost', 'flood loss', 'flood cost', 'flood crop damage'],

    'Extreme heat exposure' : ['heat warnings', 'extreme heat', 'warm nights', 'heat exposure'],

    'Extreme heat loss' : ['heat related illness', 'heat illness', 'crop loss from heat', 'chill hours'],

    'Drought exposure': ['drought exposure', 'historical drought', 'drought', 'water reduction'],

    'Drought loss': ['drought loss', 'crop loss from drought'],

    'Sea level rise exposure': ['vulnerable coastline', 'sea level rise exposure'],

    'Sea level rise loss': ['wetland change', 'loss to sea level rise']
}

#### The metric indicator column function:
* scans for our metric_to_indicator_dict dictionary values through our indicated 'relevant_columns'
    * this scanning is in order of decending value, so it searches through the 'CATEGORY' first, and finishes with 'Voucher Description'
    * it goes through each column but does not re-detect words already found
    * multiple indicators can be found per row
* the function prints the length of the dataset used, how many were not detected, and how many of each indicator was flagged

In [7]:
def metric_indicator_column(df, keyword_dict, relevant_columns, output_csv=None):
    # Initialize new columns to store climate risk mitigation keywords, detected values, repeat counts, and total unique descriptions
    df['Indicator'] = ''
    df['Detected_Metric_Keyword'] = ''
    df['Columns_Detected'] = ''  # New column to store the columns where the keyword was detected

    # Initialize a counter for each keyword
    keyword_counter = {keyword: 0 for keyword in keyword_dict}

    # Initialize a counter for detected columns
    detected_columns_counter = {column: 0 for column in relevant_columns}

    # Iterate through each row
    for index, row in df.iterrows():
        keywords_found = set()  # To store unique keywords found in each row
        detected_values = set()  # To store unique detected values for each row
        detected_columns = set()  # To store unique columns where the keyword was detected
        
        # Iterate through each relevant column
        for column in relevant_columns:
            if column in row:
                detected_keys = [key for key in keyword_dict.keys() if any(re.search(r'\b' + re.escape(val.lower()) + r'\b', str(row[column]).lower()) for val in keyword_dict[key])]
                for detected_key in detected_keys:
                    # Check if any value of the detected key is present in the column (case-insensitive)
                    detected_values.update([val for val in keyword_dict[detected_key] if re.search(r'\b' + re.escape(val.lower()) + r'\b', str(row[column]).lower())])
                    if detected_values:
                        keywords_found.add(detected_key)
                        detected_columns.add(column)

        # Update the 'Indicator' column with detected keywords
        df.at[index, 'Indicator'] = ', '.join(keywords_found)
        # Update the 'Detected_Metric_Keyword' column with detected values
        df.at[index, 'Detected_Metric_Keyword'] = ', '.join(detected_values)
        # Update the 'Columns_Detected' column with detected columns
        columns_detected_str = ', '.join(detected_columns)
        df.at[index, 'Columns_Detected'] = columns_detected_str

    number_without_indicator = df[df['Indicator'] == '']

    print(f'Length of dataset: {len(df)}')
    print('')
    print(f'Number of rows without an indicator entry: {len(number_without_indicator)}')
    print('')
    # Print detected column counts
    print("Detected Column Counts:")
    for index, row in df.iterrows():
        detected_columns = row['Columns_Detected'].split(', ')
        for column in detected_columns:
            if column:
                detected_columns_counter[column] += 1

    for column, count in detected_columns_counter.items():
        print(f"{column}: {count}")
    print('')

    # Count keywords from the 'Indicator' column after populating it
    for index, row in df.iterrows():
        indicators = row['Indicator'].split(', ')
        for indicator in indicators:
            if indicator:  # Check if indicator is not empty
                keyword_counter[indicator] += 1

    # Print keyword counts
    print("Keyword Counts:")
    for keyword, count in keyword_counter.items():
        print(f"{keyword}: {count}")
    print('')

    # Check length of 'Indicator' entries containing 'Transportation infrastructure'
    transportation_indicator_count = len(df[df['Indicator'].str.contains('Transportation infrastructure')])

    print(f"FOR TESTING/FACT CHECKING - Number of 'Indicator' entries containing 'Transportation infrastructure': {transportation_indicator_count}")
    
    # Save DataFrame as CSV if output_csv is provided
    if output_csv:
        df.to_csv(output_csv, index=False)
        print(f"DataFrame saved as {output_csv}")
        print('')

## Select a random 1000 rows from the dataset to run the function on

In [8]:
sample_data = crosswalk_data.sample(1000)

### Testing function on our sample of the dataset
* added all relevant columns to display afterwards for analysis
* included a counter in the function to fact check the counters with Transportation infrastructure
* there can be multiple indicators within the indicator column
* there can be multiple columns detected in the columns detected column

In [9]:
metric_indicator_column(sample_data, metric_to_indicator_dict, relevant_columns) #, 'cci_project_indicators.csv')
pd.set_option('display.max_colwidth', None)
data_preview = sample_data[['CATEGORY',
                            'SECTOR',
                            'Project Description',
                            'Project Type',
                            'Program Description',
                            'Sub Program Name',
                            'Other Project Benefits Description',
                            'Voucher Description',
                            'Detected_Metric_Keyword', 
                            'Columns_Detected', 
                            'Indicator', 
                            'Project Count']]

data_preview_filtered = data_preview[data_preview['Indicator'] != '']
data_preview_filtered.head(1)

Length of dataset: 1000

Number of rows without an indicator entry: 30

Detected Column Counts:
CATEGORY: 57
SECTOR: 59
Project Description: 163
Project Type: 24
Program Description: 931
Sub Program Name: 30
Other Project Benefits Description: 73
Voucher Description: 0

Keyword Counts:
Vulnerable populations: 0
Social Services: 14
Economic Health: 149
Emergency Response: 3
Personal preparedness: 0
Community preparedness: 2
Natural resources conservation: 6
Ecosystem type condition conservation: 74
Agricultural productivity conservation: 1
Transportation infrastructure: 892
Communication infrastructure: 0
Utilities infrastructure: 0
Housing vacancy and quality: 47
Wildfire exposure: 0
Wildfire loss: 0
Inland flooding exposure: 0
Inland flooding loss: 0
Extreme heat exposure: 0
Extreme heat loss: 0
Drought exposure: 7
Drought loss: 0
Sea level rise exposure: 0
Sea level rise loss: 0

FOR TESTING/FACT CHECKING - Number of 'Indicator' entries containing 'Transportation infrastructure': 892

Unnamed: 0,CATEGORY,SECTOR,Project Description,Project Type,Program Description,Sub Program Name,Other Project Benefits Description,Voucher Description,Detected_Metric_Keyword,Columns_Detected,Indicator,Project Count
33599,Light-Duty Vehicles,"Zero-Emission Vehicles, Equipment, and Infrastructure","CVRP promotes clean vehicle adoption in California by offering rebates from $1,000 to $7,502 for the purchase or lease of new, eligible zero-emission vehicles, including electric, plug-in hybrid electric and fuel cell vehicles.",,"Provides mobile source incentives to reduce GHG emissions, criteria pollutants, and air toxics through the development of advanced technology and clean transportation. The program is comprised of sub-programs that provide a variety of disadvantaged community benefits.\n\nCARB also provides incentives to help households replace an uncertified wood stove, wood insert, or fireplace used as a primary source of heat with a cleaner burning and more efficient device.",Clean Vehicle Rebate Project,"CVRP promotes clean vehicle adoption in California by offering rebates from $1,000 to $7,500 for the purchase or lease of new, eligible zero-emission vehicles, including electric, plug-in hybrid and fuel cell vehicles.",,transportation,Program Description,Transportation infrastructure,2.0


## Step two: Add the climate mitigation column to this dataset:
For the purposes of this project, the term 'climate risk' includes the following: 
* Extreme heat
* Inland flooding
* Sea level rise
* Wildfire
* Drought

Analysis Steps: \
This process is extremely similar to how we created the indicator column above. The CCI data is scanned for common keywords associated with the defined climate risks via a dictionary to automatically assign a climate risk based on any keyword found in the same relevant columns for the indicator columns:
* category
* sector
* project descriptions
* project type
* program description
* sub program name
* other project benefits description
* voucher description

counters are included below as well

### Climate risk mitigation dictionary

In [10]:
climate_risk_dict = {
    'wildfire mitigation': ['wildfire', 'prescribed fire', 'fire prevention', 'controlled burn', 'controlled_burning', 
                            'prescribed burn', 'prescribed burning' 'firefighting', 'reforest', 'reforestation', 'vegetation management', 
                            'roadside brushing', 'fuel break', 'fuel reduction', 'ignition', 'crown', 'fuel load', 'Fire and Forest Management'],
    
    'sea level rise mitigation': ['sea level rise', 'slr', 'seawall', 'seawalls', 'shoreline', 'wetland', 'mangrove', 'coastal','Restoration of riparian', 'sea-level rise'],
    
    'extreme heat mitigation': ['extreme heat', 'shade', 'shading', 'cooling center', 'cooling centers', 'heat-resistant', 
                                'heat resistant', 'heat reducing', 'heat-reducing', 'energy savings', 'urban forestry'],
    
    'drought mitigation': ['drought', 'irrigation', 'soil moisture', 'rainwater harvest', 'rainwater harvesting', 'water storage', 
                           'water allocation', 'water management', 'soil health', 'soil management', 'organic matter', 'water efficiency'],
    
    'inland flooding mitigation': ['flooding', 'runoff', 'inland flood', 'inland flooding', 'floodplain', 'flood proof', 'floodproofing', 
                                   'elevated flood', 'flood barrier', 'flood barriers', 'drainage', 'riparian', 'stormwater']
} 

## Function to create the climate mitigation column

This function is extremely similar to the indicator function

* the resulting sample df from the metric_indicator_column function is brought into this function so the final result is a CCI dataset with climate risk mitigation AND indicator columns

In [11]:
def climate_mitigation_column(df, keyword_dict, relevant_columns, output_csv=None):
    # Initialize new columns to store climate risk mitigation keywords, detected values, repeat counts, and total unique descriptions
    df['Climate_Risk_Mitigation'] = ''
    df['Detected_Climate_Risk_Mitigation_Keyword'] = ''
    df['Columns_Detected_Climate_Risk'] = ''  # New column to store the columns where the keyword was detected

    # Initialize a counter for each keyword
    keyword_counter = {keyword: 0 for keyword in keyword_dict}

    # Initialize a counter for detected columns
    detected_columns_counter = {column: 0 for column in relevant_columns}

    # Iterate through each row
    for index, row in df.iterrows():
        keywords_found = set()  # To store unique keywords found in each row
        detected_values = set()  # To store unique detected values for each row
        detected_columns = set()  # To store unique columns where the keyword was detected
        
        # Iterate through each relevant column
        for column in relevant_columns:
            if column in row:
                detected_keys = [key for key in keyword_dict.keys() if any(re.search(r'\b' + re.escape(val.lower()) + r'\b', str(row[column]).lower()) for val in keyword_dict[key])]
                for detected_key in detected_keys:
                    # Check if any value of the detected key is present in the column (case-insensitive)
                    detected_values.update([val for val in keyword_dict[detected_key] if re.search(r'\b' + re.escape(val.lower()) + r'\b', str(row[column]).lower())])
                    if detected_values:
                        keywords_found.add(detected_key)
                        detected_columns.add(column)

        # Update the 'Climate_Risk_Mitigation' column with detected keywords
        df.at[index, 'Climate_Risk_Mitigation'] = ', '.join(keywords_found)
        # Update the 'Detected_Climate_Risk_Mitigation_Keyword' column with detected values
        df.at[index, 'Detected_Climate_Risk_Mitigation_Keyword'] = ', '.join(detected_values)
        # Update the 'Columns_Detected' column with detected columns
        columns_detected_str = ', '.join(detected_columns)
        df.at[index, 'Columns_Detected_Climate_Risk'] = columns_detected_str

    number_without_climate_risk = df[df['Climate_Risk_Mitigation'] == '']

    print(f'Length of dataset: {len(df)}')
    print('')
    print(f'Number of rows without an climate risk entry: {len(number_without_climate_risk)}')
    print('')
    # Print detected column counts
    print("Detected Column Counts:")
    for index, row in df.iterrows():
        detected_columns = row['Columns_Detected_Climate_Risk'].split(', ')
        for column in detected_columns:
            if column:
                detected_columns_counter[column] += 1

    for column, count in detected_columns_counter.items():
        print(f"{column}: {count}")
    print('')

    # Count keywords from the 'Climate_Risk_Mitigation' column after populating it
    for index, row in df.iterrows():
        climate_risk = row['Climate_Risk_Mitigation'].split(', ')
        for climate in climate_risk:
            if climate:  # Check if climate risk is not empty
                keyword_counter[climate] += 1

    # Print keyword counts
    print("Keyword Counts:")
    for keyword, count in keyword_counter.items():
        print(f"{keyword}: {count}")
    print('')

    # Check length of 'Climate_Risk_Mitigation' entries containing 'Transportation infrastructure'
    wildfire_count = len(df[df['Climate_Risk_Mitigation'].str.contains('wildfire mitigation')])

    print(f"TESTING/FACT CHECKING: Number of 'Indicator' entries containing 'wildfire mitigation': {wildfire_count}")
    
    # Save DataFrame as CSV if output_csv is provided
    if output_csv:
        df.to_csv(output_csv, index=False)
        print(f"DataFrame saved as {output_csv}")
        print('')

### Calling the function, adding the relevant columns (including indicator columns)

* also includes a print statement to see how many wildfire mitigations are in the dataset to fact check the counter

In [12]:
climate_mitigation_column(sample_data, climate_risk_dict, relevant_columns) #, 'cci_project_indicators.csv')
pd.set_option('display.max_colwidth', None)
data_preview = sample_data[['CATEGORY',
                            'SECTOR',
                            'Project Description',
                            'Project Type',
                            'Program Description',
                            'Sub Program Name',
                            'Other Project Benefits Description',
                            'Voucher Description',
                            'Detected_Metric_Keyword', 
                            'Columns_Detected', 
                            'Indicator', 
                            'Climate_Risk_Mitigation',
                            'Detected_Climate_Risk_Mitigation_Keyword',
                            'Columns_Detected_Climate_Risk',
                            'Project Count']]

data_preview_filtered = data_preview[data_preview['Climate_Risk_Mitigation'] != '']
data_preview_filtered.head(1)

Length of dataset: 1000

Number of rows without an climate risk entry: 894

Detected Column Counts:
CATEGORY: 46
SECTOR: 11
Project Description: 71
Project Type: 9
Program Description: 24
Sub Program Name: 8
Other Project Benefits Description: 56
Voucher Description: 0

Keyword Counts:
wildfire mitigation: 11
sea level rise mitigation: 2
extreme heat mitigation: 53
drought mitigation: 50
inland flooding mitigation: 5

TESTING/FACT CHECKING: Number of 'Indicator' entries containing 'wildfire mitigation': 11


Unnamed: 0,CATEGORY,SECTOR,Project Description,Project Type,Program Description,Sub Program Name,Other Project Benefits Description,Voucher Description,Detected_Metric_Keyword,Columns_Detected,Indicator,Climate_Risk_Mitigation,Detected_Climate_Risk_Mitigation_Keyword,Columns_Detected_Climate_Risk,Project Count
60481,Water Efficiency,Water and Energy Efficiency,Sacramento region disadvantage communities' (DACs) need to replace high-water-use and high-energy-use fixtures with WaterSense labeled efficient fixtures through direct installation and fixture distribution to lower income households.,Residential Faucet; Residential Showerheads; Commercial Faucets; Commercial Showerheads; Commercial Toilets; Commercial Urinals,"Provides grants to implement efficiencies that reduce GHG emissions. The program is comprised of two components, the Water-Energy Grant program and the Turbines program.",Water-Energy Grant Program,Reduced Utility Costs,,income,Project Description,Economic Health,drought mitigation,water efficiency,CATEGORY,5.0


#### Get rid of columns used for analysis so we just add the indicator and climate mitigation columns, save as a csv, and upload to AWS

In [13]:
final_sample_data = sample_data.drop(columns=['Detected_Metric_Keyword',
                                               'Columns_Detected',
                                                'Columns_Detected_Climate_Risk',
                                                'Detected_Climate_Risk_Mitigation_Keyword'])
output_csv = 'SAMPLE_cci_project_indicators_and_climate_risk.csv'

final_sample_data.to_csv(output_csv, index=False)
print(f'Dataframe saved as {output_csv}')
print('')
# Initialize the S3 client
s3_client = boto3.client('s3')

# Bucket name and file paths
bucket_name = 'ca-climate-index'
directory = f'0_map_data/crosswalk_data/{output_csv}'
# Upload the CSV file to S3
print(f'Uploading {output_csv} to AWS')
with open(output_csv, 'rb') as file:
    s3_client.upload_fileobj(file, bucket_name, directory)
    print(f'Upload complete! File is in {directory}')

Dataframe saved as cci_project_indicators_and_climate_risk.csv

Uploading cci_project_indicators_and_climate_risk.csv to AWS
Upload complete! File is in 0_map_data/crosswalk_data/cci_project_indicators_and_climate_risk.csv
