## California Climate Investment Projects Crosswalk
This notebook analyses CCI funded programs and projects by connecting each CCI project with a primary associated climate risk, as defined by the California Climate Risk and Adaptation Index (Cal-CRAI). For the purposes of this project, the term 'climate risk' includes the following: 
* Extreme heat
* Inland flooding
* Sea level rise
* Wildfire
* Drought

A sixth, non-climate risk categorization is provided for *greenhouse gas (GHG) mitigation*, as many CCI projects are funded to broadly reduce GHG efforts. 

Analysis Steps:
- CCI data is scanned for common keywords associated with the defined climate risks via a dictionary to automatically assign a climate risk based on any keyword found in the project description, program description, sector, category, or action.
- When a project is assigned more than one risk:
   - If it is a defined climate risk and greenhouse gas mitigation, the project is assigned to the climate risk.
   - If it is more than one climate risk (excluding greenhouse gas mitigation), keywords are assessed in other data columns (SECTOR, CATEGORY, ACTION) to identify if there is a primary risk. If not, the project is manually assessed and assigned a primary risk classification.  

At present, the CCI data comprises 133,696 funded projects between 2015 and 2023. 

In [1]:
# Import useful libraries
import os
import boto3
import pandas as pd

In [None]:
# Initialize the S3 client
s3_client = boto3.client('s3')

# Bucket name and file paths
bucket_name = 'ca-climate-index'
directory = '0_map_data/crosswalk_data/CCI_Projects_Project_Category_Update_02142024.xlsm'

print('Pulling file')
s3_client.download_file(bucket_name, directory, 'CCI_Projects_Project_Category_Update_02142024.xlsm')
print('File pulled')

In [3]:
crosswalk_data = pd.read_excel('CCI_Projects_Project_Category_Update_02142024.xlsm')

#### How many rows within the original dataset?

In [None]:
pd.set_option('display.max_columns', None)
print('Number of rows within dataset:', len(crosswalk_data))
#display(crosswalk_data)

#### Display all columns

In [None]:
print('Number of columns:', len(crosswalk_data.columns.tolist()))
display(crosswalk_data.columns.tolist())

#### Selecting columns relevant to initial analysis

In [6]:
relevant_columns = [
    'Program Name',
    'Program Description',
    'Sub Program Name',
    'Project Type',
    'Project Description',
    'SECTOR',
    'CATEGORY',
    'ACTION',
    'Census Tract',
    'Total Project GHGReductions',
    'Project Count'
]

In [7]:
data_of_interest = crosswalk_data[relevant_columns]

In [None]:
# Set display options to show all columns and rows
# pd.set_option('display.max_columns', None)  # To display all columns
# pd.set_option('display.max_rows', None)     # To display all rows

# Now display data_of_interest
display(data_of_interest)

#### Create a climate risk dictionary to scan through data based on dictionary values

In [9]:
climate_risk_dict = {
    'wildfire mitigation': ['wildfire', 'prescribed fire', 'fire prevention', 'controlled burn', 'controlled_burning', 
                            'prescribed burn', 'prescribed burning' 'firefighting', 'reforest', 'reforestation', 'vegetation management', 
                            'roadside brushing', 'fuel break', 'fuel reduction', 'ignition', 'crown', 'fuel load', 'Fire and Forest Management'],
    
    'sea level rise mitigation': ['sea level rise', 'slr', 'seawall', 'seawalls', 'shoreline', 'wetland', 'mangrove', 'coastal','Restoration of riparian', 'sea-level rise'],
    
    'extreme heat mitigation': ['extreme heat', 'shade', 'shading', 'cooling center', 'cooling centers', 'heat-resistant', 
                                'heat resistant', 'heat reducing', 'heat-reducing', 'energy savings', 'urban forestry'],
    
    'drought mitigation': ['drought', 'irrigation', 'soil moisture', 'rainwater harvest', 'rainwater harvesting', 'water storage', 
                           'water allocation', 'water management', 'soil health', 'soil management', 'organic matter', 'water efficiency'],
    
    'inland flooding mitigation': ['flooding', 'runoff', 'inland flood', 'inland flooding', 'floodplain', 'flood proof', 'floodproofing', 
                                   'elevated flood', 'flood barrier', 'flood barriers', 'drainage', 'riparian', 'stormwater'],
    
    'greenhouse gas mitigation': ['ghg', 'GHG', 'greenhouse gas', 'emission', 'emissions', 'carbon sequestration', 'electrification', 
                                  'carbon capture', 'solar power', 'renewable energy', 'wind energy', 'hydroelectricity', 'geothermal energy', 
                                  'biomass energy', 'Energy-efficiency', 'carbon sequestering, low-carbon', 'clean vehicles']
} 

#### How many total and unique entries for each column? Will help decide which column to start with

In [None]:
def count_entries(dataframe):
    total_entries = dataframe.count()
    unique_entries = dataframe.nunique()
    return total_entries, unique_entries

total_entries, unique_entries = count_entries(data_of_interest)
print("Total entries per column:")
print(total_entries)
print("\nUnique entries per column:")
print(unique_entries)

#### Loop through 'Project Description' first as it has a large number of variation to capture many datasets, and makes most practical scense for filtering climate keywords

In [11]:
def add_climate_risk_column(df, keyword_dict, output_csv=None):
    # Initialize new columns to store climate risk mitigation keywords, detected values, repeat counts, and total unique descriptions
    df['Climate_Risk_Mitigation'] = ''
    df['Detected_Climate_Risk_Mitigation_Keyword'] = ''
    df['Repeat_Project_Description_Count'] = 0

    # Initialize a counter for each keyword
    keyword_counter = {keyword: 0 for keyword in keyword_dict}

    # Create a dictionary to store the repeat count for each unique project description
    description_counts = {}

    # Create a dictionary to store the unique count for each keyword
    unique_keyword_counts = {keyword: set() for keyword in keyword_dict}

    # Iterate through each unique description
    unique_descriptions = df['Project Description'].unique()
    total_unique_descriptions = len(unique_descriptions)

    for description in unique_descriptions:
        # Find all rows with this description
        description_rows = df[df['Project Description'] == description]
        repeat_count = len(description_rows)
        # Update the repeat count for this description
        description_counts[description] = repeat_count

        # Iterate through each row with this description
        for index, row in description_rows.iterrows():
            keywords_found = set()  # To store unique keywords found in each row
            detected_values = []    # To store the detected values for each row
            # Iterate through each keyword in the dictionary
            for keyword, values in keyword_dict.items():
                # Check if any value of the keyword is present in the description (case-insensitive)
                detected = [val for val in values if val.lower() in description.lower()]
                if detected:
                    keywords_found.add(keyword)
                    keyword_counter[keyword] += 1
                    detected_values.extend(detected)
                    # Add the description to unique count for this keyword
                    unique_keyword_counts[keyword].add(description)

            # If no keywords are found in Project Description, search Program Description
            if not keywords_found:
                program_description = row['Program Description']
                if isinstance(program_description, str):  # Check if it's a string
                    for keyword, values in keyword_dict.items():
                        detected = [val for val in values if val.lower() in program_description.lower()]
                        if detected:
                            keywords_found.add(keyword)
                            keyword_counter[keyword] += 1
                            detected_values.extend(detected)

            # Update the 'Climate_Risk_Mitigation' column with unique keywords found
            df.at[index, 'Climate_Risk_Mitigation'] = ', '.join(keywords_found)
            # Update the 'Detected_Values' column with detected values
            df.at[index, 'Detected_Climate_Risk_Mitigation_Keyword'] = ', '.join(detected_values)
            # Update the 'Repeat_Project_Description_Count' column with the repeat count for this description
            df.at[index, 'Repeat_Project_Description_Count'] = repeat_count

    # Print keyword counts
    print("Keyword Counts:")
    for keyword, count in keyword_counter.items():
        print(f"{keyword}: {count}")
    print('')
    # Print total unique descriptions count
    print(f"Total Unique Project Descriptions: {total_unique_descriptions}")
    print('')

    # Save DataFrame as CSV if output_csv is provided
    if output_csv:
        df.to_csv(output_csv, index=False)
        print(f"DataFrame saved as {output_csv}")
        print('')
        # Initialize the S3 client
        s3_client = boto3.client('s3')

        # Bucket name and file paths
        bucket_name = 'ca-climate-index'
        directory = f'0_map_data/crosswalk_data/{output_csv}'
        # Upload the CSV file to S3
        print(f'Uploading {output_csv} to AWS')
        with open(output_csv, 'rb') as file:
            s3_client.upload_fileobj(file, bucket_name, directory)
            print(f'Upload complete! File is in {directory}')


#### Testing function on whole dataset, the function will:
- loop through each 'Project Description' and look for words/phrases in our climate risk dictionary and append to keyword counter
- total keywords are counted
- number of unique 'Project Description' entries are counted
    * rows that have identical project descriptions are counted as a single unique project description
    * this helps reduce a bit of noise from some projects that have thousands of identical entries
- makes two new columns: 'Repeat Project Description Count' and 'Detected Climate Risk Mitigation Keyword' to add more context and improve dictionary keywords

#### The cell below runs the function but also adds a few things:
- makes a data preview, just selecting relevant columns that were made and help interpret Project Description screening results
- orders the data in decending order from the Repeat Project Description Count to show Project Descriptions with multiple entries first (make sure dictionary is properly assigning large entries with correct climate risk)

In [None]:
add_climate_risk_column(crosswalk_data, climate_risk_dict) #, 'climate_risk_attributed_crosswalk_data.csv')
pd.set_option('display.max_colwidth', None)
data_preview = crosswalk_data[['Project Description', 'Program Description', 'Repeat_Project_Description_Count', 'Detected_Climate_Risk_Mitigation_Keyword', 'Climate_Risk_Mitigation']]

# Filter the DataFrame to show only rows with entries in the 'Climate_Risk_Mitigation' column
data_preview_filtered = data_preview[data_preview['Climate_Risk_Mitigation'] != '']

# Sort the DataFrame based on 'Repeat_Project_Description_Count' in descending order
data_preview_filtered_sorted = data_preview_filtered.sort_values(by='Repeat_Project_Description_Count', ascending=False)

# Drop duplicates based on both 'Repeat_Project_Description_Count' and 'Project Description' to keep only one row per unique combination
data_preview_filtered_unique = data_preview_filtered_sorted.drop_duplicates(subset=['Repeat_Project_Description_Count', 'Project Description'])

#display(data_preview_filtered_unique)
display(data_preview_filtered_unique[:50])

#### Adding in tests to understand where more than one risk is assigned
- If 2 are provided, but one is GHG --> assign the category to the associated climate risk (i.e., "greenhouse gas mitigation, sea level rise mitigation" should end up as "sea level rise mitigation")
   - 654 instances
- If 2+ climate risks are assigned, need manual intervention to identify climate risk to be final assigned
   - Strip out all instances of "greenhouse gas mitigation" to reduce # of manual intervention
   - Identify the "main" or "priority" risk denoted in the project description

In [None]:
multi_risk = crosswalk_data.loc[(crosswalk_data['Climate_Risk_Mitigation'].str.count(',') == 1)]
print('Number of rows with multiple climate risk mitigation entries:', len(multi_risk))

#### Eliminating 'greenhouse gas mitigation' entries when other climate risks present

In [None]:
# Create a copy of the DataFrame to avoid modifying the original data
crosswalk_data_copy = crosswalk_data.copy()

# Filter rows containing 'greenhouse gas mitigation'
multi_risk = crosswalk_data_copy.loc[(crosswalk_data_copy['Climate_Risk_Mitigation'].str.count(',') == 1) & 
                                (crosswalk_data_copy['Climate_Risk_Mitigation'].str.contains('greenhouse gas mitigation'))]

# Replace 'greenhouse gas mitigation' with an empty string in the 'Climate_Risk_Mitigation' column
crosswalk_data_copy.loc[multi_risk.index, 'Climate_Risk_Mitigation'] = multi_risk['Climate_Risk_Mitigation'].str.replace('greenhouse gas mitigation', '')

# Remove any remaining commas
crosswalk_data_copy.loc[multi_risk.index, 'Climate_Risk_Mitigation'] = crosswalk_data_copy.loc[multi_risk.index, 'Climate_Risk_Mitigation'].str.replace(',', '')

# Clean-up view for easier access
data_preview = multi_risk[['Project Description', 'Program Description', 'Repeat_Project_Description_Count', 'Detected_Climate_Risk_Mitigation_Keyword', 'Climate_Risk_Mitigation']]

# Display the updated DataFrame
#print(crosswalk_data)

print('Number of rows with two climate risk mitigations, one being greenhouse gas mitigation:', len(data_preview))
data_preview.head(5)

#### Look through other entries with multiple detected climate risk mitigations (not selected for greenhouse gas mitigation)

In [None]:
# How many rows with multiple climate risk mitigations
multi_risk = crosswalk_data_copy.loc[crosswalk_data_copy['Climate_Risk_Mitigation'].str.count(',') >= 1]
data_preview = multi_risk[['Project Description', 'Program Description', 'CATEGORY', 'SECTOR', 'Repeat_Project_Description_Count', 'Detected_Climate_Risk_Mitigation_Keyword', 'Climate_Risk_Mitigation']]

print('Number of rows with multiple climate risk entries, greenhouse gas mitigation not included:',len(data_preview))

* further filter by running keyword dictionary filter across the 'CATEGORY' and 'SECTOR' columns
* attribute just the newly found climate mitigation to the climate risk mitigation column

In [None]:
# Iterate over the climate risk dictionary to filter rows and update the DataFrame
for mitigation_type, keywords in climate_risk_dict.items():
    # Create a boolean mask to filter rows containing any of the keywords
    mask = crosswalk_data_copy['CATEGORY'].str.contains('|'.join(keywords), case=False) | \
           crosswalk_data_copy['SECTOR'].str.contains('|'.join(keywords), case=False)
    
    # Filter rows based on the mask
    filtered_rows = multi_risk[mask]
    
    # Update the 'Climate_Risk_Mitigation' column for the filtered rows
    crosswalk_data_copy.loc[filtered_rows.index, 'Climate_Risk_Mitigation'] = mitigation_type

In [None]:
# identify how many have 1+ risks assigned
multi_risk = crosswalk_data_copy.loc[(crosswalk_data_copy['Climate_Risk_Mitigation'].str.count(',') >= 1)]
data_preview = multi_risk[['Project Description', 'CATEGORY', 'SECTOR', 'Repeat_Project_Description_Count', 'Detected_Climate_Risk_Mitigation_Keyword', 'Climate_Risk_Mitigation']]
print(len(data_preview))
pd.set_option('display.max_rows', None)  
#data_preview

#### Manually go through remaining sources that have multiple climate risk mitigation entries and didnt get further filtered with other column subsetting
* get their row number, read the project description, and attribute the number to one of the climate risks
* descriptions that seemingly address 2+ climate risk mitigations somewhat equally are give both risk mitigations

In [None]:
cleaned_crosswalk_data = crosswalk_data_copy.copy()

# Define the rows to update based on the specified criteria
sea_level_rise_rows = [15693, 62317, 74956, 89774, 75856, 116130, 116131, 116132, 116133, 112994]
inland_flooding_rows = [60114, 60160, 74973, 75775, 89750, 89847, 89918, 91016,
                        128903, 60253, 60265, 60292, 89679, 89716, 89732]
drought_rows = [41034, 75775, 89918, 113037, 119459, 75188, 75107, 75814, 89828, 89846, 109976, 
                110600, 89819, 89798, 60110, 60113, 89677, 89680, 89715, 89793, 89794, 89881,
                89985, 99696, 99708, 110954, 112998]
wildfire_rows = [89455, 90049, 110288, 110291, 110294, 110297, 110298, 110303, 110305, 110333,
                110337, 110339, 110347, 110361, 110368, 110372, 110447, 110466, 111867, 111874,
                116515, 116520, 116541, 116543, 116548, 116582, 116583, 119503, 119516, 119518,
                119554, 124581, 124582, 127999, 128030, 128063, 128083, 128144, 128262, 128280,
                110373, 119509, 62321, 75163, 75165, 75166, 75167, 75168, 75169, 75170, 75171, 
                75172, 75173, 75174, 89705, 89894]
extreme_heat_rows = [75775, 124736]
greenhouse_gas_rows = [60117, 89961, 113029, 113036, 110373, 113030, 113026, 60109]

# Create a dictionary mapping mitigation types to their corresponding rows
mitigation_mapping = {
    'sea level rise mitigation': sea_level_rise_rows,
    'inland flooding mitigation': inland_flooding_rows,
    'drought mitigation': drought_rows,
    'wildfire mitigation': wildfire_rows,
    'extreme heat mitigation': extreme_heat_rows,
    'greenhouse gas mitigation': greenhouse_gas_rows,
}

# Iterate through the mitigation types and their corresponding rows
for mitigation_type, rows_to_update in mitigation_mapping.items():
    # Update the 'Climate_Risk_Mitigation' column for each row
    for row_index in rows_to_update:
        cleaned_crosswalk_data.loc[row_index, 'Climate_Risk_Mitigation'] = mitigation_type

multi_risk = cleaned_crosswalk_data.loc[(cleaned_crosswalk_data['Climate_Risk_Mitigation'].str.count(',') >= 1)]
# How many rows that have multiple climate risk mitigation entries
print('Number of rows with multiple climate risk mitigations:', len(multi_risk))
#multi_risk

# Count entries for each climate risk
- remove blank space preceeding some entries
- add 'NA' to blank entries

In [None]:
# Replace empty entries with 'NA' in the 'Climate_Risk_Mitigation' column
cleaned_crosswalk_data['Climate_Risk_Mitigation'].replace('', 'NA', inplace=True)

# Remove leading and trailing spaces from the entries in the 'Climate_Risk_Mitigation' column
cleaned_crosswalk_data['Climate_Risk_Mitigation'] = cleaned_crosswalk_data['Climate_Risk_Mitigation'].str.strip()

# Flatten the 'Climate_Risk_Mitigation' column into a single list of keywords
all_keywords = cleaned_crosswalk_data['Climate_Risk_Mitigation'].explode().dropna()

# Count the occurrences of each keyword
keyword_counts = all_keywords.value_counts()

# Display the counts
print(keyword_counts)

#### Get rid of columns used for analysis, create csv, and upload to AWS

In [None]:
# Drop the specified columns from cleaned_crosswalk_data to create a new DataFrame
final_crosswalk_data = cleaned_crosswalk_data.drop(columns=['Detected_Climate_Risk_Mitigation_Keyword', 'Repeat_Project_Description_Count'])

output_csv = 'cci_projects_climate_risk_crosswalk.csv'

final_crosswalk_data.to_csv(output_csv, index=False)
print(f"DataFrame saved as {output_csv}")
print('')
# Initialize the S3 client
s3_client = boto3.client('s3')

# Bucket name and file paths
bucket_name = 'ca-climate-index'
directory = f'0_map_data/crosswalk_data/{output_csv}'
# Upload the CSV file to S3
print(f'Uploading {output_csv} to AWS')
with open(output_csv, 'rb') as file:
    s3_client.upload_fileobj(file, bucket_name, directory)
    print(f'Upload complete! File is in {directory}') 