## Cal-CRAI Metric Calculation for: Built Environment / Housing Vacancy & Quality
This notebook calculates 4 metrics, all sourced from the American Community Survey. 
- Housing vacancy: # of vacant homes per tract (under ownership)
- Housing age: median age of residential housing per tract
- Housing quality: % of homes lacking complete kitchen or plumbing facilities
- Housing structures: % of mobile residential structures

In [1]:
import os
import sys
import pandas as pd
import io
import numpy as np

sys.path.append(os.path.expanduser('../../'))
from scripts.utils.write_metadata import (
    append_metadata
)
from scripts.utils.file_helpers import (
    pull_csv_from_directory, upload_csv_aws
)

In [2]:
bucket_name = 'ca-climate-index'
aws_dir = '1_pull_data/built_environment/housing/acs/'

pull_csv_from_directory(bucket_name, aws_dir, search_zipped=True)

  df = pd.read_csv(csv_file)


Saved DataFrame as 'ACSST5Y2022.S2504-Data.csv'
Saved DataFrame as 'ACSST5Y2022.S2504-Column-Metadata.csv'
Saved DataFrame as 'ACSDT5Y2022.B25004-Data.csv'
Saved DataFrame as 'ACSDT5Y2022.B25004-Column-Metadata.csv'


## Metric 1: Housing Vacancy
After looking at some documentation, we should use the 'estimated total' 
number of vacant housing units. All variables imply ownership other than 'other vacant'. After searching through documentation, the 'other vacant' variable includes:
- personal/family reasons
- needs repairs
- foreclosure
- being repaired
- storage
- extended absence
- legal proceedings
- preparing to rent/sell
- possibly abandonded/to be demolished
- specific use housing
- other write in/don't know

Barring 'abandoned/to be demolished', all other entries within 'other vacant' are likely under ownership

In [None]:
housing_vacancy = pd.read_csv('ACSDT5Y2022.B25004-Data.csv')
housing_vacancy.head(5)

The GEO_ID column within is quite long, making a new column to include census tract format seen/used in other sources

In [None]:
housing_vacancy['Census_Tract'] = housing_vacancy['GEO_ID'].str[10:]
housing_vacancy[:2]


In [None]:
# Isolating relevant columns to out data metric
cri_housing_vacancy_df = housing_vacancy[['GEO_ID', 'Census_Tract', 'B25004_001E']]
# Eliminating the first row as it is more info about the top columns
cri_housing_vacancy_df = cri_housing_vacancy_df.iloc[1:]
# Rename the total vacant housing units column from its identifier to our metric name
cri_housing_vacancy_df = cri_housing_vacancy_df.rename(columns={'B25004_001E': 'estimated_total_vacant_housing_units'})
cri_housing_vacancy_df

# Saving metric df to .csv file
cri_housing_vacancy_df.to_csv('built_metric_housing_vacancy.csv')

In [None]:
bucket_name = 'ca-climate-index'
file_name = 'built_metric_housing_vacancy.csv'
directory = '3_fair_data/index_data'

upload_csv_aws(file_name, bucket_name, directory)

## Metric 2-4: Housing age, quality, and structure

In [None]:
housing_age_quality_structure = pd.read_csv('ACSST5Y2022.S2504-Data.csv')
housing_age_quality_structure['Census_Tract'] = housing_age_quality_structure['GEO_ID'].str[10:]

# Dropping first row which contains descriptions of row one columns
housing_age_quality_structure = housing_age_quality_structure.iloc[1:]

# Renaming columns from dictionary code to definition
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_001E':                         'est_occupied_housing_units'})

housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_025E':                         'percent_with_plumbing'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_026E':                         'percent_with_kitchen_facilities'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_008E':                         'percent_mobile_homes'})

housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_009E':                         'est_houses_year_structure_built_2020_or_later'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_010E':                         'est_houses_year_structure_built_2010_2019'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_011E':                         'est_houses_year_structure_built_2000_2009'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_012E':                         'est_houses_year_structure_built_1980_1999'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_013E':                         'est_houses_year_structure_built_1960_1979'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_014E':                         'est_houses_year_structure_built_1940_1959'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_015E':                         'est_houses_year_structure_built_before_1939'})


In [None]:
# Isolating relevant columns to our metric calculations
cri_metric_data_columns = housing_age_quality_structure[['GEO_ID', 'Census_Tract', 
                                                         'est_occupied_housing_units',
                                                         'percent_with_plumbing',
                                                         'percent_with_kitchen_facilities',
                                                         'percent_mobile_homes',
                                                        'est_houses_year_structure_built_2020_or_later',
                                                        'est_houses_year_structure_built_2010_2019',
                                                        'est_houses_year_structure_built_2000_2009',
                                                        'est_houses_year_structure_built_1980_1999',
                                                        'est_houses_year_structure_built_1960_1979',
                                                        'est_houses_year_structure_built_1940_1959',
                                                        'est_houses_year_structure_built_before_1939']]

In [None]:
display(cri_metric_data_columns)

## Metric 3: Calculating percentage without plumbing/kitchen facilities
* leaves me wondering how we go about calculating our single metric (% without plumbing or kitchen facilities)
* these percentages could overlap, so summing could double count houses
* could use the higher of the two percents
* could split into two metrics (still involves potential overlap)

In [None]:
# Convert 'percent_with_plumbing' column to numeric
cri_metric_data_columns['percent_with_plumbing'] = pd.to_numeric(cri_metric_data_columns['percent_with_plumbing'], errors='coerce')

# Subtract 'percent_with_plumbing' from 100 to get 'percent_without_plumbing'
cri_metric_data_columns['percent_without_plumbing'] = 100.0 - cri_metric_data_columns['percent_with_plumbing']

# Convert 'percent_with_kitchen_facilities' column to numeric
cri_metric_data_columns['percent_with_kitchen_facilities'] = pd.to_numeric(cri_metric_data_columns['percent_with_kitchen_facilities'], errors='coerce')

# Subtract 'percent_with_plumbing' from 100 to get 'percent_without_plumbing'
cri_metric_data_columns['percent_without_kitchen_facilities'] = 100.0 - cri_metric_data_columns['percent_with_kitchen_facilities']


In [None]:
cri_metric_data_columns

## Calculating Metric 2: median age of each housing unit

Since the data is in year ranges, the median can be given as the median age range group,
or we can select a year representing the middle of each age group and find the median from there

I did both, let me know which is preferred, or another method of course

In [None]:
# Initialize an empty list to store median construction years
median_construction_years = []
median_year = []

# Iterate over each row
for index, row in cri_metric_data_columns.iterrows():
    # Initialize counters for the original year range
    counters = {
        '2020_or_later': int(row['est_houses_year_structure_built_2020_or_later']),
        '2010_2019': int(row['est_houses_year_structure_built_2010_2019']),
        '2000_2009': int(row['est_houses_year_structure_built_2000_2009']),
        '1980_1999': int(row['est_houses_year_structure_built_1980_1999']),
        '1960_1979': int(row['est_houses_year_structure_built_1960_1979']),
        '1940_1959': int(row['est_houses_year_structure_built_1940_1959']),
        'before_1939': int(row['est_houses_year_structure_built_before_1939'])
    }
    
    # Calculate the cumulative sum for the original year range
    cumulative_counts = np.cumsum(list(counters.values()))
    
    # Calculate the median construction year for the original year range
    total_houses = sum(counters.values())
    median_year_index = np.searchsorted(cumulative_counts, total_houses / 2)
    median_construction_year = list(counters.keys())[median_year_index]
    median_construction_years.append(median_construction_year)
    
    # Initialize counters for the custom year range
    counters_custom_range = {
        '2020': int(row['est_houses_year_structure_built_2020_or_later']),
        '2015': int(row['est_houses_year_structure_built_2010_2019']),
        '2005': int(row['est_houses_year_structure_built_2000_2009']),
        '1990': int(row['est_houses_year_structure_built_1980_1999']),
        '1970': int(row['est_houses_year_structure_built_1960_1979']),
        '1950': int(row['est_houses_year_structure_built_1940_1959']),
        '1939': int(row['est_houses_year_structure_built_before_1939'])
    }
    
    # Calculate the cumulative sum for the custom year range
    cumulative_counts_custom_range = np.cumsum(list(counters_custom_range.values()))
    
    # Calculate the median construction year for the custom year range
    total_houses_custom_range = sum(counters_custom_range.values())
    median_year_index_custom_range = np.searchsorted(cumulative_counts_custom_range, total_houses_custom_range / 2)
    median_single_year = list(counters_custom_range.keys())[median_year_index_custom_range]
    median_year.append(median_single_year)

# Add the median construction years to the dataframe
cri_metric_data_columns.loc[:, 'median_year_range'] = median_construction_years
cri_metric_data_columns.loc[:, 'median_year'] = median_year

cri_metric_data_columns


In [None]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_age_df = cri_metric_data_columns[[
                                            'GEO_ID', 
                                            'Census_Tract', 
                                            'median_year',
                                            'median_year_range'
                                            ]]
# Saving metric df to .csv file
cri_housing_age_df.to_csv('built_housing_median_age.csv')
cri_housing_age_df[:2]


In [None]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_quality = cri_metric_data_columns[[
                                            'GEO_ID', 
                                            'Census_Tract', 
                                            'percent_without_plumbing',
                                            'percent_without_kitchen_facilities'
                                            ]]
# Saving metric df to .csv file
cri_housing_quality.to_csv('built_housing_quality.csv')
cri_housing_quality[:2]

## Metric 4: Housing structures
The data is already represented as a percentage, so no additional modification is required. 

In [None]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_mobile_homes = cri_metric_data_columns[[
                                            'GEO_ID', 
                                            'Census_Tract', 
                                            'percent_mobile_homes'
                                            ]]
# Saving metric df to .csv file
cri_housing_mobile_homes.to_csv('built_housing_mobile_homes.csv')
cri_housing_mobile_homes[:2]

## Uploading to AWS

In [None]:
bucket_name = 'ca-climate-index'
file_name = 'built_housing_quality.csv', 'built_housing_mobile_homes.csv', 'built_housing_median_age.csv' 
directory = '3_fair_data/index_data'

for name in file_name:
    upload_csv_aws(name, bucket_name, directory)