## Cal-CRAI Metric Calculation
Domain: Built Environment \
Indicator: Housing Vacancy & Quality

This notebook calculates 4 metrics, all sourced from the American Community Survey:

* Metric 1: Housing vacancy: Number of vacant homes per tract (under ownership)
* Metric 2: Housing quality: Percentage of homes lacking complete kitchen or plumbing facilities
* Metric 3: Housing age: Percentage of homes built before 1980
* Metric 4: Housing structures: Percentage of mobile residential structures

In [1]:
import os
import sys
import pandas as pd
import io
import numpy as np

sys.path.append(os.path.expanduser('../../'))
from scripts.utils.write_metadata import (
    append_metadata
)
from scripts.utils.file_helpers import (
    pull_csv_from_directory, upload_csv_aws
) 

In [None]:
bucket_name = 'ca-climate-index'
aws_dir = '1_pull_data/built_environment/housing/acs/'

pull_csv_from_directory(bucket_name, aws_dir, search_zipped=True)

## Metric 1: Housing Vacancy
After looking at some documentation, we should use the 'estimated total' 
number of vacant housing units. All variables imply ownership other than 'other vacant'. After searching through documentation, the 'other vacant' variable includes:
- personal/family reasons
- needs repairs
- foreclosure
- being repaired
- storage
- extended absence
- legal proceedings
- preparing to rent/sell
- possibly abandonded/to be demolished
- specific use housing
- other write in/don't know

Barring 'abandoned/to be demolished', all other entries within 'other vacant' are likely under ownership

In [None]:
housing_vacancy = pd.read_csv('ACSDT5Y2022.B25004-Data.csv')
housing_vacancy.head(5)

The GEO_ID column within is quite long, making a new column to include census tract format seen/used in other sources

In [None]:
housing_vacancy['Census_Tract'] = housing_vacancy['GEO_ID'].str[10:]
housing_vacancy[:2]


In [None]:
# Isolating relevant columns to out data metric
cri_housing_vacancy_df = housing_vacancy[['Census_Tract', 'B25004_001E']]
# Eliminating the first row as it is more info about the top columns
cri_housing_vacancy_df = cri_housing_vacancy_df.iloc[1:]
# Rename the total vacant housing units column from its identifier to our metric name
cri_housing_vacancy_df = cri_housing_vacancy_df.rename(columns={'B25004_001E': 'estimated_total_vacant_housing_units'})
print(len(cri_housing_vacancy_df))
cri_housing_vacancy_df.head()

In [6]:
# Saving metric df to .csv file
cri_housing_vacancy_df.to_csv('built_housing_vacancy_metric.csv')

## Metric 2-4: Housing age, quality, and structure

In [None]:
housing_age_quality_structure = pd.read_csv('ACSST5Y2022.S2504-Data.csv')
housing_age_quality_structure['Census_Tract'] = housing_age_quality_structure['GEO_ID'].str[10:]

# Dropping first row which contains descriptions of row one columns
housing_age_quality_structure = housing_age_quality_structure.iloc[1:]

# Renaming columns from dictionary code to definition
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_001E':                         'est_occupied_housing_units'})

housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_025E':                         'percent_with_plumbing'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_026E':                         'percent_with_kitchen_facilities'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_008E':                         'percent_mobile_homes'})

housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_009E':                         'est_houses_year_structure_built_2020_or_later'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_010E':                         'est_houses_year_structure_built_2010_2019'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_011E':                         'est_houses_year_structure_built_2000_2009'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_012E':                         'est_houses_year_structure_built_1980_1999'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_013E':                         'est_houses_year_structure_built_1960_1979'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_014E':                         'est_houses_year_structure_built_1940_1959'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_015E':                         'est_houses_year_structure_built_before_1939'})

In [8]:
# Isolating relevant columns to our metric calculations
cri_metric_data_columns = housing_age_quality_structure[['GEO_ID', 'Census_Tract', 
                                                         'est_occupied_housing_units',
                                                         'percent_with_plumbing',
                                                         'percent_with_kitchen_facilities',
                                                         'percent_mobile_homes',
                                                        'est_houses_year_structure_built_2020_or_later',
                                                        'est_houses_year_structure_built_2010_2019',
                                                        'est_houses_year_structure_built_2000_2009',
                                                        'est_houses_year_structure_built_1980_1999',
                                                        'est_houses_year_structure_built_1960_1979',
                                                        'est_houses_year_structure_built_1940_1959',
                                                        'est_houses_year_structure_built_before_1939']]

In [None]:
display(cri_metric_data_columns)

## Metric 2: Calculating percentage without plumbing/kitchen facilities
* leaves me wondering how we go about calculating our single metric (% without plumbing or kitchen facilities)
* these percentages could overlap, so summing could double count houses
* could use the higher of the two percents
* could split into two metrics (still involves potential overlap)

In [None]:
# Convert 'percent_with_plumbing' column to numeric
cri_metric_data_columns.loc[:,'percent_with_plumbing'] = pd.to_numeric(cri_metric_data_columns['percent_with_plumbing'], errors='coerce')

# Subtract 'percent_with_plumbing' from 100 to get 'percent_without_plumbing'
cri_metric_data_columns.loc[:,'percent_without_plumbing'] = 100.0 - cri_metric_data_columns['percent_with_plumbing']

# Convert 'percent_with_kitchen_facilities' column to numeric
cri_metric_data_columns.loc[:,'percent_with_kitchen_facilities'] = pd.to_numeric(cri_metric_data_columns['percent_with_kitchen_facilities'], errors='coerce')

# Subtract 'percent_with_plumbing' from 100 to get 'percent_without_plumbing'
cri_metric_data_columns.loc[:,'percent_without_kitchen_facilities'] = 100.0 - cri_metric_data_columns['percent_with_kitchen_facilities']

cri_metric_data_columns.loc[:,'percent_without_kitchen_facilities_or_plumbing'] = cri_metric_data_columns['percent_without_kitchen_facilities'] + cri_metric_data_columns['percent_without_plumbing']

cri_metric_data_columns = cri_metric_data_columns.rename(columns={'Census_Tract':'census_tract'})


In [None]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_quality = cri_metric_data_columns[[
                                            'census_tract', 
                                            'percent_without_plumbing',
                                            'percent_without_kitchen_facilities',
                                            'percent_without_kitchen_facilities_or_plumbing'
                                            ]]
cri_housing_quality

In [12]:
# Saving metric df to .csv file
cri_housing_quality.to_csv('built_housing_quality_metric.csv')

## Calculating Metric 3: % of homes built before 1980

Earliest housing is before 1939, so temporal range of dataset is ~80 years.
Decided to calculate percent built before 1980, splitting the temporal range in half with the metric indicating a vulnerablitiy in houses in the older half of the dataset

In [None]:
cri_metric_data_columns

In [14]:
# make sure all columns barring census tract are numeric
exclude_column = 'census_tract'
cri_housing_age_df = cri_metric_data_columns.apply(lambda x: pd.to_numeric(x, errors='coerce') if x.name != exclude_column else x)

In [None]:
# summing all columns before 1980 and isolating for new sum column, census tract, and est total usits
cri_housing_age_df['num_before_1980'] = cri_housing_age_df['est_houses_year_structure_built_1940_1959']+ cri_housing_age_df['est_houses_year_structure_built_1960_1979'] + cri_housing_age_df['est_houses_year_structure_built_before_1939']

cri_metric_age = cri_housing_age_df[['census_tract', 'est_occupied_housing_units', 'num_before_1980']]

cri_metric_age

In [None]:
# new column that has our housing percentage metric before 1980
cri_metric_age_metric = cri_metric_age
cri_metric_age_metric.loc[:, 'percent_housing_before_1980'] = (cri_metric_age['num_before_1980'] / cri_metric_age_metric['est_occupied_housing_units']) * 100

cri_metric_age_metric

In [17]:
cri_metric_age_metric.to_csv('built_housing_before_1980_metric.csv')

## Metric 4: Housing structures - mobile homes
The data is already represented as a percentage, so no additional modification is required. 

In [None]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_mobile_homes = cri_metric_data_columns[[
                                            'census_tract', 
                                            'percent_mobile_homes'
                                            ]]
# Saving metric df to .csv file
cri_housing_mobile_homes.to_csv('built_housing_mobile_homes_metric.csv')
print(len(cri_housing_mobile_homes))
cri_housing_mobile_homes[:2]

## Uploading metric csv's to AWS

In [23]:
@append_metadata
def housing_vacancy_upload(input_csv, export=False, varname=''):
    '''
    This notebook uploads prepared housing metrics, all sourced from the American Community Survey
    at: https://data.census.gov/ Code B25004

    Metrics include:
    - Housing vacancy: # of vacant homes per tract (under ownership)
    - Housing quality: % of homes lacking complete kitchen or plumbing facilities
    - Housing age: % of homes built before 1980
    - Housing structures: % of mobile residential structures

    Methods
    -------
    Relevant columns were isolated and renamed.
    Additional columns were created by calculating desired metric with existing columns.
    
    Parameters
    ----------
    input_csv: string
        csv housing data 
    export: True/False boolean
        False = will just generate metadata file(s)
        True = will upload resulting df containing CAL CRAI housing metrics to AWS

    Script
    ------
    built_housing_vacancy_quality.ipynb

    Note:
    This function assumes users have configured the AWS CLI such that their access key / secret key pair are 
    stored in ~/.aws/credentials.
    See https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html for guidance.
    '''
    print('Data transformation: columns renamed and isolated.')
    print('Data transformation: additional columns created using existing columns to calculate desired metric.')

    bucket_name = 'ca-climate-index'
    directory = '3_fair_data/index_data'
    export_filename = [input_csv]
    
    if export == True:
        upload_csv_aws(export_filename, bucket_name, directory)

    if export == False:
        print(f'{export_filename} uploaded to AWS.')

    #if os.path.exists(input_csv):
    #    os.remove(input_csv)

In [24]:
input_csv = ['built_housing_vacancy_metric.csv',
            'built_housing_quality_metric.csv', 
            'built_housing_before_1980_metric.csv',
            'built_housing_mobile_homes_metric.csv'
 
            ]

varnames = ['built_acs_housing_vacancy',
            'built_acs_housing_quality',
            'built_acs_housing_age',
            'built_acs_mobile_homes'          
            ]

bucket_name = 'ca-climate-index'
directory = '3_fair_data/index_data'

for csv, var in zip(input_csv, varnames):
    housing_vacancy_upload(csv, export=True, varname='test')#var)