## Cal-CRAI Metric Calculation for: Built Environment / Housing Vacancy & Quality
This notebook calculates 4 metrics, all sourced from the American Community Survey. 
- Housing vacancy: # of vacant homes per tract (under ownership)
- Housing age: median age of residential housing per tract
- Housing quality: % of homes lacking complete kitchen or plumbing facilities
- Housing structures: % of mobile residential structures

In [1]:
import os
import sys
import pandas as pd
import io
import numpy as np

sys.path.append(os.path.expanduser('../../'))
from scripts.utils.write_metadata import (
    append_metadata
)
from scripts.utils.file_helpers import (
    pull_csv_from_directory, upload_csv_aws
) 

In [2]:
bucket_name = 'ca-climate-index'
aws_dir = '1_pull_data/built_environment/housing/acs/'

pull_csv_from_directory(bucket_name, aws_dir, search_zipped=True)

  df = pd.read_csv(csv_file)


Saved DataFrame as 'ACSST5Y2022.S2504-Data.csv'
Saved DataFrame as 'ACSST5Y2022.S2504-Column-Metadata.csv'
Saved DataFrame as 'ACSDT5Y2022.B25004-Data.csv'
Saved DataFrame as 'ACSDT5Y2022.B25004-Column-Metadata.csv'


## Metric 1: Housing Vacancy
After looking at some documentation, we should use the 'estimated total' 
number of vacant housing units. All variables imply ownership other than 'other vacant'. After searching through documentation, the 'other vacant' variable includes:
- personal/family reasons
- needs repairs
- foreclosure
- being repaired
- storage
- extended absence
- legal proceedings
- preparing to rent/sell
- possibly abandonded/to be demolished
- specific use housing
- other write in/don't know

Barring 'abandoned/to be demolished', all other entries within 'other vacant' are likely under ownership

In [3]:
housing_vacancy = pd.read_csv('ACSDT5Y2022.B25004-Data.csv')
housing_vacancy.head(5)

Unnamed: 0,GEO_ID,NAME,B25004_001E,B25004_001M,B25004_002E,B25004_002M,B25004_003E,B25004_003M,B25004_004E,B25004_004M,B25004_005E,B25004_005M,B25004_006E,B25004_006M,B25004_007E,B25004_007M,B25004_008E,B25004_008M,Unnamed: 18
0,Geography,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,Estimate!!Total:!!For rent,Margin of Error!!Total:!!For rent,"Estimate!!Total:!!Rented, not occupied","Margin of Error!!Total:!!Rented, not occupied",Estimate!!Total:!!For sale only,Margin of Error!!Total:!!For sale only,"Estimate!!Total:!!Sold, not occupied","Margin of Error!!Total:!!Sold, not occupied","Estimate!!Total:!!For seasonal, recreational, ...","Margin of Error!!Total:!!For seasonal, recreat...",Estimate!!Total:!!For migrant workers,Margin of Error!!Total:!!For migrant workers,Estimate!!Total:!!Other vacant,Margin of Error!!Total:!!Other vacant,
1,1400000US06001400100,Census Tract 4001; Alameda County; California,119,86,0,13,0,13,9,14,0,13,55,62,0,13,55,63,
2,1400000US06001400200,Census Tract 4002; Alameda County; California,37,21,0,13,0,13,0,13,0,13,4,8,0,13,33,23,
3,1400000US06001400300,Census Tract 4003; Alameda County; California,213,144,86,92,0,19,0,19,0,19,47,74,0,19,80,93,
4,1400000US06001400400,Census Tract 4004; Alameda County; California,215,90,55,59,0,13,0,13,0,13,0,13,0,13,160,82,


The GEO_ID column within is quite long, making a new column to include census tract format seen/used in other sources

In [4]:
housing_vacancy['Census_Tract'] = housing_vacancy['GEO_ID'].str[10:]
housing_vacancy[:2]


Unnamed: 0,GEO_ID,NAME,B25004_001E,B25004_001M,B25004_002E,B25004_002M,B25004_003E,B25004_003M,B25004_004E,B25004_004M,B25004_005E,B25004_005M,B25004_006E,B25004_006M,B25004_007E,B25004_007M,B25004_008E,B25004_008M,Unnamed: 18,Census_Tract
0,Geography,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,Estimate!!Total:!!For rent,Margin of Error!!Total:!!For rent,"Estimate!!Total:!!Rented, not occupied","Margin of Error!!Total:!!Rented, not occupied",Estimate!!Total:!!For sale only,Margin of Error!!Total:!!For sale only,"Estimate!!Total:!!Sold, not occupied","Margin of Error!!Total:!!Sold, not occupied","Estimate!!Total:!!For seasonal, recreational, ...","Margin of Error!!Total:!!For seasonal, recreat...",Estimate!!Total:!!For migrant workers,Margin of Error!!Total:!!For migrant workers,Estimate!!Total:!!Other vacant,Margin of Error!!Total:!!Other vacant,,
1,1400000US06001400100,Census Tract 4001; Alameda County; California,119,86,0,13,0,13,9,14,0,13,55,62,0,13,55,63,,6001400100.0


In [5]:
# Isolating relevant columns to out data metric
cri_housing_vacancy_df = housing_vacancy[['Census_Tract', 'B25004_001E']]
# Eliminating the first row as it is more info about the top columns
cri_housing_vacancy_df = cri_housing_vacancy_df.iloc[1:]
# Rename the total vacant housing units column from its identifier to our metric name
cri_housing_vacancy_df = cri_housing_vacancy_df.rename(columns={'B25004_001E': 'estimated_total_vacant_housing_units'})
print(cri_housing_vacancy_df.head())

# Saving metric df to .csv file
cri_housing_vacancy_df.to_csv('built_metric_housing_vacancy_metric.csv')

  Census_Tract estimated_total_vacant_housing_units
1   6001400100                                  119
2   6001400200                                   37
3   6001400300                                  213
4   6001400400                                  215
5   6001400500                                  141


## Metric 2-4: Housing age, quality, and structure

In [6]:
housing_age_quality_structure = pd.read_csv('ACSST5Y2022.S2504-Data.csv')
housing_age_quality_structure['Census_Tract'] = housing_age_quality_structure['GEO_ID'].str[10:]

# Dropping first row which contains descriptions of row one columns
housing_age_quality_structure = housing_age_quality_structure.iloc[1:]

# Renaming columns from dictionary code to definition
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_001E':                         'est_occupied_housing_units'})

housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_025E':                         'percent_with_plumbing'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_026E':                         'percent_with_kitchen_facilities'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C02_008E':                         'percent_mobile_homes'})

housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_009E':                         'est_houses_year_structure_built_2020_or_later'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_010E':                         'est_houses_year_structure_built_2010_2019'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_011E':                         'est_houses_year_structure_built_2000_2009'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_012E':                         'est_houses_year_structure_built_1980_1999'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_013E':                         'est_houses_year_structure_built_1960_1979'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_014E':                         'est_houses_year_structure_built_1940_1959'})
housing_age_quality_structure = housing_age_quality_structure.rename(columns={'S2504_C01_015E':                         'est_houses_year_structure_built_before_1939'})


  housing_age_quality_structure = pd.read_csv('ACSST5Y2022.S2504-Data.csv')


In [7]:
# Isolating relevant columns to our metric calculations
cri_metric_data_columns = housing_age_quality_structure[['GEO_ID', 'Census_Tract', 
                                                         'est_occupied_housing_units',
                                                         'percent_with_plumbing',
                                                         'percent_with_kitchen_facilities',
                                                         'percent_mobile_homes',
                                                        'est_houses_year_structure_built_2020_or_later',
                                                        'est_houses_year_structure_built_2010_2019',
                                                        'est_houses_year_structure_built_2000_2009',
                                                        'est_houses_year_structure_built_1980_1999',
                                                        'est_houses_year_structure_built_1960_1979',
                                                        'est_houses_year_structure_built_1940_1959',
                                                        'est_houses_year_structure_built_before_1939']]

In [8]:
display(cri_metric_data_columns)

Unnamed: 0,GEO_ID,Census_Tract,est_occupied_housing_units,percent_with_plumbing,percent_with_kitchen_facilities,percent_mobile_homes,est_houses_year_structure_built_2020_or_later,est_houses_year_structure_built_2010_2019,est_houses_year_structure_built_2000_2009,est_houses_year_structure_built_1980_1999,est_houses_year_structure_built_1960_1979,est_houses_year_structure_built_1940_1959,est_houses_year_structure_built_before_1939
1,1400000US06001400100,6001400100,1377,100.0,100.0,2.0,28,36,159,947,42,100,65
2,1400000US06001400200,6001400200,876,100.0,99.4,0.0,0,47,9,26,53,124,617
3,1400000US06001400300,6001400300,2638,100.0,100.0,0.0,0,179,11,266,482,406,1294
4,1400000US06001400400,6001400400,1760,100.0,99.4,0.0,0,32,13,93,246,226,1150
5,1400000US06001400500,6001400500,1679,100.0,100.0,0.0,0,19,17,134,242,273,994
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9125,1400000US06115040902,6115040902,482,100.0,97.3,0.0,0,218,108,85,58,13,0
9126,1400000US06115041001,6115041001,1489,93.2,95.3,14.8,0,10,184,643,510,88,54
9127,1400000US06115041002,6115041002,1449,100.0,100.0,6.0,0,70,291,608,302,147,31
9128,1400000US06115041101,6115041101,1034,99.5,99.0,28.8,0,69,111,397,363,71,23


## Metric 3: Calculating percentage without plumbing/kitchen facilities
* leaves me wondering how we go about calculating our single metric (% without plumbing or kitchen facilities)
* these percentages could overlap, so summing could double count houses
* could use the higher of the two percents
* could split into two metrics (still involves potential overlap)

In [9]:
# Convert 'percent_with_plumbing' column to numeric
cri_metric_data_columns.loc[:,'percent_with_plumbing'] = pd.to_numeric(cri_metric_data_columns['percent_with_plumbing'], errors='coerce')

# Subtract 'percent_with_plumbing' from 100 to get 'percent_without_plumbing'
cri_metric_data_columns.loc[:,'percent_without_plumbing'] = 100.0 - cri_metric_data_columns['percent_with_plumbing']

# Convert 'percent_with_kitchen_facilities' column to numeric
cri_metric_data_columns.loc[:,'percent_with_kitchen_facilities'] = pd.to_numeric(cri_metric_data_columns['percent_with_kitchen_facilities'], errors='coerce')

# Subtract 'percent_with_plumbing' from 100 to get 'percent_without_plumbing'
cri_metric_data_columns.loc[:,'percent_without_kitchen_facilities'] = 100.0 - cri_metric_data_columns['percent_with_kitchen_facilities']

cri_metric_data_columns.loc[:,'percent_without_kitchen_facilities_or_plumbing'] = cri_metric_data_columns['percent_without_kitchen_facilities'] + cri_metric_data_columns['percent_without_plumbing']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cri_metric_data_columns.loc[:,'percent_without_plumbing'] = 100.0 - cri_metric_data_columns['percent_with_plumbing']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cri_metric_data_columns.loc[:,'percent_without_kitchen_facilities'] = 100.0 - cri_metric_data_columns['percent_with_kitchen_facilities']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guid

In [10]:
cri_metric_data_columns

Unnamed: 0,GEO_ID,Census_Tract,est_occupied_housing_units,percent_with_plumbing,percent_with_kitchen_facilities,percent_mobile_homes,est_houses_year_structure_built_2020_or_later,est_houses_year_structure_built_2010_2019,est_houses_year_structure_built_2000_2009,est_houses_year_structure_built_1980_1999,est_houses_year_structure_built_1960_1979,est_houses_year_structure_built_1940_1959,est_houses_year_structure_built_before_1939,percent_without_plumbing,percent_without_kitchen_facilities,percent_without_kitchen_facilities_or_plumbing
1,1400000US06001400100,6001400100,1377,100.0,100.0,2.0,28,36,159,947,42,100,65,0.0,0.0,0.0
2,1400000US06001400200,6001400200,876,100.0,99.4,0.0,0,47,9,26,53,124,617,0.0,0.6,0.6
3,1400000US06001400300,6001400300,2638,100.0,100.0,0.0,0,179,11,266,482,406,1294,0.0,0.0,0.0
4,1400000US06001400400,6001400400,1760,100.0,99.4,0.0,0,32,13,93,246,226,1150,0.0,0.6,0.6
5,1400000US06001400500,6001400500,1679,100.0,100.0,0.0,0,19,17,134,242,273,994,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9125,1400000US06115040902,6115040902,482,100.0,97.3,0.0,0,218,108,85,58,13,0,0.0,2.7,2.7
9126,1400000US06115041001,6115041001,1489,93.2,95.3,14.8,0,10,184,643,510,88,54,6.8,4.7,11.5
9127,1400000US06115041002,6115041002,1449,100.0,100.0,6.0,0,70,291,608,302,147,31,0.0,0.0,0.0
9128,1400000US06115041101,6115041101,1034,99.5,99.0,28.8,0,69,111,397,363,71,23,0.5,1.0,1.5


## Calculating Metric 2: median age of each housing unit

Since the data is in year ranges, the median can be given as the median age range group,
or we can select a year representing the middle of each age group and find the median from there

I did both, let me know which is preferred, or another method of course

In [11]:
# Initialize an empty list to store median construction years
median_construction_years = []
median_year = []

# Iterate over each row
for index, row in cri_metric_data_columns.iterrows():
    # Initialize counters for the original year range
    counters = {
        '2020_or_later': int(row['est_houses_year_structure_built_2020_or_later']),
        '2010_2019': int(row['est_houses_year_structure_built_2010_2019']),
        '2000_2009': int(row['est_houses_year_structure_built_2000_2009']),
        '1980_1999': int(row['est_houses_year_structure_built_1980_1999']),
        '1960_1979': int(row['est_houses_year_structure_built_1960_1979']),
        '1940_1959': int(row['est_houses_year_structure_built_1940_1959']),
        'before_1939': int(row['est_houses_year_structure_built_before_1939'])
    }
    
    # Calculate the cumulative sum for the original year range
    cumulative_counts = np.cumsum(list(counters.values()))
    
    # Calculate the median construction year for the original year range
    total_houses = sum(counters.values())
    median_year_index = np.searchsorted(cumulative_counts, total_houses / 2)
    median_construction_year = list(counters.keys())[median_year_index]
    median_construction_years.append(median_construction_year)
    
    # Initialize counters for the custom year range
    counters_custom_range = {
        '2020': int(row['est_houses_year_structure_built_2020_or_later']),
        '2015': int(row['est_houses_year_structure_built_2010_2019']),
        '2005': int(row['est_houses_year_structure_built_2000_2009']),
        '1990': int(row['est_houses_year_structure_built_1980_1999']),
        '1970': int(row['est_houses_year_structure_built_1960_1979']),
        '1950': int(row['est_houses_year_structure_built_1940_1959']),
        '1939': int(row['est_houses_year_structure_built_before_1939'])
    }
    
    # Calculate the cumulative sum for the custom year range
    cumulative_counts_custom_range = np.cumsum(list(counters_custom_range.values()))
    
    # Calculate the median construction year for the custom year range
    total_houses_custom_range = sum(counters_custom_range.values())
    median_year_index_custom_range = np.searchsorted(cumulative_counts_custom_range, total_houses_custom_range / 2)
    median_single_year = list(counters_custom_range.keys())[median_year_index_custom_range]
    median_year.append(median_single_year)

# Add the median construction years to the dataframe
cri_metric_data_columns.loc[:, 'median_year_range'] = median_construction_years
cri_metric_data_columns.loc[:, 'median_year'] = median_year

cri_metric_data_columns


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cri_metric_data_columns.loc[:, 'median_year_range'] = median_construction_years
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cri_metric_data_columns.loc[:, 'median_year'] = median_year


Unnamed: 0,GEO_ID,Census_Tract,est_occupied_housing_units,percent_with_plumbing,percent_with_kitchen_facilities,percent_mobile_homes,est_houses_year_structure_built_2020_or_later,est_houses_year_structure_built_2010_2019,est_houses_year_structure_built_2000_2009,est_houses_year_structure_built_1980_1999,est_houses_year_structure_built_1960_1979,est_houses_year_structure_built_1940_1959,est_houses_year_structure_built_before_1939,percent_without_plumbing,percent_without_kitchen_facilities,percent_without_kitchen_facilities_or_plumbing,median_year_range,median_year
1,1400000US06001400100,6001400100,1377,100.0,100.0,2.0,28,36,159,947,42,100,65,0.0,0.0,0.0,1980_1999,1990
2,1400000US06001400200,6001400200,876,100.0,99.4,0.0,0,47,9,26,53,124,617,0.0,0.6,0.6,before_1939,1939
3,1400000US06001400300,6001400300,2638,100.0,100.0,0.0,0,179,11,266,482,406,1294,0.0,0.0,0.0,1940_1959,1950
4,1400000US06001400400,6001400400,1760,100.0,99.4,0.0,0,32,13,93,246,226,1150,0.0,0.6,0.6,before_1939,1939
5,1400000US06001400500,6001400500,1679,100.0,100.0,0.0,0,19,17,134,242,273,994,0.0,0.0,0.0,before_1939,1939
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9125,1400000US06115040902,6115040902,482,100.0,97.3,0.0,0,218,108,85,58,13,0,0.0,2.7,2.7,2000_2009,2005
9126,1400000US06115041001,6115041001,1489,93.2,95.3,14.8,0,10,184,643,510,88,54,6.8,4.7,11.5,1980_1999,1990
9127,1400000US06115041002,6115041002,1449,100.0,100.0,6.0,0,70,291,608,302,147,31,0.0,0.0,0.0,1980_1999,1990
9128,1400000US06115041101,6115041101,1034,99.5,99.0,28.8,0,69,111,397,363,71,23,0.5,1.0,1.5,1980_1999,1990


### Planning to use median year for final indicator calculation but including median year range as context given the inconsistent date ranges

In [12]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_age_df = cri_metric_data_columns[[
                                            'Census_Tract', 
                                            'median_year',
                                            ]]
# Saving metric df to .csv file
cri_housing_age_df.to_csv('built_housing_median_age_metric.csv')
cri_housing_age_df[:2]


Unnamed: 0,Census_Tract,median_year
1,6001400100,1990
2,6001400200,1939


In [13]:
cri_housing_age_df.median_year.unique()

array(['1990', '1939', '1950', '1970', '2005', '2020', '2015'],
      dtype=object)

In [16]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_quality = cri_metric_data_columns[[
                                            'Census_Tract', 
                                            'percent_without_plumbing',
                                            'percent_without_kitchen_facilities',
                                            'percent_without_kitchen_facilities_or_plumbing'
                                            ]]
# Saving metric df to .csv file
cri_housing_quality.to_csv('built_housing_quality_metric.csv')
cri_housing_quality[:2]

Unnamed: 0,Census_Tract,percent_without_plumbing,percent_without_kitchen_facilities,percent_without_kitchen_facilities_or_plumbing
1,6001400100,0.0,0.0,0.0
2,6001400200,0.0,0.6,0.6


## Metric 4: Housing structures
The data is already represented as a percentage, so no additional modification is required. 

In [20]:
# Selecting relevant columns for CRI housing age, quality, & structure metrics (separate csv's)
cri_housing_mobile_homes = cri_metric_data_columns[[
                                            'Census_Tract', 
                                            'percent_mobile_homes'
                                            ]]
# Saving metric df to .csv file
cri_housing_mobile_homes.to_csv('built_housing_mobile_homes_metric.csv')
cri_housing_mobile_homes[:2]

Unnamed: 0,Census_Tract,percent_mobile_homes
1,6001400100,2.0
2,6001400200,0.0


## Uploading metric csv's to AWS

In [27]:
@append_metadata
def housing_vacancy_upload(input_csv, export=False, varname=''):
    '''
    This notebook uploads prepared housing metrics, all sourced from the American Community Survey
    at: https://data.census.gov/ Code B25004

    Metrics include:
    - Housing vacancy: # of vacant homes per tract (under ownership)
    - Housing age: median age of residential housing per tract
    - Housing quality: % of homes lacking complete kitchen or plumbing facilities
    - Housing structures: % of mobile residential structures

    Methods
    -------
    Relevant columns were isolated and renamed.
    Additional columns were created by calculating desired metric with existing columns.
    
    Parameters
    ----------
    input_csv: string
        csv housing data 
    export: True/False boolean
        False = will just generate metadata file(s)
        True = will upload resulting df containing CAL CRAI housing metrics to AWS

    Script
    ------
    built_housing_vacancy_quality.ipynb

    Note:
    This function assumes users have configured the AWS CLI such that their access key / secret key pair are 
    stored in ~/.aws/credentials.
    See https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html for guidance.
    '''
    print('Data transformation: columns renamed and isolated.')
    print('Data transformation: additional columns created using existing columns to calculate desired metric.')

    bucket_name = 'ca-climate-index'
    directory = '3_fair_data/index_data'
    export_filename = [input_csv]
    
    if export == True:
        upload_csv_aws(export_filename, bucket_name, directory)

    if export == False:
        print(f'{export_filename} uploaded to AWS.')

    #if os.path.exists(input_csv):
    #    os.remove(input_csv)

In [28]:
input_csv = ['built_housing_quality_metric.csv', 
             'built_housing_mobile_homes_metric.csv', 
             'built_housing_median_age_metric.csv',
             'built_metric_housing_vacancy_metric.csv'
            ]

varnames = [
            'built_acs_housing_quality',
            'built_acs_mobile_homes',
            'built_acs_housing_age',
            'built_acs_housing_vacancy'
            ]

bucket_name = 'ca-climate-index'
directory = '3_fair_data/index_data'

for csv, var in zip(input_csv, varnames):
    housing_vacancy_upload(csv, export=True, varname='test')#var)