# <a id='toc1_'></a>[Data Prep and Cleanup](#toc0_)

Our project examines the potential loss of homeowner insurance coverage due to climate disasters, evaluating how different sectors of society are affected by this process. To do so, we compiled in this notebook several datasets produced by different organizations, particularly focusing on data relatd to insurance, population, housing, and climate disaster, to generate a base dataset for our analysis. Given to the availability of data, we focus on California, having its residential zipcodes as our observations.

In this notebook, we will:
- load datasets 
- perform basic clean-up tasks
- add basic new features
- standardize feature names
- generate base dataset for the project's next steps

**Table of contents**<a id='toc0_'></a>    
- [Data Prep and Cleanup](#toc1_)    
  - [Census Data](#toc1_1_)    
    - [Demography](#toc1_1_1_)    
    - [Housing (2021)](#toc1_1_2_)    
  - [Insurance Data](#toc1_2_)    
    - [Renewals](#toc1_2_1_)    
    - [Premiums, Claims, and Losses](#toc1_2_2_)    
    - [FAIR Plan (2022)](#toc1_2_3_)    
  - [Zillow Data (Housing Value Index)](#toc1_3_)    
  - [Disaster Data](#toc1_4_)    
  - [Consolidating data](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [168]:
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

In [169]:
# base folders
RAW_DATA_DIR = Path('../raw_data/')
CLEAN_DATA_DIR = Path('../clean_data/')

In [170]:
# years to slice the data
start_year = 2018
end_year = 2021

## <a id='toc1_1_'></a>[Census Data](#toc0_)


ACS5?

TODO: add description here

### <a id='toc1_1_1_'></a>[Demography](#toc0_)

TODO: add description

In [171]:
# Median Income
median_income = pd.read_csv(RAW_DATA_DIR / 'median_incomes_flat.csv', index_col=0)

median_income.columns = ['ZIP Code', 'Avg Median Income', '% Avg Change Median Income', '% Change Median Income']

median_income['ZIP Code'] = median_income['ZIP Code'].astype(str)
median_income = median_income[['ZIP Code', 'Avg Median Income', '% Change Median Income']]

median_income.set_index('ZIP Code', inplace=True)
median_income

Unnamed: 0_level_0,Avg Median Income,% Change Median Income
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1
90001,48115.285714,0.703617
90002,43639.714286,0.651706
90003,44032.857143,0.592610
90004,53401.000000,0.345076
90005,42199.000000,0.625181
...,...,...
96145,94609.857143,0.702736
96146,88706.857143,0.561384
96148,80547.857143,0.167422
96150,65445.428571,0.578176


In [172]:
# Race 
# Note: we use percentage of "white-only" population as a proxy race

white_pop = pd.read_csv(RAW_DATA_DIR / 'percent_white_population_flat.csv', index_col=0)

white_pop.columns = ['ZIP Code', 'Avg % White-only Pop', '% Avg Change % of White-only Pop', '% Change White-only Pop']

white_pop['ZIP Code'] = white_pop['ZIP Code'].astype(str)
white_pop = white_pop[['ZIP Code', 'Avg % White-only Pop', '% Change White-only Pop']]

white_pop.set_index('ZIP Code', inplace=True)
white_pop.sample(5)

Unnamed: 0_level_0,Avg % White-only Pop,% Change White-only Pop
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1
93747,56.714286,-21.7
93915,49.357143,-28.4
92614,56.085714,-15.7
96097,82.742857,-5.7
94541,37.785714,-10.6


### <a id='toc1_1_2_'></a>[Housing (2021)](#toc0_)

*** IMPORTANT NOTE: Median Home Value is capped at $2,000,001 (censored data)

In [173]:
# number of housing units
housing_units = pd.read_csv(RAW_DATA_DIR / 'Housing Units in Census Zip Code Tabulation Areas of California (2021).csv')
housing_units = housing_units[['Entity properties name', 'Variable observation value']]
housing_units.columns=['ZIP Code', 'Housing Units']
housing_units.set_index('ZIP Code', inplace=True)

# median gross rent
gross_rent = pd.read_csv(RAW_DATA_DIR / 'Median Gross Rent of Housing Unit_ With Cash Rent in Census Zip Code Tabulation Areas of California (2021).csv')
gross_rent = gross_rent[['Entity properties name', 'Variable observation value']]
gross_rent.columns=['ZIP Code', 'Median Gross Rent ($)']
gross_rent.set_index('ZIP Code', inplace=True)

# median ownership costs
# Note: This dataset contains the median cost of housing units without mortgage
owner_cost = pd.read_csv(RAW_DATA_DIR / 'Median Cost of Housing Unit (Selected Monthly Owner Costs)_ Without Mortgage in Census Zip Code Tabulation Areas of California (2021).csv')
owner_cost = owner_cost[['Entity properties name', 'Variable observation value']]
owner_cost.columns=['ZIP Code', 'Median Owner Cost ($)']
owner_cost.set_index('ZIP Code', inplace=True)

# median home value
home_value = pd.read_csv(RAW_DATA_DIR / 'Median Home Value of Housing Unit_ Occupied Housing Unit, Owner Occupied in Census Zip Code Tabulation Areas of California (2021).csv')
home_value = home_value[['Entity properties name', 'Variable observation value']]
home_value.columns=['ZIP Code', 'Median Home Value - Census ($)']
home_value.set_index('ZIP Code', inplace=True)

housing = pd.concat([housing_units, gross_rent, owner_cost, home_value], axis=1).dropna()
housing.index = housing.index.astype(str)


In [174]:
housing.sample(3)

Unnamed: 0_level_0,Housing Units,Median Gross Rent ($),Median Owner Cost ($),Median Home Value - Census ($)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
93446,19168,1595.0,615.0,502500.0
93622,2865,688.0,412.0,185900.0
96093,2023,929.0,407.0,298300.0


## <a id='toc1_2_'></a>[Insurance Data](#toc0_)

California requires insurers with written premiums above of $10 million to submit a biennial report to the Insurance Commissioner with its residential property experience data for the previous two years. The data is processed by the Department of Insurance and the aggregates are published at zipcode level, including information about the number of policies, renewals, premiums, and losses. In this project, we used the following datasets:

- [New, renewed, and non-renewed insurance policies, 2015-2021](https://www.insurance.ca.gov/01-consumers/200-wrr/upload/Residential-Insurance-Policy-Analysis-by-County-2015-to-2021-2.pdf) 
- [Earned premiums, claims, and losses in residential units, 2018-2023](https://www.insurance.ca.gov/01-consumers/200-wrr/WildfireRiskInfoRpt.cfm)

Besides "regular" insurance data, we also include information about California's FAIR Plan. The California FAIR Plan provides basic insurance coverage for high-risk properties when traditional insurance companies will not. It has recently expanded to offer higher coverage limits of $3 million for residential policyholders and $20 million for commercial policies per location, serving as a safety net for properties that can't obtain coverage in the standard insurance market. The available data comes from these two datasets:
- [Residential Structures Insured under a FAIR Plan Policy, 2022](https://www.insurance.ca.gov/01-consumers/200-wrr/upload/Number-of-Residential-Dwelling-Units-Insured-in-2022-FAIR-Plan-vs-Voluntary.pdf)
- [Residential policies  by the program, 2020-2024](https://www.cfpnet.com/wp-content/uploads/2024/11/CFP5yearPIFGrowthbyzipcodethrough09302024(Residential%20line)20241112v001.pdf)
- [Residential exposure covered by the program, 2020-2024](https://www.cfpnet.com/wp-content/uploads/2024/11/CFP5yearTIVGrowthbyzipcodethrough09302024(Residentialline)20241112v001.pdf)

*** IMPORTANT NOTE: the best available data covers only 2022, and will be used a the target for our model *** 

### <a id='toc1_2_1_'></a>[Renewals](#toc0_)

In [175]:
# loading and performing initial processing on the renewals data
renewals = pd.read_excel(RAW_DATA_DIR / 'Residential-Property-Voluntary-Market-New-Renew-NonRenew-by-ZIP-2015-2021.xlsx', dtype={'ZIP Code': str})

# removing zipcodes not associated with a county
renewals = renewals[renewals['County'].isnull() == False]

# keeping only columns of interest and renaming
cols = ["ZIP Code", "Year", "New", "Renewed", "Insured-Initiated Nonrenewed", "Insurer-Initiated Nonrenewed"]
renewals = renewals[cols].copy()

renaming = {
    'New' : 'New Policies',
    'Renewed': 'Renewed Policies',
    'Insured-Initiated Nonrenewed': 'Nonrenewed Policies (by Owner)',
    'Insurer-Initiated Nonrenewed': 'Nonrenewed Policies (by Company)',
}

renewals.rename(columns=renaming, inplace=True)

renewals.sample(3)

Unnamed: 0,ZIP Code,Year,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company)
8885,95728,2018,92,905,70,20
11312,90058,2020,14,104,5,3
12419,93926,2020,167,1488,138,23


The original dataset provides raw counts of policy renewals and includes multiple years. Based on this information, we can calculate some extra features, including the relative importance of each count (i.e., percentages) and their change over time.

In [176]:
# number of non-renewed policies and  expiring policies (or contracts up to renewal) 
renewals['Nonrenewed Policies'] = renewals['Nonrenewed Policies (by Owner)'] + renewals['Nonrenewed Policies (by Company)']
renewals['Expiring Policies'] = renewals['Nonrenewed Policies'] + renewals['Renewed Policies']

In [177]:
# filtering years of interest
cond1 = renewals['Year'] >= start_year
cond2 = renewals['Year'] <= end_year

renewals_filtered = renewals.loc[cond1 & cond2].copy()
renewals_filtered = renewals_filtered.groupby('ZIP Code', as_index=False).sum()
renewals_filtered.drop(columns='Year', inplace=True)

renewals_filtered.sample(3)

Unnamed: 0,ZIP Code,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies
1930,95742,3874,15887,2445,258,2703,18590
532,92230,402,1883,264,161,425,2308
1332,94539,6363,50691,5470,571,6041,56732


In [178]:
# percentage of non-renewed policies of the expiring policies
renewals_filtered['% Nonrenewed Policies'] = renewals_filtered['Nonrenewed Policies'] / renewals_filtered['Expiring Policies']

# percentage of policies not-renewals by the initiative of the owner or company
renewals_filtered['% Nonrenewed Policies (by Owner)'] = renewals_filtered['Nonrenewed Policies (by Owner)'] / renewals_filtered['Expiring Policies']
renewals_filtered['% Nonrenewed Policies (by Company)'] = renewals_filtered['Nonrenewed Policies (by Company)'] / renewals_filtered['Expiring Policies']

renewals_filtered.sample(3)

Unnamed: 0,ZIP Code,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies,% Nonrenewed Policies,% Nonrenewed Policies (by Owner),% Nonrenewed Policies (by Company)
1270,94132,1743,17985,1496,236,1732,19717,0.087843,0.075874,0.011969
801,92822,0,1,0,0,0,1,0.0,0.0,0.0
512,92173,1001,9178,829,294,1123,10301,0.109019,0.080478,0.028541


In [179]:
# # TODO: move this one to feature engineering section

# # ratio of new policies to non-renewed policies
# renewals_filtered['ratio_new_to_nonrenewed'] = renewals_filtered['new_policies'] / (renewals_filtered['owner_nonrenewed'] + renewals_filtered['company_nonrenewed'])

In [180]:
# calculating change over time based on the start and end years
cond1 = renewals['Year'] == start_year
cond2 = renewals['Year'] == end_year

renewals_change = renewals[cond1 | cond2].copy().sort_values(['ZIP Code', 'Year']).set_index('ZIP Code')

renewals_change = renewals_change.groupby(['ZIP Code']).pct_change().dropna().copy().drop(columns='Year')
renewals_change.replace([np.inf, -np.inf], np.nan, inplace=True)

renewals_change.columns = ['% Change - ' + col for col in renewals_change.columns]
renewals_change.reset_index(inplace=True)
renewals_change.sample(3)

Unnamed: 0,ZIP Code,% Change - New Policies,% Change - Renewed Policies,% Change - Nonrenewed Policies (by Owner),% Change - Nonrenewed Policies (by Company),% Change - Nonrenewed Policies,% Change - Expiring Policies
911,94030,0.105566,-0.010411,0.240909,0.317073,0.247401,0.010751
682,93243,0.121951,-0.017341,0.8,1.1,0.885714,0.065617
1688,96120,0.0,0.0625,0.636364,2.125,0.926829,0.152672


In [181]:
# merging datasets 
renewals_change['ZIP Code'] = renewals_change['ZIP Code'].astype(str)
renewals_filtered['ZIP Code'] = renewals_filtered['ZIP Code'].astype(str)

renewals_reworked = pd.merge(renewals_filtered, renewals_change, on='ZIP Code')
renewals_reworked.set_index('ZIP Code', inplace=True)
renewals_reworked.sample(3)

Unnamed: 0_level_0,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies,% Nonrenewed Policies,% Nonrenewed Policies (by Owner),% Nonrenewed Policies (by Company),% Change - New Policies,% Change - Renewed Policies,% Change - Nonrenewed Policies (by Owner),% Change - Nonrenewed Policies (by Company),% Change - Nonrenewed Policies,% Change - Expiring Policies
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
92252,2726,13223,1991,525,2516,15739,0.159858,0.126501,0.033357,0.514652,0.035934,0.676845,0.275862,0.585462,0.110226
95562,543,3914,395,143,538,4452,0.120845,0.088724,0.03212,0.03937,-0.001018,0.082353,-0.542373,-0.173611,-0.023091
92655,839,4927,625,168,793,5720,0.138636,0.109266,0.029371,0.193717,0.018852,0.251748,0.2,0.239362,0.048295


### <a id='toc1_2_2_'></a>[Premiums, Claims, and Losses](#toc0_)

This dataset covers certain types of residential policies (Dwelling Fire policies, Homeowners policies, Earthquake policies) and include information about total earned premiums as well as number of claims and total losses paid by insurance companies for fire- and smoke-related incidents.  Similar to renewal data, we will compute the aggregate values and percentage changes for the timespan of interest.

Note: The column "Total Exposure" seems to contain problematic data. The value for total exposure--the amount covered by the insurers--is much smaller than premiums they received in a particular zipcode, which doesn't make sense. Therefore, we're removing the data.

In [182]:
premiums = pd.read_excel(RAW_DATA_DIR / 'Residential-Property-Coverage-Amounts-Wildfire-Risk-and-Losses.xlsx', sheet_name='Cleaned', header=3)

# removing "Grand total" and "County" rows from the dataset 
premiums = premiums[premiums.Zipcode.apply(type) == int]

# getting columns for claims (n.) and losses ($)
cols_claim = []
cols_losses = []

for col in premiums.columns[6:]:
    if col[-6:] == 'Claims':
        cols_claim.append(col)
    else:
        cols_losses.append(col)

# calculating total numbers of claims and values in losses
premiums['Claims (Fire and Smoke)'] = premiums[cols_claim].sum(axis=1)
premiums['Losses (Fire and Smoke) ($)'] = premiums[cols_losses].sum(axis=1)

# keeping only the columns of interest
cols = ['Zipcode', 'Year', 'Earned Premium', 'Claims (Fire and Smoke)', 'Losses (Fire and Smoke) ($)']

# filtering and renaming columns
premiums = premiums[cols].rename(columns={'Zipcode': 'ZIP Code', 'Earned Premium': 'Earned Premium ($)'})
premiums['ZIP Code'] = premiums['ZIP Code'].astype(str)
premiums.sample(3)

Unnamed: 0,ZIP Code,Year,Earned Premium ($),Claims (Fire and Smoke),Losses (Fire and Smoke) ($)
37504,93726,2021,248193,1,3774.0
43593,92373,2021,10857237,10,1951842.0
45355,92028,2021,680620,2,22007.0


In [183]:
# calculating the aggregate values for the timespan
cond1 = premiums['Year'] >= start_year
cond2 = premiums['Year'] <= end_year

premiums_aggs = premiums[cond1 & cond2].groupby(['ZIP Code']).sum().reset_index()
premiums_aggs = premiums_aggs.drop(columns=['Year'])

premiums_aggs.sample(3)

Unnamed: 0,ZIP Code,Earned Premium ($),Claims (Fire and Smoke),Losses (Fire and Smoke) ($)
996,92926,20,0,0.0
458,91944,12579,0,0.0
1311,93663,422,0,0.0


In [184]:
# filtering years and getting growth
cond1 = premiums['Year'] == start_year
cond2 = premiums['Year'] == end_year

# creating pivot table with start and end years
premiums_pivot = pd.pivot_table(premiums[cond1 | cond2], index='ZIP Code', columns='Year').dropna()
premiums_pivot.columns = [f'{str(s[1])}_{s[0]}' for s in premiums_pivot.columns]
premiums_pivot = premiums_pivot.reset_index()

# calculating growth
premiums_pivot['% Change - Earned Premiums'] = premiums_pivot[['2018_Earned Premium ($)', '2021_Earned Premium ($)']].pct_change(axis=1).iloc[:, 1]
premiums_pivot['% Change - Claims (Fire and Smoke)'] = premiums_pivot[['2018_Claims (Fire and Smoke)', '2021_Claims (Fire and Smoke)']].pct_change(axis=1).iloc[:, 1]
premiums_pivot['% Change - Losses (Fire and Smoke)'] = premiums_pivot[['2018_Losses (Fire and Smoke) ($)', '2021_Losses (Fire and Smoke) ($)']].pct_change(axis=1).iloc[:, 1]

premiums_pivot.replace([np.inf, -np.inf], np.nan, inplace=True)
premiums_pivot.sample(3)

Unnamed: 0,ZIP Code,2018_Claims (Fire and Smoke),2021_Claims (Fire and Smoke),2018_Earned Premium ($),2021_Earned Premium ($),2018_Losses (Fire and Smoke) ($),2021_Losses (Fire and Smoke) ($),% Change - Earned Premiums,% Change - Claims (Fire and Smoke),% Change - Losses (Fire and Smoke)
1118,93906,2.571429,3.714286,1179368.0,1505969.0,216488.857143,239262.857143,0.27693,0.444444,0.105197
1775,95658,1.285714,1.857143,405807.9,573793.6,23706.714286,643199.428571,0.413954,0.444444,26.13153
1241,94519,1.714286,0.714286,653436.7,857109.7,98937.142857,4723.285714,0.311695,-0.583333,-0.95226


In [185]:
df1 = premiums_pivot[['ZIP Code', '% Change - Earned Premiums', '% Change - Claims (Fire and Smoke)', '% Change - Losses (Fire and Smoke)']].copy()
df2 = premiums_aggs.copy()

premiums_reworked = pd.merge(df2, df1, on='ZIP Code')
premiums_reworked.set_index('ZIP Code', inplace=True)

premiums_reworked.sample(3)

Unnamed: 0_level_0,Earned Premium ($),Claims (Fire and Smoke),Losses (Fire and Smoke) ($),% Change - Earned Premiums,% Change - Claims (Fire and Smoke),% Change - Losses (Fire and Smoke)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
92623,1259,0,0.0,-0.427621,,
91793,6929,0,0.0,-0.074182,,
91030,29031764,34,3318227.0,0.34158,-0.714286,-0.977853


In [186]:
# TODO: move this one to feature engineering section

# # ratio between losses and premium
# premiums_reworked['ratio_losses_to_premium'] = premiums_reworked['fire_smoke_losses'] / premiums_reworked['earned_premium']

### <a id='toc1_2_3_'></a>[FAIR Plan (2022)](#toc0_)

In [187]:
fair22 = pd.read_excel(RAW_DATA_DIR / 'full_residential_units_insured_2022.xlsx')

cols = ['ZIP Code', "Voluntary Market Units", "FAIR Plan Units"]
fair22 = fair22[cols]

# calculate percentages
fair22['Total Res Units'] = fair22['Voluntary Market Units'] + fair22['FAIR Plan Units']
fair22['% Market Units'] = fair22['Voluntary Market Units'] / fair22['Total Res Units']
fair22['% FAIR Plan Units'] = fair22['FAIR Plan Units'] / fair22['Total Res Units']

fair22.head(2)

Unnamed: 0,ZIP Code,Voluntary Market Units,FAIR Plan Units,Total Res Units,% Market Units,% FAIR Plan Units
0,90001,6913,2104,9017,0.766663,0.233337
1,90002,6534,1330,7864,0.830875,0.169125


Besides the 2022 dataset, California also published general information about total exposure covered by FAIR Plan policies, which we're incorporating below. It's potentially a secondary target variable.

In [188]:
columns_exp = ['ZIP Code', 
               'growth_exp_23_24', 'exposure_24',
               'growth_exp_22_23', 'exposure_23',
               'growth_exp_21_22', 'Total Exposure ($)',
               'growth_exp_20_21', 'exposure_21',
               'exposure_20']

fair_exp = pd.read_excel(RAW_DATA_DIR / 'CFP5yearTIVGrowthbyzipcodethrough09302024(Residentialline)20241112v001_unlocked.xlsx', names=columns_exp)

# removing rows that doesn't contain actual data (totals, etc.)
from pandas.api.types import is_integer, is_number
fair_exp = fair_exp[fair_exp['ZIP Code'].apply(is_integer)].copy()

# cleaning up exposure data
def clean_exposure(value):
    try:
        return float(value)
    except ValueError:
        return np.nan
    
fair_exp['Total Exposure ($)'] = fair_exp['Total Exposure ($)'].map(clean_exposure)
fair_exp.drop(index=[1670], inplace=True)  # removing duplicated data

fair_exp.sample(3)

Unnamed: 0,ZIP Code,growth_exp_23_24,exposure_24,growth_exp_22_23,exposure_23,growth_exp_21_22,Total Exposure ($),growth_exp_20_21,exposure_21,exposure_20
482,90304,0.13,114411657,0.101,101241332,0.019,91976917.0,0.023,90269952,88205111
61,95669,0.442,215038564,0.597,149086211,0.229,93327722.0,0.785,75934360,42533207
911,92844,3.14,16070903,2.275,3881562,2.574,1185378.0,0.052,331659,315170


In [189]:
# merging exposure column
fair = pd.merge(fair22, fair_exp[['ZIP Code', 'Total Exposure ($)']], on='ZIP Code')
fair['ZIP Code'] = fair['ZIP Code'].astype(str)
fair.set_index('ZIP Code', inplace=True)

In [190]:
fair.sample(3)

Unnamed: 0_level_0,Voluntary Market Units,FAIR Plan Units,Total Res Units,% Market Units,% FAIR Plan Units,Total Exposure ($)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
93033,10973,23,10996,0.997908,0.002092,15689632.0
95636,200,121,321,0.623053,0.376947,65656163.0
90601,6934,264,7198,0.963323,0.036677,160340744.0


## <a id='toc1_3_'></a>[Zillow Data (Housing Value Index)](#toc0_)

Zillow Housing Value Index (ZHVI), overall, represents the “typical” home value for a region. It’s calculated as a weighted average of the middle third of homes in a given region--therefore, it reflects the typical value for homes in the 35th to 65th percentile range.

The base dataset can be found here (https://www.zillow.com/research/data/) with the Data Type: "ZHVI All Homes (SFR, Condo/Co-Op) Time Series, Smoothed, Seasonally Adjusted($)" and "Zip Code" for "Geography." Additional information about ZHVI is available [here](https://www.zillow.com/research/zhvi-user-guide/).

In [191]:
zillow = pd.read_csv(RAW_DATA_DIR / "Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month_adjusted.csv", date_format='%Y-%m-%d', parse_dates=[2,3,4,5,6,])

# Filter rows where the STATE column is 'CA'
zillow = zillow[zillow['State'] == 'CA']

# Filter the house value for Dec 31st of each year
non_date_cols = ['RegionName', 'State']
date_cols = pd.date_range(start='2015-12-31', periods=10, freq='YE').strftime('%m/%d/%Y').tolist()
columns_to_keep = non_date_cols + date_cols

# Keep only those columns (skip missing ones to avoid KeyError)
existing_cols = [col for col in columns_to_keep if col in zillow.columns]
zillow = zillow[existing_cols]

# Rename columns to have a consistent format
zillow.columns = non_date_cols + zillow[date_cols].columns.str.slice(-4).tolist()

# Dropping data that weren't interested in 
cols_to_drop = ['State', '2015', '2016', '2017', '2022', '2023', '2024']
zillow.drop(columns=cols_to_drop, inplace=True)

# Renaming columns for clarity
zillow.rename(columns={'RegionName': 'ZIP Code',
                          '2018': 'Zillow Home Value 2018 ($)',
                          '2019': 'Zillow Home Value 2019 ($)',
                          '2020': 'Zillow Home Value 2020 ($)',
                          '2021': 'Zillow Home Value 2021 ($)'}, inplace=True)
                          
zillow['ZIP Code'] = zillow['ZIP Code'].astype(str)
zillow.dropna(inplace=True)
zillow.set_index('ZIP Code', inplace=True)

zillow.head()

Unnamed: 0_level_0,Zillow Home Value 2018 ($),Zillow Home Value 2019 ($),Zillow Home Value 2020 ($),Zillow Home Value 2021 ($)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
90011,443282.5808,459312.416,506920.126,560746.111
90650,513295.6656,526093.8573,581498.8551,660390.8548
91331,504148.1256,516199.8496,576276.3692,643635.0577
90044,474820.6675,498004.119,549195.7913,618134.0289
92336,452715.2279,465611.3794,508069.8122,623341.2874


In [192]:
# calculating mean and change over time
zillow['Zillow Mean Home Value ($)'] = zillow[['Zillow Home Value 2018 ($)', 'Zillow Home Value 2019 ($)', 'Zillow Home Value 2020 ($)', 'Zillow Home Value 2021 ($)']].mean(axis=1)
zillow['% Change - Zillow Home Value'] = zillow[['Zillow Home Value 2018 ($)', 'Zillow Home Value 2021 ($)']].pct_change(axis=1).iloc[:, 1]


zillow.sample(3)


Unnamed: 0_level_0,Zillow Home Value 2018 ($),Zillow Home Value 2019 ($),Zillow Home Value 2020 ($),Zillow Home Value 2021 ($),Zillow Mean Home Value ($),% Change - Zillow Home Value
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
96124,263357.4,271834.3,298154.2,352829.2,296543.8,0.339736
94018,1179327.0,1167470.0,1255408.0,1417533.0,1254935.0,0.201985
91355,588457.1,583390.2,632556.7,744152.0,637139.0,0.264581


## <a id='toc1_4_'></a>[Disaster Data](#toc0_)

Tiana TODO: can you add a brief description about this dataset?


Leo: I'm starting from [cleaned_climate_disasters.csv](../data/cleaned_climate_disasters.csv). Tiana --> do you have the code before getting to this point.

*** IMPORTANT NOTE: Disaster data is published at county level. Therefore, we consider that every zipcode in a given county was affect by the disaster.

In [193]:
disasters = pd.read_csv(DATA_DIR / 'cleaned_climate_disasters.csv')
disasters.sample(3)

Unnamed: 0,ZIP,DATE,DISASTER
42617,96058,1992,Storm
21067,92554,2015,Fire
14526,95444,2019,Fire


In [194]:
range1 = list(range(2022-1, 2022))
range3 = list(range(2022-3, 2022))
range5 = list(range(2022-5, 2022))
range10 = list(range(2022-10, 2022))

In [195]:
all_disasters = pd.pivot_table(disasters, index='ZIP', columns='DATE', aggfunc='count', fill_value=0)
all_disasters.columns = [x[1] for x in all_disasters.columns]

all_disasters['All Disasters 1y'] = all_disasters[range1].sum(axis=1)
all_disasters['All Disasters 3y'] = all_disasters[range3].sum(axis=1)
all_disasters['All Disasters 5y'] = all_disasters[range5].sum(axis=1)
all_disasters['All Disasters 10y'] = all_disasters[range10].sum (axis=1)

all_disasters.iloc[:, -4:].sample(5)

Unnamed: 0_level_0,All Disasters 1y,All Disasters 3y,All Disasters 5y,All Disasters 10y
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
92371,1,7,9,12
95111,0,2,2,3
92273,0,2,2,4
94960,0,2,2,3
94564,2,4,5,6


In [196]:
fire_disasters = disasters[disasters['DISASTER'] == 'Fire'].copy()
fire_disasters = pd.pivot_table(fire_disasters, index='ZIP', columns='DATE', aggfunc='count', fill_value=0)

fire_disasters.columns = [x[1] for x in fire_disasters.columns]

fire_disasters['Fire Disasters 1y'] = fire_disasters[range1].sum(axis=1)
fire_disasters['Fire Disasters 3y'] = fire_disasters[range3].sum(axis=1)
fire_disasters['Fire Disasters 5y'] = fire_disasters[range5].sum(axis=1)
fire_disasters['Fire Disasters 10y'] = fire_disasters[range10].sum (axis=1)

fire_disasters.iloc[:, -4:].sample(5)

Unnamed: 0_level_0,Fire Disasters 1y,Fire Disasters 3y,Fire Disasters 5y,Fire Disasters 10y
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
92506,0,3,5,8
94229,0,2,2,3
92052,0,3,5,7
94128,0,2,2,3
93650,0,3,3,4


In [197]:
climate_disasters = pd.concat([all_disasters.iloc[:, -4:], fire_disasters.iloc[:, -4:]], axis=1)
climate_disasters.index.name = 'ZIP Code'
climate_disasters.index = climate_disasters.index.astype(str)

climate_disasters.sample(3)

Unnamed: 0_level_0,All Disasters 1y,All Disasters 3y,All Disasters 5y,All Disasters 10y,Fire Disasters 1y,Fire Disasters 3y,Fire Disasters 5y,Fire Disasters 10y
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
90610,1,7,11,17,0,5,8,11
95076,2,4,5,7,0,2,2,3
90024,0,2,2,3,0,2,2,3


## <a id='toc1_5_'></a>[Consolidating data](#toc0_)

In [222]:
dfs = [housing, renewals_reworked, premiums_reworked, fair, zillow, climate_disasters, white_pop, median_income]
concat = pd.concat(dfs, axis=1)
concat.sample(3)

Unnamed: 0_level_0,Housing Units,Median Gross Rent ($),Median Owner Cost ($),Median Home Value - Census ($),New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies,...,All Disasters 5y,All Disasters 10y,Fire Disasters 1y,Fire Disasters 3y,Fire Disasters 5y,Fire Disasters 10y,Avg % White-only Pop,% Change White-only Pop,Avg Median Income,% Change Median Income
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
90048,13758.0,2162.0,1239.0,1441200.0,2001.0,12852.0,1634.0,362.0,1996.0,14848.0,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,97685.714286,0.220514
90846,,,,,,,,,,,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,,
93011,,,,,2.0,8.0,3.0,1.0,4.0,12.0,...,6.0,8.0,0.0,2.0,4.0,5.0,72.714286,-21.3,,


In [225]:
concat.to_csv(CLEAN_DATA_DIR / 'full_dataset.csv')
concat.dropna().to_csv(CLEAN_DATA_DIR / 'cleaned_data.csv')

In [224]:
concat.dropna()

Unnamed: 0_level_0,Housing Units,Median Gross Rent ($),Median Owner Cost ($),Median Home Value - Census ($),New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies,...,All Disasters 5y,All Disasters 10y,Fire Disasters 1y,Fire Disasters 3y,Fire Disasters 5y,Fire Disasters 10y,Avg % White-only Pop,% Change White-only Pop,Avg Median Income,% Change Median Income
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
90001,14010.0,1262.0,427.0,425200.0,3162.0,22127.0,1780.0,1073.0,2853.0,24980.0,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,48115.285714,0.703617
90002,13577.0,1287.0,465.0,411500.0,3448.0,21908.0,2091.0,1022.0,3113.0,25021.0,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,43639.714286,0.651706
90003,18501.0,1293.0,555.0,430800.0,4525.0,26933.0,2667.0,1304.0,3971.0,30904.0,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,44032.857143,0.592610
90004,25100.0,1530.0,829.0,1148400.0,2182.0,15683.0,1704.0,434.0,2138.0,17821.0,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,53401.000000,0.345076
90005,18128.0,1413.0,1123.0,964700.0,521.0,3357.0,384.0,125.0,509.0,3866.0,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,42199.000000,0.625181
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96143,2843.0,1315.0,715.0,542100.0,768.0,4378.0,475.0,315.0,790.0,5168.0,...,2.0,3.0,0.0,2.0,2.0,3.0,79.071429,-10.2,52730.714286,0.383941
96145,4840.0,1473.0,876.0,785700.0,1555.0,8787.0,917.0,692.0,1609.0,10396.0,...,2.0,3.0,0.0,2.0,2.0,3.0,79.071429,-10.2,94609.857143,0.702736
96146,2014.0,1522.0,920.0,1203900.0,623.0,3749.0,364.0,457.0,821.0,4570.0,...,2.0,3.0,0.0,2.0,2.0,3.0,79.071429,-10.2,88706.857143,0.561384
96148,887.0,954.0,646.0,681100.0,476.0,2226.0,260.0,134.0,394.0,2620.0,...,2.0,3.0,0.0,2.0,2.0,3.0,79.071429,-10.2,80547.857143,0.167422
