# <a id='toc1_'></a>[Data Prep and Cleanup](#toc0_)

Our project examines the potential loss of homeowner insurance coverage due to climate disasters, evaluating how different sectors of society are affected by this process. To do so, we compiled in this notebook several datasets produced by different organizations, particularly focusing on data relatd to insurance, population, housing, and climate disaster, to generate a base dataset for our analysis. Given to the availability of data, we focus on California, having its residential zipcodes as our observations.

In this notebook, we will:
- load datasets 
- perform basic clean-up tasks
- add basic new features
- standardize feature names
- generate base dataset for the project's next steps

**Table of contents**<a id='toc0_'></a>    
- [Data Prep and Cleanup](#toc1_)    
  - [Census Data](#toc1_1_)    
    - [TODO: Demographic](#toc1_1_1_)    
    - [Housing (2021)](#toc1_1_2_)    
  - [Insurance Data](#toc1_2_)    
    - [Renewals](#toc1_2_1_)    
    - [Premiums, Claims, and Losses](#toc1_2_2_)    
    - [FAIR Plan (2022)](#toc1_2_3_)    
  - [Zillow Data (Housing Value Index)](#toc1_3_)    
  - [TODO: Disaster Data](#toc1_4_)    
- [OLD CODE](#toc2_)    
  - [FAIR Plan 2 (2020-2024)](#toc2_1_)    
  - [FEMA Projected Premium Increases (2021, 2025)](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [35]:
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

In [36]:
# base folders
RAW_DATA_DIR = Path('../raw_data/')
CLEAN_DATA_DIR = Path('../clean_data/')

In [37]:
# years to slice the data
start_year = 2018
end_year = 2021

## <a id='toc1_1_'></a>[Census Data](#toc0_)


ACS5?

TODO: add description here

### <a id='toc1_1_1_'></a>[Demography](#toc0_)

TODO: add description

In [38]:
# Median Income
median_income = pd.read_csv(RAW_DATA_DIR / 'median_incomes_flat.csv', index_col=0)

median_income.columns = ['ZIP Code', 'Avg Median Income', '% Avg Change Median Income', '% Change Median Income']

median_income['ZIP Code'] = median_income['ZIP Code'].astype(str)
median_income = median_income[['ZIP Code', 'Avg Median Income', '% Change Median Income']]

median_income.set_index('ZIP Code', inplace=True)
median_income

Unnamed: 0_level_0,Avg Median Income,% Change Median Income
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1
90001,48115.285714,0.703617
90002,43639.714286,0.651706
90003,44032.857143,0.592610
90004,53401.000000,0.345076
90005,42199.000000,0.625181
...,...,...
96145,94609.857143,0.702736
96146,88706.857143,0.561384
96148,80547.857143,0.167422
96150,65445.428571,0.578176


In [39]:
# Race 
# Note: we use percentage of "white-only" population as a proxy race

white_pop = pd.read_csv(RAW_DATA_DIR / 'percent_white_population_flat.csv', index_col=0)

white_pop.columns = ['ZIP Code', 'Avg % White-only Pop', '% Avg Change % of White-only Pop', '% Change White-only Pop']

white_pop['ZIP Code'] = white_pop['ZIP Code'].astype(str)
white_pop = white_pop[['ZIP Code', 'Avg % White-only Pop', '% Change White-only Pop']]

white_pop.set_index('ZIP Code', inplace=True)
white_pop.sample(5)

Unnamed: 0_level_0,Avg % White-only Pop,% Change White-only Pop
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1
90023,45.857143,-16.4
95465,71.442857,-10.8
92259,52.728571,-34.7
95215,49.414286,-18.8
95009,40.714286,-12.3


### <a id='toc1_1_2_'></a>[Housing (2021)](#toc0_)

*** IMPORTANT NOTE: Median Home Value is capped at $2,000,001 (censored data)

In [40]:
# number of housing units
housing_units = pd.read_csv(RAW_DATA_DIR / 'Housing Units in Census Zip Code Tabulation Areas of California (2021).csv')
housing_units = housing_units[['Entity properties name', 'Variable observation value']]
housing_units.columns=['ZIP Code', 'Housing Units']
housing_units.set_index('ZIP Code', inplace=True)

# median gross rent
gross_rent = pd.read_csv(RAW_DATA_DIR / 'Median Gross Rent of Housing Unit_ With Cash Rent in Census Zip Code Tabulation Areas of California (2021).csv')
gross_rent = gross_rent[['Entity properties name', 'Variable observation value']]
gross_rent.columns=['ZIP Code', 'Median Gross Rent ($)']
gross_rent.set_index('ZIP Code', inplace=True)

# median ownership costs
# Note: This dataset contains the median cost of housing units without mortgage
owner_cost = pd.read_csv(RAW_DATA_DIR / 'Median Cost of Housing Unit (Selected Monthly Owner Costs)_ Without Mortgage in Census Zip Code Tabulation Areas of California (2021).csv')
owner_cost = owner_cost[['Entity properties name', 'Variable observation value']]
owner_cost.columns=['ZIP Code', 'Median Owner Cost ($)']
owner_cost.set_index('ZIP Code', inplace=True)

# median home value
home_value = pd.read_csv(RAW_DATA_DIR / 'Median Home Value of Housing Unit_ Occupied Housing Unit, Owner Occupied in Census Zip Code Tabulation Areas of California (2021).csv')
home_value = home_value[['Entity properties name', 'Variable observation value']]
home_value.columns=['ZIP Code', 'Median Home Value - Census ($)']
home_value.set_index('ZIP Code', inplace=True)

housing = pd.concat([housing_units, gross_rent, owner_cost, home_value], axis=1).dropna()
housing.index = housing.index.astype(str)


In [41]:
housing.sample(3)

Unnamed: 0_level_0,Housing Units,Median Gross Rent ($),Median Owner Cost ($),Median Home Value - Census ($)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
90290,2427,3501.0,886.0,1184900.0
94930,4080,2106.0,914.0,927100.0
95832,3264,1586.0,488.0,319400.0


## <a id='toc1_2_'></a>[Insurance Data](#toc0_)

California requires insurers with written premiums above of $10 million to submit a biennial report to the Insurance Commissioner with its residential property experience data for the previous two years. The data is processed by the Department of Insurance and the aggregates are published at zipcode level, including information about the number of policies, renewals, premiums, and losses. In this project, we used the following datasets:

- [New, renewed, and non-renewed insurance policies, 2015-2021](https://www.insurance.ca.gov/01-consumers/200-wrr/upload/Residential-Insurance-Policy-Analysis-by-County-2015-to-2021-2.pdf) 
- [Earned premiums, claims, and losses in residential units, 2018-2023](https://www.insurance.ca.gov/01-consumers/200-wrr/WildfireRiskInfoRpt.cfm)

Besides "regular" insurance data, we also include information about California's FAIR Plan. The California FAIR Plan provides basic insurance coverage for high-risk properties when traditional insurance companies will not. It has recently expanded to offer higher coverage limits of $3 million for residential policyholders and $20 million for commercial policies per location, serving as a safety net for properties that can't obtain coverage in the standard insurance market. The available data comes from these two datasets:
- [Residential Structures Insured under a FAIR Plan Policy, 2022](https://www.insurance.ca.gov/01-consumers/200-wrr/upload/Number-of-Residential-Dwelling-Units-Insured-in-2022-FAIR-Plan-vs-Voluntary.pdf)
- [Residential policies  by the program, 2020-2024](https://www.cfpnet.com/wp-content/uploads/2024/11/CFP5yearPIFGrowthbyzipcodethrough09302024(Residential%20line)20241112v001.pdf)
- [Residential exposure covered by the program, 2020-2024](https://www.cfpnet.com/wp-content/uploads/2024/11/CFP5yearTIVGrowthbyzipcodethrough09302024(Residentialline)20241112v001.pdf)

*** IMPORTANT NOTE: the best available data covers only 2022, and will be used a the target for our model *** 

### <a id='toc1_2_1_'></a>[Renewals](#toc0_)

In [42]:
# loading and performing initial processing on the renewals data
renewals = pd.read_excel(RAW_DATA_DIR / 'Residential-Property-Voluntary-Market-New-Renew-NonRenew-by-ZIP-2015-2021.xlsx', dtype={'ZIP Code': str})

# removing zipcodes not associated with a county
renewals = renewals[renewals['County'].isnull() == False]

# keeping only columns of interest and renaming
cols = ["ZIP Code", "Year", "New", "Renewed", "Insured-Initiated Nonrenewed", "Insurer-Initiated Nonrenewed"]
renewals = renewals[cols].copy()

renaming = {
    'New' : 'New Policies',
    'Renewed': 'Renewed Policies',
    'Insured-Initiated Nonrenewed': 'Nonrenewed Policies (by Owner)',
    'Insurer-Initiated Nonrenewed': 'Nonrenewed Policies (by Company)',
}

renewals.rename(columns=renaming, inplace=True)

renewals.sample(3)

Unnamed: 0,ZIP Code,Year,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company)
1295,93628,2015,9,169,12,3
993,92850,2015,0,1,0,0
8823,95654,2018,0,0,0,0


The original dataset provides raw counts of policy renewals and includes multiple years. Based on this information, we can calculate some extra features, including the relative importance of each count (i.e., percentages) and their change over time.

In [43]:
# number of non-renewed policies and  expiring policies (or contracts up to renewal) 
renewals['Nonrenewed Policies'] = renewals['Nonrenewed Policies (by Owner)'] + renewals['Nonrenewed Policies (by Company)']
renewals['Expiring Policies'] = renewals['Nonrenewed Policies'] + renewals['Renewed Policies']

In [44]:
# filtering years of interest
cond1 = renewals['Year'] >= start_year
cond2 = renewals['Year'] <= end_year

renewals_filtered = renewals.loc[cond1 & cond2].copy()
renewals_filtered = renewals_filtered.groupby('ZIP Code', as_index=False).sum()
renewals_filtered.drop(columns='Year', inplace=True)

renewals_filtered.sample(3)

Unnamed: 0,ZIP Code,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies
802,92823,839,5162,585,174,759,5921
2101,96061,100,557,55,33,88,645
1535,95115,0,4,0,0,0,4


In [45]:
# percentage of non-renewed policies of the expiring policies
renewals_filtered['% Nonrenewed Policies'] = renewals_filtered['Nonrenewed Policies'] / renewals_filtered['Expiring Policies']

# percentage of policies not-renewals by the initiative of the owner or company
renewals_filtered['% Nonrenewed Policies (by Owner)'] = renewals_filtered['Nonrenewed Policies (by Owner)'] / renewals_filtered['Expiring Policies']
renewals_filtered['% Nonrenewed Policies (by Company)'] = renewals_filtered['Nonrenewed Policies (by Company)'] / renewals_filtered['Expiring Policies']

renewals_filtered.sample(3)

Unnamed: 0,ZIP Code,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies,% Nonrenewed Policies,% Nonrenewed Policies (by Owner),% Nonrenewed Policies (by Company)
304,91602,1338,9368,1086,212,1298,10666,0.121695,0.101819,0.019876
452,92082,3605,18068,2110,1521,3631,21699,0.167335,0.09724,0.070095
1413,94706,1437,16864,1229,186,1415,18279,0.077411,0.067236,0.010176


In [46]:
# # TODO: move this one to feature engineering section

# # ratio of new policies to non-renewed policies
# renewals_filtered['ratio_new_to_nonrenewed'] = renewals_filtered['new_policies'] / (renewals_filtered['owner_nonrenewed'] + renewals_filtered['company_nonrenewed'])

In [47]:
# calculating change over time based on the start and end years
cond1 = renewals['Year'] == start_year
cond2 = renewals['Year'] == end_year

renewals_change = renewals[cond1 | cond2].copy().sort_values(['ZIP Code', 'Year']).set_index('ZIP Code')

renewals_change = renewals_change.groupby(['ZIP Code']).pct_change().dropna().copy().drop(columns='Year')
renewals_change.replace([np.inf, -np.inf], np.nan, inplace=True)

renewals_change.columns = ['% Change - ' + col for col in renewals_change.columns]
renewals_change.reset_index(inplace=True)
renewals_change.sample(3)

Unnamed: 0,ZIP Code,% Change - New Policies,% Change - Renewed Policies,% Change - Nonrenewed Policies (by Owner),% Change - Nonrenewed Policies (by Company),% Change - Nonrenewed Policies,% Change - Expiring Policies
537,92596,0.442039,0.141955,0.263359,0.636872,0.317848,0.169354
38,90042,0.213,0.030972,0.243207,0.098901,0.214597,0.048351
1104,94970,0.313725,-0.009804,-0.019608,2.111111,0.3,0.022807


In [48]:
# merging datasets 
renewals_change['ZIP Code'] = renewals_change['ZIP Code'].astype(str)
renewals_filtered['ZIP Code'] = renewals_filtered['ZIP Code'].astype(str)

renewals_reworked = pd.merge(renewals_filtered, renewals_change, on='ZIP Code')
renewals_reworked.set_index('ZIP Code', inplace=True)
renewals_reworked.sample(3)

Unnamed: 0_level_0,New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies,% Nonrenewed Policies,% Nonrenewed Policies (by Owner),% Nonrenewed Policies (by Company),% Change - New Policies,% Change - Renewed Policies,% Change - Nonrenewed Policies (by Owner),% Change - Nonrenewed Policies (by Company),% Change - Nonrenewed Policies,% Change - Expiring Policies
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
92061,440,1944,322,178,500,2444,0.204583,0.131751,0.072831,0.428571,-0.052941,0.350649,2.416667,0.629213,0.048414
91941,4125,35453,3600,712,4312,39765,0.108437,0.090532,0.017905,0.092357,-0.026254,0.182266,0.445205,0.222338,-0.002312
92253,11516,58296,9578,1453,11031,69327,0.159115,0.138157,0.020959,0.221793,0.011146,0.256132,0.326154,0.26544,0.047957


### <a id='toc1_2_2_'></a>[Premiums, Claims, and Losses](#toc0_)

This dataset covers certain types of residential policies (Dwelling Fire policies, Homeowners policies, Earthquake policies) and include information about total earned premiums as well as number of claims and total losses paid by insurance companies for fire- and smoke-related incidents.  Similar to renewal data, we will compute the aggregate values and percentage changes for the timespan of interest.

Note: The column "Total Exposure" seems to contain problematic data. The value for total exposure--the amount covered by the insurers--is much smaller than premiums they received in a particular zipcode, which doesn't make sense. Therefore, we're removing the data.

In [49]:
premiums = pd.read_excel(RAW_DATA_DIR / 'Residential-Property-Coverage-Amounts-Wildfire-Risk-and-Losses.xlsx', sheet_name='Cleaned', header=3)

# removing "Grand total" and "County" rows from the dataset 
premiums = premiums[premiums.Zipcode.apply(type) == int]

# getting columns for claims (n.) and losses ($)
cols_claim = []
cols_losses = []

for col in premiums.columns[6:]:
    if col[-6:] == 'Claims':
        cols_claim.append(col)
    else:
        cols_losses.append(col)

# calculating total numbers of claims and values in losses
premiums['Claims (Fire and Smoke)'] = premiums[cols_claim].sum(axis=1)
premiums['Losses (Fire and Smoke) ($)'] = premiums[cols_losses].sum(axis=1)

# keeping only the columns of interest
cols = ['Zipcode', 'Year', 'Earned Premium', 'Claims (Fire and Smoke)', 'Losses (Fire and Smoke) ($)']

# filtering and renaming columns
premiums = premiums[cols].rename(columns={'Zipcode': 'ZIP Code', 'Earned Premium': 'Earned Premium ($)'})
premiums['ZIP Code'] = premiums['ZIP Code'].astype(str)
premiums.sample(3)

Unnamed: 0,ZIP Code,Year,Earned Premium ($),Claims (Fire and Smoke),Losses (Fire and Smoke) ($)
39153,90222,2021,359650,3,148857.0
30411,93240,2020,1194128,3,77710.0
40860,95458,2021,145362,0,0.0


In [50]:
# calculating the aggregate values for the timespan
cond1 = premiums['Year'] >= start_year
cond2 = premiums['Year'] <= end_year

premiums_aggs = premiums[cond1 & cond2].groupby(['ZIP Code']).sum().reset_index()
premiums_aggs = premiums_aggs.drop(columns=['Year'])

premiums_aggs.sample(3)

Unnamed: 0,ZIP Code,Earned Premium ($),Claims (Fire and Smoke),Losses (Fire and Smoke) ($)
1820,95212,28152983,63,4037077.0
2315,95974,308306,0,0.0
1428,94085,11649411,18,985065.0


In [51]:
# filtering years and getting growth
cond1 = premiums['Year'] == start_year
cond2 = premiums['Year'] == end_year

# creating pivot table with start and end years
premiums_pivot = pd.pivot_table(premiums[cond1 | cond2], index='ZIP Code', columns='Year').dropna()
premiums_pivot.columns = [f'{str(s[1])}_{s[0]}' for s in premiums_pivot.columns]
premiums_pivot = premiums_pivot.reset_index()

# calculating growth
premiums_pivot['% Change - Earned Premiums'] = premiums_pivot[['2018_Earned Premium ($)', '2021_Earned Premium ($)']].pct_change(axis=1).iloc[:, 1]
premiums_pivot['% Change - Claims (Fire and Smoke)'] = premiums_pivot[['2018_Claims (Fire and Smoke)', '2021_Claims (Fire and Smoke)']].pct_change(axis=1).iloc[:, 1]
premiums_pivot['% Change - Losses (Fire and Smoke)'] = premiums_pivot[['2018_Losses (Fire and Smoke) ($)', '2021_Losses (Fire and Smoke) ($)']].pct_change(axis=1).iloc[:, 1]

premiums_pivot.replace([np.inf, -np.inf], np.nan, inplace=True)
premiums_pivot.sample(3)

Unnamed: 0,ZIP Code,2018_Claims (Fire and Smoke),2021_Claims (Fire and Smoke),2018_Earned Premium ($),2021_Earned Premium ($),2018_Losses (Fire and Smoke) ($),2021_Losses (Fire and Smoke) ($),% Change - Earned Premiums,% Change - Claims (Fire and Smoke),% Change - Losses (Fire and Smoke)
1786,95669,0.857143,0.142857,102566.9,197950.3,68248.714286,1628.857,0.929963,-0.833333,-0.976134
1910,95945,2.428571,13.714286,1207216.0,2211778.0,261584.714286,2929872.0,0.832132,4.647059,10.20047
19,90020,1.166667,0.285714,660993.3,747966.6,6678.333333,2100.714,0.13158,-0.755102,-0.685443


In [52]:
df1 = premiums_pivot[['ZIP Code', '% Change - Earned Premiums', '% Change - Claims (Fire and Smoke)', '% Change - Losses (Fire and Smoke)']].copy()
df2 = premiums_aggs.copy()

premiums_reworked = pd.merge(df2, df1, on='ZIP Code')
premiums_reworked.set_index('ZIP Code', inplace=True)

premiums_reworked.sample(3)

Unnamed: 0_level_0,Earned Premium ($),Claims (Fire and Smoke),Losses (Fire and Smoke) ($),% Change - Earned Premiums,% Change - Claims (Fire and Smoke),% Change - Losses (Fire and Smoke)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
91906,4553552,10,662467.0,0.408847,-1.0,-1.0
93011,15810,0,0.0,1.079944,,
96133,233782,1,250.0,0.406558,,


In [53]:
# TODO: move this one to feature engineering section

# # ratio between losses and premium
# premiums_reworked['ratio_losses_to_premium'] = premiums_reworked['fire_smoke_losses'] / premiums_reworked['earned_premium']

### <a id='toc1_2_3_'></a>[FAIR Plan (2022)](#toc0_)

In [88]:
fair22 = pd.read_excel(RAW_DATA_DIR / 'full_residential_units_insured_2022.xlsx')

cols = ['ZIP Code', "Voluntary Market Units", "FAIR Plan Units"]
fair22 = fair22[cols]

# calculate percentages
fair22['Total Res Units'] = fair22['Voluntary Market Units'] + fair22['FAIR Plan Units']
fair22['% Market Units'] = fair22['Voluntary Market Units'] / fair22['Total Res Units']
fair22['% FAIR Plan Units'] = fair22['FAIR Plan Units'] / fair22['Total Res Units']

fair22.head(2)

Unnamed: 0,ZIP Code,Voluntary Market Units,FAIR Plan Units,Total Res Units,% Market Units,% FAIR Plan Units
0,90001,6913,2104,9017,0.766663,0.233337
1,90002,6534,1330,7864,0.830875,0.169125


Besides the 2022 dataset, California also published general information about total exposure covered by FAIR Plan policies, which we're incorporating below. It's potentially a secondary target variable.

array([90001, 90002, 90003, ..., 96155, 96158, 96161], shape=(1717,))

In [145]:
columns_exp = ['ZIP Code', 
               'growth_exp_23_24', 'exposure_24',
               'growth_exp_22_23', 'exposure_23',
               'growth_exp_21_22', 'Total Exposure ($)',
               'growth_exp_20_21', 'exposure_21',
               'exposure_20']

fair_exp = pd.read_excel(RAW_DATA_DIR / 'CFP5yearTIVGrowthbyzipcodethrough09302024(Residentialline)20241112v001_unlocked.xlsx', names=columns_exp)

# removing rows that doesn't contain actual data (totals, etc.)
from pandas.api.types import is_integer, is_number
fair_exp = fair_exp[fair_exp['ZIP Code'].apply(is_integer)].copy()

# cleaning up exposure data
def clean_exposure(value):
    try:
        return float(value)
    except ValueError:
        return np.nan
    
fair_exp['Total Exposure ($)'] = fair_exp['Total Exposure ($)'].map(clean_exposure)
fair_exp.drop(index=[1670], inplace=True)  # removing duplicated data

fair_exp.sample(3)

Unnamed: 0,ZIP Code,growth_exp_23_24,exposure_24,growth_exp_22_23,exposure_23,growth_exp_21_22,Total Exposure ($),growth_exp_20_21,exposure_21,exposure_20
163,95684,0.175,537990319,0.29,457852399,0.052,354914482.0,0.422,337276242,237229348
299,93238,0.701,126870934,0.896,74576474,0.417,39342221.0,0.701,27755891,16316702
411,90040,0.135,62555524,0.039,55113081,0.063,53028728.0,0.054,49869166,47313153


In [146]:
# merging exposure column
fair = pd.merge(fair22, fair_exp[['ZIP Code', 'Total Exposure ($)']], on='ZIP Code')
fair['ZIP Code'] = fair['ZIP Code'].astype(str)
fair.set_index('ZIP Code', inplace=True)

In [147]:
fair.sample(3)

Unnamed: 0_level_0,Voluntary Market Units,FAIR Plan Units,Total Res Units,% Market Units,% FAIR Plan Units,Total Exposure ($)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
91605,7114,298,7412,0.959795,0.040205,125624573.0
95123,14175,7,14182,0.999506,0.000494,4007959.0
95358,6507,10,6517,0.998466,0.001534,3475629.0


## <a id='toc1_3_'></a>[Zillow Data (Housing Value Index)](#toc0_)

Zillow Housing Value Index (ZHVI), overall, represents the “typical” home value for a region. It’s calculated as a weighted average of the middle third of homes in a given region--therefore, it reflects the typical value for homes in the 35th to 65th percentile range.

The base dataset can be found here (https://www.zillow.com/research/data/) with the Data Type: "ZHVI All Homes (SFR, Condo/Co-Op) Time Series, Smoothed, Seasonally Adjusted($)" and "Zip Code" for "Geography." Additional information about ZHVI is available [here](https://www.zillow.com/research/zhvi-user-guide/).

In [58]:
zillow = pd.read_csv(RAW_DATA_DIR / "Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month_adjusted.csv", date_format='%Y-%m-%d', parse_dates=[2,3,4,5,6,])

# Filter rows where the STATE column is 'CA'
zillow = zillow[zillow['State'] == 'CA']

# Filter the house value for Dec 31st of each year
non_date_cols = ['RegionName', 'State']
date_cols = pd.date_range(start='2015-12-31', periods=10, freq='YE').strftime('%m/%d/%Y').tolist()
columns_to_keep = non_date_cols + date_cols

# Keep only those columns (skip missing ones to avoid KeyError)
existing_cols = [col for col in columns_to_keep if col in zillow.columns]
zillow = zillow[existing_cols]

# Rename columns to have a consistent format
zillow.columns = non_date_cols + zillow[date_cols].columns.str.slice(-4).tolist()

# Dropping data that weren't interested in 
cols_to_drop = ['State', '2015', '2016', '2017', '2022', '2023', '2024']
zillow.drop(columns=cols_to_drop, inplace=True)

# Renaming columns for clarity
zillow.rename(columns={'RegionName': 'ZIP Code',
                          '2018': 'Zillow Home Value 2018 ($)',
                          '2019': 'Zillow Home Value 2019 ($)',
                          '2020': 'Zillow Home Value 2020 ($)',
                          '2021': 'Zillow Home Value 2021 ($)'}, inplace=True)
                          
zillow['ZIP Code'] = zillow['ZIP Code'].astype(str)
zillow.dropna(inplace=True)
zillow.set_index('ZIP Code', inplace=True)

zillow.head()

Unnamed: 0_level_0,Zillow Home Value 2018 ($),Zillow Home Value 2019 ($),Zillow Home Value 2020 ($),Zillow Home Value 2021 ($)
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
90011,443282.5808,459312.416,506920.126,560746.111
90650,513295.6656,526093.8573,581498.8551,660390.8548
91331,504148.1256,516199.8496,576276.3692,643635.0577
90044,474820.6675,498004.119,549195.7913,618134.0289
92336,452715.2279,465611.3794,508069.8122,623341.2874


In [59]:
# calculating mean and change over time
zillow['Zillow Mean Home Value ($)'] = zillow[['Zillow Home Value 2018 ($)', 'Zillow Home Value 2019 ($)', 'Zillow Home Value 2020 ($)', 'Zillow Home Value 2021 ($)']].mean(axis=1)
zillow['% Change - Zillow Home Value'] = zillow[['Zillow Home Value 2018 ($)', 'Zillow Home Value 2021 ($)']].pct_change(axis=1).iloc[:, 1]


zillow.sample(3)


Unnamed: 0_level_0,Zillow Home Value 2018 ($),Zillow Home Value 2019 ($),Zillow Home Value 2020 ($),Zillow Home Value 2021 ($),Zillow Mean Home Value ($),% Change - Zillow Home Value
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
95605,306879.7,318819.8,352920.8,400438.1,344764.6,0.30487
93616,256489.8,277089.3,290331.3,329071.0,288245.3,0.282979
94305,2680949.0,2509053.0,2615213.0,2951873.0,2689272.0,0.101055


## <a id='toc1_4_'></a>[Disaster Data](#toc0_)

Tiana TODO: can you add a brief description about this dataset?


Leo: I'm starting from [cleaned_climate_disasters.csv](../data/cleaned_climate_disasters.csv). Tiana --> do you have the code before getting to this point.

*** IMPORTANT NOTE: Disaster data is published at county level. Therefore, we consider that every zipcode in a given county was affect by the disaster.

In [60]:
disasters = pd.read_csv(DATA_DIR / 'cleaned_climate_disasters.csv')
disasters.sample(3)

Unnamed: 0,ZIP,DATE,DISASTER
38403,93648,1995,Storm
43043,91723,1992,Other
37940,92586,1995,Storm


In [61]:
range1 = list(range(2022-1, 2022))
range3 = list(range(2022-3, 2022))
range5 = list(range(2022-5, 2022))
range10 = list(range(2022-10, 2022))

In [62]:
all_disasters = pd.pivot_table(disasters, index='ZIP', columns='DATE', aggfunc='count', fill_value=0)
all_disasters.columns = [x[1] for x in all_disasters.columns]

all_disasters['All Disasters 1y'] = all_disasters[range1].sum(axis=1)
all_disasters['All Disasters 3y'] = all_disasters[range3].sum(axis=1)
all_disasters['All Disasters 5y'] = all_disasters[range5].sum(axis=1)
all_disasters['All Disasters 10y'] = all_disasters[range10].sum (axis=1)

all_disasters.iloc[:, -4:].sample(5)

Unnamed: 0_level_0,All Disasters 1y,All Disasters 3y,All Disasters 5y,All Disasters 10y
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
94552,0,2,2,3
93427,2,5,9,11
96063,2,5,6,8
91962,0,2,2,3
95490,3,7,10,15


In [63]:
fire_disasters = disasters[disasters['DISASTER'] == 'Fire'].copy()
fire_disasters = pd.pivot_table(fire_disasters, index='ZIP', columns='DATE', aggfunc='count', fill_value=0)

fire_disasters.columns = [x[1] for x in fire_disasters.columns]

fire_disasters['Fire Disasters 1y'] = fire_disasters[range1].sum(axis=1)
fire_disasters['Fire Disasters 3y'] = fire_disasters[range3].sum(axis=1)
fire_disasters['Fire Disasters 5y'] = fire_disasters[range5].sum(axis=1)
fire_disasters['Fire Disasters 10y'] = fire_disasters[range10].sum (axis=1)

fire_disasters.iloc[:, -4:].sample(5)

Unnamed: 0_level_0,Fire Disasters 1y,Fire Disasters 3y,Fire Disasters 5y,Fire Disasters 10y
ZIP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
93524,1,3,3,5
95042,0,2,2,3
94167,0,2,2,3
95834,0,2,2,3
92286,0,2,2,3


In [64]:
climate_disasters = pd.concat([all_disasters.iloc[:, -4:], fire_disasters.iloc[:, -4:]], axis=1)
climate_disasters.index.name = 'ZIP Code'
climate_disasters.index = climate_disasters.index.astype(str)

climate_disasters.sample(3)

Unnamed: 0_level_0,All Disasters 1y,All Disasters 3y,All Disasters 5y,All Disasters 10y,Fire Disasters 1y,Fire Disasters 3y,Fire Disasters 5y,Fire Disasters 10y
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
91612,0,2,2,3,0,2,2,3
92501,0,4,7,11,0,3,5,8
95320,1,3,3,4,0,2,2,3


## Consolidating data

In [None]:
dfs = [housing, renewals_reworked, premiums_reworked, fair, zillow, climate_disasters, white_pop, median_income]

In [157]:
concat = pd.concat(dfs, axis=1)
concat.sample(10)

Unnamed: 0_level_0,Housing Units,Median Gross Rent ($),Median Owner Cost ($),Median Home Value - Census ($),New Policies,Renewed Policies,Nonrenewed Policies (by Owner),Nonrenewed Policies (by Company),Nonrenewed Policies,Expiring Policies,...,All Disasters 5y,All Disasters 10y,Fire Disasters 1y,Fire Disasters 3y,Fire Disasters 5y,Fire Disasters 10y,Avg % White-only Pop,% Change White-only Pop,Avg Median Income,% Change Median Income
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
95543,,,,,133.0,971.0,71.0,50.0,121.0,1092.0,...,5.0,7.0,0.0,3.0,3.0,4.0,72.271429,-9.7,39120.285714,-0.094353
90607,,,,,,,,,,,...,11.0,17.0,0.0,5.0,8.0,11.0,45.857143,-16.4,,
91921,,,,,,,,,,,...,2.0,3.0,0.0,2.0,2.0,3.0,64.514286,-17.8,,
95666,3822.0,875.0,586.0,288400.0,2317.0,9178.0,1141.0,1090.0,2231.0,11409.0,...,9.0,12.0,1.0,3.0,3.0,6.0,83.642857,-9.2,66873.714286,0.528561
91716,,,,,,,,,,,...,2.0,3.0,0.0,2.0,2.0,3.0,45.857143,-16.4,,
95843,15229.0,1779.0,584.0,361900.0,6905.0,42386.0,5687.0,860.0,6547.0,48933.0,...,5.0,6.0,0.0,2.0,2.0,3.0,53.957143,-11.5,83578.142857,0.333779
91762,19099.0,1490.0,590.0,437700.0,7095.0,39239.0,4321.0,1008.0,5329.0,44568.0,...,9.0,12.0,0.0,3.0,3.0,5.0,53.542857,-22.8,64950.857143,0.426976
93218,196.0,1167.0,338.0,134500.0,87.0,480.0,66.0,41.0,107.0,587.0,...,8.0,9.0,2.0,4.0,5.0,6.0,64.1,-36.3,43070.0,2.291426
92187,,,,,,,,,,,...,2.0,3.0,0.0,2.0,2.0,3.0,64.514286,-17.8,,
93067,,,,,194.0,1296.0,129.0,32.0,161.0,1457.0,...,9.0,11.0,1.0,3.0,6.0,7.0,68.928571,-20.4,,


In [166]:
for i in concat.sort_index().index:
    print(i)

90001
90002
90003
90004
90005
90006
90007
90008
90009
90010
90011
90012
90013
90014
90015
90016
90017
90018
90019
90020
90021
90022
90023
90024
90025
90026
90027
90028
90029
90030
90031
90032
90033
90034
90035
90036
90037
90038
90039
90040
90041
90042
90043
90044
90045
90046
90047
90048
90049
90050
90051
90052
90053
90054
90055
90056
90057
90058
90059
90060
90061
90062
90063
90064
90065
90066
90067
90068
90069
90070
90071
90072
90073
90074
90075
90076
90077
90078
90079
90080
90081
90082
90083
90084
90086
90087
90088
90089
90090
90091
90093
90094
90095
90096
90099
90101
90102
90103
90134
90189
90201
90202
90209
90210
90211
90212
90213
90220
90221
90222
90223
90224
90230
90231
90232
90233
90239
90240
90241
90242
90245
90247
90248
90249
90250
90251
90254
90255
90260
90261
90262
90263
90264
90265
90266
90267
90270
90272
90274
90275
90277
90278
90280
90290
90291
90292
90293
90294
90295
90296
90301
90302
90303
90304
90305
90306
90307
90308
90309
90310
90311
90312
90313
90397
90398
90401
9040

# <a id='toc2_'></a>[OLD CODE](#toc0_)

## <a id='toc2_1_'></a>[FAIR Plan 2 (2020-2024)](#toc0_)

This dataset contains FAIR Plan information for multiple years (2020-24) as well as information about the total exposure. However, it doesn't include data about the total market policies like the previous dataset.

Also, there are data for residential and commercial policies, but this notebook only deals with residential ones.

In [None]:
# Policies
columns_pol = ['ZIP Code', 
               'growth_pol_23_24', 'policies_24',
               'growth_pol_22_23', 'policies_23',
               'growth_pol_21_22', 'policies_22',
               'growth_pol_20_21', 'policies_21',
               'policies_20']
fair2_pol = pd.read_excel(RAW_DATA_DIR / 'CFP5yearPIFGrowthbyzipcodethrough09302024(Residential+line)20241112v001_unlocked.xlsx', names=columns_pol)

# Exposure
columns_exp = ['ZIP Code', 
               'growth_exp_23_24', 'exposure_24',
               'growth_exp_22_23', 'exposure_23',
               'growth_exp_21_22', 'exposure_22',
               'growth_exp_20_21', 'exposure_21',
               'exposure_20']
fair2_exp = pd.read_excel(RAW_DATA_DIR / 'CFP5yearTIVGrowthbyzipcodethrough09302024(Residentialline)20241112v001_unlocked.xlsx', names=columns_exp)

In [81]:
# removing rows that doesn't contain actual data (totals, etc.)
from pandas.api.types import is_integer, is_number

fair2_pol = fair2_pol[fair2_pol['ZIP Code'].apply(is_integer)].copy()
fair2_exp = fair2_exp[fair2_exp['ZIP Code'].apply(is_integer)].copy()

fair2_pol.shape, fair2_exp.shape

((1647, 10), (1647, 10))

In [82]:
fair2_exp.head(3)

Unnamed: 0,ZIP Code,growth_exp_23_24,exposure_24,growth_exp_22_23,exposure_23,growth_exp_21_22,exposure_22,growth_exp_20_21,exposure_21,exposure_20
2,94501,0.676,98431342,0.179,58719416,0.137,49797731,0.091,43791971,40143917
3,94502,13.274,6880050,0.025,481983,1.85,470279,0.0,165000,165000
4,94536,1.995,40642190,0.852,13571624,0.307,7327808,0.264,5605823,4435256


In [83]:
fair2_pol.head(3)

Unnamed: 0,ZIP Code,growth_pol_23_24,policies_24,growth_pol_22_23,policies_23,growth_pol_21_22,policies_22,growth_pol_20_21,policies_21,policies_20
2,94501,0.333,104,0.04,78,-0.063,75,0.0,80,80
3,94502,2.5,7,0.0,2,1.0,2,0.0,1,1
4,94536,2.105,59,0.727,19,0.222,11,-0.1,9,10


In [84]:
# merging datasets
fair2 = pd.merge(fair2_pol, fair2_exp, on='ZIP Code')
fair2.head()

Unnamed: 0,ZIP Code,growth_pol_23_24,policies_24,growth_pol_22_23,policies_23,growth_pol_21_22,policies_22,growth_pol_20_21,policies_21,policies_20,growth_exp_23_24,exposure_24,growth_exp_22_23,exposure_23,growth_exp_21_22,exposure_22,growth_exp_20_21,exposure_21,exposure_20
0,94501,0.333,104,0.04,78,-0.063,75,0.0,80,80,0.676,98431342,0.179,58719416,0.137,49797731,0.091,43791971,40143917
1,94502,2.5,7,0.0,2,1.0,2,0.0,1,1,13.274,6880050,0.025,481983,1.85,470279,0.0,165000,165000
2,94536,2.105,59,0.727,19,0.222,11,-0.1,9,10,1.995,40642190,0.852,13571624,0.307,7327808,0.264,5605823,4435256
3,94538,1.4,24,0.667,10,0.2,6,0.25,5,4,1.445,15574256,1.647,6370385,0.771,2406677,0.352,1358996,1004964
4,94539,2.471,59,1.125,17,1.667,8,-0.25,3,4,1.934,79814473,0.983,27207162,2.696,13722261,-0.094,3712311,4096084


In [85]:
fair2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1649 entries, 0 to 1648
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ZIP Code          1649 non-null   object
 1   growth_pol_23_24  1649 non-null   object
 2   policies_24       1649 non-null   object
 3   growth_pol_22_23  1649 non-null   object
 4   policies_23       1649 non-null   object
 5   growth_pol_21_22  1649 non-null   object
 6   policies_22       1649 non-null   object
 7   growth_pol_20_21  1649 non-null   object
 8   policies_21       1649 non-null   object
 9   policies_20       1649 non-null   object
 10  growth_exp_23_24  1649 non-null   object
 11  exposure_24       1649 non-null   object
 12  growth_exp_22_23  1649 non-null   object
 13  exposure_23       1649 non-null   object
 14  growth_exp_21_22  1649 non-null   object
 15  exposure_22       1649 non-null   object
 16  growth_exp_20_21  1649 non-null   object
 17  exposure_21   

In [86]:
# convert all columns to float and set to 0 empty values
def clean_non_ints(val):
    return 0 if is_number(val) == False else val

fair2 = fair2.map(clean_non_ints)


In [None]:
fair2.to_csv(DATA_DIR / 'fair_plan.csv')

## <a id='toc2_2_'></a>[FEMA Projected Premium Increases (2021, 2025)](#toc0_)

FEMA created a methodology to predict monthly prices increases (Risk Rating 2.0) and publishes zipcode-level data based on it. The data consists of $10-increment columns with number of policies that they predict to change. They also have data for all the policies and for only single-housing units.

https://www.fema.gov/flood-insurance/risk-rating/profiles

In [None]:
fema = pd.read_excel(RAW_DATA_DIR / 'fema_risk-rating-zip-breakdown-california_2021.xlsx', header=3, sheet_name='SFH Zip Count')

# drop State column and Grand Total row
fema.drop(columns='State', inplace=True)
fema = fema[fema.columns[:-1]]

fema.tail()

Unnamed: 0,Zip Code,< -$100,$-100 to $-90,$-90 to $-80,$-80 to $-70,$-70 to $-60,$-60 to $-50,$-50 to $-40,$-40 to $-30,$-30 to $-20,...,$10 to $20,$20 to $30,$30 to $40,$40 to $50,$50 to $60,$60 to $70,$70 to $80,$80 to $90,$90 to $100,> $100
1456,96161,6.0,1.0,,,2.0,,,1.0,,...,7.0,,,,,,,,,
1457,CA Total of ZIPs w/ <5 Policies,43.0,4.0,4.0,5.0,4.0,7.0,4.0,5.0,5.0,...,29.0,7.0,3.0,4.0,,,,,,
1458,CA Unknown ZIP,,,,,,,1.0,,1.0,...,,1.0,1.0,,,,,,,
1459,00052 <5 Policies,,,,,,,,,,...,,,,,,,,,,
1460,,7464.0,814.0,934.0,1277.0,1519.0,1835.0,1977.0,2094.0,2372.0,...,10326.0,3132.0,611.0,109.0,46.0,16.0,1.0,7.0,3.0,8.0


In [104]:
# removing totals and zipcodes with less than 5 policies and other non-zipcode-level data
fema = fema[fema['Zip Code'].str.len() == 5]

# replace NaN for 0 and force int (instead of float)
fema[fema.columns[1:]] = fema[fema.columns[1:]].replace(np.nan, 0).apply(pd.to_numeric).astype(int)

fema

Unnamed: 0,Zip Code,< -$100,$-100 to $-90,$-90 to $-80,$-80 to $-70,$-70 to $-60,$-60 to $-50,$-50 to $-40,$-40 to $-30,$-30 to $-20,...,$10 to $20,$20 to $30,$30 to $40,$40 to $50,$50 to $60,$60 to $70,$70 to $80,$80 to $90,$90 to $100,> $100
2,90004,1,0,0,0,2,0,0,0,1,...,2,0,0,0,0,0,0,0,0,0
3,90005,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
4,90006,0,0,0,0,0,0,0,0,0,...,7,0,0,0,0,0,0,0,0,0
5,90007,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,90008,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1452,96145,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
1453,96146,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1454,96148,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1455,96150,10,2,0,1,1,0,1,1,1,...,8,7,0,0,0,0,0,0,0,0


The original data is binned in very small buckets ($10 changes), which makes the information very hard to digest. I'll simplify to only 6 buckets and multiply for 12 months, to get the annual change.

In [105]:
# creating the bins
bin50_100_minus = fema.columns[2:7]
bin0_50_minus = fema.columns[7:12]
bin0_50 = fema.columns[12:17]
bin50_100 = fema.columns[17:22]


new_df = {'ZIP Code': fema['Zip Code'],
          '< -$100': fema['< -$100'],
          '-$100 to -$50': np.sum(fema[bin50_100_minus], axis=1),
          '-$50 to -$0': np.sum(fema[bin0_50_minus], axis=1),
          '$0 to $50': np.sum(fema[bin0_50], axis=1),
          '$50 to $100': np.sum(fema[bin50_100], axis=1),
          '> $100': fema['> $100']
}

projs = pd.DataFrame(new_df)
projs.sample(5)

Unnamed: 0,ZIP Code,< -$100,-$100 to -$50,-$50 to -$0,$0 to $50,$50 to $100,> $100
1310,95825,6,6,371,434,0,0
612,93221,3,2,5,20,0,0
110,90621,0,0,0,29,0,0
1115,95370,2,0,1,31,0,0
1377,96001,11,9,17,195,0,0


In [106]:
projs['n_decrease'] = np.sum(projs[projs.columns[1:4]], axis=1)
projs['n_increase'] = np.sum(projs[projs.columns[4:7]], axis=1)
projs['ratio_inc_to_dec'] = projs['n_decrease'] / projs['n_increase']
projs.sample(3)

Unnamed: 0,ZIP Code,< -$100,-$100 to -$50,-$50 to -$0,$0 to $50,$50 to $100,> $100,n_decrease,n_increase,ratio_inc_to_dec
1204,95565,1,0,2,3,0,0,3,3,1.0
752,93657,13,23,24,111,0,0,60,111,0.540541
97,90405,0,0,0,55,0,0,0,55,0.0


In [107]:
projs.to_csv(DATA_DIR / 'premium_change2021.csv')