In this notebook I'm going to calculate network metrics like centrality to be used to predict the flow between nodes. I also add population too.

## Load the data

I first load the data, separating prior to the kincaid fire and during.

In [63]:
import pandas as pd
import numpy as np

kincaid = pd.read_csv('../data/california_jul_nov_2019.csv', engine='pyarrow', date_format='%Y-%m-%d')
kincaid['date'] = pd.to_datetime(kincaid.date)

pre_fire = kincaid.query('date < 20191023')
during_fire = kincaid.query('date >= 20191023')

# data on wildfires. filter for when the kincaid wildfire was active
wildfires = pd.read_csv('../data/ca_daily_fire_2000_03252022.csv', engine='pyarrow')
wildfires['acq_date'] = pd.to_datetime(wildfires.acq_date)
wildfires = wildfires.query('acq_date >= 20191023 & acq_date <= 20191106')

I group by origin-destination pairs, and sum up the total flows between them. I then divide by the number of days to get the average daily traffic between these nodes over the time period.

In [2]:
n_days = (pre_fire.date.max() - pre_fire.date.min()).days
pre_fire = pre_fire.groupby(['geoid_o', 'geoid_d']).agg({'visitor_flows':'sum', 'pop_flows':'sum'})
pre_fire = pre_fire.multiply(1/n_days)
pre_fire.reset_index(inplace=True)

# add the latitudes and longtitudes
pre_fire = pre_fire.merge(kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o','lng_o']] ,how='left', on=['geoid_o']).merge(kincaid.drop_duplicates(['geoid_d'])[['geoid_d','lat_d','lng_d']], how='left', on='geoid_d')

pre_fire.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d
1130,6041,6021,1.769231,45.296703,38.055183,-122.748859,39.598405,-122.392207
1435,6051,6069,0.197802,2.604396,37.93899,-118.886709,36.605706,-121.074996
2606,6095,6031,8.593407,141.527473,38.266919,-121.940007,36.075361,-119.815502
1153,6041,6067,67.373626,1677.098901,38.055183,-122.748859,38.449315,-121.344243
1642,6059,6027,82.153846,1346.516484,33.676913,-117.776146,36.511127,-117.41079


We now have data on the average daily flows between California counties between July-Oct before the Kincaid fire.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

# alternative visualization that's not as cool
# fig = go.Figure(go.Scattergeo(
#     lon= wildfires[wildfires.confidence == 100].longitude,
#     lat=wildfires[wildfires.confidence == 100].latitude
# ))
# fig.update_geos(
#     visible=False, resolution=110, scope="usa",
#     showcountries=True, countrycolor="Black",
#     showsubunits=True, subunitcolor="Blue"
# )
# fig.update_layout(height=300, margin={"r":0,"t":0,"l":0,"b":0})
# fig.show()

fig = px.density_mapbox(wildfires, lat='latitude', lon='longitude', z='confidence', radius=2,
                        center=dict(lat=38.7, lon=-122.73), zoom=7,
                        mapbox_style="stamen-terrain",
                        color_continuous_scale='turbo')
fig.show()

So the wildfire took place in this specific region. Let's then filter for any latitude longitudes that are in the surrounding radius up to some distance.

In [3]:
from geopy.distance import geodesic
import pandas as pd

# Define the center point
center = (38.70369, -122.7669)

# Define the maximum distance (in kilometers) from the center point
max_distance = 150  # For example, 10 kilometers

# Calculate the distance between each observation and the center point
distances_o = pre_fire.apply(lambda row: geodesic(center, (row['lat_o'], row['lng_o'])).kilometers, axis=1)
distances_d = pre_fire.apply(lambda row: geodesic(center, [row['lat_d'], row['lng_d']]).kilometers, axis=1)

# Filter the DataFrame for observations within the maximum distance
pre_fire_filtered = pre_fire[(distances_o <= max_distance) & (distances_d <= max_distance)].reset_index(drop=True)
pre_fire_filtered['pop_flows'] = pd.to_numeric(pre_fire_filtered['pop_flows'], errors='coerce')

We can calculate eigenvector centralities for counties.

In [4]:
import igraph as ig
graph = ig.Graph.TupleList(pre_fire_filtered[['geoid_o','geoid_d','pop_flows']].itertuples(index=False), directed=True, edge_attrs='pop_flows')

eigen_centralities = graph.eigenvector_centrality(weights='pop_flows')
centrality_df = pd.concat([pd.Series(graph.vs['name'],name='geoid'), pd.Series(eigen_centralities,name='eigen_centrality')], axis=1)

  eigen_centralities = graph.eigenvector_centrality(weights='pop_flows')


## Getting population data
Let's add population to the dataset. I obtain a dataset on population from [here](https://www.census.gov/data/tables/time-series/demo/popest/2020s-counties-total.html)

In [5]:
county_pop = pd.read_excel('../data/california_county_populations.xlsx', header=3).rename({'Unnamed: 0':'county',2020:'pop'}, axis=1).dropna()[['county','pop']].iloc[1:, :].reset_index(drop=True)
county_pop['county'] = county_pop.county.str[1:].str.split(',').str[0]
county_pop.sample(5)

Unnamed: 0,county,pop
16,Lake County,68195.0
41,Santa Barbara County,448244.0
12,Imperial County,179595.0
20,Marin County,261282.0
19,Madera County,156343.0


First, we need to get the county names of our flows dataset.

In [11]:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import json
from geopy.point import Point

def lat_long_to_location(lat_longs):
    geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)
    counties = []
    for lat_long in lat_longs:
        _loc = geocode(Point(lat_long))
        if _loc == None:
            counties.append(None)
        else:
            counties.append(_loc.raw['address']['county'])
    # locations = [geocode(loc)[1] for loc in location]
    return counties

geolocator = Nominatim(user_agent="myapplication")

def create_converter_dict(series, converter, batch = False):
    unique_keys = series.str[0]
    if batch == True:
        unique_vals = converter(series.str[1:])
    else:
        unique_vals = unique_keys.apply(converter)
    mapper_dict = dict(zip(unique_keys, unique_vals))
    return mapper_dict

In [None]:
# this is a list of all the lat-long coordinates for the data
cali_county_coordinates = pd.Series(
    pd.concat(
        [
            kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o', 'lng_o']]. \
                rename({'geoid_o':'geoid','lat_o':'lat','lng_o':'lng'}, axis=1),
            kincaid.drop_duplicates(['geoid_d'])[['geoid_d', 'lat_d', 'lng_d']]. \
                rename({'geoid_d':'geoid','lat_d':'lat', 'lng_d':'lng'}, axis=1)
                ], axis=0).drop_duplicates().reset_index(drop=True).itertuples(name=None, index=False))

# i look up the counties of these coordinates:
cali_counties_lat_long_dict = create_converter_dict(cali_county_coordinates, lat_long_to_location, batch=True)

with open('../data/cali_counties_lat_long_dict.json', 'w') as fp:
    json.dump(cali_counties_lat_long_dict, fp)

In [12]:
with open('../data/cali_counties_lat_long_dict.json') as json_file:
    cali_counties_lat_long_dict = json.load(json_file)

In [8]:
pre_fire_filtered[['county_o','county_d']] = pre_fire_filtered[['geoid_o','geoid_d']].apply(lambda x: x.astype(str).map(cali_counties_lat_long_dict))
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d
245,6115,6033,4.098901,64.087912,39.269008,-121.351263,39.099645,-122.753184,Yuba County,Lake County
63,6013,6115,23.142857,396.758242,37.923419,-121.951212,39.269008,-121.351263,Contra Costa County,Yuba County
198,6097,6041,1270.802198,26354.340659,38.525293,-122.922545,38.055183,-122.748859,Sonoma County,Marin County
211,6101,6013,29.21978,476.208791,39.034523,-121.694842,37.923419,-121.951212,Sutter County,Contra Costa County
23,6007,6045,50.43956,987.0,39.666929,-121.600672,39.433624,-123.431554,Butte County,Mendocino County


Now that we've named our counties, we need to get population estimates

In [9]:
pre_fire_filtered = pre_fire_filtered.merge(county_pop, left_on='county_o', right_on='county', how='left').drop('county',axis=1).merge(county_pop, left_on='county_d', right_on='county', how='left', suffixes=('_o','_d')).drop('county',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d
101,6041,6033,10.912088,259.758242,38.055183,-122.748859,39.099645,-122.753184,Marin County,Lake County,261282.0,68195.0
113,6045,6007,13.912088,297.934066,39.433624,-123.431554,39.666929,-121.600672,Mendocino County,Butte County,91361.0,210135.0
204,6097,6097,46404.263736,953758.21978,38.525293,-122.922545,38.525293,-122.922545,Sonoma County,Sonoma County,488281.0,488281.0
203,6097,6095,263.285714,5430.527473,38.525293,-122.922545,38.266919,-121.940007,Sonoma County,Solano County,488281.0,452698.0
242,6115,6011,31.307692,484.582418,39.269008,-121.351263,39.177573,-122.237028,Yuba County,Colusa County,81994.0,21858.0


And we can also add the egeinvector centralities

In [10]:
pre_fire_filtered = pre_fire_filtered.merge(centrality_df, how='left',left_on='geoid_o',right_on='geoid').drop('geoid',axis=1).merge(centrality_df, how='left',left_on='geoid_d',right_on='geoid', suffixes=('_o','_d')).drop('geoid',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d,eigen_centrality_o,eigen_centrality_d
243,6115,6013,22.626374,348.010989,39.269008,-121.351263,37.923419,-121.951212,Yuba County,Contra Costa County,81994.0,1165986.0,0.000553,0.124888
12,6001,6097,409.153846,7720.769231,37.650547,-121.917885,38.525293,-122.922545,Alameda County,Sonoma County,1680380.0,488281.0,1.0,0.005418
161,6081,6007,25.945055,496.0,37.43621,-122.355663,39.666929,-121.600672,San Mateo County,Butte County,762511.0,210135.0,0.058452,0.001142
68,6021,6021,2164.340659,40949.087912,39.598405,-122.392207,39.598405,-122.392207,Glenn County,Glenn County,28896.0,28896.0,0.000133,0.000133
142,6055,6113,38.417582,820.648352,38.506474,-122.330533,38.686646,-121.90162,Napa County,Yolo County,137485.0,216291.0,0.002045,0.00695


And we can add the connectedness index between counties. Cool visualiations [here](https://www.nytimes.com/interactive/2018/09/19/upshot/facebook-county-friendships.html) and I get the data from [here](https://data.humdata.org/dataset/social-connectedness-index)

In [11]:
import pandas as pd
import gc

connectedness = pd.read_csv('../data/county_county.tsv', engine='pyarrow', sep='\t')
pre_fire_filtered = pre_fire_filtered.merge(connectedness, how='left', left_on=['geoid_o', 'geoid_d'], right_on=['user_loc','fr_loc']).drop(['user_loc','fr_loc'], axis=1)

del connectedness; gc.collect()
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d,eigen_centrality_o,eigen_centrality_d,scaled_sci
222,6101,6113,168.340659,2771.274725,39.034523,-121.694842,38.686646,-121.90162,Sutter County,Yolo County,99464.0,216291.0,0.000616,0.00695,89305
158,6067,6113,3492.802198,53656.395604,38.449315,-121.344243,38.686646,-121.90162,Sacramento County,Yolo County,1586465.0,216291.0,0.25575,0.00695,126306
56,6013,6055,328.989011,5599.758242,37.923419,-121.951212,38.506474,-122.330533,Contra Costa County,Napa County,1165986.0,137485.0,0.124888,0.002045,48687
17,6007,6007,23161.604396,436752.241758,39.666929,-121.600672,39.666929,-121.600672,Butte County,Butte County,210135.0,210135.0,0.001142,0.001142,2591079
238,6113,6113,20025.989011,346133.362637,38.686646,-121.90162,38.686646,-121.90162,Yolo County,Yolo County,216291.0,216291.0,0.00695,0.00695,1374296


In [14]:
pre_fire_filtered.county_o.unique()

array(['Alameda County', 'Butte County', 'Colusa County',
       'Contra Costa County', 'Glenn County', 'Lake County',
       'Marin County', 'Mendocino County', 'Napa County',
       'Sacramento County', 'San Mateo County', 'Solano County',
       'Sonoma County', 'Sutter County', 'Yolo County', 'Yuba County'],
      dtype=object)

I then write this to a csv where I can fit a GRM to it in the model section.

In [38]:
pre_fire_filtered.to_csv('../data/clean/pre_kincaid.csv', index=False)