In this notebook I'm going to calculate network metrics like centrality to be used to predict the flow between nodes. I also add population too.

## Load the data

I first load the data, separating prior to the kincaid fire and during.

In [16]:
import pandas as pd
import numpy as np

kincaid = pd.read_csv('../data/california_jul_nov_2019.csv', engine='pyarrow', date_format='%Y-%m-%d')
kincaid['date'] = pd.to_datetime(kincaid.date)

pre_fire = kincaid.query('date < 20191023')
during_fire = kincaid.query('date >= 20191023')

# data on wildfires. filter for when the kincaid wildfire was active
wildfires = pd.read_csv('../data/ca_daily_fire_2000_03252022.csv', engine='pyarrow')
wildfires['acq_date'] = pd.to_datetime(wildfires.acq_date)
wildfires = wildfires.query('acq_date >= 20191023 & acq_date <= 20191106')

I group by origin-destination pairs, and sum up the total flows between them. I then divide by the number of days to get the average daily traffic between these nodes over the time period.

In [17]:
n_days = (pre_fire.date.max() - pre_fire.date.min()).days
pre_fire = pre_fire.groupby(['geoid_o', 'geoid_d']).agg({'visitor_flows':'sum', 'pop_flows':'sum'})
pre_fire = pre_fire.multiply(1/n_days)
pre_fire.reset_index(inplace=True)

# add the latitudes and longtitudes
pre_fire = pre_fire.merge(kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o','lng_o']] ,how='left', on=['geoid_o']).merge(kincaid.drop_duplicates(['geoid_d'])[['geoid_d','lat_d','lng_d']], how='left', on='geoid_d')

pre_fire.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d
1040,6037,6069,16.945055,309.230769,34.198001,-118.261017,36.605706,-121.074996
218,6009,6009,3256.637363,58651.461538,38.204607,-120.554132,38.204607,-120.554132
644,6023,6063,1.692308,34.0,40.705009,-123.915819,40.004682,-120.838598
314,6011,6089,4.989011,107.538462,39.177573,-122.237028,40.763767,-122.040522
2353,6085,6083,213.527473,4085.813187,37.232484,-121.696279,34.538344,-120.030786


We now have data on the average daily flows between California counties between July-Oct before the Kincaid fire.

In [18]:
import plotly.express as px
import plotly.graph_objects as go

# alternative visualization that's not as cool
# fig = go.Figure(go.Scattergeo(
#     lon= wildfires[wildfires.confidence == 100].longitude,
#     lat=wildfires[wildfires.confidence == 100].latitude
# ))
# fig.update_geos(
#     visible=False, resolution=110, scope="usa",
#     showcountries=True, countrycolor="Black",
#     showsubunits=True, subunitcolor="Blue"
# )
# fig.update_layout(height=300, margin={"r":0,"t":0,"l":0,"b":0})
# fig.show()

fig = px.density_mapbox(wildfires, lat='latitude', lon='longitude', z='confidence', radius=2,
                        center=dict(lat=38.7, lon=-122.73), zoom=7,
                        mapbox_style="stamen-terrain",
                        color_continuous_scale='turbo')
fig.show()

So the wildfire took place in this specific region. Let's then filter for any latitude longitudes that are in the surrounding radius up to some distance.

In [24]:
from geopy.distance import geodesic
import pandas as pd

# Define the center point
center = (38.70369, -122.7669)

# Define the maximum distance (in kilometers) from the center point
max_distance = 150  # For example, 10 kilometers

# Calculate the distance between each observation and the center point
distances_o = pre_fire.apply(lambda row: geodesic(center, (row['lat_o'], row['lng_o'])).kilometers, axis=1)
distances_d = pre_fire.apply(lambda row: geodesic(center, [row['lat_d'], row['lng_d']]).kilometers, axis=1)

# Filter the DataFrame for observations within the maximum distance
pre_fire_filtered = pre_fire[(distances_o <= max_distance) & (distances_d <= max_distance)].reset_index(drop=True)
pre_fire_filtered['pop_flows'] = pd.to_numeric(pre_fire_filtered['pop_flows'], errors='coerce')

We can calculate eigenvector centralities for counties.

In [29]:
import igraph as ig
graph = ig.Graph.TupleList(pre_fire_filtered[['geoid_o','geoid_d','pop_flows']].itertuples(index=False), directed=True, edge_attrs='pop_flows')

eigen_centralities = graph.eigenvector_centrality(weights='pop_flows')
centrality_df = pd.concat([pd.Series(graph.vs['name'],name='geoid'), pd.Series(eigen_centralities,name='eigen_centrality')], axis=1)


Weighted directed graph in eigenvector centrality at src/centrality/eigenvector.c:303



## Getting population data
Let's add population to the dataset. I obtain a dataset on population from [here](https://www.census.gov/data/tables/time-series/demo/popest/2020s-counties-total.html)

In [30]:
county_pop = pd.read_excel('../data/california_county_populations.xlsx', header=3).rename({'Unnamed: 0':'county',2020:'pop'}, axis=1).dropna()[['county','pop']].iloc[1:, :].reset_index(drop=True)
county_pop['county'] = county_pop.county.str[1:].str.split(',').str[0]
county_pop.sample(5)

Unnamed: 0,county,pop
50,Sutter County,99464.0
2,Amador County,40532.0
44,Shasta County,181984.0
8,El Dorado County,191220.0
32,Riverside County,2422847.0


First, we need to get the county names of our flows dataset.

In [31]:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import json
from geopy.point import Point

def lat_long_to_location(lat_longs):
    geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)
    counties = []
    for lat_long in lat_longs:
        _loc = geocode(Point(lat_long))
        if _loc == None:
            counties.append(None)
        else:
            counties.append(_loc.raw['address']['county'])
    # locations = [geocode(loc)[1] for loc in location]
    return counties

geolocator = Nominatim(user_agent="myapplication")

def create_converter_dict(series, converter, batch = False):
    unique_keys = series.str[0]
    if batch == True:
        unique_vals = converter(series.str[1:])
    else:
        unique_vals = unique_keys.apply(converter)
    mapper_dict = dict(zip(unique_keys, unique_vals))
    return mapper_dict

In [None]:
# this is a list of all the lat-long coordinates for the data
cali_county_coordinates = pd.Series(
    pd.concat(
        [
            kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o', 'lng_o']]. \
                rename({'geoid_o':'geoid','lat_o':'lat','lng_o':'lng'}, axis=1),
            kincaid.drop_duplicates(['geoid_d'])[['geoid_d', 'lat_d', 'lng_d']]. \
                rename({'geoid_d':'geoid','lat_d':'lat', 'lng_d':'lng'}, axis=1)
                ], axis=0).drop_duplicates().reset_index(drop=True).itertuples(name=None, index=False))

# i look up the counties of these coordinates:
cali_counties_lat_long_dict = create_converter_dict(cali_county_coordinates, lat_long_to_location, batch=True)

with open('../data/cali_counties_lat_long_dict.json', 'w') as fp:
    json.dump(cali_counties_lat_long_dict, fp)

In [33]:
with open('../data/cali_counties_lat_long_dict.json') as json_file:
    cali_counties_lat_long_dict = json.load(json_file)

In [34]:
pre_fire_filtered[['county_o','county_d']] = pre_fire_filtered[['geoid_o','geoid_d']].apply(lambda x: x.astype(str).map(cali_counties_lat_long_dict))
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d
123,6045,6095,16.912088,376.21978,39.433624,-123.431554,38.266919,-121.940007,Mendocino County,Solano County
167,6081,6045,25.43956,492.494505,37.43621,-122.355663,39.433624,-123.431554,San Mateo County,Mendocino County
22,6007,6041,12.373626,233.362637,39.666929,-121.600672,38.055183,-122.748859,Butte County,Marin County
45,6011,6101,151.131868,3220.615385,39.177573,-122.237028,39.034523,-121.694842,Colusa County,Sutter County
180,6095,6021,7.802198,126.78022,38.266919,-121.940007,39.598405,-122.392207,Solano County,Glenn County


Now that we've named our counties, we need to get population estimates

In [35]:
pre_fire_filtered = pre_fire_filtered.merge(county_pop, left_on='county_o', right_on='county', how='left').drop('county',axis=1).merge(county_pop, left_on='county_d', right_on='county', how='left', suffixes=('_o','_d')).drop('county',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d
229,6113,6033,7.967033,139.208791,38.686646,-121.90162,39.099645,-122.753184,Yolo County,Lake County,216291.0,68195.0
8,6001,6055,188.505495,3541.505495,37.650547,-121.917885,38.506474,-122.330533,Alameda County,Napa County,1680380.0,137485.0
204,6097,6097,46404.263736,953758.21978,38.525293,-122.922545,38.525293,-122.922545,Sonoma County,Sonoma County,488281.0,488281.0
99,6041,6013,278.802198,6811.989011,38.055183,-122.748859,37.923419,-121.951212,Marin County,Contra Costa County,261282.0,1165986.0
126,6045,6113,10.307692,230.384615,39.433624,-123.431554,38.686646,-121.90162,Mendocino County,Yolo County,91361.0,216291.0


And we can also add the egeinvector centralities

In [36]:
pre_fire_filtered = pre_fire_filtered.merge(centrality_df, how='left',left_on='geoid_o',right_on='geoid').drop('geoid',axis=1).merge(centrality_df, how='left',left_on='geoid_d',right_on='geoid', suffixes=('_o','_d')).drop('geoid',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d,eigen_centrality_o,eigen_centrality_d
122,6045,6081,21.549451,473.263736,39.433624,-123.431554,37.43621,-122.355663,Mendocino County,San Mateo County,91361.0,762511.0,0.000578,0.058452
255,6115,6115,6550.89011,101031.516484,39.269008,-121.351263,39.269008,-121.351263,Yuba County,Yuba County,81994.0,81994.0,0.000553,0.000553
74,6021,6081,5.769231,108.857143,39.598405,-122.392207,37.43621,-122.355663,Glenn County,San Mateo County,28896.0,762511.0,0.000133,0.058452
35,6011,6013,3.626374,75.978022,39.177573,-122.237028,37.923419,-121.951212,Colusa County,Contra Costa County,21858.0,1165986.0,0.000172,0.124888
63,6013,6115,23.142857,396.758242,37.923419,-121.951212,39.269008,-121.351263,Contra Costa County,Yuba County,1165986.0,81994.0,0.124888,0.000553


And we can add the connectedness index between counties. Cool visualiations [here](https://www.nytimes.com/interactive/2018/09/19/upshot/facebook-county-friendships.html) and I get the data from [here](https://data.humdata.org/dataset/social-connectedness-index)

In [37]:
import pandas as pd
import gc

connectedness = pd.read_csv('../data/county_county.tsv', engine='pyarrow', sep='\t')
pre_fire_filtered = pre_fire_filtered.merge(connectedness, how='left', left_on=['geoid_o', 'geoid_d'], right_on=['user_loc','fr_loc']).drop(['user_loc','fr_loc'], axis=1)

del connectedness; gc.collect()

2712

I then write this to a csv where I can fit a GRM to it in the model section.

In [38]:
pre_fire_filtered.to_csv('../data/clean/pre_kincaid.csv', index=False)