In this notebook I'm going to calculate network metrics like centrality to be used to predict the flow between nodes. I also add population too.

## Load the data

I first load the data, separating prior to the kincaid fire and during.

In [383]:
import pandas as pd
import numpy as np

kincaid = pd.read_csv('../data/california_jul_nov_2019.csv', engine='pyarrow', date_format='%Y-%m-%d')
kincaid['date'] = pd.to_datetime(kincaid.date)

pre_fire = kincaid.query('date < 20191023')
during_fire = kincaid.query('date >= 20191023')

# data on wildfires. filter for when the kincaid wildfire was active
wildfires = pd.read_csv('../data/ca_daily_fire_2000_03252022.csv', engine='pyarrow')
wildfires['acq_date'] = pd.to_datetime(wildfires.acq_date)
wildfires = wildfires.query('acq_date >= 20191023 & acq_date <= 20191106')

I group by origin-destination pairs, and sum up the total flows between them. I then divide by the number of days to get the average daily traffic between these nodes over the time period.

In [384]:
n_days = (pre_fire.date.max() - pre_fire.date.min()).days
pre_fire = pre_fire.groupby(['geoid_o', 'geoid_d']).agg({'visitor_flows':'sum', 'pop_flows':'sum'})
pre_fire = pre_fire.multiply(1/n_days)
pre_fire.reset_index(inplace=True)

# add the latitudes and longtitudes
pre_fire = pre_fire.merge(kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o','lng_o']] ,how='left', on=['geoid_o']).merge(kincaid.drop_duplicates(['geoid_d'])[['geoid_d','lat_d','lng_d']], how='left', on='geoid_d')

pre_fire.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d
1808,6065,6017,44.021978,703.252747,33.74368,-115.993849,38.778738,-120.524647
3186,6115,6059,8.428571,131.648352,39.269008,-121.351263,33.676913,-117.776146
1863,6067,6013,809.923077,12400.637363,38.449315,-121.344243,37.923419,-121.951212
1026,6037,6041,134.406593,2460.120879,34.198001,-118.261017,38.055183,-122.748859
955,6035,6013,4.065934,78.648352,40.67359,-120.594333,37.923419,-121.951212


We now have data on the average daily flows between California counties between July-Oct before the Kincaid fire.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

# alternative visualization that's not as cool
# fig = go.Figure(go.Scattergeo(
#     lon= wildfires[wildfires.confidence == 100].longitude,
#     lat=wildfires[wildfires.confidence == 100].latitude
# ))
# fig.update_geos(
#     visible=False, resolution=110, scope="usa",
#     showcountries=True, countrycolor="Black",
#     showsubunits=True, subunitcolor="Blue"
# )
# fig.update_layout(height=300, margin={"r":0,"t":0,"l":0,"b":0})
# fig.show()

fig = px.density_mapbox(wildfires, lat='latitude', lon='longitude', z='confidence', radius=2,
                        center=dict(lat=38.7, lon=-122.73), zoom=7,
                        mapbox_style="stamen-terrain",
                        color_continuous_scale='turbo')
fig.show()

So the wildfire took place in this specific region. Let's then filter for any latitude longitudes that are in the surrounding radius up to some distance.

In [385]:
from geopy.distance import geodesic
import pandas as pd

# Define the center point
center = (38.70369, -122.7669)

# Define the maximum distance (in kilometers) from the center point
max_distance = 50  # For example, 10 kilometers

# Calculate the distance between each observation and the center point
distances = pre_fire.apply(lambda row: geodesic(center, (row['lat_o'], row['lng_o'])).kilometers, axis=1)

# Filter the DataFrame for observations within the maximum distance
pre_fire_filtered = pre_fire[distances <= max_distance].reset_index(drop=True)
pre_fire_filtered['pop_flows'] = pd.to_numeric(pre_fire_filtered['pop_flows'], errors='coerce')

In [381]:
import igraph as ig
graph = ig.Graph.TupleList(pre_fire_filtered[['geoid_o','geoid_d','pop_flows']].itertuples(index=False), directed=True, edge_attrs='pop_flows')

eigen_centralities = graph.eigenvector_centrality(weights='pop_flows')
centrality_df = pd.concat([pd.Series(graph.vs['name'],name='geoid'), pd.Series(eigen_centralities,name='eigen_centrality')], axis=1)


Weighted directed graph in eigenvector centrality at src/centrality/eigenvector.c:303



We have calculated eigenvector centralities for counties.

## Getting population data
Let's add population to the dataset. I obtain a dataset on population from [here](https://www.census.gov/data/tables/time-series/demo/popest/2020s-counties-total.html)

In [400]:
county_pop = pd.read_excel('../data/california_county_populations.xlsx', header=3).rename({'Unnamed: 0':'county',2020:'pop'}, axis=1).dropna()[['county','pop']].iloc[1:, :].reset_index(drop=True)
county_pop['county'] = county_pop.county.str[1:].str.split(',').str[0]
county_pop.sample(5)

Unnamed: 0,county,pop
54,Tuolumne County,55374.0
6,Contra Costa County,1165986.0
5,Colusa County,21858.0
4,Calaveras County,45346.0
20,Marin County,261282.0


First, we need to get the county names of our flows dataset.

In [359]:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import json
from geopy.point import Point

def lat_long_to_location(lat_longs):
    geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)
    counties = []
    for lat_long in lat_longs:
        _loc = geocode(Point(lat_long))
        if _loc == None:
            counties.append(None)
        else:
            counties.append(_loc.raw['address']['county'])
    # locations = [geocode(loc)[1] for loc in location]
    return counties

geolocator = Nominatim(user_agent="myapplication")

def create_converter_dict(series, converter, batch = False):
    unique_keys = series.str[0]
    if batch == True:
        unique_vals = converter(series.str[1:])
    else:
        unique_vals = unique_keys.apply(converter)
    mapper_dict = dict(zip(unique_keys, unique_vals))
    return mapper_dict

In [361]:
# this is a list of all the lat-long coordinates for the data
cali_county_coordinates = pd.Series(
    pd.concat(
        [
            kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o', 'lng_o']]. \
                rename({'geoid_o':'geoid','lat_o':'lat','lng_o':'lng'}, axis=1),
            kincaid.drop_duplicates(['geoid_d'])[['geoid_d', 'lat_d', 'lng_d']]. \
                rename({'geoid_d':'geoid','lat_d':'lat', 'lng_d':'lng'}, axis=1)
                ], axis=0).drop_duplicates().reset_index(drop=True).itertuples(name=None, index=False))

# i look up the counties of these coordinates:
cali_counties_lat_long_dict = create_converter_dict(cali_county_coordinates, lat_long_to_location, batch=True)

In [401]:
pre_fire_filtered[['county_o','county_d']] = pre_fire_filtered[['geoid_o','geoid_d']].apply(lambda x: x.map(cali_counties_lat_long_dict))
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d
5,6033,6011,16.0,330.703297,39.099645,-122.753184,39.177573,-122.237028,Lake County,Colusa County
109,6055,6107,4.846154,101.483516,38.506474,-122.330533,36.220157,-118.800473,Napa County,Tulare County
92,6055,6071,9.824176,213.648352,38.506474,-122.330533,34.841438,-116.178461,Napa County,San Bernardino County
52,6033,6107,1.934066,39.164835,39.099645,-122.753184,36.220157,-118.800473,Lake County,Tulare County
124,6097,6021,5.725275,119.857143,38.525293,-122.922545,39.598405,-122.392207,Sonoma County,Glenn County


Now that we've named our counties, we need to get population estimates

In [409]:
pre_fire_filtered = pre_fire_filtered.merge(county_pop, left_on='county_o', right_on='county', how='left').drop('county',axis=1).merge(county_pop, left_on='county_d', right_on='county', how='left', suffixes=('_o','_d')).drop('county',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d
98,6055,6085,122.153846,2465.659341,38.506474,-122.330533,37.232484,-121.696279,Napa County,Santa Clara County,137485.0,1931026.0
133,6097,6039,5.527473,116.296703,38.525293,-122.922545,37.218031,-119.762648,Sonoma County,Madera County,488281.0,156343.0
108,6055,6105,0.67033,15.945055,38.506474,-122.330533,40.650694,-123.112628,Napa County,Trinity County,137485.0,16089.0
36,6033,6073,8.054945,165.175824,39.099645,-122.753184,33.028203,-116.770207,Lake County,San Diego County,68195.0,3296045.0
81,6055,6049,0.142857,3.263736,38.506474,-122.330533,41.589834,-120.725029,Napa County,Modoc County,137485.0,8663.0


And we can also add the egeinvector centralities

In [414]:
pre_fire_filtered = pre_fire_filtered.merge(centrality_df, how='left',left_on='geoid_o',right_on='geoid').drop('geoid',axis=1).merge(centrality_df, how='left',left_on='geoid_d',right_on='geoid', suffixes=('_o','_d')).drop('geoid',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d,eigen_centrality_o,eigen_centrality_d
156,6097,6087,53.791209,1141.098901,38.525293,-122.922545,37.021606,-122.009787,Sonoma County,Santa Cruz County,488281.0,270462.0,1.0,0.0012
119,6097,6011,8.307692,173.076923,38.525293,-122.922545,39.177573,-122.237028,Sonoma County,Colusa County,488281.0,21858.0,1.0,0.000184
126,6097,6025,1.362637,28.769231,38.525293,-122.922545,33.039506,-115.36532,Sonoma County,Imperial County,488281.0,179595.0,1.0,3e-05
9,6033,6019,3.373626,69.483516,39.099645,-122.753184,36.758179,-119.64932,Lake County,Fresno County,68195.0,1009503.0,0.002952,0.00078
166,6097,6107,13.868132,287.0,38.525293,-122.922545,36.220157,-118.800473,Sonoma County,Tulare County,488281.0,473891.0,1.0,0.000302
