In this notebook I'm going to calculate network metrics like centrality to be used to predict the flow between nodes. I also add population too.

## Load the data

I first load the data, separating prior to the kincaid fire and during.

In [7]:
import pandas as pd
import numpy as np

kincaid = pd.read_csv('../data/california_jul_nov_2019.csv', engine='pyarrow', date_format='%Y-%m-%d')
kincaid['date'] = pd.to_datetime(kincaid.date)

pre_fire = kincaid.query('date < 20191023')
during_fire = kincaid.query('date >= 20191023')

# data on wildfires. filter for when the kincaid wildfire was active
wildfires = pd.read_csv('../data/ca_daily_fire_2000_03252022.csv', engine='pyarrow')
wildfires['acq_date'] = pd.to_datetime(wildfires.acq_date)
wildfires = wildfires.query('acq_date >= 20191023 & acq_date <= 20191106')

I group by origin-destination pairs, and sum up the total flows between them. I then divide by the number of days to get the average daily traffic between these nodes over the time period.

In [8]:
n_days = (pre_fire.date.max() - pre_fire.date.min()).days
pre_fire = pre_fire.groupby(['geoid_o', 'geoid_d']).agg({'visitor_flows':'sum', 'pop_flows':'sum'})
pre_fire = pre_fire.multiply(1/n_days)
pre_fire.reset_index(inplace=True)

# add the latitudes and longtitudes
pre_fire = pre_fire.merge(kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o','lng_o']] ,how='left', on=['geoid_o']).merge(kincaid.drop_duplicates(['geoid_d'])[['geoid_d','lat_d','lng_d']], how='left', on='geoid_d')

pre_fire.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d
1403,6051,6005,0.065934,0.78022,37.93899,-118.886709,38.44639,-120.651116
1959,6069,6093,1.131868,18.967033,36.605706,-121.074996,41.592687,-122.540368
3187,6115,6061,618.21978,9608.208791,39.269008,-121.351263,39.063427,-120.717652
2834,6103,6031,1.054945,17.263736,40.125736,-122.233885,36.075361,-119.815502
883,6031,6099,34.252747,615.824176,36.075361,-119.815502,37.559137,-120.997683


We now have data on the average daily flows between California counties between July-Oct before the Kincaid fire.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

# alternative visualization that's not as cool
# fig = go.Figure(go.Scattergeo(
#     lon= wildfires[wildfires.confidence == 100].longitude,
#     lat=wildfires[wildfires.confidence == 100].latitude
# ))
# fig.update_geos(
#     visible=False, resolution=110, scope="usa",
#     showcountries=True, countrycolor="Black",
#     showsubunits=True, subunitcolor="Blue"
# )
# fig.update_layout(height=300, margin={"r":0,"t":0,"l":0,"b":0})
# fig.show()

fig = px.density_mapbox(wildfires, lat='latitude', lon='longitude', z='confidence', radius=2,
                        center=dict(lat=38.7, lon=-122.73), zoom=7,
                        mapbox_style="stamen-terrain",
                        color_continuous_scale='turbo')
fig.show()

So the wildfire took place in this specific region. Let's then filter for any latitude longitudes that are in the surrounding radius up to some distance.

In [10]:
from geopy.distance import geodesic
import pandas as pd

# Define the center point
center = (38.70369, -122.7669)

# Define the maximum distance (in kilometers) from the center point
max_distance = 50  # For example, 10 kilometers

# Calculate the distance between each observation and the center point
distances = pre_fire.apply(lambda row: geodesic(center, (row['lat_o'], row['lng_o'])).kilometers, axis=1)

# Filter the DataFrame for observations within the maximum distance
pre_fire_filtered = pre_fire[distances <= max_distance].reset_index(drop=True)
pre_fire_filtered['pop_flows'] = pd.to_numeric(pre_fire_filtered['pop_flows'], errors='coerce')

We can calculate eigenvector centralities for counties.

In [12]:
import igraph as ig
graph = ig.Graph.TupleList(pre_fire_filtered[['geoid_o','geoid_d','pop_flows']].itertuples(index=False), directed=True, edge_attrs='pop_flows')

eigen_centralities = graph.eigenvector_centrality(weights='pop_flows')
centrality_df = pd.concat([pd.Series(graph.vs['name'],name='geoid'), pd.Series(eigen_centralities,name='eigen_centrality')], axis=1)


Weighted directed graph in eigenvector centrality at src/centrality/eigenvector.c:303



## Getting population data
Let's add population to the dataset. I obtain a dataset on population from [here](https://www.census.gov/data/tables/time-series/demo/popest/2020s-counties-total.html)

In [13]:
county_pop = pd.read_excel('../data/california_county_populations.xlsx', header=3).rename({'Unnamed: 0':'county',2020:'pop'}, axis=1).dropna()[['county','pop']].iloc[1:, :].reset_index(drop=True)
county_pop['county'] = county_pop.county.str[1:].str.split(',').str[0]
county_pop.sample(5)

Unnamed: 0,county,pop
18,Los Angeles County,9992236.0
6,Contra Costa County,1165986.0
55,Ventura County,843310.0
35,San Bernardino County,2183239.0
36,San Diego County,3296045.0


First, we need to get the county names of our flows dataset.

In [14]:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import json
from geopy.point import Point

def lat_long_to_location(lat_longs):
    geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)
    counties = []
    for lat_long in lat_longs:
        _loc = geocode(Point(lat_long))
        if _loc == None:
            counties.append(None)
        else:
            counties.append(_loc.raw['address']['county'])
    # locations = [geocode(loc)[1] for loc in location]
    return counties

geolocator = Nominatim(user_agent="myapplication")

def create_converter_dict(series, converter, batch = False):
    unique_keys = series.str[0]
    if batch == True:
        unique_vals = converter(series.str[1:])
    else:
        unique_vals = unique_keys.apply(converter)
    mapper_dict = dict(zip(unique_keys, unique_vals))
    return mapper_dict

In [15]:
# this is a list of all the lat-long coordinates for the data
cali_county_coordinates = pd.Series(
    pd.concat(
        [
            kincaid.drop_duplicates(['geoid_o'])[['geoid_o','lat_o', 'lng_o']]. \
                rename({'geoid_o':'geoid','lat_o':'lat','lng_o':'lng'}, axis=1),
            kincaid.drop_duplicates(['geoid_d'])[['geoid_d', 'lat_d', 'lng_d']]. \
                rename({'geoid_d':'geoid','lat_d':'lat', 'lng_d':'lng'}, axis=1)
                ], axis=0).drop_duplicates().reset_index(drop=True).itertuples(name=None, index=False))

# i look up the counties of these coordinates:
cali_counties_lat_long_dict = create_converter_dict(cali_county_coordinates, lat_long_to_location, batch=True)

with open('../data/cali_counties_lat_long_dict.json', 'w') as fp:
    json.dump(cali_counties_lat_long_dict, fp)

In [17]:
with open('../data/cali_counties_lat_long_dict.json') as json_file:
    cali_counties_lat_long_dict = json.load(json_file)

In [21]:
pre_fire_filtered[['county_o','county_d']] = pre_fire_filtered[['geoid_o','geoid_d']].apply(lambda x: x.astype(str).map(cali_counties_lat_long_dict))
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d
28,6033,6057,4.032967,82.241758,39.099645,-122.753184,39.30137,-120.768752,Lake County,Nevada County
78,6055,6043,1.857143,41.032967,38.506474,-122.330533,37.581522,-119.905517,Napa County,Mariposa County
82,6055,6051,1.549451,35.758242,38.506474,-122.330533,37.93899,-118.886709,Napa County,Mono County
118,6097,6009,8.879121,186.681319,38.525293,-122.922545,38.204607,-120.554132,Sonoma County,Calaveras County
70,6055,6027,1.010989,21.032967,38.506474,-122.330533,36.511127,-117.41079,Napa County,Inyo County


Now that we've named our counties, we need to get population estimates

In [22]:
pre_fire_filtered = pre_fire_filtered.merge(county_pop, left_on='county_o', right_on='county', how='left').drop('county',axis=1).merge(county_pop, left_on='county_d', right_on='county', how='left', suffixes=('_o','_d')).drop('county',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d
68,6055,6023,5.461538,117.186813,38.506474,-122.330533,40.705009,-123.915819,Napa County,Humboldt County,137485.0,136261.0
91,6055,6069,0.78022,16.714286,38.506474,-122.330533,36.605706,-121.074996,Napa County,San Benito County,137485.0,64521.0
77,6055,6041,132.32967,2823.120879,38.506474,-122.330533,38.055183,-122.748859,Napa County,Marin County,137485.0,261282.0
90,6055,6067,96.043956,2035.494505,38.506474,-122.330533,38.449315,-121.344243,Napa County,Sacramento County,137485.0,1586465.0
40,6033,6083,2.098901,42.252747,39.099645,-122.753184,34.538344,-120.030786,Lake County,Santa Barbara County,68195.0,448244.0


And we can also add the egeinvector centralities

In [23]:
pre_fire_filtered = pre_fire_filtered.merge(centrality_df, how='left',left_on='geoid_o',right_on='geoid').drop('geoid',axis=1).merge(centrality_df, how='left',left_on='geoid_d',right_on='geoid', suffixes=('_o','_d')).drop('geoid',axis=1)
pre_fire_filtered.sample(5)

Unnamed: 0,geoid_o,geoid_d,visitor_flows,pop_flows,lat_o,lng_o,lat_d,lng_d,county_o,county_d,pop_o,pop_d,eigen_centrality_o,eigen_centrality_d
108,6055,6105,0.67033,15.945055,38.506474,-122.330533,40.650694,-123.112628,Napa County,Trinity County,137485.0,16089.0,0.013733,5.8e-05
47,6033,6097,331.516484,6739.626374,39.099645,-122.753184,38.525293,-122.922545,Lake County,Sonoma County,68195.0,488281.0,0.002952,1.0
69,6055,6025,0.549451,12.010989,38.506474,-122.330533,33.039506,-115.36532,Napa County,Imperial County,137485.0,179595.0,0.013733,3e-05
11,6033,6023,11.923077,242.208791,39.099645,-122.753184,40.705009,-123.915819,Lake County,Humboldt County,68195.0,136261.0,0.002952,0.000991
17,6033,6035,1.32967,26.659341,39.099645,-122.753184,40.67359,-120.594333,Lake County,Lassen County,68195.0,32317.0,0.002952,7.8e-05


And we can add the connectedness index between counties. Cool visualiations [here](https://www.nytimes.com/interactive/2018/09/19/upshot/facebook-county-friendships.html) and I get the data from [here](https://data.humdata.org/dataset/social-connectedness-index)

In [27]:
import pandas as pd
import gc

connectedness = pd.read_csv('../data/county_county.tsv', engine='pyarrow', sep='\t')
pre_fire_filtered = pre_fire_filtered.merge(connectedness, how='left', left_on=['geoid_o', 'geoid_d'], right_on=['user_loc','fr_loc']).drop(['user_loc','fr_loc'], axis=1)

del connectedness; gc.collect()

0