In [34]:
import pandas as pd
import math

fire_locations = pd.read_csv('../data/Wildland_Fire_Incident_Locations.csv')

fire_locations.head()

Unnamed: 0,IncidentName,UniqueFireIdentifier,FireDiscoveryDateTime,InitialResponseAcres,CreatedOnDateTime_dt,GACC,ContainmentDateTime,ControlDateTime,FireOutDateTime,DiscoveryAcres,FinalAcres,IncidentSize,InitialLatitude,InitialLongitude,Latitude,Longitude
0,Grand Pager,2014-AKFAS-411093,1530/08/16,0.1,2014-05-21 04:09:45+00:00,AKCC,2014/05/12 05:17:21+00,2014/05/12 05:17:35+00,2014/05/13 01:53:46+00,0.1,0.1,0.1,64.80265,-147.744683,64.802702,-147.745026
1,Glenn Alps,2014-AKMSS-401077,2014/05/09 03:22:55+00,0.1,2014-05-21 04:10:05+00:00,AKCC,2014/05/09 07:09:00+00,2014/05/09 07:09:31+00,2014/05/10 20:27:38+00,0.1,0.7,0.7,61.102967,-149.662833,61.103002,-149.663023
2,Johnson Lake,2014-AKKKS-403067,2014/05/05 00:18:25+00,0.1,2014-05-21 04:10:06+00:00,AKCC,2014/05/05 00:28:35+00,2014/05/05 00:28:00+00,2014/05/05 00:33:26+00,0.1,0.1,0.1,60.295033,-151.262467,60.295001,-151.262022
3,Mile 9 Talkeetna Spur,2014-AKMSS-401137,2014/05/19 21:45:02+00,0.1,2014-05-21 04:10:07+00:00,AKCC,2014/05/19 22:02:00+00,2014/05/19 22:02:31+00,2014/05/25 20:02:25+00,0.1,0.1,0.1,62.422633,-150.081217,62.422602,-150.081024
4,Jim Lake,2014-AKMSS-401131,2014/05/19 00:01:00+00,0.1,2014-05-21 04:10:07+00:00,AKCC,2014/05/19 16:14:45+00,2014/05/19 16:14:49+00,2014/05/29 23:42:41+00,0.1,1.0,1.0,61.543767,-148.879433,61.543802,-148.879023


Let's take a look at which attributes are not well tracked (i.e. which ones have null values for lots of instances)

In [35]:
total_instances = fire_locations.shape[0]
print("Total instances: " + str(total_instances))
missing_values = fire_locations.isnull().sum()
print("Missing values for each attribute:")
for attribute, count in missing_values.items():
    print(f"{attribute}: {count} ({int(count/total_instances * 10000)/100}%)")

Total instances: 207301
Missing values for each attribute:
IncidentName: 40 (0.01%)
UniqueFireIdentifier: 0 (0.0%)
FireDiscoveryDateTime: 0 (0.0%)
InitialResponseAcres: 133197 (64.25%)
CreatedOnDateTime_dt: 0 (0.0%)
GACC: 58 (0.02%)
ContainmentDateTime: 81565 (39.34%)
ControlDateTime: 92348 (44.54%)
FireOutDateTime: 81482 (39.3%)
DiscoveryAcres: 59460 (28.68%)
FinalAcres: 195128 (94.12%)
IncidentSize: 65217 (31.46%)
InitialLatitude: 59051 (28.48%)
InitialLongitude: 59051 (28.48%)
Latitude: 0 (0.0%)
Longitude: 0 (0.0%)


Remove instances with missing values for the attributes FireOutDateTime and IncidentSize because we need both for calculations

In [36]:
fire_locations.dropna(subset=["FireOutDateTime"], inplace=True)
fire_locations.dropna(subset=["IncidentSize"], inplace=True)

print("Total instances: " + str(fire_locations.shape[0]))

Total instances: 119225



Remove UniqueFireID _2014-IDNCF-000609_ and _2014-AKFAS-411093_ because they have dates in 1530.

Then, find the difference between each fire's discovery date and time and it's FireOut date and time to get the total time burned. Display the first few rows to verify the calculation. Finaly, get the hours burned from the time burned so that we're working with a consistent time unit.

In [37]:
fire_locations.drop(fire_locations[fire_locations['UniqueFireIdentifier'] == '2014-AKFAS-411093'].index, inplace=True)
fire_locations.drop(fire_locations[fire_locations['UniqueFireIdentifier'] == '2014-IDNCF-000609'].index, inplace=True)

fire_locations['FireDiscoveryDateTime'] = pd.to_datetime(fire_locations['FireDiscoveryDateTime'], format='%Y/%m/%d %H:%M:%S+00')
fire_locations['FireOutDateTime'] = pd.to_datetime(fire_locations['FireOutDateTime'], format='%Y/%m/%d %H:%M:%S+00')
fire_locations['TimeBurned'] = fire_locations['FireOutDateTime'] - fire_locations['FireDiscoveryDateTime']

print(fire_locations[['FireDiscoveryDateTime', 'FireOutDateTime', 'TimeBurned']].head())

fire_locations['HoursBurned'] = fire_locations['TimeBurned'].dt.total_seconds() / (60*60)

  FireDiscoveryDateTime     FireOutDateTime       TimeBurned
1   2014-05-09 03:22:55 2014-05-10 20:27:38  1 days 17:04:43
2   2014-05-05 00:18:25 2014-05-05 00:33:26  0 days 00:15:01
3   2014-05-19 21:45:02 2014-05-25 20:02:25  5 days 22:17:23
4   2014-05-19 00:01:00 2014-05-29 23:42:41 10 days 23:41:41
5   2014-05-17 01:37:14 2014-05-20 00:51:20  2 days 23:14:06


Now, let's get an idea of what we're working with using the range and average of both attributes we'll be working with for supression result. We want to get an idea of the distribution of our data.

In [38]:
summary = fire_locations[['HoursBurned', 'IncidentSize']].describe()

time_burned_range = summary.loc['max', 'HoursBurned'] - summary.loc['min', 'HoursBurned']
acres_burned_range = summary.loc['max', 'IncidentSize'] - summary.loc['min', 'IncidentSize']
average_time_burned = summary.loc['mean', 'HoursBurned']
average_acres_burned = summary.loc['mean', 'IncidentSize']

print("HoursBurned Range:", time_burned_range)
print("HoursBurned Average:", average_time_burned)
print("IncidentSize Range:", acres_burned_range)
print("IncidentSize Average:", average_acres_burned)

HoursBurned Range: 28090.043611111112
HoursBurned Average: 295.28396181478774
IncidentSize Range: 589368.0
IncidentSize Average: 435.8770563314125


Quite the range there, and we know from Kole's graphs that especially for acres burned we have a ton of very small fires and not many large ones. With such a dramatic of a right (or positive) skew, we won't be able to fully normallize the data.

![Acres Burned Distribution](./acres-distribution-chart.svg)

We'll use a log to normalize as much as we can. We'll also need to transform our data (x + 1) to make sure we don't get negative values from the fires that burned for less than an acre or less than an hour.

SupressionResult is calculated using the average of normalized time burned and normalized acreage burned. Acreage burned is given twice as much weight as time burned because we think it is a better indicator of how well a fire was supressed. This average is then converted to a percent.

SuppressionResult = $\left (1 - \frac{\displaystyle\left(\frac{\log(x + 1)}{\log(\text{MaxHoursBurned} + 1)} + \frac{2(\log(y + 1))}{\log(\text{MaxAcresBurned} + 1)}\right)}{\bigg(3\bigg)}\right) \times 100$

Where $x = HoursBurned$ and $y = AcresBurned$ of a given fire

In [39]:
log_time_range = math.log(time_burned_range+1)
log_acres_range = math.log(acres_burned_range+1)
print("log time: " + str(log_time_range) + " log acres: " + str(log_acres_range))

fire_locations["NormalizedTime"] = fire_locations["HoursBurned"].apply(lambda x: math.log(x+1)/log_time_range if x > 0 else 0)
fire_locations["NormalizedAcreage"] = fire_locations["IncidentSize"].apply(lambda x: math.log(x+1)/log_acres_range if x > 0 else 0)
fire_locations['SupressionResult'] = (1 - (fire_locations['NormalizedTime'] + (2 * fire_locations['NormalizedAcreage']))/3) * 100

log time: 10.243206071815102 log acres: 13.286807752042323


Finally, let's take a look how well our suppression result measures up against both the bigger fires and the rest of the data (mostly much smaller fires).

In [40]:
filtered_fire_locations = fire_locations[fire_locations['IncidentSize'] > 500]
print(filtered_fire_locations[['FireDiscoveryDateTime', 'HoursBurned', 'IncidentSize', 'NormalizedTime', 'NormalizedAcreage', 'SupressionResult']].head())

print(fire_locations[['FireDiscoveryDateTime', 'HoursBurned', 'IncidentSize', 'NormalizedTime', 'NormalizedAcreage', 'SupressionResult']].head())

    FireDiscoveryDateTime  HoursBurned  IncidentSize  NormalizedTime  \
11    2014-05-19 22:46:36  1104.806944        1906.0        0.684193   
17    2014-05-20 00:03:50  4866.938889      196610.0        0.828884   
132   2014-05-11 21:02:00  5274.466667        5484.0        0.836732   
143   2014-04-19 23:30:00  2089.500000       73622.0        0.746364   
145   2014-05-17 21:00:00  1420.000000        1482.0        0.708676   

     NormalizedAcreage  SupressionResult  
11            0.568480         39.294887  
17            0.917375         11.212226  
132           0.647994         28.909317  
143           0.843447         18.891433  
145           0.549554         39.740507  
  FireDiscoveryDateTime  HoursBurned  IncidentSize  NormalizedTime  \
1   2014-05-09 03:22:55    41.078611           0.7        0.365075   
2   2014-05-05 00:18:25     0.250278           0.1        0.021806   
3   2014-05-19 21:45:02   142.289722           0.1        0.484699   
4   2014-05-19 00:01:00   263