In [1]:
import pandas as pd
import geopandas as gpd
from datetime import datetime

Looking at our fire data - We can see that a lot of the Start and end time are missing, which is a problem when we're trying to use the start and end time as a feature for predicting fires.

Let's look at our data again and see how many missing information we have:


In [2]:
fire_df = pd.read_csv('Data/Fire_data_1990to2022.csv')

  fire_df = pd.read_csv('Data/Fire_data_1990to2022.csv')


In [3]:
fire_df.shape

(35665, 23)

In [4]:
fire_df.isna().sum()

Unnamed: 0.1        0
Unnamed: 0          0
YEAR                0
NFIREID             0
BASRC               0
FIREMAPS            0
FIREMAPM            0
FIRECAUS            0
BURNCLAS            0
SDATE           23311
EDATE           23311
AFSDATE          8336
AFEDATE         22014
CAPDATE         20698
POLY_HA             0
ADJ_HA              0
ADJ_FLAG            0
AGENCY              0
BT_GID              0
VERSION             0
COMMENTS         7321
geometry            0
COMMENT         34844
dtype: int64

|Name|Description|
|-----|---------|
|YEAR| is the fire year|
|NFIREID | a uniquely assigned ID to each fire event over a spatial region and for a specific year. It is the common ID used to link a fire|
|BASRC	|event crossing a provincial, territorial, or national park boundary for a specific year.|
|FIREMAPS|describes the data source or platform used to identify the burn.|
|FIREMAPM|describes the method used to delineate the burn polygon. |
|FIRECAUS|describes the ignition source of the fire recorded by the agency.|
|BURNCLAS|The class type of the burns incurred by fire. |
|SDATE|is the date of the first detected hotspot within the spatial extent of the fire event. Null if no hotspots were detected|
|EDATE|is the date of the last detected hotspot within the spatial extent of the fire event. Null if no hotspots were detected|
|AFSDATE|is the fire start date reported by the agency. Could also represent the recorded date or detection date of the fire by the agency|
|AFEDATE|is the fire end date reported by the agency. Where different dates are recorded by agencies for cross-border fires, NBAC uses the last date|
|CAPDATE| is the acquisition date of the source data. Examples include date of GPS acquisition, air photo acquisition, and satellite image. Null if not provided|
|POLY_HA|is the total area calculated in hectares using the Canada Albers Equal Area Conic projection|
|ADJ_HA|is an adjusted area burn calculated in hectares. Area burned adjustment models are described in https://doi.org/10.1088/1748-ADJ_FLAG9326/abfb2c.|
|AGENCY|is the location (Province, Territory, or Parks Canada) where the fire perimeter is mapped|
|BT_GID|GID is a Global Identifier that concatenates the fire year and NFIREID. This identifier is useful for selecting unique fire records merged by year.|

In [5]:
ratio_SDATE = fire_df['SDATE'].isna().sum()/fire_df.shape[0]
ratio_SDATE

0.6536099817748493

In [None]:
ss

We can see that 65% of our start date data is missing. That is a significant number of missing data to interpolate. I will be pivotting from predicting the start and end date to simply creating a model that can predict whether fire is more likely depending on weather in a certain month. To do that, I will have to:
- Create a new month column and fill it out based on the already available SDATE
- Drop the SDATE, EDATE, AFSDATE, AFEDATE, & CAPDATE, as we won't be looking at any other dates moving forward. 
- Drop the comments columns as those are not needed moving forward
- Create a separate df including only rows with SDATE

In [6]:
fire_df['SDATE'].apply(pd.to_datetime)

0       2005-06-08
1       2005-06-01
2       2005-07-05
3       2005-06-01
4       2005-07-16
           ...    
35660          NaT
35661          NaT
35662          NaT
35663          NaT
35664          NaT
Name: SDATE, Length: 35665, dtype: datetime64[ns]