# Wildfire Prediction Project
## Notebook 2) Feature Engineering 

In this notebook, I will finish cleaning for model 0 and develop additinal features for model 1.

During the EDA, we did complete some basic cleaning like converting the float date feature to datetime and extracted the month, day, and day of week. Now we will continue adding features.

* Imputing the containment date (just to calculate the fire furation)
* Calculating the fire duration (in days)
* Dropping features with null values

### Import Libraries

In [7]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [8]:
df = pd.read_csv("../../data/fires_FIPS_EDA.csv", index_col = 0)

In [9]:
df.head()

Unnamed: 0,OBJECTID,FOD_ID,FPA_ID,SOURCE_SYSTEM_TYPE,SOURCE_SYSTEM,NWCG_REPORTING_AGENCY,NWCG_REPORTING_UNIT_ID,NWCG_REPORTING_UNIT_NAME,SOURCE_REPORTING_UNIT,SOURCE_REPORTING_UNIT_NAME,...,COUNTY,FIPS_CODE,FIPS_NAME,Shape,DISC_GREG_DATE,CONT_GREG_DATE,DISC_MONTH,DISC_DAY,DISC_DAYOFWEEK,LABEL
1,2,2,FS-1418827,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,61,61,Placer,b'\x00\x01\xad\x10\x00\x00T\xb6\xeej\xe2\x19^\...,2004-05-12,2004-05-12,5,12,2,1
2,3,3,FS-1418835,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,17,17,El Dorado,b'\x00\x01\xad\x10\x00\x00\xd0\xa5\xa0W\x13/^\...,2004-05-31,2004-05-31,5,31,0,2
3,4,4,FS-1418845,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,3,3,Alpine,b'\x00\x01\xad\x10\x00\x00\x94\xac\xa3\rt\xfa]...,2004-06-28,2004-07-03,6,28,0,1
4,5,5,FS-1418847,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,3,3,Alpine,b'\x00\x01\xad\x10\x00\x00@\xe3\xaa.\xb7\xfb]\...,2004-06-28,2004-07-03,6,28,0,1
5,6,6,FS-1418849,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,5,5,Amador,b'\x00\x01\xad\x10\x00\x00\xf0<~\x90\xa1\x06^\...,2004-06-30,2004-07-01,6,30,2,1


In [10]:
df.shape

(84893, 44)

### Fire Duration Feature
Let's correlate fire size with burn length so we can impute the containment date

In [11]:
duration = df[['DISCOVERY_DOY', 'CONT_DOY','FIRE_SIZE_CLASS']]
duration.dropna(inplace=True)

In [12]:
duration.dtypes

DISCOVERY_DOY        int64
CONT_DOY           float64
FIRE_SIZE_CLASS     object
dtype: object

In [13]:
duration['Fire_duration'] = duration['CONT_DOY'] - duration['DISCOVERY_DOY']

In [14]:
duration.head()

Unnamed: 0,DISCOVERY_DOY,CONT_DOY,FIRE_SIZE_CLASS,Fire_duration
1,133,133.0,A,0.0
2,152,152.0,A,0.0
3,180,185.0,A,5.0
4,180,185.0,A,5.0
5,182,183.0,A,1.0


In [15]:
duration.describe()

Unnamed: 0,DISCOVERY_DOY,CONT_DOY,Fire_duration
count,71958.0,71958.0,71958.0
mean,207.283513,209.292157,2.008644
std,53.753622,54.332838,9.791609
min,1.0,1.0,-364.0
25%,183.0,184.0,0.0
50%,212.0,213.0,0.0
75%,237.0,240.0,1.0
max,366.0,366.0,335.0


Looks like fires that started and ended in different years are causing some negative fire durations.

In [16]:
duration[duration['Fire_duration'] < 0]

Unnamed: 0,DISCOVERY_DOY,CONT_DOY,FIRE_SIZE_CLASS,Fire_duration
7937,217,7.0,G,-210.0
214230,283,180.0,A,-103.0
224562,365,1.0,A,-364.0
257222,305,5.0,A,-300.0
260645,195,41.0,A,-154.0
260646,195,41.0,A,-154.0
260647,196,41.0,A,-155.0


There aren't very many with this issue, let's add 365 manually

In [17]:
duration.loc[duration['Fire_duration'] < 0, 'Fire_duration'] = duration['Fire_duration'] + 365

In [18]:
#check
duration[duration['Fire_duration'] < 0]

Unnamed: 0,DISCOVERY_DOY,CONT_DOY,FIRE_SIZE_CLASS,Fire_duration


In [19]:
# mean fire duration by grouped size of fire
fireclass_mean = duration.groupby('FIRE_SIZE_CLASS')['Fire_duration'].mean()
fireclass_mean

FIRE_SIZE_CLASS
A     1.181335
B     2.128508
C     5.833396
D    10.256560
E    12.814898
F    22.422330
G    37.924198
Name: Fire_duration, dtype: float64

In [20]:
def fireMeanLength(x):
    '''
    Function that returns a fire length for a given fire size class
    x is a string A to G
    '''
    fire_classes = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
    position = fire_classes.index(x)
    
    return pd.to_timedelta(fireclass_mean[position], unit='D')

In [21]:
df['FIRE_MEAN_DURATION'] = df['FIRE_SIZE_CLASS'].apply(lambda x: fireMeanLength(x))

In [22]:
df['CONT_GREG_DATE'] = pd.to_datetime(df['CONT_GREG_DATE'])
df['DISC_GREG_DATE'] = pd.to_datetime(df['DISC_GREG_DATE'])

In [23]:
df['CONT_GREG_DATE'].dtype, df['DISC_GREG_DATE'].dtype

(dtype('<M8[ns]'), dtype('<M8[ns]'))

We will use the mean fire duration feature to impute the containment dates

In [24]:
# Filling all missing data in CONT_GREG_DATE with the calculated date
df['CONT_GREG_DATE'].fillna(value=(df['DISC_GREG_DATE'] + df['FIRE_MEAN_DURATION']), inplace=True)

# Calculating the burn length of fire
df['FIRE_DURATION'] = df['CONT_GREG_DATE'] - df['DISC_GREG_DATE']

In [25]:
# Changing the dtype of Fire_duration from dt to int
df['FIRE_DURATION'] = df['FIRE_DURATION'].dt.days

In [26]:
#we can now drop the mean duration
df.drop('FIRE_MEAN_DURATION', axis = 1, inplace = True)

In [27]:
df.shape

(84893, 45)

### Dropping features with null values

In [28]:
df.isna().sum()

OBJECTID                          0
FOD_ID                            0
FPA_ID                            0
SOURCE_SYSTEM_TYPE                0
SOURCE_SYSTEM                     0
NWCG_REPORTING_AGENCY             0
NWCG_REPORTING_UNIT_ID            0
NWCG_REPORTING_UNIT_NAME          0
SOURCE_REPORTING_UNIT             0
SOURCE_REPORTING_UNIT_NAME        0
LOCAL_FIRE_REPORT_ID          39744
LOCAL_INCIDENT_ID             29393
FIRE_CODE                     40081
FIRE_NAME                      9865
ICS_209_INCIDENT_NUMBER       82680
ICS_209_NAME                  82680
MTBS_ID                       84054
MTBS_FIRE_NAME                84054
COMPLEX_NAME                  83639
FIRE_YEAR                         0
DISCOVERY_DATE                    0
DISCOVERY_DOY                     0
DISCOVERY_TIME                23618
STAT_CAUSE_CODE                   0
CONT_DATE                     12935
CONT_DOY                      12935
CONT_TIME                     29977
FIRE_SIZE                   

In [29]:
df.dropna(axis = 1, inplace=True)

In [30]:
df.columns

Index(['OBJECTID', 'FOD_ID', 'FPA_ID', 'SOURCE_SYSTEM_TYPE', 'SOURCE_SYSTEM',
       'NWCG_REPORTING_AGENCY', 'NWCG_REPORTING_UNIT_ID',
       'NWCG_REPORTING_UNIT_NAME', 'SOURCE_REPORTING_UNIT',
       'SOURCE_REPORTING_UNIT_NAME', 'FIRE_YEAR', 'DISCOVERY_DATE',
       'DISCOVERY_DOY', 'STAT_CAUSE_CODE', 'FIRE_SIZE', 'FIRE_SIZE_CLASS',
       'LATITUDE', 'LONGITUDE', 'OWNER_CODE', 'OWNER_DESCR', 'STATE', 'COUNTY',
       'FIPS_CODE', 'FIPS_NAME', 'Shape', 'DISC_GREG_DATE', 'CONT_GREG_DATE',
       'DISC_MONTH', 'DISC_DAY', 'DISC_DAYOFWEEK', 'LABEL', 'FIRE_DURATION'],
      dtype='object')

#### Save cleaned dataset to file

In [35]:
df.to_csv('../../data/fires_clean_FIPS.csv') 