Importing necessary libraries.

In [14]:
import numpy as np
import pandas as pd
import sqlite3
from datetime import date, datetime

Reading dataset from .sqlite file downloaded from https://www.kaggle.com/rtatman/188-million-us-wildfires

In [None]:
eng = sqlite3.connect('FPA_FOD_20170508.sqlite')
query = 'SELECT * FROM Fires'

In [None]:
df = pd.read_sql(query, eng)

Saving to .csv for easier access.

In [None]:
df.to_csv('wildfires.csv')

Checking column for later dropping. Columns description can be found here: https://www.kaggle.com/rtatman/188-million-us-wildfires

In [None]:
df = pd.read_csv('wildfires.csv', low_memory=False, index_col=0)

In [None]:
df.columns

Dropping unneeded columns. The following attributes are not related to the location, date or size of the wildfire and therefore are not of interest for my analysis.

In [None]:
columns_to_drop = ['FOD_ID', 'FPA_ID', 'SOURCE_SYSTEM_TYPE', 'SOURCE_SYSTEM', 'NWCG_REPORTING_AGENCY', 'NWCG_REPORTING_UNIT_ID', 
                   'NWCG_REPORTING_UNIT_NAME', 'SOURCE_REPORTING_UNIT', 'SOURCE_REPORTING_UNIT_NAME',
                   'LOCAL_FIRE_REPORT_ID', 'LOCAL_INCIDENT_ID', 'FIRE_CODE', 'FIRE_NAME', 'ICS_209_INCIDENT_NUMBER',
                   'ICS_209_NAME', 'MTBS_ID', 'MTBS_FIRE_NAME', 'COMPLEX_NAME', 'FIPS_CODE', 'FIPS_NAME', 'OWNER_CODE',
                   'OWNER_DESCR', 'COUNTY', 'Shape']

In [None]:
df.drop(columns_to_drop, axis=1, inplace=True)

In [None]:
df.columns

Renaming and lowercasing columns for easier reading. 

In [None]:
df.columns = df.columns.str.lower()

In [None]:
df.rename(columns={'objectid': 'id', 'fire_year': 'year', 'discovery_date': 'disc_date', 'discovery_doy': 'disc_doy',
                   'discovery_time': 'disc_time', 'stat_cause_code': 'cause_code', 'stat_cause_descr': 'cause',
                   'fire_size': 'size', 'fire_size_class': 'size_class', 'latitude': 'lat', 'longitude': 'lon'},
                   inplace=True)

Checking NaNs.

In [None]:
df.isna().sum()

Almost half of the observations are missing the time of discovery and control and the actual control date. Given this kind of scale I will not be basing any of my analysis on the time of day at which the wildfire took place or was controlled, and therefore I can simply drop the disc_time, cont_dat, cont_doy and cont_time columns.

In [None]:
df.drop(['disc_time', 'cont_date', 'cont_doy', 'cont_time'], axis=1, inplace=True)

Checking data types.

In [16]:
df.dtypes

id              int64
year            int64
disc_date      object
disc_doy        int64
cause_code    float64
cause          object
size          float64
size_class     object
lat           float64
lon           float64
state          object
dtype: object

The disc_date column can be converted to datetime format. Checking current format.

In [None]:
df.disc_date.head(1)

Julian date format.

In [None]:
epoch = pd.to_datetime(0, unit='s').to_julian_date()

In [None]:
df.disc_date = pd.to_datetime(df.disc_date - epoch, unit='D')

In [None]:
df.cont_date = pd.tallo_datetime(df.cont_date - epoch, unit='D')

Cleaning complete, the final look of the df is the following.

In [None]:
df.head()

Saving to new .csv for easier access.

In [None]:
df.to_csv('wildfires_clean.csv')

During the analysis I have found that there are 52 states in the list and not 50.

In [7]:
df = pd.read_csv('wildfires_clean.csv', index_col=0)

  mask |= (ar1 == a)


In [8]:
print(sorted(df.state.unique()))

['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']


The 2 excess states are DC (District of Columbia), which is actually a federal district, and PR (Puerto Rico), which is an unincorporated territory. I will be removing all rows related to the two from the dataframe and saving it again.

In [None]:
df = df[df.state != 'DC']

In [None]:
df = df[df.state != 'PR']

In [None]:
df.state.unique()

Adding square kilometers columns by converting acres.

In [None]:
df['size_sqmt'] = df['size'] * 0.00404686

Changing disc_date to datetime format.

In [17]:
df.disc_date = pd.to_datetime(df.disc_date)

Adding season column for analysis.

In [10]:
def get_season(now):
    if isinstance(now, datetime):
        now = now.date()
    now = now.replace(year=Y)
    return next(season for season, (start, end) in seasons
                if start <= now <= end)

In [15]:
Y = 2000
seasons = [('winter', (date(Y,  1,  1),  date(Y,  3, 20))),
           ('spring', (date(Y,  3, 21),  date(Y,  6, 20))),
           ('summer', (date(Y,  6, 21),  date(Y,  9, 22))),
           ('autumn', (date(Y,  9, 23),  date(Y, 12, 20))),
           ('winter', (date(Y, 12, 21),  date(Y, 12, 31)))]

In [None]:
df['season'] = df.disc_date.apply(get_season)

Changing winter, autumn and spring to rest_of_year

In [None]:
func = lambda x: 'summer' if x == 'summer' else 'rest_of_year'
df['season_div'] = df['season'].apply(func)

Saving updated .csv

In [None]:
df.to_csv('wildfires_clean.csv')