Import Statements

In [1]:
import pandas as pd
import sqlite3
import numpy as np

Connect to the data. The data is in a SQLLite Database file

In [2]:
cursor = sqlite3.connect('./Data/FPA_FOD_20210617.sqlite')

Define a query to pull in all the records from the Database file

In [3]:
query = "SELECT f.* from Fires f"
all_data = pd.read_sql_query(query, cursor)

Take a look at the first record in the dataset

In [4]:
all_data.iloc[1]

FOD_ID                                                  2
FPA_ID                                         FS-1418827
SOURCE_SYSTEM_TYPE                                    FED
SOURCE_SYSTEM                                 FS-FIRESTAT
NWCG_REPORTING_AGENCY                                  FS
NWCG_REPORTING_UNIT_ID                            USCAENF
NWCG_REPORTING_UNIT_NAME         Eldorado National Forest
SOURCE_REPORTING_UNIT                                 503
SOURCE_REPORTING_UNIT_NAME       Eldorado National Forest
LOCAL_FIRE_REPORT_ID                                   13
LOCAL_INCIDENT_ID                                      13
FIRE_CODE                                            AAC0
FIRE_NAME                                          PIGEON
ICS_209_PLUS_INCIDENT_JOIN_ID                        None
ICS_209_PLUS_COMPLEX_JOIN_ID                         None
MTBS_ID                                              None
MTBS_FIRE_NAME                                       None
COMPLEX_NAME  

I will only need a handful of these so I can subset to the columns which I require

In [5]:
columns = ['DISCOVERY_DATE','FIRE_SIZE_CLASS', 'LATITUDE', 'LONGITUDE', 
           'FIPS_CODE']
subset_data = all_data[columns]

Let's check data types to make sure nothing funky happened

In [6]:
subset_data.dtypes

DISCOVERY_DATE      object
FIRE_SIZE_CLASS     object
LATITUDE           float64
LONGITUDE          float64
FIPS_CODE           object
dtype: object

In [7]:
subset_data.loc[:, 'DISCOVERY_DATE'] = pd.to_datetime(subset_data.loc[:, 'DISCOVERY_DATE'].copy())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [8]:
subset_data['DISCOVERY_DATE'].describe(datetime_is_numeric=True)

count                          2166753
mean     2005-10-09 12:01:09.283393536
min                1992-01-01 00:00:00
25%                1999-08-22 00:00:00
50%                2006-03-04 00:00:00
75%                2011-10-20 00:00:00
max                2018-12-31 00:00:00
Name: DISCOVERY_DATE, dtype: object

Let's get some descriptions of these columns to see what we are working with

Let's start with the categorical columns

In [9]:
subset_data[['FIRE_SIZE_CLASS']].value_counts()

FIRE_SIZE_CLASS
B                  1047772
A                   810694
C                   246247
D                    32261
E                    16227
F                     9097
G                     4455
dtype: int64

In [10]:
subset_data[['FIPS_CODE']].describe()

Unnamed: 0,FIPS_CODE
count,1509518
unique,2925
top,6065
freq,14989


Looking at the missing data

In [11]:
subset_data.isna().sum()

DISCOVERY_DATE          0
FIRE_SIZE_CLASS         0
LATITUDE                0
LONGITUDE               0
FIPS_CODE          657235
dtype: int64

Do some imputation of FIPS_CODE using FCC API which can provide FIPS Code based on lat/long - which are available for all wildfires. 

FCC API Details: https://geo.fcc.gov/api/census/#!/block/get_block_find

In [12]:
# Imputation with API Happens Here

After using the Lat/Longs we can drop those columns as they are no longer needed

In [13]:
subset_data.drop(['LATITUDE', 'LONGITUDE'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


At this point - drop remaining nulls. Won't call FCC service for all nulls 600K is too many requests, just want to fill in the Size G fires as a handful of those don't have state/county info which I want to keep as many of the large fires as possible

In [14]:
subset_data.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_data.dropna(inplace=True)


Create the Discovery Month Column

In [15]:
subset_data['DISCOVERY_MONTH'] = subset_data['DISCOVERY_DATE'].to_numpy().astype('datetime64[M]')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_data['DISCOVERY_MONTH'] = subset_data['DISCOVERY_DATE'].to_numpy().astype('datetime64[M]')


Clean data looks like this

In [16]:
subset_data.iloc[1]

DISCOVERY_DATE     2004-05-12 00:00:00
FIRE_SIZE_CLASS                      A
FIPS_CODE                        06061
DISCOVERY_MONTH    2004-05-01 00:00:00
Name: 1, dtype: object

Formatting as a matrix

In [17]:
final_data = subset_data.groupby(by=['DISCOVERY_MONTH', 'FIPS_CODE']).agg({'FIRE_SIZE_CLASS': max}).reset_index()

In [18]:
# FIPS 06037 is Los Angeles County California
example = final_data[final_data['FIPS_CODE'] == '06037']
example = example.pivot(index='DISCOVERY_MONTH', columns='FIPS_CODE', values='FIRE_SIZE_CLASS')
example[example.columns] = np.where(example == 'G', 1, 0)

In [19]:
complete_index = pd.date_range(start=min(subset_data['DISCOVERY_MONTH']), end=max(subset_data['DISCOVERY_MONTH']), 
                              freq='MS')

In [20]:
example = example.reindex(index=complete_index, fill_value=0)

In [21]:
example.describe()

FIPS_CODE,06037
count,324.0
mean,0.037037
std,0.189145
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0
