# **WildFire** **Ananlysis**

As many as 90 percent of wildland fires in the United States are caused by people, according to the U.S. Department of Interior. Some human-caused fires result from campfires left unattended, the burning of debris, downed power lines, negligently discarded cigarettes and intentional acts of arson. The remaining 10 percent are started by lightning or lava.

According to Verisk’s 2019 Wildfire Risk Analysis 4.5 million U.S. homes were identified at high or extreme risk of wildfire, with more than 2 million in California alone. (reference [link text](https://www.iii.org/fact-statistic/facts-statistics-wildfires))

**EDA Thoughts:**


1.   Top Three results of wildfire
2.   Narrowing down to specific states: 
        *   State with highest wildfire damage
        *   Causes for the state
3.   Are we seeing spikes and dips over the years? Why is that?
4.   What main features can we extract and correlate to help in the future
5.   Aside yearly analysis, what months for each year do we have the highest number of fires? What could be the cause? What is the weather around that time of the year?









In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## **READ DATA IN**

In [9]:
df=pd.read_csv('gdrive/MyDrive/Big_Data_Princess/Fires_1.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
df.head()

Unnamed: 0,FPA_ID,FIRE_YEAR,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,COUNTY,FIRE_NAME,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,DISCOVERY_TIME,CONT_DATE,CONT_TIME,OWNER_CODE,OWNER_DESCR
0,FS-1418826,2005,9.0,Miscellaneous,40.036944,-121.005833,CA,63,FOUNTAIN,2005-02-02 00:00:00,0.1,A,1300.0,2005-02-02 00:00:00,1730.0,5.0,USFS
1,FS-1418827,2004,1.0,Lightning,38.933056,-120.404444,CA,61,PIGEON,2004-05-12 00:00:00,0.25,A,845.0,2004-05-12 00:00:00,1530.0,5.0,USFS
2,FS-1418835,2004,5.0,Debris Burning,38.984167,-120.735556,CA,17,SLACK,2004-05-31 00:00:00,0.1,A,1921.0,2004-05-31 00:00:00,2024.0,13.0,STATE OR PRIVATE
3,FS-1418845,2004,1.0,Lightning,38.559167,-119.913333,CA,3,DEER,2004-06-28 00:00:00,0.1,A,1600.0,2004-07-03 00:00:00,1400.0,5.0,USFS
4,FS-1418847,2004,1.0,Lightning,38.559167,-119.933056,CA,3,STEVENOT,2004-06-28 00:00:00,0.1,A,1600.0,2004-07-03 00:00:00,1200.0,5.0,USFS


In [11]:
df.columns

Index(['FPA_ID', 'FIRE_YEAR', 'STAT_CAUSE_CODE', 'STAT_CAUSE_DESCR',
       'LATITUDE', 'LONGITUDE', 'STATE', 'COUNTY', 'FIRE_NAME',
       'DISCOVERY_DATE', 'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'DISCOVERY_TIME',
       'CONT_DATE', 'CONT_TIME', 'OWNER_CODE', 'OWNER_DESCR'],
      dtype='object')

## **Exploratory** **Analysis**

In [None]:
# Vizualize to show different relationships of what causes wildfires

In [13]:
fire_by_state_size = df.groupby('STATE')['FIRE_SIZE'].mean().reset_index()


In [14]:
fire_by_state_size.head()

Unnamed: 0,STATE,FIRE_SIZE
0,AK,2509.779198
1,AL,13.82823
2,AR,16.072761
3,AZ,77.901837
4,CA,67.242725


In [30]:
# get the count of fire cases per state
from collections import Counter 
Counter(df['STATE'].values)

Counter({'AK': 12843,
         'AL': 66570,
         'AR': 31663,
         'AZ': 71586,
         'CA': 189550,
         'CO': 34157,
         'CT': 4976,
         'DC': 66,
         'DE': 171,
         'FL': 90261,
         'GA': 168867,
         'HI': 9895,
         'IA': 4134,
         'ID': 36698,
         'IL': 2327,
         'IN': 2098,
         'KS': 7673,
         'KY': 27089,
         'LA': 30013,
         'MA': 2626,
         'MD': 3622,
         'ME': 13150,
         'MI': 10502,
         'MN': 44769,
         'MO': 17953,
         'MS': 79230,
         'MT': 40767,
         'NC': 111277,
         'ND': 15201,
         'NE': 7973,
         'NH': 2452,
         'NJ': 25949,
         'NM': 37478,
         'NV': 16956,
         'NY': 80870,
         'OH': 3479,
         'OK': 43239,
         'OR': 61088,
         'PA': 8712,
         'PR': 22081,
         'RI': 480,
         'SC': 81315,
         'SD': 30963,
         'TN': 31154,
         'TX': 142021,
         'UT': 30725,
   

In [32]:
import altair as alt
size_by_year = pd.DataFrame(df.groupby(['FIRE_YEAR']).FIRE_SIZE.sum().sort_values(ascending=False)).reset_index()
size_by_year.rename(columns={'FIRE_SIZE': 'Acres_Affected'}, inplace=True)

alt.Chart(size_by_year).mark_line().encode(
    alt.X('FIRE_YEAR:N', title=None),
    alt.Y('Acres_Affected'),
).properties(
    title=f'Acres_Affected each Year',
    width=600,
    height=300,
).configure_axis(
    labelFontSize=14,
    titleFontSize=14
)

Notes: Notice the acres affected taken a deep and increased.
- What can we say is the reason for the year 2008, 2010, 2014
- What happened those times  that made us see the deep.


In [74]:
# Get the month the fire occurs. Slice the DISCOVERY_DATE
# Lets see how the fire is per month as against year.
df['DISCOVERY_DATE'] = pd.to_datetime(df['DISCOVERY_DATE'], errors='coerce') #convert to time so we can use it below

# Create a month year column
df['Month_Year'] = df.DISCOVERY_DATE.dt.strftime('%b' ' %Y')
df['Month']= df.DISCOVERY_DATE.dt.strftime('%b')

In [75]:
df.head()

Unnamed: 0,FPA_ID,FIRE_YEAR,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,COUNTY,FIRE_NAME,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,DISCOVERY_TIME,CONT_DATE,CONT_TIME,OWNER_CODE,OWNER_DESCR,Month_Year,Month,Year
0,FS-1418826,2005,9.0,Miscellaneous,40.036944,-121.005833,CA,63,FOUNTAIN,2005-02-02,0.1,A,1300.0,2005-02-02 00:00:00,1730.0,5.0,USFS,Feb 2005,Feb,2005
1,FS-1418827,2004,1.0,Lightning,38.933056,-120.404444,CA,61,PIGEON,2004-05-12,0.25,A,845.0,2004-05-12 00:00:00,1530.0,5.0,USFS,May 2004,May,2004
2,FS-1418835,2004,5.0,Debris Burning,38.984167,-120.735556,CA,17,SLACK,2004-05-31,0.1,A,1921.0,2004-05-31 00:00:00,2024.0,13.0,STATE OR PRIVATE,May 2004,May,2004
3,FS-1418845,2004,1.0,Lightning,38.559167,-119.913333,CA,3,DEER,2004-06-28,0.1,A,1600.0,2004-07-03 00:00:00,1400.0,5.0,USFS,Jun 2004,Jun,2004
4,FS-1418847,2004,1.0,Lightning,38.559167,-119.933056,CA,3,STEVENOT,2004-06-28,0.1,A,1600.0,2004-07-03 00:00:00,1200.0,5.0,USFS,Jun 2004,Jun,2004


In [76]:
# What month do we have the highest fire
# lets group this by month and year for each state
sub_with_Month = df[['Month', 'FIRE_YEAR', 'STATE', 'COUNTY', 'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'Month_Year']]

In [77]:
sub_with_Month.head()

Unnamed: 0,Month,FIRE_YEAR,STATE,COUNTY,FIRE_SIZE,FIRE_SIZE_CLASS,Month_Year
0,Feb,2005,CA,63,0.1,A,Feb 2005
1,May,2004,CA,61,0.25,A,May 2004
2,May,2004,CA,17,0.1,A,May 2004
3,Jun,2004,CA,3,0.1,A,Jun 2004
4,Jun,2004,CA,3,0.1,A,Jun 2004


**Monthly Trends**

 - get count per month for each year
 - get fire size for each month  and year
 - get the month trends for each year across all states.

In [78]:
# count occurance FIRE_SIZE_CLASS
occur_size = sub_with_Month.groupby(['FIRE_SIZE_CLASS', 'STATE']).size()

In [79]:
occur_size

FIRE_SIZE_CLASS  STATE
A                AK        6622
                 AL        8625
                 AR         924
                 AZ       42694
                 CA       98309
                          ...  
G                VA           6
                 WA         152
                 WI           1
                 WV           1
                 WY          96
Length: 335, dtype: int64

In [84]:
# Which County or state has the hughest 
monthly_data = sub_with_Month.groupby(['Month_Year']).size().reset_index(name='Month_Count').sort_values('Month_Year')

In [85]:
monthly_data.head()

Unnamed: 0,Month_Year,Month_Count
0,Apr 1992,7810
1,Apr 1993,6926
2,Apr 1994,9561
3,Apr 1995,10089
4,Apr 1996,10035


In [86]:
import plotly.express as px
month_fig = px.line(monthly_data, x='Month_Year', y='Month_Count')
month_fig.show()

**Day of the Week**


1.  Should we look at what day of the week this occured?
---
2.  Was it around a holiday? Weekend? Weather issues like a storm?



## **Model Building**

### Correlation

We need to see how the other features correlate to our target variable

#### Feature Engineering

Here we will be extracting variables that correlate to fire

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesClassifier 

In [47]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [48]:
# xgboost for classification
from numpy import asarray
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [49]:
models = []
models.append(('LR',LogisticRegression()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('DTR',DecisionTreeRegressor()))
models.append(('RFC',RandomForestClassifier()))

In [50]:
models_LR = LogisticRegression()
models_KNN = KNeighborsClassifier(n_neighbors=3)
models_DTR = DecisionTreeRegressor()
models_RFC = RandomForestClassifier()
models_ETC = ExtraTreesClassifier()
xgb_model = XGBClassifier()

In [None]:
# Encode the variable for 


In [None]:
from sklearn.feature_selection import SelectFromModel

# fit/transform model for each of the models

#### Testing the model

REFERENCES