# Wildfires in the US
### Exploratory data analysis

_Francesco Pudda, Sept. 2022_

*Context*

This data publication contains a spatial database of wildfires that occurred in the United States from 1992 to 2015. It is the third update of a publication originally generated to support the national Fire Program Analysis (FPA) system. The wildfire records were acquired from the reporting systems of federal, state, and local fire organizations.

@https://www.kaggle.com/datasets/rtatman/188-million-us-wildfires

#### Usage: run each cell one by one with Shift+Enter or Ctrl+Enter

## Data preparation

In [None]:
from utils.data_utils import load_dataset, preprocess_dataset

In [None]:
df = load_dataset()

In [None]:
df = preprocess_dataset(df)

## Q1 Have wildfires become more or less frequent over time?

In order to answer this question it will be useful to show the trend of fires over time by counting the fires in a certain time windows. In addition plotting the regression line will help immediately identify the trend.

In [None]:
from utils.plotting_utils import plot_q1

In [None]:
plot_q1(df)

As it can be seen, unfortunately the number of fires has been increasing over time with an average of around 340 new fires per year within the 1992-2016 time window.

## Q2 What counties are the most and least fire-prone?

To answer this question it would be sufficient to group by county, count the number of fires per group and show the top and bottom entries, but it make much more sense to also display where this counties are in the US map by also emphasizing which states are the more fire prone.

In [None]:
from utils.geopandas_plotting_utils import MapPlottingHandler
from utils.data_utils import load_shapefiles
from folium.map import Icon

In [None]:
dfs = load_shapefiles()
if dfs:
    us_states_df, us_counties_df = dfs[0], dfs[1]

In [None]:
icon = Icon(color='red')
q2_handler = MapPlottingHandler(df, us_states_df, us_counties_df)

In [None]:
q2_handler.activate_q2()

#### NOTE: Everytime the above sliders are changed and prints a success message, the cells below should be run again to be updated

Below the chart with the counties with the N most number of wildfires.

In [None]:
q2_handler.topn_counties_layer.explore(m=q2_handler.states_layer, marker_type="marker", marker_kwds={'icon': icon})

Interestingly, the counties with the most wildfires do not necessarily reside in the most hit states.

Below the chart with the counties with only one wildfire.

In [None]:
q2_handler.bottom_counties_layer.explore(m=q2_handler.states_layer, marker_type="marker", marker_kwds={'icon': icon})

And the same thing can be said for counties with only just fire.

## Q3 Given the size, location and date, can you predict the cause of a wildfire?

In order to answer this question a ML model was trained over the requested features.

Model description:
<ul>
  <li>Model type: LightGBM Classifiers</li>
  <li>Number of estimators: 400</li>
  <li>Number of leaves: 40</li>
</ul> 

In [None]:
# Code to reproduce training

# from sklearn.model_selection import train_test_split
# from lightgbm import LGBMClassifier
# import pickle as pkl

# df = df[df['cause'] != 13].reset_index(drop=True)
# X_train, X_test, y_train, y_test = train_test_split(df[['size','lat','lon','date']], df['cause'], test_size=0.2, random_state=1)
# model = LGBMClassifier(boosting_type='goss', class_weight='balanced', n_estimators=400)
# model = model.fit(X_train, y_train)

# with open('lgbm_model.pkl', mode='wb') as f:
#     pkl.dump(model, f)

In [None]:
from utils.data_utils import load_ml_model
from utils.plotting_utils import plot_q3

In [None]:
ml_model = load_ml_model()

In [None]:
plot_q3(ml_model)