# Forest Fires Analysis
Author: Simon Thornewill von Essen

Date: 2019-05-24

Forest fires are dangerous to human and animal lives and can cause lots of expensive damage to homes and buildings. As such, it is important firefighting is done efficiently in order to minimise costs. Towards this end, being able to predict forest fires can help with reaction time which can make fires easier to handle. This is especially germane due to global warming increasing the frequency and severity of forest fires.[[1]](https://www.c2es.org/content/wildfires-and-climate-change/) 

Some preliminary work has been done by Cortez *Et. Al.* to collect some data using relatively cheap sensors and to create a supervised learning model to regress on the total area burned by each fire. They used an SVM in order to achieve this. With a best MAD value of $12.71 \pm 0.01$ while predicting smaller fires more accurately.

I want to try and explore this dataset a little bit and see if I can create a better model using XG-Boost based on some machine learning my colleagues are performing at work on a variable with a similar distribution. 

## Importing Packages, Data and Comprehension

Before doing any serious anyalysis, the data will be imported and some basic discussion over the features in the dataset will be done. 

In [1]:
# Import the holy trinity, long may they live
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

# Import the rest...

# Acquire a sense of taste
plt.style.use("seaborn")

In [2]:
# Import dataset
df = pd.read_csv("../dat/forestfires.csv")

df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [3]:
# Identify shape of dataset
df.shape

(517, 13)

In this dataset we have 517 observations, 12 explanatory variables (features) and one target variable, the area.

The variables are as follows;

1. X - x-axis spatial coord for Monteshino Park
2. Y - y-axis spatial coord for M. Park
3. month - month of the year
4. day - day of the week
5. FFMC - Fine Fuel Moisture Code index of FWI system
6. DMC - Duff Moisture Code index of FWI system 
7. DC - Drought Code index of the FWI system
8. ISI - Buildup Index of the FWI system
9. temp - temperature in $C^\circ$
10. RH - relative humidity in $\%$
11. wind - wind speed in $km/h$
12. rain - rainfall in $mm/m^2$

The target variable, fire area, is measured in $ha$.

### The Canadian Forest Fire Weather Index (FWI) System

This dataset makes use of the Canadian Forest Fire Weather Index (FWI) system. [[2]](http://cwfis.cfs.nrcan.gc.ca/background/summary/fwi) These metrics are calculated using the temp., wind, humidity and rain features of the dataset, as seen in the image below.

![FWI System](./fwi_structure.gif)

This means that although we have 13 features, we probably shouldn't use all features at once, since certain features will be able to be found within each other. This means that we should investigate which features are better predictors of our target variable. For example. Is it better to use simple Wind or Initial Spread Index? Is it better to use a combination of temp. humidity and rain or simply the duff moisture code?

### Plan of Attack and Initial Questions

First I would like to at least skim the paper associated with this dataset to try and understand the approach of Cortez *Et. Al.*. It would also be a good idea to try and implement their solution and try to achieve similar results before going on to writing my own algorithm.

Before doing any modeling however, I would also like check the data for cleanliness and tidiness before doing some EDA using visualisations. 

Finally, one of the major drawbacks of this dataset is the lack of features. Unfortunately, this means that there is a limited scope for feature engineering, which is typically responsible for the greatest increases in scores for supervised learning. However, I might be able to use the X, Y variables to create some kind of categorical variable to be used for the regression.

Leading Questions:
1. What variable has the highest correlation with burn area?
2. Can I build a model that is better than what is described in this paper?
3. Can I use the X, Y coordinates of this dataset to engineer a new variable?

### Notes After Reading Paper

The notes can be found in the `research` directory of this project. The basic gist of it is to use `c-to-1` encoding for the nominal categorical features and to normalise the rest of the featues on $N(0, 1)$. I should then fit the SVM using grid-search and 10-fold cross validation. I didn't understand anything more specific than that. So this will be my reference model for this analysis.

There was a little bit of feature engineering beyond the basic clearning that went into the dataset that I downloaded, but I didn't quite understand what they were doing and so I won't attempt much more than what I attempt to do on my own.

This means that my model may not be as performant as the one found in the paper.

## EDA: Univariate Analysis

Text.

## EDA: Bivariate Analysis

Text.

## EDA: Multivariate Analysis

Text.

## EDA: Summary

Text.

## Data Cleaning and Feature Engineering

Text.

## ML Models: SVM

Text.

## ML Models: XGBoost

Text.

## ML Models: Discussion
Text.

## Conclusion
Text.