# Fire Damage Estimation

### Table of Contents<a id="toc"></a>

* [Introduction](#intro)
    * [First Steps](#setup)
* [Initial Exploration](#init_explore)
    * [Column Definitions](#col_deff)
* [Deep Exploration](#deep_explore)
    * [Summary](#summary)
* [Data Cleaning](#data_cleaning)
    * [Distributions](#distributions)
    * [Outliers](#outliers)
    * [Correlations](#correlations)
* [Model Building](#building)
* [Model Metrics and Evaluation](#model_eval)
* [Decision Tree Flow Diagram](#diagram)
* [Final Evaluation](#final_eval)
* [Conclusion](#conclusion)


[Back to Table of Contents](#toc)

### Introduction<a id="intro"></a>

In this project we will be working with a dataset containing information about forest fires in northeast Portugal. We are tasked with building a model that predicts the potentil amount of damage future forest fires may cause. Our goal in this project is to use various tequnics to optimize our model, to achive the highest predictive accuracy as possible.

### First Steps<a id="setup"></a>

As always, our first steps of any project are to import our libraries and upload our data.

In [1]:
# NOTE: Make sure to clean library selection at the end and remove any that did not get used

# importing libraries
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# or
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_val_score 
from sklearn.decomposition import PCA

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import SplineTransformer

In [2]:
fires = pd.read_csv("fires.csv")

### Initial Exploration<a id="init_explore"></a>

With our project now prepared and ready, we can move into the initial exploration stage. This step provides us with an idea of the data we’re working with and helps us start our list of items we’ll need to clean up later.

In [3]:
fires

Unnamed: 0.1,Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,1,7,5,mar,fri,86.2,26.2,94.3,5.1,,51.0,6.7,0.0,0.00
1,2,7,4,oct,tue,90.6,,669.1,6.7,18.0,33.0,0.9,0.0,0.00
2,3,7,4,oct,sat,90.6,43.7,,6.7,14.6,33.0,1.3,0.0,0.00
3,4,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97.0,4.0,0.2,0.00
4,5,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99.0,,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,513,4,3,aug,sun,81.6,56.7,665.6,1.9,27.8,32.0,2.7,0.0,6.44
513,514,2,4,aug,sun,81.6,56.7,665.6,1.9,21.9,71.0,5.8,0.0,54.29
514,515,7,4,aug,sun,81.6,56.7,665.6,1.9,21.2,70.0,6.7,0.0,11.16
515,516,1,4,aug,sat,94.4,146.0,614.7,11.3,25.6,42.0,4.0,0.0,0.00


**`fires` – Observations:**

* Some column names appear to be shorthand for longer terms, so we need to confirm their full meanings.
* There are **517** rows and **14** columns.
* We can see a few categorical columns
* We also notice several empty or 'NaN' values

In [4]:
fires.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  517 non-null    int64  
 1   X           517 non-null    int64  
 2   Y           517 non-null    int64  
 3   month       517 non-null    object 
 4   day         517 non-null    object 
 5   FFMC        469 non-null    float64
 6   DMC         496 non-null    float64
 7   DC          474 non-null    float64
 8   ISI         515 non-null    float64
 9   temp        496 non-null    float64
 10  RH          487 non-null    float64
 11  wind        482 non-null    float64
 12  rain        485 non-null    float64
 13  area        517 non-null    float64
dtypes: float64(9), int64(3), object(2)
memory usage: 56.7+ KB


**`fires.info()` – Observations:**

* There are **2** object columns, **3** integer columns, and **9** float columns.
* As we suspected, there are lots of missing values

In [5]:
fires.describe()

Unnamed: 0.1,Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517.0,469.0,496.0,474.0,515.0,496.0,487.0,482.0,485.0,517.0
mean,259.0,4.669246,4.299807,90.580384,111.195363,550.673418,9.018835,18.884677,44.38193,4.021784,0.023093,12.847292
std,149.389312,2.313778,1.2299,5.698137,64.00845,246.061309,4.56489,5.748318,16.180372,1.79446,0.305532,63.655818
min,1.0,1.0,2.0,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,0.0
25%,130.0,3.0,4.0,90.2,70.8,441.2,6.45,15.475,33.0,2.7,0.0,0.0
50%,259.0,4.0,4.0,91.6,108.3,664.5,8.4,19.3,42.0,4.0,0.0,0.52
75%,388.0,7.0,5.0,92.8,141.575,713.9,10.75,22.725,53.5,4.9,0.0,6.57
max,517.0,9.0,9.0,96.2,291.3,860.6,56.1,33.3,100.0,9.4,6.4,1090.84


**`fires.describe()` – Observations:**

* There do not appear to be any binary columns.

In [6]:
fires.columns

Index(['Unnamed: 0', 'X', 'Y', 'month', 'day', 'FFMC', 'DMC', 'DC', 'ISI',
       'temp', 'RH', 'wind', 'rain', 'area'],
      dtype='object')

**Dataset Column Definitions<a id="col_deff"></a> (from the [Official Page](https://archive.ics.uci.edu/dataset/162/forest+fires)):**

   1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
   2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
   3. month - month of the year: 'jan' to 'dec' 
   4. day - day of the week: 'mon' to 'sun'
   5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
   6. DMC - DMC index from the FWI system: 1.1 to 291.3 
   7. DC - DC index from the FWI system: 7.9 to 860.6 
   8. ISI - ISI index from the FWI system: 0.0 to 56.10
   9. temp - temperature in Celsius degrees: 2.2 to 33.30
   10. RH - relative humidity in %: 15.0 to 100
   11. wind - wind speed in km/h: 0.40 to 9.40 
   12. rain - outside rain in mm/m2 : 0.0 to 6.4 
   13. area - the burned area of the forest (in ha)

In [7]:
model = LinearRegression()