In [8]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
set.seed(3456) 

### Introduction:
Fire behaviour is normally observed through components of fuel moisture and weather conditions. The Forest Fire Weather Index (FWI) is a system that consists of six components: three fuel moisture codes and three fire behaviour indices. Calculations of the components are based on daily weather observations. The goal will be to predict forest fires using climate characteristics in Algeria. We will compare the FWI to our predictions. We’ll work with the Algerian Forest Fires dataset, which consists of a total of 244 instances of forest fires situated in two Algerian regions. There are 11 attributes, one output attribute, classified as either fire (138 classes) or not fire (106 classes).
Here is what the column headings mean:
* day/month/year: indicate the day/month/year that the observation was taken, respectively.
* Temperature: Maximum Temperature on that day (Celsius)
* RH: Relative Humidity (%)
* Ws: Wind Speed (km/h)
* Rain: Total that day (mm)
* FFMC: Fine Fuel Moisture Code Index from FWI system
* DMC: Duff Moisture Code Index from FWI system
* DC: Drought Code Index from FWI system
* ISI: Initial Spread Index from FWI system
* BUI: Buildup index from FWI system
* FWI: Fire Weather Index
* Classes: either "not fire" or "fire"


### Preliminary Exploratory Data Analysis:
1) Reading the dataset from the web into R

In [12]:
forest_fire_data_raw <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv", skip = 1)
head(forest_fire_data_raw) #previewing the first 6 rows of the dataset

Unnamed: 0_level_0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
2,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
3,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
4,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
5,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire
6,6,6,2012,31,67,14,0.0,82.6,5.8,22.2,3.1,7.0,2.5,fire


2) Cleaning and Wrangling the Data into a Tidy Format

* The data is not in a tidy form as the data actually contains two regions of observations. Rows 1-122 are observations for the Bejaia Region, while rows 125 to 246 are for the Sidi-Bel Abbes Region (rows 123 and 124 contain column headers). To tidy this, an extra column called "region" will be created. After this all observations will be tidy, as they are split into individual rows, each column is a single variable and each value is in a single cell.
* We also noticed that for the 14/07/2012 data point (row 166 for the forest_fire_data_tidy dataframe) in the Sidi-Bel Abbes Region, the original dataset had a typo in it, where they forgot the delim between the ISI and BUI columns. To fix this, we removed that datapoint and replaced it with one that is correctly written (but did not change the values).
* Additionally, to make the data more usable, we decided to convert some of the columns from characters to factors (Classes) and numeric values (Temperature:FWI).

In [13]:
forest_fire_data_collect <- collect(forest_fire_data_raw) # We first collected the tables from the web to be able to work with them more easily.

forest_fire_data_bejaia <- forest_fire_data_collect %>% # We then seperated out the Bejaia Region and created a new "regions" column.
    slice(1:122) %>%
    mutate(region = "Bejaia")

forest_fire_data_sidibel <- forest_fire_data_collect %>% # We then seperated out the Sidi-Bel Abbes Region and created a new "regions" column.
    slice(125:246) %>%
    mutate(region = "Sidi-Bel_Abbes")

forest_fire_data_tidy <- rbind(forest_fire_data_bejaia,forest_fire_data_sidibel) # Combining the two tables produced above.

forest_fire_fix <- forest_fire_data_tidy %>% # Fixing the broken row for 14/07/2012 observation.
    slice(-166) %>%
    add_row(day = "14", month = "07", year = "2012", Temperature = "37",RH = "37",Ws = "18",Rain = "0.2",FFMC = "88.9",DMC = "12.9",DC = "14.6",ISI = "9", BUI = "12.5", FWI = "10.4", Classes = "fire",region = "Sidi-Bel_Abbes")

forest_fire_data_mutated <- forest_fire_fix %>% # Mutating the numeric values into as.numerics. 
    mutate(Temperature = as.numeric(Temperature),
           RH = as.numeric(RH),
           Ws = as.numeric(Ws),
           Rain = as.numeric(Rain),
           FFMC = as.numeric(FFMC),
           DMC = as.numeric(DMC),
           DC = as.numeric(DC),
           ISI = as.numeric(ISI),
           BUI = as.numeric(BUI),
           FWI = as.numeric(FWI))

forest_fire_categories <- forest_fire_data_mutated %>% # Grouping the Classes column into just "fire" and "not fire." Converting the Classes column into a factor.
    mutate(Classes = ifelse(Classes == "fire" | Classes == "fire   " | Classes == "fire "| Classes == "fire  "|Classes == "fire   ", "fire", "not fire")) %>%
    mutate(Classes = as.factor(Classes))

forest_fire_categories

day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,region
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<chr>
01,06,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,Bejaia
02,06,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,Bejaia
03,06,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,Bejaia
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
29,09,2012,24,54,18,0.1,79.7,4.3,15.2,1.7,5.1,0.7,not fire,Sidi-Bel_Abbes
30,09,2012,24,64,15,0.2,67.3,3.8,16.5,1.2,4.8,0.5,not fire,Sidi-Bel_Abbes
14,07,2012,37,37,18,0.2,88.9,12.9,14.6,9.0,12.5,10.4,fire,Sidi-Bel_Abbes


### Extracting Training Data
* 75% of the data was used as training data in this analysis

In [14]:
forest_split <- initial_split(forest_fire_categories, prop = 0.75, strata = Classes) # Splitting the data with 75% for training and 25% for testing.
forest_fire_training <- training(forest_split)   
forest_fire_testing <- testing(forest_split)

3) Creating a Summary Table

* We wanted to create a summary table that showed the number of observations in both the "fire" and "not fire" classes that we are using as well as the means of the predictor variables we plan to use (Temperature:Rain).

In [7]:
forest_fire_means <- forest_fire_training %>%
    group_by(Temperature, RH, Ws, Rain) %>%
    summarize(mean_temp = mean(Temperature),
             mean_RH = mean(RH),
             mean_Ws = mean(Ws),
             mean_rain = mean(rain))
    
forest_fire_means


# forest_fire_summary <- forest_fire_training %>%
#     group_by(Classes) %>%
#     summarize(n = n()) 

# forest_fire_summary

ERROR: Error: Problem with `summarise()` input `mean_rain`.
[31m✖[39m object 'rain' not found
[34mℹ[39m Input `mean_rain` is `mean(rain)`.
[34mℹ[39m The error occurred in group 1: Temperature = 22, RH = 76, Ws = 26, Rain = 8.3.


### Expected Outcomes and Significance:

We expect to find that days where there are reported fires have: little to no rain; lower relative humidity; higher temperature; and higher FWI values. Accurate predictions would allow us to better prepare and equip firefighters for forest fires as well as lead to quicker evacuations and extinguishing of fires. This might lead us to new questions, such as whether it is possible to predict the magnitude or scale of a forest fire based on weather conditions. This question can lead us to predict what resources are necessary to combat the fire in advance and assess a general size or radius that would require evacuations.
