# FIT5149 S1 2020 Assessment 1: Bushfire Analysis using Meteorological Data


Student information
- Family Name: Muralitharan
- Given Name: Keerthana
- Student ID: 30159474
- Student email: kmur0015@student.monash.edu

Programming Language: R 3.3 in Jupyter Notebook

R Libraries used:
- dplyr
- kernlab
- caret
- ggplot
- MASS
- leaps
- randomForest
- corrplot
- tidyverse

In [None]:
#install.packages('kernlab')
#install.packages('e1071')
#install.packages('tidyverse')

In [None]:
library(dplyr)
require(kernlab)
require(caret)
library(e1071)
library(ggplot2)
library(MASS)
library(leaps)
library(randomForest)
library(corrplot)
library(tidyverse)

## Table of Contents

* [Introduction](#sec_1)
* [Data Exploration](#sec_2)
* [Model Development](#sec_3)
* [Model Comparison](#sec_4)
* [Variable Identification and Explanation](#sec_5)
* [Conclusion](#sec_6)
* [References](#sec_7)

## 1. Introduction <a class="anchor" id="sec_1"></a>

This assessment involves 2 tasks to be performed namely the **prediction** task and the **Description** task.

**Prediction Task**

- In this task, The burned area due to bush fires in between 2000 and 2003 at Portugal using the meteorological data by developing 3 models - **Linear Regression model, Support Vector Regression model and Random Forest Regressor model** and choosing the best model to predict the burned area

**Description Task**

-In this task, the best suited features in the model is identified and its significance and importance is mentioned.


#### Understanding the data.

In [None]:
# reading the forestfires.csv input file and displaying the dimensions of the file
forestdata <-  read.csv('forestfires.csv', header = TRUE, sep = ",", stringsAsFactors=FALSE)
bushdata<-forestdata

In [None]:
# Display the dimensions
cat("The housing dataset has", dim(forestdata)[1], "records, each with", dim(forestdata)[2],"attributes.")

In [None]:
cat("The attribute names of the forest fire dataset are \n")
# Display the column(attribute) names 
colnames(forestdata)
cat("The structure of the forest fire data is \n\n")
# Display the structure
str(forestdata)


##### OBSERVATION :


<li>Month, Day are <b>Categorical</b> Attribute</li>
<li>X, Y, FFMC, RH, wind, rain,DMC, DC, ISI, Temp and are <b>numerical</b> attributes</li>

In [None]:
cat("\nBelow is the small portion of how the dataset appears:")
# Display the first few records
head(forestdata)

In [None]:
cat("\nBelow is the sumary for each attribute are:")

summary(forestdata)

## 2. Data Exploration<a class="anchor" id="sec_2"></a>

In [None]:
# Check to see if any data missing.
sum(is.na(forestdata))

# Check to see how many cases have an area of 0
cat("\nNumber of bush fire cases where burned area is zero:")
length(which(forestdata$area==0))

### Exploration of Categorical Variables 


In [None]:
#Filtering the month 
forestfires_month <- forestdata %>% group_by(month) %>% summarize(fires=n())
#filtering the day
forestfires_day <- forestdata %>% group_by(day) %>% summarize(fires=n())

#the months are factorised and assigned the values
forestfires_month <- forestfires_month %>%
  mutate(month = factor(month, levels = c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec")))
#the days are factorised and assigned the values
forestfires_day <- forestfires_day %>%
  mutate(day = factor(day, levels = c("mon", "tue", "wed", "thu", "fri", "sat", "sun")))

#monthwise and daywise plot is plotted
fires_by_month <- ggplot(data=forestfires_month) + aes(x=month,y=fires) + geom_bar(stat="identity",color='blue',fill='skyblue') + labs(title = "Fires by Month", x = "Month", y ="Number of fires") 
fires_by_day <- ggplot(data=forestfires_day) + aes(x=day,y=fires) + geom_bar(stat="identity",color='blue',fill='skyblue') + labs(title = "Fires by Day", x = "Day", y ="Number of fires")

#month plot versus total number of bushfires
fires_by_month
#month plot versus total number of bushfires
fires_by_day


In [None]:
#Analysing the discrete vars
##Creating a new data set where area is less than 300
catforest <- forestdata %>% filter((area < 300))
index <- unlist(lapply(catforest, is.numeric))
forest <- cbind(catforest[ , c(!index)],catforest$area)
names(forest) <- c('Month','Day','area')
print(head(forest))
mat <- table(forest$Month,forest$Day)
options(repr.plot.res = 100,repr.plot.width=18, repr.plot.height=10)
mosaicplot(mat,main = "Distribution of Days in Month", col=c(1,2,3,4,5,6,7))

##### Observation :

- We see that high number of the bush fires occur in the month of **August and September**.

- In the case of day, **Friday,saturday,sunday and monday** have more number of bushfires

- On day-wise accidents for the week , all the days have >60 bushfires with almost same level of bushfires.

- The mosaic plot for the area affected in different Months shows in **january,november and may** do not have much bush fires and other months having less than 50 bush fires.

In [None]:
# explore the relationship between burn area and park location

burn_coord = forestdata %>% group_by(X, Y) %>% summarize(area_mean = mean(area))
ggplot(burn_coord, aes(x = factor(X), y = factor(Y),
       fill = area_mean)) + geom_tile() + scale_fill_gradient2()

#### Observation:

Based on the above heat map , the co-ordinates (X,Y)-(8,8) has more average bush fires compared to the other locations

### Exploration of Numerical Variables

In [None]:
# Convert month and day string variables into numeric values(Categorical to Numerical conversion)
forestdata$month <- as.numeric(as.factor(forestdata$month))
forestdata$day <- as.numeric(as.factor(forestdata$day))

In [None]:
par(mfrow=c(1,1))
#plot a correlation plot for the variables
M <- cor(forestdata)
corrplot(M, method="color", outline = TRUE,type="lower",order = "hclust",
         number.cex = 1.0,addCoef.col = "black",
         tl.col="black", tl.srt=45, diag=FALSE,tl.cex = 1,mar=c(0,0,3,0),
         title="Correlation plot between Predictor and Outcome variables")

#### Observation :
Based on the correlation plot we have examined the correlations between all the 13 variables where we can see a number of important correlations which we may want to account for in our model.
- Positive correlations between ISI, temp, DCM and DC
- Positive correlations between X and Y
- Negative correlations between RH and Temp

Also the Area outcome variable isn’t strongly correlated to any of 12 variables.

- positive DC&DMC - high correlation with 0.68
- positive temp &DMC - high correlation with 0.5
- negative RH & temp - negative high correlation with -0.53
    
**while selecting the features if (DC,DMC) or (temp,DMC) and (RH,temp) appear together anyone of the variable is neglected to avoid multi-collinearity**

In [None]:
#Plotting boxplots for all the 13 variable
par(mfrow = c(3,5)) # 5 x 3 grid
for (i in 1:(length(forestdata))) {
        boxplot(forestdata[,i], main = names(forestdata[i]), type="l", col = 'maroon') 
}

#### Observation 

From the above boxplots we can observe outliers with the folowing variables

<li>area
<li>FFMC
<li>ISI
<li>rain

However, the above outliers are not error values so we have not removed it.

In [None]:
# Density plot is created for all the 12 input model variables to understand its skewness
par(mfrow=c(2,6),mar=c(3.50, 1, 2.5, 2.5))
for (variables in 1:(dim(forestdata)[2]-1)){
  thisvar = forestdata[,variables]
  d <- density(thisvar)
  plot(d, main = names(forestdata[variables]),xlab="")
  polygon(d, col="pink", border="red")
  title("Density plots for all 12 Model Variables", line = -27, outer = TRUE)}

#### Observation 

The above matrix of density plots shows us  that the <b>rain,ISI</b> are right skewed and <b>FFMC</b> is left skewed while  we could observe a normal (gaussian) distributions for <b>temp, wind, RH, X, DMC and day</b>.

In [None]:
# Density plot for the target variable - area
d <- density(forestdata$area)
plot(d)
polygon(d, col="skyblue", border="blue")

#### Observation

Based on the above density plot on area we could it is right skewed with more number of zeroes also it has some outlier with area above 300 which can be removed

#### Removal of outliers

In [None]:
#display the area value in decreasing order to find the outliers
cat("Area value in decreasing order\n ")
sort(forestdata$area, decreasing = TRUE)[1:10]

We see that there are two particularly large area values of 1090.84 and 746.28 above 300. These 2 outlier values are removed from the dataset

In [None]:
#removing the outliers
forestdata <- forestdata %>% filter((area < 300))

### Variable Transformations

Based on the boxplots and density plots, we can use **reflected log transform for FFMC** and **log transform of rain** and **log transform for area**, since the area is highly concentrated towards zero and assymetrical.

we can see from the graph that burned area is highly skewed with lots of 0 values so performing log transformation to the burnt area to reduce skewness. we need to transform the target variable 'area' and input variable 'rain' by taking its logarithm (after adding 1 to avoid zeros)
$$\text{Log-area} = log_{10}(area+1)$$
$$\text{Log-rain} = log_{10}(rain+1)$$
$$\text{Log-area} = log_{10}(max(FFMC+1)-FFMC)$$


#### Log transformation for FFMC variable

In [None]:
#https://www.datanovia.com/en/lessons/transform-data-to-normal-distribution-in-r/

#reflected log trasformation for FFMC variable
forestdata$ref_log_FFMC <- log10(max(forestdata$FFMC+1)-forestdata$FFMC)
forestdata <- subset(forestdata, select = -c(FFMC))
#Plotting the histogram after transformation
ggplot(forestdata, aes(x = ref_log_FFMC)) + geom_histogram(color='black',fill='skyblue')

#### Log transformation for rain variable

In [None]:
#Log transformation of the rain variable
forestdata$log_rain <- log10(forestdata$rain+1)
forestdata <- subset(forestdata, select = -c(rain))

#Plotting graph after transformation
ggplot(forestdata, aes(x = log_rain)) + geom_histogram(color='black',fill='skyblue')



#### Observation
- It is clear that after reflected log- transformation we could see a normal distribution with the FFMC , whereas there is not much of a difference with rain variable because there are very few like 5 values have some numerical values leaving the rest with zero

#### One-hot encoding for Month and Day categorical variables

In [None]:
#One-hot encoding is performed to month variable 
for(unique_value in unique(forestdata$month)){

forestdata[paste("month", unique_value, sep = ".")] <- ifelse(forestdata$month== unique_value, 1, 0)

}
#After the encoding the month variable from the dataset is removed
forestdata <- subset(forestdata, select = -c(month))

#One-hot encoding is performed to the day variable

for(unique_value in unique(forestdata$day)){

forestdata[paste("day", unique_value, sep = ".")] <- ifelse(forestdata$day== unique_value, 1, 0)

}

#After the encoded values are filled in the data , the day variable is removed
forestdata <- subset(forestdata, select = -c(day))

#### Log transformation for target variable- AREA

In [None]:
#log transformation for area variable
forestdata$log_area <- log10(forestdata$area+1)
forestdata <- subset(forestdata, select = -c(area))
#plot the area variable after transformation
ggplot(forestdata, aes(x = log_area)) + geom_histogram(color='black',fill='skyblue')

Since the log transformation is applied to the area varible, we could see a gaussian distribution excluding the zeroes.

# 3. Model Development

In [None]:
##Sample the dataset.
set.seed(90)
#main dataset is split in 80:20 into train and test datasets by setting a index
row.number <- sample(1:nrow(forestdata), 0.8*nrow(forestdata))
#Split to train and test dataset
bushtrain = forestdata[row.number,]
bushtest = forestdata[-row.number,]

#display the rows of the datasets.
cat("Number of rows in dataset\t:",nrow(forestdata))
cat("\nNumber of rows in train dataset\t:",nrow(bushtrain))
cat("\nNumber of rows in test dataset\t:",nrow(bushtest))



## Model 1. Linear Regression

Based on the exploration we were not able to guess the variables which must be used for building the models,therefore we do a subset selection for linear regression is done using **stepwise** regression using **sequential replacement** and **nvmax=9**.Once the features are selected, the linear model is built.

In [None]:
#Using Stepwise regression to find the significant variables affexting the area
models <- regsubsets(log_area~., data = bushtrain, nvmax = 9,method = "seqrep")

#display the summary of the stepwise regression
summary(models)

#### Features selected for the linear regression are 
DMC,temp,wind,months - (12,1,4,3 and 9 which are dec,jan,apr,mar and sep) and day1,5-(monday and friday)

#### Building the linear regression with selected features

The linear model is built for all the above selected variables against the log_area for the train dataset.

In [None]:
#Building the linear regression model with features obtained from feature selection
linear_model <-lm(log_area ~ DMC + temp + wind +month.12 +month.1+month.4+month.3+month.9+day.1+day.5, data = bushtrain )

In [None]:
#summary of the model
summary(linear_model)

#### Observation: 

All the varibles excluding month.9(sep) and day.1(monday) , all other are important.

## Model 2.Random Forest Regressor

The second model is Random forest regressor which is set to be built for the dataset to predict the log_area

In [None]:
#Random forest model with all the variables
random_forest<- randomForest(log_area ~ .,  data = bushtrain, ntree=500)
#summary of the model
random_forest

###### Finding the Important varibles used for random forest to improve the performance.

In [None]:
#selecting the variables which are most important.
importance(random_forest)[order(-importance(random_forest)),]

##### Observation
**temp,RH,DMC,DC,wind** are the most 5 important variables.

In [None]:
#Plotting the variable importance
varImpPlot(random_forest,pch=18,col="blue",cex=1.0)

#### Observation:

Based on the above plot we can find **temp,RH,DMC,DC,FFMC,wind and ISI** are most important.To avoid the multi-collinearity we can ignore **RH and DC** as temp is more important than RH and DMC is more important than DC.
#### Tuning the random forest regressor
Before we model the random forest we must determine the best  model by selecting **the mtry - preselected directions used in splitting,and number of trees, and the maximum nodesize** for the random forest regressor

##### Selecting the mtry value

In [None]:
#https://www.guru99.com/r-random-forest-tutorial.html
#setting the k-fold cross validation
trControl <- trainControl(method='cv',number=10,search="grid")
# to find the best mtry
rf_default <- train(log_area~.,
    data = bushtrain,
    method = "rf",
    trControl = trControl)
# Print the results
print(rf_default)

##### Selecting the  best maxnode 

In [None]:
#https://www.guru99.com/r-random-forest-tutorial.html
#search the best maxnode
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = 2)
for (maxnodes in c(5: 20)) {
    #train the model to find the maximum node
    rf_maxnode <- train(log_area~.,
        data = bushtrain,
        method = "rf",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = maxnodes,
        ntree = 500)
    current_iteration <- toString(maxnodes)
    store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
#display the summary
summary(results_mtry)

for value **maxnode=14** the mean of errors are less comaratively.

##### selecting the best ntree

In [None]:
#https://www.guru99.com/r-random-forest-tutorial.html
#search best ntrees
store_maxtrees <- list()

#train the model to find the best number of trees 
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
    rf_maxtrees <- train(log_area~.,
        data = bushtrain,
        method = "rf",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = 24,
        ntree = ntree)
    key <- toString(ntree)
    store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
#display the summary
summary(results_tree)

For the value **ntree=250**, the mean errors are less.
Based on the above results.the best values are

- **mtry**=2
- **maxnode**=14
- **ntree**=250

Using these values, we can train the model to the features selected based on its importance.

In [None]:
# Based on the variable importance and tuned parameters Random forest is built
random_forest_rf<-randomForest(log_area~temp+DMC+ref_log_FFMC+wind+ISI, data=bushtrain,ntree=250,mtry=2,maxnodes=14)

#summary of random forest
random_forest_rf

After tuning the Random forest regressor we could find that the **mean of squared residuals** and **percentage of variance** has improved from **0.3832 to 0.354** and **-11.16 to -2.7** respectively.

## Model 3. Support Vector Regressor

The third model is Support vector regressor which is built in order to predict the log_area.

In [None]:
#build a svr model -radial kernel with all the variables
svr_model =svm(log_area ~ .,data=bushtrain)

#summary of the model
svr_model

#### Feature selection based on Recursive feature elimination method

In [None]:
#https://datasciencebeginners.com/2018/11/26/functions-and-packages-for-feature-selection-in-r/
set.seed(86)

 
# Setting the cross validation parameters
ctrl_param <- rfeControl(functions = rfFuncs,
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE,
                   returnResamp = "all")
 
#using RFE choosing the top variables
rfe_lm_profile <- rfe(bushtrain[,-30], bushtrain[,30],
                 sizes = c(2,30),
                 rfeControl = ctrl_param,
                 newdata = bushtest[,-30])

#display the summary
rfe_lm_profile

The top variables are **DMC,DC,temp,month.3(mar),month.2(feb)**. since DMC and DC are highly correlated and to avoid collinearity we would consider only **DMC,temp,month.3 and month.2**

#### Tuning the model to find the best cost and gamma value


In [None]:
#https://www.kdnuggets.com/2017/03/building-regression-models-support-vector-regression.html
options(warn=-1)

#Tuning SVR model by varying values of maximum allowable error and cost parameter
#Tune the SVM model
OptModelsvm=tune(svm,log_area~., data=bushtrain,ranges=list(elsilon=seq(0,5,1), cost=seq(0.1,2,0.1)))

#Print optimum value of parameters
print(OptModelsvm)

#Plot the perfrormance of SVM Regression model
plot(OptModelsvm)


The above R code is used in the tuning of the SVR model by changing the maximum error which is allowable and parameter cost. The tuning function determines the performance of 100 models (20*5) i.e. for every value of cost parameter (0.1 to 2) vs  maximum allowable error (0 -5) . The OptModelsvm has epsilon and cost as 0.1 and 0.4  respectively.The best model has lower MSE. the lower the MSE,the darker is the region, which means the model is better. In our subset of data,the MSE is low at epsilon value 0.1 and cost value 0.4

In [None]:
#finding the best tuned model 
BstModel=OptModelsvm$best.model
BstModel

In [None]:
#re-model with the features selected and tuned parameters
svr_model_new =svm(log_area ~ DMC+temp+month.3+month.2,data=bushtrain,cost=0.4,epsilon=0.1)
#display the summary
svr_model_new

Based on the above 2 svr models we can come to a conclusion that the model with all the variables is better when compared to the model with features selected.The increase in gamma value from **0.0344** to **0.25** tells us the first full model is to be considered

## 4. Model Comparsion<a class="anchor" id="sec_4"></a>

#### Predicting the burnt area using 3 models built.

##### 1. Linear Regression

In [None]:
#Predict the area using linear model
linear_predict<-predict(linear_model, newdata = bushtest)

In [None]:
#display the metrics
cat("Metrics of Linear regression\n")
cat("RMSE\t:",RMSE(linear_predict, bushtest$log_area))
cat("\nR-squared:",R2(linear_predict, bushtest$log_area))
cat("\nMAE\t:",MAE(linear_predict, bushtest$log_area))
cat("\nVariance",var(linear_predict,bushtest$log_area))

###### 2. Random Forest Regressor

In [None]:
#predicting the burned area using random forest
random_forest_predict <- predict(random_forest_rf,newdata=bushtest)

In [None]:
#display the metrics
cat("Metrics of Random Forest\n")
cat("RMSE\t:",RMSE(random_forest_predict,bushtest$log_area))
cat("\nR-squared:",R2(random_forest_predict,bushtest$log_area))
cat("\nMAE\t:",MAE(random_forest_predict,bushtest$log_area))
cat("\nvariance:",var(random_forest_predict,bushtest$log_area))

##### 3.Support Vector Regressor

In [None]:
#predicting the burned area using the svr
svr_predict<- predict(svr_model_new,newdata=bushtest)

In [None]:
#display the metrics
cat("Metrics of Support Vector regreesor\n")
cat("RMSE\t:",RMSE(svr_predict,bushtest$log_area))
cat("\nR-squared:",R2(svr_predict,bushtest$log_area))
cat("\nMAE\t:",MAE(svr_predict,bushtest$log_area))
cat("\nVAriance:",var(svr_predict,bushtest$log_area))

#### Functions to calculate the RMSE,R^2,MSE,MAE and variance for the models.

In [None]:
#function to calculate RMSE
rmse<- function(pred){
    RMSE=round(RMSE(pred, bushtest$log_area),5)
    return(RMSE)
}

#function to calculate MSE
mse<- function(pred){
    MSE=round(sum((pred - bushtest$log_area)^2)/length(pred),5)
    return(MSE)
}

#function to calculate MAE
mae<- function(pred){
    MAE=round(MAE(pred, bushtest$log_area),5)
    return(MAE)
}

#function to calculate variance
vari<- function(pred){
    var=round(var(pred, bushtest$log_area),5)
    return(var)
}

#function to calculate R-squared
r2<- function(pred){
    R2=round(R2(pred, bushtest$log_area),5)
    return(R2)
}

##### Storing the model name and metrics in a dataframe

In [None]:
#https://www.c-sharpcorner.com/article/r-data-frame-operations-adding-rows-removing-rows-and-merging-two-data-frame/
Model <-c("Linear Regression","Random Forest","Support Vector Regression") # model name
RMSE <-c(rmse(linear_predict),rmse(random_forest_predict),rmse(svr_predict)) # RMSE
R.squared <-c(r2(linear_predict),r2(random_forest_predict),r2(svr_predict)) # R-squared
MSE <- c(mse(linear_predict),mse(random_forest_predict),mse(svr_predict)) # MSE
MAE <- c(mae(linear_predict),mae(random_forest_predict),mae(svr_predict)) #MAE
variance <- c(vari(linear_predict),vari(random_forest_predict),vari(svr_predict)) # Variance

In [None]:
 # dataframe binded with the models and metrics
Metric <- data.frame(Model,RMSE,R.squared,MSE,MAE,variance)
# disply the metric dataframe
Metric

##### Displaying the metrics for the models in the form of barplots

##### 1. RMSE
Root Mean Square Error (RMSE) quantifies the error between two datasets namely the test and train.it compares a predicted value and known value. The smaller the RMSE value, the values predicted and observed values are closely matching.

In [None]:
#Plotting a bargraph for RMSE
barplot(height=Metric$RMSE, names=Metric$Model, 
        col='SkyBlue',
        xlab="Model", 
        ylab="values", 
        main="RMSE", 
        ylim=c(0.0,0.65)
        )

##### 2. MSE

MSE is the mean of the squared error that is used as the loss function for least squares regression: It is the sum, over all the data points, of the square of the difference between the predicted and actual target variables, divided by the number of data points.

In [None]:
#plotting a bar graph for MSE
barplot(height=Metric$MSE, names=Metric$Model, 
        col='SkyBlue',
        xlab="Model", 
        ylab="values", 
        main="MSE", 
        ylim=c(0.0,0.40)
        )

##### 3.MAE

MAE is the average absolute vertical or horizontal distance between each  data point in a scatter plot and the Y=X line.MAE is the average absolute difference between X and Y.

In [None]:
#plotting a bargraph for MAE
barplot(height=Metric$MAE, names=Metric$Model, 
        col='SkyBlue',
        xlab="Model", 
        ylab="values", 
        main="MAE", 
        ylim=c(0.0,0.50)
        )

##### 4. Variance
Variance is the type of errors that occurs due to a model's sensitivity to small fluctuations in the training dataset.High variance in model would create the noise in the training dataset, which is commonly referred as overfitting.

In [None]:
#Plotting a barplot for variance
barplot(height=Metric$variance, names=Metric$Model, 
        col='SkyBlue',
        xlab="Model", 
        ylab="values", 
        main="Variance", 
        ylim=c(0.0,0.10)
        )

##### 5 . R-squared

R-squared (R2) is a measure which represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. 

In [None]:
#plotting the R-squared values in a bargraph
barplot(height=Metric$R.squared, names=Metric$Model, 
        col='SkyBlue',
        xlab="Model", 
        ylab="values", 
        main="R-squared", 
        ylim=c(0.0,0.05)
        )

#### Results:

- When comparing the **RMSE** for the three models, we have less errors in **Random Forest**
- When comparing the **MSE** for the three models, we have less errors in **Random Forest**
- When comparing the **MAE** for the three models, we have less errors in **Support Vector Regression**
- When comparing the **Variance** for the three models, we have variance in **Random forest**
- When comparing the **R-square** for the three models, we have high value in **Support Vector Regression**

Based on the above metrics we could say that the **Random Forest** is better when compared to the other models

## 5. Variable Identification and Explanation <a class="anchor" id="sec_5"></a>

Since we were not able to consider any variables to develop from our Exploratory Data analysis, we have performed feature selection to build better models.We could see a considerable improvement in some of our models after selecting certain features to build the model.

### 1. Linear model variable identification 

For the linear regression we have used **stepwise** in **sequential replacement(seqrep)** mode to get the best features and we
have selected **DMC,temp,wind,month and day**

-> lm(log_area ~ DMC + temp + wind +month.12 +month.1+month.4+month.3+month.9+day.1+day.5, data = bushtrain )

### 2. Random forest variable identification

For random forest we have used the **variable importance** - **varImp** to select the features for the model which resulted in these variables **temp,DMC,RH,DC,FFMC,wind and ISI** - we have removed DC and RH to avoid collinearity issues.

-> randomForest(log_area~temp+DMC+ref_log_FFMC+wind+ISI, data=bushtrain,ntree=250,mtry=2,maxnodes=14)

### 3. Support Vector Regressor variable identification

For the supprt vector regressor, we have used the **Recursive Feature Selection** and got 5 top variables which are **DMC,DC,temp,month**, as discussed earlier we have removed DC to avoid collinearity issues.

-> svr_model_new =svm(log_area ~ DMC+temp+month.3+month.2,data=bushtrain,cost=0.4,epsilon=0.1)

But the above svr model did not give any improvement to the model, therefore the full model with all the variables is considered to be the best

## 6. Conclusion <a class="anchor" id="sec_6"></a>

With these 3 models built, we could say that the **Random Forest** is the best model as it had considerable amount of low errors and a proper fit with the features selected based on **variable Importance**.

As far as the given bushfire dataset is considered, the three models developed are not a good fit to the dataset as the
regression measures calcuated are not as expected for all the models. This issue can be resolved by 
- Having more data in the dataset that could improve the quality of prediction
- Exploring more complex algorithms such as Neural Networks

## 7. References <a class="anchor" id="sec_7"></a>

**Exploratory Data analysis**
- https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
- https://rstudio-pubs-static.s3.amazonaws.com/547492_3001dd3441ab47d7989f5fe529a95868.html
- https://stats.stackexchange.com/questions/56678/analyzing-reflected-and-transformed-variables
- https://www.datanovia.com/en/lessons/transform-data-to-normal-distribution-in-r/

**Data Transformation**
- https://www.datanovia.com/en/lessons/transform-data-to-normal-distribution-in-r/
- https://www.analytics-link.com/post/2017/08/25/how-to-r-one-hot-encoding

**Feature Selection**
- https://www.youtube.com/watch?v=-2DlAMYioqY
- https://stackoverflow.com/questions/51999898/how-do-i-generate-a-decision-tree-plot-and-a-variable-importance-plot-in-random
- https://dataaspirant.com/2018/01/15/feature-selection-techniques-r/
- http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
- https://datasciencebeginners.com/2018/11/26/functions-and-packages-for-feature-selection-in-r/
- https://dataaspirant.com/2018/01/15/feature-selection-techniques-r/

**Model Development & Metric Calculations**
- http://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/
- https://towardsdatascience.com/random-forest-in-r-f66adf80ec9
- https://topepo.github.io/caret/recursive-feature-elimination.html
- https://uc-r.github.io/random_forests
- http://ugrad.stat.ubc.ca/R/library/randomForest/html/randomForest.html
- https://www.guru99.com/r-random-forest-tutorial.html
- https://topepo.github.io/caret/variable-importance.htmli
- https://www.listendata.com/2014/11/random-forest-with-r.html
- https://www.researchgate.net/post/what_is_the_acceptable_r-squared_value
- https://www.svm-tutorial.com/2014/10/support-vector-regression-r/
- https://datasciencebeginners.com/2018/11/26/functions-and-packages-for-feature-selection-in-r/
- https://www.kdnuggets.com/2017/03/building-regression-models-support-vector-regression.html
- http://www.columbia.edu/~yh2693/ForestFire.html
- https://www.c-sharpcorner.com/article/r-data-frame-operations-adding-rows-removing-rows-and-merging-two-data-frame/
- https://www.investopedia.com/terms/r/r-squared.asp
- https://datascience.stackexchange.com/questions/37345/what-is-the-meaning-of-term-variance-in-machine-learning-model
- https://en.wikipedia.org/wiki/Mean_absolute_error
- https://www.oreilly.com/library/view/machine-learning-with/9781785889936/669125cc-ce5c-4507-a28e-065ebfda8f86.xhtml
- https://gisgeography.com/root-mean-square-error-rmse-gis/
