**Abstract**: *On average, Algeria experiences 1,636 forest fires every year. These forest fires burn 35,024 ha of forest lands annually. Scientists have also determined that forest fires explain 90% of Algerian forest land degradation. The majority of these fires (80%) are of unknown origin, making them difficult to anticipate. Our goal was to create a model that will accurately predict the chance of a forest fire occurring based on climate data. A major stakeholder for this project is the Algerian government. Reducing forest fires has been a major governmental initiative, especially since devastating fires in early 2022. Our model could help inform more efficient wildfire management programs. Additional stakeholders include the Algerian people and numerous wildlife organizations and researchers. We developed a logistic regression model that uses interactions between climate and forest fire index predictors that does not fail to predict a forest fire when one occurs (100% recall). We recommend stakeholders optimize data collection of the predictors used in our model and employ a similar model to inform wildfire preparation efforts and processes.*

## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Code should be put separately in the code template {-}
Your report should be in a research-paper like style. If there is something that can only be explained by showing the code, then you may put it, otherwise do not put the code in the report. We will check your code in the code template. 

**Delete this section from the report, when using this template.** 

## Background / Motivation

As mentioned in the abstract, forest fires are a major problem in Algeria. In addition to destroying valuable forest land, forest fires are responsible for numerous deaths. In 2022, wildfires killed 44 people and displaced 500 families. Predicting when these wildfires occur is integral to preventing deaths and the loss of forest land


https://reliefweb.int/disaster/fr-2022-000297-dza

## Problem statement 

Our goal was to create a model that uses climate data to predict the likelihood of a forest fire occuring. This is a classification problem and our main goal was prediction. One important note is that we decided that failing to predict a fire when one occurred was significantly worse than falsely predicting a fire when there was none. This informed our model building philosophy.

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

The data source used in this project is the Algerian Forest Fires Dataset Data Set, and it is from the UCI Machine Learning Repository. It is a multivariate data set with 244 instances of whether a fire occured or not on a specific day in the period from June 2012 to September 2012, across two regions in Northern Algeria, Bejaia region in the northeast and the Sidi Bel-abbes region in the northwest. It has 12 variables in total, including 6 indexes from the Fire Weather Index (FWI) system, a unitless system widely used to measure  general fire intensity potential. Such variables are as follows with the scales in which they can be at minimum or maximum:
- Fine Fuel Moisture Code (FFMC): 28.6 to 92.5
- Duff Moisture Code (DMC): 1.1 to 65.9
- Drought Code (DC): 7 to 220.4
- Initial Spread Index (ISI): 0 to 18.5
- Buildup Index (BUI): 1.1 to 68
- Fire Weather Index (FWI): 0 to 31.1
- Temp (at noon) in Celsius: 22 to 42
- Relative Humidity (RH) in %: 21 to 90 
- Wind speed (Ws) in km/h: 6 to 29 
- Rain (total day) in mm: 0 to 16.8

In our analysis, we found that using day or month as a function of analysis was almost useless, and thus excluded it from our model. Additionally, we had to take into consideration the collinearity in our data. With the FWI system, many of the variables are derived from the same variables. The FFMC is derived from Temp, RH, Wind, and Rain. The ISI is derived from FFMC and Wind. Then, the DMC is based on Temp, RH, and Rain. DC is derived from Temperature and Rain only. Then, DC and DMC are what the BUI is derived from. Finally, ISI and BUI is what FWI is based on. So, dealing with these variables stacking from each other, we took this collinearity into consideration when creating our model.


## Stakeholders
Who cares? If you are successful, what difference will it make to them?

With the completion of this model, we hope that the model will be able to support the Algerian government, the Algerian people, and local wildlife/climate organizations and researchers.
Being able to accurate predict forest fires will allow the Algerian government to properly appropriate funds and staff members to the places and groups that need it most, since they can easily predict whether there is or isn't a fire. This model will allow the Algerian people to live a bit more peacefully, since they will not have the stress of misidentifying a fire. Fires will be accurately predicted and these stakeholders will rest easy knowing that if the model does not class certain indexes as a fire, there actually isn't a fire.
As climate change becomes a more and more important topic in today's world, local and even international wildlife and climate groups will benefit from this model. With this classification model, and its increased use, these organizations will be able to learn about trends within this region and more broadly, within the continent. 

## Data quality check / cleaning / preparation 

The code represented below is necessary to demonstrate the distribution of values of each variable in tabular form, before and data cleaning occured. From the data quality check, we found that there were only two rows with missing values. One of the rows that had missing values was simply a row utilized for labelling by the publisher of the data. They used it to show that Rows 1-122 is from Bejaia, Row 123 onwards is Sidi Bel-abbes. We decided to drop that row and then create a new column that labelled the Regions numerically, Region 1 as Bejaia and Region 2 as Sidi Bel-abbes. The other missing row was missing data, but since it was only one row, and the data was really consistent and clean, we decided to drop it. Afterwards, we had to clean some of the data inputs for "no fire" and "fire". Many of them had odd, unecessary spacing issues, so we took out the spaces and then assigned dummy variables for "no fire" and "fire", being 0 and 1 respectively. Then, the dataypes needed to be converted data types to workable, numerical data types. Finally, we created the test and train sets, utilizing 0.3 test and 0.7 train, as it seemed to create the best performance result for the set. 

In [10]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('Algerian_forest_fires_dataset_UPDATE.csv',header=1)
df = data.copy()

summary = df.describe(include='all')

missing_values = df.isnull().sum()

summary.loc['missing_values'] = missing_values

cat_dist = pd.DataFrame(columns=['unique_values', 'top_values'])

for column in df.columns:
    if df[column].dtype == np.object:
        unique_values = df[column].nunique()
        frequency = df[column].value_counts(normalize=True)
        top_values = ", ".join(list(df[column].value_counts().index[:5]))
        cat_dist.loc[column] = [unique_values, top_values]
summary

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
count,246,245,245,245,245,245,245,245.0,245.0,245,245.0,245,245.0,244
unique,33,5,2,20,63,19,40,174.0,167.0,199,107.0,175,128.0,9
top,1,7,2012,35,64,14,0,88.9,7.9,8,1.1,3,0.4,fire
freq,8,62,244,29,10,43,133,8.0,5.0,5,8.0,5,12.0,131
missing_values,0,1,1,1,1,1,1,1.0,1.0,1,1.0,1,1.0,2


In [11]:
cat_dist

Unnamed: 0,unique_values,top_values
day,33,"01, 02, 30, 29, 28"
month,5,"07, 08, 06, 09, month"
year,2,"2012, year"
Temperature,20,"35, 31, 34, 33, 30"
RH,63,"64, 55, 58, 54, 78"
Ws,19,"14, 15, 13, 17, 16"
Rain,40,"0, 0.1, 0.2, 0.3, 0.4"
FFMC,174,"88.9, 89.4, 89.1, 85.4, 89.3"
DMC,167,"7.9, 12.5, 1.9, 3.4, 4.6"
DC,199,"8, 7.6, 7.8, 8.4, 7.5"


## Exploratory data analysis (Mel)

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

We used a logistic approach as it was a classification problem. We chose to optimize for recall rather than precision or accuracy. This was due to the context of our problem. In our opinion, it is better for our model to cause a false alarm (i.e. predict a fire when none occurs) than fail to predict a fire when one actually occurs. In the latter case, people would be completely blindsided by the fire and the damage done would be much worse. When optimizing our model using forward and backward stepwise selection, we picked the model with the lowest AIC as opposed to BIC. This is because we wanted our model to be less conservative for the reasons mentioned above.

There is nothing uniquely unorthodox or new in our approach. One major problem we anticipated was collinearity between our predictors. As mentioned above, a lot of our predictors (especially the FWI indexes) are derived from the same "base" climate variables (temperature, rain, humidity, etc...). As such, we anticpated and experienced severe collinearity between our predictors. Our initial thought was to perform a VIF test (seen below) and select variables where the VIF test was below 5. This ended up changing for reasons that will be explained in the "Developing the Model" section.

Our code did already have some solutions on GitHub, but they were mostly using different model building than methods than we learned in class. As most used a combination of different methods, they ended up being slightly better than our model. However, our model is the only model that I found that can hit 100% Recall consistently.

## Developing the model

Our base model initially did not consist of all the predictors. As mentioned earlier, many of our variables were strongly correlated. The logit model, which uses a Hessian Matrix for optimization by default, would throw an error. As such, our base model consisted of the following variables: Temperature, RH, Ws, Rain, and FFMC. These variables had a VIF score of less than 5. It was quite accurate when it came to the training data and fairly accurate when it came to the test data. You can see the confusion matrix in the code document labeled as **Figure 1**. 

Eventually, we were able to get all of the variables work in a basic logit model by changing the optimization method from the Hessian Matrix to the Broyden–Fletcher–Goldfarb–Shanno algorithm. This consisted of nothing more than adding method = 'bgfs' to the logit model. The results were not significantly different from the VIF tested model. 

What was most concerning was the 5 false negatives. Even though the precision and accuracy were both fairly high, our goal was to maximize recall at whatever cost. First, we checked to see if any transformations were necessary. We binned each of the variables and tried to identify any potential non-linear trends. There didn't seem to be any, and when we tested quadratic and cubic terms it hurt the model significantly. The plots of the binned variables vs the percentages of fires can be seen in the code notebook. 

Our main technique was performing forward stepwise selection to find the best possible subsets of predictors and interactions. First, we performed a basic forward stepwise selection. As mentioned above, we chose the model with the lowest AIC because we wanted the model to be more aggressive. That returned a subset of six predictors, which can be seen in the code notebook. It excelled on the training data (100% recall) and did very well on the test data (97.4%) recall. Still, we felt like we could do better

Next, we performed forward stepwise selection to find the best possible subset of all potential degree two interaction terms. That returned a model of 35 predictors. This can also be seen in the code notebook. This performed better on the training data (100% recall) but slightly worse on the test data (92.3% recall). We felt that this was indicative of overfitting. 

Our third and our most successful approach was performing forward stepwise selection on all possible interactions derived from the subset of predictors returned by our basic forward stepwise selection (those 6 predictors mentioned two paragraphs ago). This was the most successful model and returned a model with 13 predictors. The best subset plots can be seen in the code document labeled as **Figure 2**

This model performed very well on the train data and pretty well on the test data. The confusion matrix can be seen in the code document labeled as **Figure 3**

At this point, we felt our work was sufficient. Our model had hit 100% recall which was the major goal. Even though accuracy and precision were worse than the base model, we optimized the most important metric. We did end up testing a few more ideas, including degree three, four, and five interactions but those ended up being significantly worse. The results of the degree four interaction stepwise selection can be seen in the code notebook. I believe we achieved our goal

Our final model equation is: <br>
Classes = FFMC + FWI*RH + BUI*Ws + FWI*Rain + RH + FFMC*RH + FFMC*FWI + FWI + FFMC*Ws + BUI + RH*Ws + FFMC*BUI + RH*BUI

## Limitations of the model with regard to inference / prediction (Abenezer)

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

## Conclusions and Recommendations to stakeholder(s)

It is somewhat difficult to outline policy-based recommendations for stakeholders. The purpose of this model was simply to serve as a tool for predicting wildfires. One key takeaway is that a combination of climate variables are relevant to whether or not a fire occurs, not single variables. In our model, we used many interaction terms between different climate variables. Additionally, we found that temperature not as significant as rain (or rather the lack thereof) and relative humidity in determining whether or not there will be a fire. Finally, time variables (such as day, month, and year) are not very important for predicting fires. When we tested adding them into the model, it significantly reduced our recall and thus we left them out. We would recommend that stakeholders optimize the collection of climate variables to ensure that predictive models such as ours can be as accurate as possible. 

## GitHub and individual contribution {-}

Link: https://github.com/ArushIyer12/303-2Project

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Abby Burt</td>
    <td>Data cleaning and EDA</td>
    <td>I cleaned and processed all of the data to prepare it for the model buildind, EDA, and analysis. Also visualized the distributions of all the variables of the starting set and simple variable interactions. </td>
    <td>11</td>
  </tr>
  <tr>
    <td>Mel Megala</td>
    <td>EDA</td>
    <td>Insert</td>
    <td>Insert</td>
  </tr>
    <tr>
    <td>Arush Iyer</td>
    <td>Model building and variable selection</td>
    <td>Performed variable selection on predictors to address multicollinearity and overfitting. Created and tested model interactions and developed the final model.</td>
    <td>12</td>    
  </tr>
    <tr>
    <td>Abenezer Bekele</td>
    <td>Model analysis</td>
    <td>Insert</td>
    <td>Insert</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3
