# DSCI 521: Data Analysis and Interpretation <br> Final Report

# Team
*   **Austin Eversole** (ae588@drexel.edu): Computer Science student with interest in 3D Rendering and Modeling as well as climate studies. Background in defense industry and computer graphics.

*   **Greg Savage** (gs824@drexel.edu): Background in social science and criminal justice research. My interests are in data analysis and statistical analysis for use in the behavioral health field.

*   **Robert Thompson** (rt598@drexel.edu): a Software Engineering student with a Software Engineering Background in the Defense Industry. Interested in solving optimization problems by leveraging various machine learning algorithms to identify and solve complex problems or issues.

# Project Overview and Introduction

The purpose of this project was to analyze California wildfire data to learn about the frequency, duration, and amount of damage that a fire can cause to structures and human life. Wildfires in California have been on the rise, especially in recent years, and have caused destruction and devastation in its wake. Analyzing this wildfire data will give insight into potential patterns that may exist and perpetuate in the future.

Exploratory data analysis was conducted to understand the dataset being utilized. This provided a basic understanding of several factors associated with wildfires including timing, size, and the locations of wildfires in California. Later, several machine learning models were created in order to predict if a fire would be classified as major or not as well as predict the location of future fires. The results of that analysis are described below.

## Who Might Be Interested?

Given that the data is provided by the California Government, the local, state, and federal Government could potentially be interested in viewing and understanding our data analysis. Scientists and Data Scientists could find this data and report useful based on the increasing frequency and severity of fires that have plagued not only California but North America as a whole.

# Data Set

This report utilizes the California Data set developed by the California State Government and is hosted on Kaggle. The data set covers Wildfire data from 2013 to 2020 for a total of 7 years of incidents. The data sat contains a combination of structured and numeric data across 40 columns (as seen below). In total, there are 1636 total samples. In the later parts of this report, we will discuss how features variables and targets were chosen to classify major fire incidents and predict the location of a fire incident.

| Field      | Description |
| :----------- | :----------- |
| AcresBurned      | Acres of land affected by wildfires      |
| Active | If the fire is active or contained? |
| AdminUnit | Agency where fire started |
| AirTankers | Number of air tanker resources assigned |
| ArchiveYear | Year the data was archived |
| CalFireIncident | Is the incident treated as a CalFire incident? |
| CanonicalUrl | Where on the fire.ca.gov website to find incident information |
| ConditionStatement | Observations and notes about the incident |
| ControlStatement | Movement controls around the fire such as excavations or road closures |
| Counties | County where fire started |
| CountyIds | ID number of where the county started |
| CrewsInvolved | The number of fire crews involved |
| Dozers | The number of bulldozer resources assigned |
| Engines | The number of engine resources assigned |
| Extinguished | Extinguished date |
| Fatalities | Fatality count |
| FuelType | Type of material that burned |
| Helicopters | Number of helicopter resources assigned |
| Injuries | Count of injured personnel |
| Latitude | Latitude of Wildfire incident |
| Location | Description of the location |
| Longitude | Longitude of Wildfire incident |
| MajorIncident | Is it considered a major incident or not? |
| Name | Name of the wildfire |
| PercentContained | What percent of the fire is contained? |
| PersonnelInvolved | Number of CalFire personnel involved |
| Public | Is the fire a public or private residence fire? |
| SearchDescription | Description of fire incident |
| SearchKeywords | Key words used to map back to a given fire incident |
| Started | Fire start date |
| StructuresDamaged | Count of structures damaged |
| StructuresDestroyed | Count of structures destroyed |
| StructuresEvacuated | Count of structures evacuated |
| StructuresThreatened | Count of structures threatened |
| UniqueID | Incident unique alphanumeric id |
| Updated | Last update date |
| WaterTenders | The number of water tender resource assigned |

# Exploratory Data Analysis


## Major Fires vs. Total Fires Per Year

![major_fires_and_total_fires.png](../images/major_fires_and_total_fires.png)

Across the span of seven years, California was most plagued by fires, including major fires, in 2017 and 2018. The differential in totals across the span of seven years can be visualized in the bar plot above as well as in table format below:

||2013|2014|2015|2016|2017|2018|2019|
|:----|:----|:----|:----|:----|:----|:----|:----|
|Major Fire Incidents|44|40|52|49|93|74|24|
|Total Fire Incidents|158|115|149|182|437|315|191|

## Months with the Most Fires

![number_of_fires_per_month.png](../images/number_of_fires_per_month.png)

Now that we have examined the annual fire incidents in the data set and determined that 2017 and 2018 contained the largest number of major and total fire incidents, we can examine which months were most plagued by fires.

In the above bar plot, we can see that there is a large disparity in the data such that most fires occur during the summers months. In the table below, we can see the true counts between the summer month fire incidents:

||June|July|August|
|:----|:----|:----|:----|
|Total Fire Incidents|313|394|265|

In an article referenced below it states, "California is dry for most of the year. Precipitation only comes during the winter months. This is typically followed by a dry and hot summer." This aligns with the data that is presented above that shows how most fires in California are in the summer months.

## Number of Fires and Acres Burned
![number_of_fires_and_acres_burned.png](../images/number_of_fires_and_acres_burned.png)

The relationship between the number of fires per year and number of acres burned is somewhat existent. In 2013 and 2015, the relationship is slightly more and less in each of the years. The biggest outliers based on the plot above are in 2017 and 2018. 2017 had significantly more acres burned compared to the numbers of fires that occurred, while 2018 had significantly more fires than it had acres burned. Therefore, there is a slight correlation to the number of fires and acres burned.

## Counties Most Effected By Fires
![top_counties_with_fires.png](../images/top_counties_with_fires.png)

Over the span of seven years, Riverside had the most reported fire incidents with 138. This is the highest number of incidents across the fifty-eight counties in California. The county with the next highest number of reported fires is San Diego with 85.


## Counties With Most Acres Burned
![top_counties_with_acres_burned.png](../images/top_counties_with_acres_burned.png)

When comparing the total number of fires and total acres burned in California counties, there is a small correlation between the two. If you look at the plot of the top fifteen counties, two counties were common between the two data frames: Shasta and Siskiyou. A further analysis is below:

*   **Riverside** county had the highest number of fires reported but was not in the top fifteen counties for acres burned.
*   **Lake** county had the greatest number of acres burned but was #11 on the reported number of fires.
*   **Mendocino** county, which had the second highest number of acres burned, was not in the top ten counties for reported fires.

# Machine Learning


### Data Pre-Processing

The initial approach our team took to pre-processing began in the exploratory data analysis phase. We begin with replacing all the NaN (Not a number) values in the database with 0. Replacing it with 0  makes sense in this database because the variables are quantifiable. For example, the *StructuresDestroyed* columns has numerical values but also has NaN values. We can reasonably assume that the NaN values would correspond to 0 in this case.

The next step the team took involved converting the dates in the dataset from UTC to be decomposed into years and months. This enabled us to create more human readable columns (YearStarted, MonthStarted, FireDuration, and FireDurationDays) to visualize the effects of fires in California.

We then removed entries where the value for FireDurationDays was negative (invalid entries) as well as entries where YearStarted is less than 2000 (outlier). This is because we are not interested in fires that predate 2000 as they are not relevant to our model predictions.

For each machine learning problem we were attempting to solve, we hand-selected feature columns based on the combination of what the data was telling us and what features we believed were most useful in each problem. Our target (or label) was also selected based on what problem we were attempting to solve.

Once the features were select, we split the data as 75% training and 25% testing data. With the training and test data, we leveraged SkLearn's StandardScaler to standardize our data. Now that are features and target have been selected, it is time to select what machine learning models we should leverage to solve our problems.

## Models

For both of our machine learning problems we selected to use models that would enable us to perform binary and multi-class classification. We decided on Logistic Regression, Decision Tree, and Random Forest Classifiers.

The Logistic Regression Classifier was selected because it is best used for binary classification but can also be used for multi-class classification. Based on the features, it uses the logistic function to obtain the probability of a whether a given target value is of certain type.

The Decision Tree Classifier was chosen for its incremental approach and the ability to visualize the tree as it is broken down.

The Random Forest Classifier was chosen as it is essentially a 1-N *forest* of decision trees. The classifier is built on bootstrapped data from 1-N trees and the final label is chosen based on the results of all the decision trees.

In both the Decision Tree and Random Forest Classifiers, they are very prone to overfitting. So although later in the report we will see very near perfect results, there is a possibility that our trees and our forest have been overfitted.

The below table will show the hyperparameters that were changed from the default values in each of the models. The hyperparameter changes below are consistent across both classification problems:


|Model|Target|
|:----|:----|
|Logistic Regression|max_iter=1000|
|Decision Tree|criterion='entropy'|
|Random Forest|criterion='entropy'|


## Classifying a Major Fire Incident

### Feature and Target Selection

When it came to predicting a major fire incident from the data set, we needed to first look at the columns and determine what variable should be our target. We selected the **MajorIncident** column to be our target and we changed its values from True/False to 1/0 for our models.

After selecting our target variable, we now had to decide which of the 39 remaining columns would be used to determine if a fire incident was major or not. We decided on the following columns as our features:

| Field      | Description |
| :----------- | :----------- |
| AcresBurned      | Acres of land affected by wildfires      |
| Fatalities | Fatality count |
| Injuries | Count of injured personnel |
| StructuresDamaged | Count of structures damaged |
| StructuresDestroyed | Count of structures destroyed |
| StructuresEvacuated | Count of structures evacuated |
| StructuresThreatened | Count of structures threatened |

The reason we selected these features because they were deemed to be the most aligned with overall damaged both physically and to loss of life. Now that we have our features, target, split our data into training and test, and have standardized our data, it is now time to pass the data to our models for prediction!

### Train Data Evaluation

The below table shows the results of each of the models performance when using the training data for validation. The Decision Tree and Random Forest Classifiers predicted the highest with 93% accuracy and the Logistic Regression Classifier with 83% accuracy.

|Model|Target|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|:----|
|Logistic Regression|Not a Major Incident (0)|0.83|1.0|0.81|0.89|
|Logistic Regression|Major Incident (1)|0.83|0.38|0.99|0.55|
|Decision Tree|Not a Major Incident (0)|0.93|1.0|0.91|0.95|
|Decision Tree|Major Incident (1)|0.93|0.75|0.99|0.85|
|Random Forest|Not a Major Incident (0)|0.93|1.0|0.91|0.95|
|Random Forest|Major Incident (1)|0.93|0.75|0.98|0.85|


### Test Data Evaluation

The below table shows the results of each of the models performance when using the testing data for validation. The Logistic Regression Classifier predicted the highest with 90% accuracy while the Decision Tree and Random Forest Classifiers predicted the same with 88% accuracy.

|Model|Target|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|:----|
|Logistic Regression|Not a Major Incident (0)|0.9|1.0|0.89|0.94|
|Logistic Regression|Major Incident (1)|0.9|0.22|1.0|0.35|
|Decision Tree|Not a Major Incident (0)|0.88|0.93|0.93|0.93|
|Decision Tree|Major Incident (1)|0.88|0.51|0.54|0.53|
|Random Forest|Not a Major Incident (0)|0.88|0.94|0.93|0.93|
|Random Forest|Major Incident (1)|0.88|0.53|0.56|0.55|

## Predicting a Fires Location

### Feature and Target Selection

When it came to predicting the location of a fire incident from the data set, we needed to first look at the columns and determine what variable should be our target. We could decide either to be extremely precise and look at the latitude and longitude values to see where a cluster of fires are occurring or look at the fire incidents for counties in general. We decided on the latter.

After selecting our target variable we now had to decide which of the 39 remaining columns would be used to predict the location of a fire incident. We decided on the following columns as our features:

| Field      | Description |
| :----------- | :----------- |
| AcresBurned      | Acres of land affected by wildfires      |
| AirTankers | Number of air tanker resources assigned |
| CalFireIncident | Is the incident treated as a CalFire incident? |
| CrewsInvolved | The number of fire crews involved |
| Dozers | The number of bulldozer resources assigned |
| Engines | The number of engine resources assigned |
| Fatalities | Fatality count |
| Helicopters | Number of helicopter resources assigned |
| Injuries | Count of injured personnel |
| Latitude | Latitude of Wildfire incident |
| Longitude | Longitude of Wildfire incident |
| MajorIncident | Is it considered a major incident or not? |
| PersonnelInvolved | Number of CalFire personnel involved |
| Public | Is the fire a public or private residence fire? |
| StructuresDamaged | Count of structures damaged |
| StructuresDestroyed | Count of structures destroyed |
| StructuresEvacuated | Count of structures evacuated |
| StructuresThreatened | Count of structures threatened |
| WaterTenders | The number of water tender resource assigned |

Now that we have our features, target, split our data into training and test, and have standardized our data, it is now time to pass the data to our models for prediction!

The team realized when using all of the data that we did not spend enough time looking over the data as we should have. From previous models and predictions, we noticed results where there had been some outliers within our data. These led us to go back to analyze our data and start further pre-processing of the data. Below is the steps we needed to take to set our data up correctly to accurately predict:
- Convert our boolean feature variables into 1s and 0s
- Research and get a list of all counties within California and remove any incident from our data set that did not occur in one of the California counties. This was a big one because we noticed results from Oregon, Washington, and Mexico that were listed in the **Counties** column.
- Only take counties that had fire incidents of greater than 50

After all further pre-processing, this left us with 578 total samples of data that would be used to predict the location of a fire.

### Train Data Evaluation

The below table shows the results of each of the models performance when using the training data for validation. The Decision Tree and Random Forest Classifiers predicted the highest with 98% accuracy and the Logistic Regression Classifier with 38% accuracy.

|Model|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|
|Logistic Regression|0.38|0.32|0.56|0.3|
|Decision Tree|0.98|0.98|0.99|0.98|
|Random Forest|0.98|0.98|0.99|0.98|


### Test Data Evaluation

The below table shows the results of each of the models performance when using the training data for validation. The Random Forest Classifier predicted the highest with an almost perfect 99% accuracy. The Decision Tree Classifier predicted with 95% accuracy and the Logistic Regression Classifier with 39% accuracy.

|Model|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|
|Logistic Regression|0.39|0.32|0.56|0.3|
|Decision Tree|0.95|0.98|0.99|0.98|
|Random Forest|0.99|0.98|0.99|0.98|

# Conclusion

When classifying a major fire incident, the Logistic Regression Classifier performed the best with 90% accuracy while the Decision Tree and Random Forest Classifiers achieved an 88% accuracy when analyzing the testing data. The Logistic Regression Classifier more than likely performed the best because the binary classification problem that we attempted to solve closely models a linear problem, which is what the Logistic Regression model performs well with.

For predicting the location of a fire incident, the Logistic Regression Classifier performed extremely poorly with a 39% accuracy. The Decision Tree performed at a 98% accuracy and the Random Forest performed at a near perfect 100% accuracy. The Decision Tree and Random Forest Classifiers are more well versed at multi-class classification problems and that is extremely apparent in the results we see above.

Overall, across both problems the Decision Tree and Random Forest classifiers both predicted at greater than 80% accuracy and as a team, we are extremely happy with these results.



# Future Work

This work could be applied to a larger investigation of where fire resources are currently located in California and if they could be more effectively allocated to where fires have been and may be located in the future.

We must remember that because of the dataset we are using, we are limited in our classification and prediction to data from 2013-2020. Wildfires have increased in frequency in California for the last 50 years and because we are only training on a 7-year window this can potentially limit the usefulness of our outcomes. One way we could improve this in the future is by seeking out more recent wildfire data from CalFire (the source of the Kaggle dataset) open data portal although further preprocessing would be required to format the data the way that this dataset is prepared.

It is important to note that this  dataset is confined to only California. We must keep in mind this limitation if we attempt to draw generalized conclusions from our predictive analysis. Taking a broader bird's eye view of the entire North American continent could yield a better overall analysis and aid in future classification.

Another thought for future work would be to look at wildfires across North America. In 2023, we as a continent have experienced fires in California, Canada, New Jersey, and more from wildfires.

# References

**California Wildfire Data (2013 - 2020):**

https://www.kaggle.com/datasets/ananthu017/california-wildfire-incidents-20132020

**California Government Fires**

https://www.fire.ca.gov/

**California State Facts (2016)**

https://sgf.senate.ca.gov/sites/sgf.senate.ca.gov/files/county_facts_2016.pdf

**Why Does California Have So Many Wildfires?**

https://a-z-animals.com/blog/why-does-california-have-so-many-wildfires/

**California Counties**

https://www.ndangira.net/list-of-california-counties/