# DSCI 521: Data Analysis and Interpretation <br> Final Report

# Team
*   **Austin Eversole** (ae588@drexel.edu): Computer Science student with interest in 3D Rendering and Modeling as well as climate studies. Background in defense industry and computer graphics.

*   **Greg Savage** (gs824@drexel.edu): Background in social science and criminal justice research. My interests are in data analysis and statistical analysis for use in the behavioral health field.

*   **Robert Thompson** (rt598@drexel.edu): a Software Engineering student with a Software Engineering Background in the Defense Industry. Interested in solving optimization problems by leveraging various machine learning algorithms to identify and solve complex problems or issues.

# Project Overview and Introduction

The purpose of this project will be to analyze California wildfire data to learn about the frequency, duration, and amount of damage that a fire can cause to structures and human life. Wildfires in California have been on the rise, especially in recent years, and have caused destruction and devastation in its wake. Analyzing this wildfire data will give us an insight into potential patterns that may exist and perpetuate in the future. 

Our goal is to perform exploratory data analysis and eventually determine the best machine learning model that will enable us to accurately determine if a fire incident is major or not. 

## Data Set

### Column Descriptions

| Field      | Description |
| :----------- | :----------- |
| AcresBurned      | Acres of land affected by wildfires      |
| Active | If the fire is active or contained? |
| AdminUnit | Agency where fire started |
| AirTankers | Number of air tanker resources assigned |
| ArchiveYear | Year the data was archived |
| CalFireIncident | Is the incident treated as a CalFire incident? |
| CanonicalUrl | Where on the fire.ca.gov website to find incident information |
| ConditionStatement | Observations and notes about the incident |
| ControlStatement | Movement controls around the fire such as evacations or road closures |
| Counties | County where fire started |
| CountyIds | ID number of where the county started |
| CrewsInvolved | The number of fire crews involved |
| Dozers | The number of bulldozer resources assigned |
| Engines | The number of engine resources assigned |
| Extinguished | Extinguished date |
| Fatalities | Fatality count |
| FuelType | Type of material that burned |
| Helicopters | Number of helicopter resources assigned |
| Injuries | Count of injured personnel |
| Latitude | Latitude of Wildfire incident |
| Location | Description of the location |
| Longitude | Longitude of Wildfire incident |
| MajorIncident | Is it considered a major incident or not? |
| Name | Name of the wildfire |
| PercentContained | What percent of the fire is contained? |
| PersonnelInvolved | Number of CalFire personnel involved |
| Public | Is the fire a public or private residence fire? |
| SearchDescription | Description of fire incident |
| SearchKeywords | Key words used to map back to a given fire incident |
| Started | Fire start date |
| StructuresDamaged | Count of structures damaged |
| StructuresDestroyed | Count of structures destroyed |
| StructuresEvacuated | Count of structures evacuated |
| StructuresThreatened | Count of structures threatened |
| UniqueID | Incident unique alphanumeric id |
| Updated | Last update date |
| WaterTenders | The number of water tender resource assigned |

## Application / Investigation

We intend to perform data analysis on the California Wildfire Data (2013-2020) Kaggle dataset. To do this, we will begin by applying exploratory data analysis techniques using summarization and association approaches. This type of analysis will support our goal of classifying major fires and/or predicting how many acres will be burned each year by assisting in evaluating the integrity of the dataset and guiding our data preprocessing. Based on results from exploratory data analysis, we can use a supervised machine learning model to accomplish our goal of major fire classification and burned acres prediction.

# Exploratory Data Analysis

## Major Fires vs. Total Fires Per Year

![major_fires_and_total_fires.png](../images/major_fires_and_total_fires.png)

California was plagued by the most fires including major fires in 2017 and 2018. The totals are listed below:
* 2017:
  * 93 Major Fires
  * 437 Total Fires
* 2018:
  * 74 Major Fires
  * 315 Total Fires

## Months with the Most Fires

![number_of_fires_per_month.png](../images/number_of_fires_per_month.png)

Now that we have examined and introduced that 2017 and 2018 contained the greatest number of major and total fires, we can see in the above plot that the months of June, July, and August contains the greatest number of fires:
* June: 313
* July: 394
* August: 265

In an article referenced below it states, "California is dry for most of the year. Precipitation only comes during the winter months. This is typically followed by a dry and hot summer." This aligns with the data that is presented above that shows how most fires in California are in the summer months.

## Number of Fires and Acres Burned
![number_of_fires_and_acres_burned.png](../images/number_of_fires_and_acres_burned.png)

The relationship between the number of fires per year and number of acres burned is somewhat existent. In 2013 and 2015, the relationship is slightly more and less in each of the years. The biggest outliers based on the plot above are in 2017 and 2018. 2017 had significantly more acres burned compared to the numbers of fires that occurred, while 2018 had significantly more fires than it had acres burned. Therefore, there is a slight correlation to the number of fires and acres burned.

## Counties Most Effected By Fires
![top_counties_with_fires.png](../images/top_counties_with_fires.png)

Over the span of seven years, Riverside had the most reported fire incidents with 138. This is the highest number of incidents across the fifty-eight counties in California. The county with the next highest number of reported fires is San Diego with 85.


## Counties With Most Acres Burned
![top_counties_with_acres_burned.png](../images/top_counties_with_acres_burned.png)

When comparing the total number of fires and total acres burned in California counties, there is a small correlation between the two. If you look at the plot of the top fifteen counties, two counties were common between the two data frames: Shasta and Siskiyou. A further analysis is below:

*   **Riverside** county had the highest number of fires reported but was not in the top fifteen counties for acres burned.
*   **Lake** county had the greatest number of acres burned but was #11 on the reported number of fires.
*   **Mendocino** county, which had the second highest number of acres burned, was not in the top ten counties for reported fires.

# Machine Learning - Classifying a Major Fire Incident

## Data Pre-Processing

**TODO: Explain what we had to do with:**
- **Filtering of the data**
- **Feature selection**
- **Standardizing the data**

## Models

**TODO: Talk about what models we chosen and why**

## Evaluation

### Train Data

**TODO - Summary of the results**


|Model|Target|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|:----|
|Logistic Regression|Not a Major Incident (0)|0.83|1.0|0.81|0.89|
|Logistic Regression|Major Incident (1)|0.83|0.38|0.99|0.55|
|Decision Tree|Not a Major Incident (0)|0.93|1.0|0.91|0.95|
|Decision Tree|Major Incident (1)|0.93|0.75|0.99|0.85|
|Random Forest|Not a Major Incident (0)|0.93|1.0|0.91|0.95|
|Random Forest|Major Incident (1)|0.93|0.75|0.98|0.85|


### Test Data

**TODO - Summary of the results**


|Model|Target|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|:----|
|Logistic Regression|Not a Major Incident (0)|0.9|1.0|0.89|0.94|
|Logistic Regression|Major Incident (1)|0.9|0.22|1.0|0.35|
|Decision Tree|Not a Major Incident (0)|0.88|0.93|0.93|0.93|
|Decision Tree|Major Incident (1)|0.88|0.51|0.54|0.53|
|Random Forest|Not a Major Incident (0)|0.88|0.94|0.93|0.93|
|Random Forest|Major Incident (1)|0.88|0.53|0.56|0.55|

# Machine Learning - Predicting a Fires Location

## Data Pre-Processing

**TODO: Explain what we had to do with:**
- **Filtering of the data**
- **Feature selection**
- **Standardizing the data**

## Models

**TODO: Talk about what models we chosen and why**


## Evaluation

### Train Data

**TODO - Summary of the results**

|Model|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|
|Logistic Regression|0.38|0.32|0.56|0.3|
|Decision Tree|0.98|0.98|0.99|0.98|
|Random Forest|0.98|0.98|0.99|0.98|


### Test Data

**TODO - Summary of the results**

|Model|Accuracy|Recall|Precision|F1-Score|
|:----|:----|:----|:----|:----|
|Logistic Regression|0.39|0.32|0.56|0.3|
|Decision Tree|0.95|0.98|0.99|0.98|
|Random Forest|0.99|0.98|0.99|0.98|

# Audience

## Who Might Be Interested?

Given that the data is provided by the California Government, the local, state, and federal Government would be interested in viewing and understanding our data analysis. Scientists and Data Scientists would find this data and report useful based on the current outcome and number of fires that have plagued not only California but North America as a whole.

With most data analysis, there is the thought of using machine learning algorithms to make decisions, predict outcomes, or classify problems. Our team does not plan to predict when or if a fire may occur but to determine the severity of the fire. In phase two of the project, we will analyze and choose features that will help us classify whether a fire incident should be classified as a major incident or not.

## How Will the Analysis Be Disseminated?

Once features are selected and a classification algorithm is chosen, the team will run various models on the data while hyper tuning provided parameters to achieve the highest possible accuracy based on the data that is provided. When completed, the analysis could be provided publicly to the California government, Kaggle, Medium, and more. The team hopes to be able to choose the correct features that will give future data scientists the ability to accurately predict the type of fire incident that is currently active.

# Conclusion

**TODO - Overall summary of EDA and models**

# Future Work

**TODO - Talk about how we could improve by using weather data to predict a fire**

**TODO - consolidate limitation with future work**

### Limitations

Because of the dataset we are using, we are limited in our classification and prediction to data from 2013-2020. Wildfires have increased in frequency in California for the last 50 years and because we are only training on a 7-year window this can potentially limit the usefulness of our outcomes. This could be improved by seeking out more recent wildfire data from CalFire (the source of the Kaggle dataset) open data portal although further preprocessing would be required to format the data the way that this dataset is prepared. Furthermore, the dataset is confined to only California. We must keep in mind this limitation if we attempt to draw generalized conclusions from our predictive analysis. Taking a broader bird's eye view of the entire North American continent could yield a better overall analysis and aid in future classification.


# References

**California Wildfire Data (2013 - 2020):**

https://www.kaggle.com/datasets/ananthu017/california-wildfire-incidents-20132020

**California Government Fires**

https://www.fire.ca.gov/

**California State Facts (2016)**

https://sgf.senate.ca.gov/sites/sgf.senate.ca.gov/files/county_facts_2016.pdf

**Why Does California Have So Many Wildfires?**

https://a-z-animals.com/blog/why-does-california-have-so-many-wildfires/