# Predicting California Wildfires 2013-2018

## Overview

This project's aim is to build a classification model that will predict if there will be a fire in given county over the course of that week. This will be done by using wildfire incidents from Kaggle, and combining it with data about fuel sources by using ground cover data, topography by using elevation data and weather data from CIMIS and Weather Underground. Predicting the weekly probability of wildfires can provide early warning to government groups like CalFire in fighting the spread of wildfires.

## Business Problem

As of early December, there have been more than 9,500 wildfires that have burned more than 4 million acres of California land. California has about 100 million acres, so nearly 4% of California's land has burned from wildfires in the year 2020. This makes 2020 the most destructive year of wildfires recorded in California. Right now the government group CalFire has a Red Flag and Fire Watch Warning system which is issued during forecasted weather events 24-72 hours in advance, which may result in wildfire behavior. The model in this project would predict the probabilities of a wildfire occuring in a county of California during a specific week based on the three main factors that affect wildfire behavior: weather, topography, and fuels. CalFire could use this model in addition to their warning system to better allocate funds and personnel to counties in more danger of wildfire incidents, and thus better control the spread of wildfires that do start.

## Data Understanding

The data set used in the modeling is gathered from muliple sources. The original wildfire data comes from a Kaggle data set that had over 1600 different incidents between 2013 and 2020 that the government group CalFire managed the containment of. From this data set, the `acres_burned`, `county`, and `Start` date were used in the final data set. From the `Start` and `county` columns we regrouped the data based on a weekly basis per county and `acres_burned` gave our target variable if there was an indicent in that county and week or not. To help predict the probability of wildfire, we used the features that affect wildfire behavior: Fuel, Topography, and Weather.

The Fuel data was gathered from the California Land Use and Ownership Portal by the University of California. Between the years of 2013-2018, each county has a Summary Land Use Statistics csv file for each year. There are 26 different types of land cover, like agriculture use, type of forest, or even barren. All these different files were combined into one dataset that that had number of acres and percentage of county land for each tyoe of landcover. 

The Topography data used the min and max elevations of the counties from the Anyplace America topographic maps.

The Weather Data was gathered from the California Irrigation Management Information System (CIMIS) by the California Department of Water Resources and from the Weather Underground Website. The daily measurements for the following features were grabbed:

- `Avg Air Temp (F)`
- `Max Air Temp (F)`
- `Min Air Temp (F)`
- `Max Rel Hum (%)`
- `Avg Rel Hum (%)`
- `Min Rel Hum (%)`
- `Dew Point (F)`
- `Avg Wind Speed (mph)`
- `Precip (in)`

For each of the features, the previous week's and previous month's averages were created for the final data set.

## Data Prep and Cleaning

To build the final data set for our modeling, we needed to restructure and combine the different data sets into one dataframe.

We first create the final dataframe and add a row for each county for each week between the years 2013-2018.

In [None]:
final_data = county_week()

We then added the fire incidents to the final dataframe.

In [None]:
# Import in the fire incidents as a pandas dataframe
fire_incidents = pd.read_csv('data/California_Fire_Incidents.csv')
# Using custom function, split dataframe into counties and find weekly fire incidents
county_fire_df = county_fire(fire_incidents)
# Using custom function, create new dataframe with binary target variable fire_started
fire_started_df = fire_started(county_fire_df)
# Merge the target variable dataframe with final dataframe
final_data = pd.merge(final_data, fire_started, 
                      how = 'left', 
                      left_on = ['county', 'date'], 
                      right_on = ['county', 'date'])

For the fuel data, all the different county land use statistics need to be restructured and put into one dataframe. For ease, the max elevations and min elevations from topography will be merged here as well.

In [None]:
# Find all the filenames for the ground cover data
files = filenames()
# Import the csv files into pandas dataframes
gc_df_list = import_ground_cover(files)
# Restructure dataframes and merge into on dataframe
gc_df = ground_cover_data(gc_df_list, files)
# Check the first five rows
gc_df.head()

The fuel dataframe is merged with the final dataframe.

In [None]:
final_data = pd.merge(final_data, gc_df, how = 'left', left_on = ['county', 'year'], right_on = ['county', 'year'])

We want to check for any null values in the fuel dataframe

In [None]:
final_data.isna().sum()

The null values in the fuel section are where there is no particular ground cover in that county. Not all the counties grow the same agricultural produce. The null values in the fire started is where there are no fires. This means the null values can be replaced with 0.

In [None]:
final_data = final_data.fillna(0)

For the weather data, the specific features need to be gathered from the daily weather data from CIMIS and Weather Underground.

In [None]:
# Find all cimis weather files
cimis_files = glob('data/weather_cimis/*.csv')
# Sort files into alphabetical order
cimis_files.sort()
# Import csv files into pandas dataframes
cimis_data = import_cimis(cimis_files)
