# DATA 512 - Project Part 1: Common Analysis

## Wildfires Analysis - Data Preprocessing

### Tanushree Yandra, University of Washington, Seattle

More and more frequently, summers in the western US have been characterized by wildfires with smoke billowing across multiple western states. There are many proposed causes for this: climate change, US Forestry policy, growing awareness, just to name a few. Regardless of the cause, the impact of wildland fires is widespread. There is a growing body of work pointing to the negative impacts of smoke on health, tourism, property, and other aspects of society. This project analyzes wildfire impacts on the city of Twin Falls, Idaho in the US. The end goal is to be able to inform policy makers, city managers, city councils, or other civic institutions, to make an informed plan for how they could or whether they should make plans to mitigate future impacts from wildfires.

Wildland fires within 1250 miles of Twin Falls, Idaho are analyzed for the last 60 years (1963-2020). A smoke estimate is then created to estimate the wildfire smoke impact which is later modeled to make predictions for the next 30 years (until 2049).

This section of the analysis preprocesses the [Wildland Fires Data]() generated from the [Wildfire Analysis - Data Retrieval notebook](https://github.com/TanushreeYandra/data-512-projectpart1/blob/main/Analysis/Wildfires_Analysis_Data_Retrieval.ipynb). We will be analyzing the various columns present in the dataset and make relevant assumptions to remove any unnecessary columns.

### Step 1: Preliminaries

First, we start by importing required modules and packages.

In [1]:
# These are standard python modules
import pandas as pd
import json
import warnings

In [2]:
# Suppress the warning statements
warnings.filterwarnings("ignore")

### Step 2: Analyzing the Wildland Fires Dataset

Next, we load the JSON data into a JSON object. The JSON object has multiple 'features' where each 'feature' represents one wildfire incident. All the features are nested dictionaries and have two primary key-value pairs - attributes and geometry. The key-value pair 'geometry' was used in the previous step of data retrieval for accessing the coordinates of the fire rings and computing the geodetic distance from Twin Falls, Idaho. For this step of the analysis, we will be focusing on the key-value pair of 'attributes' of the wildfires data. 'attributes' is once again a dictionary with multiple key-value pairs. Some of the keys for example, are, OBJECTID, Assigned_Fire_Type, etc.

In [3]:
# Open the JSON file in a JSON object
with open('final_wildfire_data.json', 'r') as json_file:
    final_wildfire_data = json.load(json_file)

In [4]:
# Convert the JSON object to a dataframe
# We are only considering the nested dictionary of the key 'attributes'
wf_df = pd.DataFrame([{k: v for k, v in data['attributes'].items()} for data in final_wildfire_data])

In [5]:
# Look at the top of the dataframe
wf_df.head()

Unnamed: 0,OBJECTID,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,Fire_Polygon_Tier,Fire_Attribute_Tiers,GIS_Acres,GIS_Hectares,Source_Datasets,Listed_Fire_Types,...,Wildfire_Notice,Prescribed_Burn_Notice,Wildfire_and_Rx_Flag,Overlap_Within_1_or_2_Flag,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area,Distance
0,14299,14299,Wildfire,1963,1,"1 (1), 3 (3)",40992.458271,16589.059302,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (1), Likely Wildfire (3)",...,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.385355,,No,73550.428118,165890600.0,160.94989
1,14300,14300,Wildfire,1963,1,"1 (1), 3 (3)",25757.090203,10423.524591,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (2), Likely Wildfire (2)",...,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.364815,,No,59920.576713,104235200.0,187.994096
2,14301,14301,Wildfire,1963,1,"1 (5), 3 (15), 5 (1)",45527.210986,18424.208617,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (6), Likely Wildfire (15)",...,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.320927,,No,84936.82781,184242100.0,160.517096
3,14302,14302,Wildfire,1963,1,"1 (1), 3 (3), 5 (1)",10395.010334,4206.711433,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (2), Likely Wildfire (3)",...,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.428936,,No,35105.903602,42067110.0,80.011735
4,14303,14303,Wildfire,1963,1,"1 (1), 3 (3)",9983.605738,4040.2219,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (1), Likely Wildfire (3)",...,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.703178,,No,26870.456126,40402220.0,144.89905


In [6]:
# Look at the shape of the dataframe
wf_df.shape

(84319, 31)

Thus, our dataset has 84319 wildfire incidents with 31 features before preprocessing. Now we look at the various columns present in the data to identify those columns that will be useful for further modeling and analysis.

In [7]:
# Look at all the column names of the dataframe
wf_df.columns

Index(['OBJECTID', 'USGS_Assigned_ID', 'Assigned_Fire_Type', 'Fire_Year',
       'Fire_Polygon_Tier', 'Fire_Attribute_Tiers', 'GIS_Acres',
       'GIS_Hectares', 'Source_Datasets', 'Listed_Fire_Types',
       'Listed_Fire_Names', 'Listed_Fire_Codes', 'Listed_Fire_IDs',
       'Listed_Fire_IRWIN_IDs', 'Listed_Fire_Dates', 'Listed_Fire_Causes',
       'Listed_Fire_Cause_Class', 'Listed_Rx_Reported_Acres',
       'Listed_Map_Digitize_Methods', 'Listed_Notes', 'Processing_Notes',
       'Wildfire_Notice', 'Prescribed_Burn_Notice', 'Wildfire_and_Rx_Flag',
       'Overlap_Within_1_or_2_Flag', 'Circleness_Scale', 'Circle_Flag',
       'Exclude_From_Summary_Rasters', 'Shape_Length', 'Shape_Area',
       'Distance'],
      dtype='object')

Looking at the column names above, several columns can be identified that are complex string values which will not add any value to our analysis and modeling. Such columns are dropped in the next step.

### Step 3: Dropping Columns that are not Useful for the Analysis

In this step, several columns not useful for the analysis are identified and dropped. The explanation for dropping each of these columns is provided below,

**OBJECTID**: It is a unique identification for the fire polygon and its attributes. The dataset also has another column named 'USGS_Assigned_ID' which is also a unique identification that provides further consistency. Thus, the OBJECTID column is dropped.

**Fire_Polygon_Tier**: This refers to the tier from which the fire polygon is generated. One or more polygons within the tier can be combined to create the fire polygon. This feature although numerical, did not feel like it will add any value to the creation of the smoke estimate and its modeling. Thus, it is dropped.

**Fire_Attribute_Tiers**: The dataset being used is created by combining 40 different data sources. This feature has a list of Polygon Tiers consolidated from all the data sources for each fire. This is irrelevant to the analysis at hand, and is hence dropped.

**GIS_Hectares**: This encapsulates the hectares of the fire polygon calculated by using the Calculate Geometry tool in ArcGIS Pro. Since there is another column representing the same value in the units of acres, this column is dropped.

**Source_Datasets**: This column contains all the original source datasets that contributed to either the polygon or the attributes. This is irrelevant for our analysis.

**Listed_Fire_Types**: This includes each fire type listed in the fires from the merged dataset where the number of features that contributed to a specific fire type are in parentheses after the fire type. Since we have kept the 'Fire_Type' column in our dataset for now, this column is not needed.

**Listed_Fire_Codes**: This includes each fire code listed in the fires from the merged dataset. Any feature that has a 'list' of values from the merged dataset are ignored.

**Listed_Fire_IDs**: This includes each fire ID listed in the fires from the merged dataset. Since it is a 'list', it is dropped.

**Listed_Fire_IRWIN_IDs**: This includes each fire IRWIN ID listed in the fires from the merged dataset. This is dropped since it is a 'list'.

**Listed_Fire_Dates**: This includes each fire date listed in the fires from the merged dataset. Since we are considering wildfires on a yearly basis, the fire dates are not important for our analysis.

**Listed_Fire_Causes**: This includes each fire cause listed in the fires from the merged dataset. It is a 'list' and is hence dropped.

**Listed_Fire_Cause_Class**: This includes each fire cause class listed in the fires from the merged dataset. While fire cause may seem important for analysis, it cannot be quantified in any manner. Thus, it is dropped.

**Listed_Rx_Reported_Acres**: This contains each prescribed fire's reported acres listed in the fires from the merged dataset. For the area of the fire, we are relying on the column 'GIS_Acres' and hence, this column is dropped.

**Listed_Map_Digitize_Methods**: This includes each fire digitization method listed in the fires from the merged dataset. This does not add any value to our analysis and is thus dropped.

**Listed_Notes**: This contains additional notes associated with each fire from the merged dataset. Notes are irrelevant to our study.

**Processing_Notes**: This indicates that the attribute data was altered during the processing and a new attribute was added. It will also explain the rationale for the change. These notes are not relevant to us.

**Wildfire_Notice**: It is a notice present in every field that indicates the quality of the wildfire data in the dataset. This is not needed.

**Prescribed_Burn_Notice**: It is a notice present in every field that indicates the quality of the prescribed burn data in the dataset which is not required.

In [8]:
# Drop all irrelevant columns
wf_df = wf_df.drop(['OBJECTID', 'Fire_Polygon_Tier', 'Fire_Attribute_Tiers', 'GIS_Hectares',
                    'Source_Datasets', 'Listed_Fire_Types', 'Listed_Fire_Codes', 'Listed_Fire_IDs',
                    'Listed_Fire_IRWIN_IDs', 'Listed_Fire_Dates', 'Listed_Fire_Causes', 'Listed_Fire_Cause_Class',
                    'Listed_Rx_Reported_Acres', 'Listed_Map_Digitize_Methods', 'Listed_Notes', 'Processing_Notes',
                    'Wildfire_Notice', 'Prescribed_Burn_Notice'], axis=1)

In [9]:
# Look at the top of the dataframe
wf_df.head()

Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Wildfire_and_Rx_Flag,Overlap_Within_1_or_2_Flag,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area,Distance
0,14299,Wildfire,1963,40992.458271,RATTLESNAKE (4),,,0.385355,,No,73550.428118,165890600.0,160.94989
1,14300,Wildfire,1963,25757.090203,"McChord Butte (2), No Fire Name Provided (1), ...",,,0.364815,,No,59920.576713,104235200.0,187.994096
2,14301,Wildfire,1963,45527.210986,"WILLOW CREEK (16), EAST CRANE CREEK (4), Crane...",,,0.320927,,No,84936.82781,184242100.0,160.517096
3,14302,Wildfire,1963,10395.010334,"SOUTH CANYON CREEK (4), No Fire Name Provided (1)",,,0.428936,,No,35105.903602,42067110.0,80.011735
4,14303,Wildfire,1963,9983.605738,WEBB CREEK (4),,,0.703178,,No,26870.456126,40402220.0,144.89905


In [10]:
# Look at the dataframe's shape
wf_df.shape

(84319, 13)

Our dataframe looks as follows (above) after dropping all the columns. The number of columns now came down to 13 from the original 31 columns.

### Step 4: Looking at the 'Assigned_Fire_Type' and the 'Wildfire_and_Rx_Flag' columns

The 'Assigned_Fire_Type' column is one of the five types - Wildfire, Likely Wildfire, Unknown - Likely Wildfire, Prescribed Fire, Unknown - Likely Prescribed Fire. The key difference between Wildfires and Prescribed fires is the intent. A prescribed fire is a planned fire intentionally ignited by park managers to meet management objectives. A wildfire on the other hand, is an unplanned fire caused by lightning or other natural causes, by accidental (or arson-caused) human ignitions, or by an escaped prescribed fire. While prescribed fires are intentional and usually in control, they still DO contribute to air pollution. For this analysis, it is assumed that prescirbed fires and wildfires contribute to the same amount of pollution for a given land with same area.

In [11]:
# Unique values of the column 'Assigned_Fire_Type'
wf_df['Assigned_Fire_Type'].unique()

array(['Wildfire', 'Unknown - Likely Wildfire', 'Prescribed Fire',
       'Likely Wildfire', 'Unknown - Likely Prescribed Fire'],
      dtype=object)

The 'Wildfire_and_Rx_Flag' column is a text flag field indicating that the attributes from the various data sources flagged a fire as both a wildfire and a prescribed fire. This could indicate an error in assigning the fire type, a misassignment of the fire type, or that there were actually two fires that occurred in this area in the same year, one a wildfire and one a prescribed burn. 

Since we are treating both wildfires and prescribed fires in the same manner, this field can be ignored.

In [12]:
# Drop 'Wildfire_and_Rx_Flag' column
wf_df = wf_df.drop(['Wildfire_and_Rx_Flag'], axis=1)

In [13]:
# Look at the top of the dataframe
wf_df.head()

Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Overlap_Within_1_or_2_Flag,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area,Distance
0,14299,Wildfire,1963,40992.458271,RATTLESNAKE (4),,0.385355,,No,73550.428118,165890600.0,160.94989
1,14300,Wildfire,1963,25757.090203,"McChord Butte (2), No Fire Name Provided (1), ...",,0.364815,,No,59920.576713,104235200.0,187.994096
2,14301,Wildfire,1963,45527.210986,"WILLOW CREEK (16), EAST CRANE CREEK (4), Crane...",,0.320927,,No,84936.82781,184242100.0,160.517096
3,14302,Wildfire,1963,10395.010334,"SOUTH CANYON CREEK (4), No Fire Name Provided (1)",,0.428936,,No,35105.903602,42067110.0,80.011735
4,14303,Wildfire,1963,9983.605738,WEBB CREEK (4),,0.703178,,No,26870.456126,40402220.0,144.89905


In [14]:
# Look at the dataframe's shape
wf_df.shape

(84319, 12)

Our dataframe looks as follows (above) after dropping the columns. The number of columns now came down to 12 from 13.

### Step 5: Analyzing the Column 'Overlap_Within_1_or_2_Flag'

In the wildfires dataset, fire polygons with near 100% overlap in consecutive years could be the same fire in different datasets with a year value that is correct in one and incorrect in another. This can occur particularly with older fires. There is no way to identify the actual year or which one is correct, if one is in fact incorrect. 

Therefore, 'Overlap_Within_1_or_2_Flag' column is present to flag areas that burned with >10% overlap of the current fire within 1 or 2 years of the current burn. Each fire that meets this criteria is included in this attribute including the percentage and acres of overlap, the year the overlapping fire occurred, and the overlapping fire's Assigned_USGS_ID.

While the overlap flag may mor may not be correct, it is assumed that another row pertaining to the same fire exists for those fires that are flagged. Overlapping fires are thus removed since there is an other fire already existing in the database with more than 10% overlap.

In [15]:
# Filter the dataframe to those rows that have the 'Overlap_Within_1_or_2_Flag' as None
wf_df = wf_df[wf_df['Overlap_Within_1_or_2_Flag'].isna()==True]

In [16]:
# Now that the rows are filtered, drop the column
wf_df = wf_df.drop(['Overlap_Within_1_or_2_Flag'], axis=1)

In [17]:
# Look at the top of the dataframe
wf_df.head()

Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area,Distance
0,14299,Wildfire,1963,40992.458271,RATTLESNAKE (4),0.385355,,No,73550.428118,165890600.0,160.94989
1,14300,Wildfire,1963,25757.090203,"McChord Butte (2), No Fire Name Provided (1), ...",0.364815,,No,59920.576713,104235200.0,187.994096
2,14301,Wildfire,1963,45527.210986,"WILLOW CREEK (16), EAST CRANE CREEK (4), Crane...",0.320927,,No,84936.82781,184242100.0,160.517096
3,14302,Wildfire,1963,10395.010334,"SOUTH CANYON CREEK (4), No Fire Name Provided (1)",0.428936,,No,35105.903602,42067110.0,80.011735
4,14303,Wildfire,1963,9983.605738,WEBB CREEK (4),0.703178,,No,26870.456126,40402220.0,144.89905


In [18]:
# Look at the dataframe's shape
wf_df.shape

(75119, 11)

Now our dataframe has 11 columns with the number of rows coming down to 75119 from 84319. The dataframe looks as follows (above) after making the above changes.

### Step 6: Dropping Rows that are Near Perfect Circles and have High Acreage

Some of the fires in the wildfires dataset appear as near perfect circles. This could be from lightning strikes being counted as small fires or other small fires having a point buffered to the acreage of the fire size because no true polygon was created. A circle-ness index is thus calculated by using the following equation in Field Calculator,

*4 x pi x (Shape_Area/(Shape_Length x Shape_Length))*

As values of the circle-ness index approach 1, the shape becomes more circular. A 'Circle_Flag' column is thus present to flag any shapes with a value greater than or equal to 0.98.

Circular fire polygons are highly unlikely to represent the actual area burned. When fire size is less than 1 acre, the risk of misassigning the burned area is minimal given the fire size. For any circular polygons greater than 1 acre, the risk of misassigning a burned area is too high and hence these are not included in the analysis. The column 'Exclude_From_Summary_Rasters' has a flag - 'Yes' for fires that are circular and greater than 1 acre, and 'No' for non-circular fires and circular fires lesser than 1 acre. Thus, the fires that have the flag as 'Yes', are removed

In [19]:
# Remove circular fires greater than 1 acre - indicated as 'Yes' in 'Exclude_From_Summary_Rasters'
wf_df = wf_df[wf_df['Exclude_From_Summary_Rasters']=='No']

Now we drop the columns 'Circle Flag' and 'Exclude_From_Summary_Rasters' since they are not relevant to our analysis anymore.

In [20]:
# Drop columns 'Circle_Flag' and 'Exclude_From_Summary_Rasters'
wf_df = wf_df.drop(['Circle_Flag', 'Exclude_From_Summary_Rasters'], axis=1)

In [21]:
# Look at the top of the dataframe
wf_df.head()

Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Circleness_Scale,Shape_Length,Shape_Area,Distance
0,14299,Wildfire,1963,40992.458271,RATTLESNAKE (4),0.385355,73550.428118,165890600.0,160.94989
1,14300,Wildfire,1963,25757.090203,"McChord Butte (2), No Fire Name Provided (1), ...",0.364815,59920.576713,104235200.0,187.994096
2,14301,Wildfire,1963,45527.210986,"WILLOW CREEK (16), EAST CRANE CREEK (4), Crane...",0.320927,84936.82781,184242100.0,160.517096
3,14302,Wildfire,1963,10395.010334,"SOUTH CANYON CREEK (4), No Fire Name Provided (1)",0.428936,35105.903602,42067110.0,80.011735
4,14303,Wildfire,1963,9983.605738,WEBB CREEK (4),0.703178,26870.456126,40402220.0,144.89905


In [22]:
# Look at the dataframe's shape
wf_df.shape

(72608, 9)

Now our dataframe has 9 columns with the number of rows further coming down to 72608 from 75119. The dataframe looks as follows (above) after making the above changes.

### Step 7: Renaming the Fire Names and Exporting the Final Dataset

The 'Listed_Fire_Names' column includes each fire name listed in the fires from the merged dataset that intersect a particular fire polygon in space and year. The number of features that contributed the specific fire name are in parentheses after the fire name. We rename the fire name to the first instance occuring in the list of fire names. Since the list is ordered by the number of occurences, we will be basically assigning the fire name that was the most common in all the 40 data sources.

In [23]:
# Rename the fires to the first instance occurring in the 'Listed_Fire_Names' column
wf_df['Listed_Fire_Names'] = wf_df['Listed_Fire_Names'].str.split(',').str[0]

In [24]:
# Look at the top of the dataframe
wf_df.head()

Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Circleness_Scale,Shape_Length,Shape_Area,Distance
0,14299,Wildfire,1963,40992.458271,RATTLESNAKE (4),0.385355,73550.428118,165890600.0,160.94989
1,14300,Wildfire,1963,25757.090203,McChord Butte (2),0.364815,59920.576713,104235200.0,187.994096
2,14301,Wildfire,1963,45527.210986,WILLOW CREEK (16),0.320927,84936.82781,184242100.0,160.517096
3,14302,Wildfire,1963,10395.010334,SOUTH CANYON CREEK (4),0.428936,35105.903602,42067110.0,80.011735
4,14303,Wildfire,1963,9983.605738,WEBB CREEK (4),0.703178,26870.456126,40402220.0,144.89905


In [25]:
# Look at the number of rows of the final preprocessed dataframe
wf_df.shape[0]

72608

Thus, our final processed wildfires data contains 72608 wildfire instances with 9 features. The features GIS_Acres, Circleness_Scale, Shape_Length, Shape_Area, and Distance were retained since they are numerical values that might be handy for creating the Smoke Estimate and its further modeling.

This final dataset can now be exported to a CSV file.

In [26]:
# Export the final processed dataframe to a CSV file
wf_df.to_csv('Wildfire_Data_Processed.csv', index=False)