This notebook marks the beginning of a comprehensive study aimed at understanding the impact of wildfires on air quality in [Madison](https://docs.google.com/spreadsheets/d/1pHLA9XzXoy9nJTaiNkgThGPQjVEa0tfeH203I6FA238/edit?gid=0#gid=0&range=E55), located in [Dane County](https://en.wikipedia.org/wiki/Dane_County,_Wisconsin), [Wisconsin](https://docs.google.com/spreadsheets/d/1pHLA9XzXoy9nJTaiNkgThGPQjVEa0tfeH203I6FA238/edit?gid=0#gid=0&range=F55), with a particular focus on estimating smoke levels resulting from these fires. As wildfires become more frequent and intense due to climate change, the effects of smoke on public health and air quality are becoming increasingly concerning, even in regions far from the actual fires. The primary objective of this notebook is to lay the groundwork for building robust smoke estimators by processing and refining wildfire data from a national dataset. By transforming and analyzing this data, we aim to develop a clear understanding of how wildfire smoke travels and affects air quality in Madison.

To achieve this, we will systematically filter and refine the dataset according to specific criteria, ensuring its relevance to our analysis.

We will begin by working with the "USGS_Wildland_Fire_Combined_Dataset.json" file, obtained from the USGS ["Combined Wildland Fire Datasets for the United States and Certain Territories, 1800s-Present"](https://www.sciencebase.gov/catalog/item/61aa537dd34eb622f699df81) dataset. The data will be filtered based on two criterias:

1. Geographic Boundary: We will focus mainly on wildfires within a 650-mile radius of Madison, Dane County, Wisconsin. 

2. Temporal Range: We will consider only wildfires from the last 60 years, spanning from 1964 to 2024.

These 2 constraints set the scope of our data and analysis.


### License

#### Code Attribution

Snippets of the code in this notebook was adapted from the "wildfire_geo_proximity_example" notebook, created by Professor McDonald for the DATA 512 a course in the UW MS Data Science degree program. The original code is available under the [**Creative Commons CC-BY license**](https://creativecommons.org/licenses/by/4.0/).

#### Step 1: Data Retrieval

In the following sections, we will begin by importing the necessary Python libraries. It is required to specifically install two additional packages - pyproj, which handles geodesic coordinate conversions and distance calculations, and geojson, which is used for working with GeoJSON data.

In [2]:
#
#    IMPORTS
# 

#    Import some standard python modules
import os, json, time
#
#    The module pyproj is a standard module that can be installed using pip or your other favorite
#    installation tool. This module provides tools to convert between different geodesic coordinate systems
#    and for calculating distances between points (coordinates) in a specific geodesic system.
#
from pyproj import Transformer, Geod
#    
#
#    There is a GeoJSON reader that you might try if you wanted to read the data. It has its own quirks.
#    There will be an example below that reads the sample file "Wildfire_short_sample.json"
#    
import geojson

import pandas as pd
#

We will define global constants that we will be using to set scope of the project as defined earlier. These will be used to filter the data. Latitude and Longitude decimalinformation is obtained form wikipedia - [Madison](https://en.wikipedia.org/wiki/Madison,_Wisconsin).

In [3]:
#
#    CONSTANTS
#
#    A dictionary of some city locations from the US west coast states.
#
DATA_FILENME  = 'input_files/USGS_Wildland_Fire_Combined_Dataset.json'
CITY_LOCATIONS = {
    'madison' :     {'city'   : 'Madison',
                       'latlon' : [43.074722,-89.384167] }
}

MILE_BOUNDARY = 650
STARTYEAR = 1964
ENDYEAR = 2024

we use the GeoJSON module ([documentation](https://pypi.org/project/geojson/), [GitHub repo](https://github.com/jazzband/geojson)) to load the data file. This module will do some conversion of Geo type things to format that we can work with. 

In [4]:
#
#    Open a file, load it with the geojson loader
#
print(f"Attempting to open '{DATA_FILENME}'")
geojson_file = open(DATA_FILENME,"r")
print(f"Using GeoJSON module to load data file '{DATA_FILENME}'")
gj_data = geojson.load(geojson_file)
geojson_file.close()
#
#    Print the keys from the object
#
gj_keys = list(gj_data.keys())
print("The loaded JSON dictionary has the following keys:")
print(gj_keys)
print()
#
#    For all GeoJSON type things, the most important part of the file are the 'features'. 
#    In the case of the wildfire dataset, each feature is a polygon (ring) of points that define the bounary of a fire
#
print(f"Found {len(gj_data['features'])} features in the variable 'gj_data' ")
print(f"\n Found {len(gj_data['fields'])} attributes in the variable 'gj_data' ")
#

Attempting to open 'input_files/USGS_Wildland_Fire_Combined_Dataset.json'
Using GeoJSON module to load data file 'input_files/USGS_Wildland_Fire_Combined_Dataset.json'
The loaded JSON dictionary has the following keys:
['displayFieldName', 'fieldAliases', 'geometryType', 'spatialReference', 'fields', 'features']

Found 135061 features in the variable 'gj_data' 

 Found 30 attributes in the variable 'gj_data' 


The loaded json file has about 135061 records and 30 attributes. It has 5 Keys.

- **displayFieldName:** An empty string which otherwise denotes the name of the dataset.

- **fieldAliases:** dictionary that holds human readable name of each of the 30 fields.

- **geometryType:** Specifies shape type of the spatial data.

- **spatialReference:** Defines the coordinate system used for the Geodata

- **fields:** A list of dicts identifying the name, type and alias of 30 attributes.

- **features:** The list of all observations stored in JSON format. Each observation is saved as dictionary with keys for the attributes (fields) and geometry.

Following code section is to diaplay the first record in the file to better undertsand the data structure, attribute names of features, rings data etc.

#### Step 2: Data Processing

In [5]:
#
#    Get the first item in the list of features
#
SLOT = 0
gj_feature = gj_data['features'][SLOT]
#
#    Print everyting in this dictionary (i.e., gj_feature) - it's long
#
print(f"The wildfire feature from slot '{SLOT}' of the loaded gj_data['features']")
print(json.dumps(gj_feature, indent=4))


The wildfire feature from slot '0' of the loaded gj_data['features']
{
    "attributes": {
        "OBJECTID": 1,
        "USGS_Assigned_ID": 1,
        "Assigned_Fire_Type": "Wildfire",
        "Fire_Year": 1860,
        "Fire_Polygon_Tier": 1,
        "Fire_Attribute_Tiers": "1 (1)",
        "GIS_Acres": 3940.20708940724,
        "GIS_Hectares": 1594.5452365353703,
        "Source_Datasets": "Comb_National_NIFC_Interagency_Fire_Perimeter_History (1)",
        "Listed_Fire_Types": "Wildfire (1)",
        "Listed_Fire_Names": "Big Quilcene River (1)",
        "Listed_Fire_Codes": "No code provided (1)",
        "Listed_Fire_IDs": "",
        "Listed_Fire_IRWIN_IDs": "",
        "Listed_Fire_Dates": "Listed Other Fire Date(s): 2006-11-02 - NIFC DATE_CUR field (1)",
        "Listed_Fire_Causes": "",
        "Listed_Fire_Cause_Class": "Undetermined (1)",
        "Listed_Rx_Reported_Acres": null,
        "Listed_Map_Digitize_Methods": "Other (1)",
        "Listed_Notes": "",
        "Proce

We try and extract the largest ring of the first sample featre displayed above. The largest shape (ring) is supposed to be item zero in the list of 'rings'. This information is useful as we will use the largest ring to calculate the distance in the following sections.

In [6]:
#
#    Every feature has a 'geometry' which specifies geo coordinates that make up each geographic thing
#    In the case of the wildfire data, most wildfires are bounded shapes, circles, squares, etc. This is
#    represented by shapes called 'rings' in GeoJSON.
# 
# Get the geometry for the feature we pulled from the feature_list
gj_geometry = gj_feature['geometry']
# The largest shape (ring) is supposed to be item zero in the list of 'rings'
gj_bigest_ring = gj_geometry['rings'][0]

print(f"The largest ring of gj_feature['features'][{SLOT}]['rings'] consists of {len(gj_bigest_ring)} points.")

The largest ring of gj_feature['features'][0]['rings'] consists of 768 points.


One of the constraints in doing geodetic computations is that most of the time we need to have our points (the coordinates for places) in the same geographic coordinate system. There are tons and tons of coordinate systems. We can find descriptions of many of them at [EPSG.io.](https://www.google.com/url?q=https%3A%2F%2Fepsg.io) So we will have to identify the coordinate system used for rings in our dataset. As mentioned earlier 2 of the 5 keys hold information about the shape and coordinate system, so we will check for it.



In [8]:
print(f"geometryType : {gj_data["geometryType"]}")
print(f"spatialReference : {gj_data["spatialReference"]}")

geometryType : esriGeometryPolygon
spatialReference : {'wkid': 102008, 'latestWkid': 102008}


The above output indicates that the geometry of our wildfire data are generic polygons and that they are expressed in a coordinate system with the well-known ID (WKID) 102008. This coordinate system is also known as [ESRI:102008](https://www.google.com/url?q=https%3A%2F%2Fepsg.io%2F102008)

But the most common system is 'WGS84', a representation of the earth, that also relies on a well known coordinate system that is sometimes called 'decimal degrees' (DD). That decimal degrees system has an official name (or WKID) of [EPSG:4326.](https://www.google.com/url?q=https%3A%2F%2Fepsg.io%2F4326)

The below custom module is going to take the geometry of a fire feature, extract the largest ring (i.e., the largest boundary of the fire, usually the first in the list of rings) and convert all of the points in that ring from the ESRI:102008 coordinate system to EPSG:4326 coordinates. Since we have already defined Madison, WI coordinated in EPSG:4326, it will be easier to calculate the distance.

In [9]:
#
#    Transform feature geometry data
#
#    The function takes one parameter, a list of ESRI:102008 coordinates that will be transformed to EPSG:4326
#    The function returns a list of coordinates in EPSG:4326
def convert_ring_to_epsg4326(ring_data=None):
    converted_ring = list()
    #
    # We use a pyproj transformer that converts from ESRI:102008 to EPSG:4326 to transform the list of coordinates
    to_epsg4326 = Transformer.from_crs("ESRI:102008","EPSG:4326")
    # We'll run through the list transforming each ESRI:102008 x,y coordinate into a decimal degree lat,lon
    for coord in ring_data:
        lat,lon = to_epsg4326.transform(coord[0],coord[1])
        new_coord = lat,lon
        converted_ring.append(new_coord)
    return converted_ring

Next we will have to find out how far a fire is from Madison. Because fires have irregular shapes, the distance can change based on how we measure it. To keep things simple and accurate, I will use the shortest distance by measuring from the closest point on the fire's edge. This method gives a good estimate of the minimum distance to the fire, which is helpful for assessing the smoke.

The below custom module finds the point on the perimiter with the shortest distance to the city (Madison) and returns the distance as well as the latitude,longitude of the perimeter point.


In [11]:
#    
#    The function takes two parameters
#        A place - which is coordinate point (list or tuple with two items, (lat,lon) in decimal degrees EPSG:4326
#        Ring_data - a list of decimal degree coordinates for the fire boundary
#
#    The function returns a list containing the shortest distance to the perimeter and the point where that is
#
def shortest_distance_from_place_to_fire_perimeter(place=None,ring_data=None):
    # convert the ring data to the right coordinate system
    ring = convert_ring_to_epsg4326(ring_data)   
    # create a epsg4326 compliant object - which is what the WGS84 ellipsoid is
    geodcalc = Geod(ellps='WGS84')
    closest_point = list()
    # run through each point in the converted ring data
    for point in ring:
        # calculate the distance
        d = geodcalc.inv(place[1],place[0],point[1],point[0])
        # convert the distance to miles
        distance_in_miles = d[2]*0.00062137
        # if it's closer to the city than the point we have, save it
        if not closest_point:
            closest_point.append(distance_in_miles)
            closest_point.append(point)
        elif closest_point and closest_point[0]>distance_in_miles:
            closest_point = list()
            closest_point.append(distance_in_miles)
            closest_point.append(point)
    return closest_point


To calculate the distance for each fire, we will have to first extract all the features and store them in a variable for repeated use in the following sections of the code. 
Below code loads features from the GeoJSON file 'gj_data' into a list, tracking the number of features loaded and printing progress every 10,000 features.

In [12]:
MAX_FEATURE_LOAD = 10000
feature_list = list()
feature_count = 0
# Loop through each feature in the GeoJSON FeatureCollection
for feature in gj_data['features']:
    feature_list.append(feature)
    feature_count += 1
    
    # Print progress every 100 features
    if (feature_count % 10000) == 0:
        print(f"Loaded {feature_count} features")
    

# Print the final count of loaded features
print(f"Loaded a total of {feature_count} features")

# Verify the feature list length matches the loaded count
print(f"Variable 'feature_list' contains {len(feature_list)} features")

Loaded 10000 features
Loaded 20000 features
Loaded 30000 features
Loaded 40000 features
Loaded 50000 features
Loaded 60000 features
Loaded 70000 features
Loaded 80000 features
Loaded 90000 features
Loaded 100000 features
Loaded 110000 features
Loaded 120000 features
Loaded 130000 features
Loaded a total of 135061 features
Variable 'feature_list' contains 135061 features


We will use this feature list to iterate through every record and compute the shortest distance to the Madison city, using the largest ring available for that record. For simplicity, I will only store the required fileds i.e objectID and shortest distance and load it into a csv file - distance.csv. Some fires have irregular shape, such features will be skipped during distance computation.

In [12]:
# Initialize empty lists to store wildfire IDs and their respective shortest distances.
fire_ids = []
shortest_dist_from_edge = []

# Initialize a counter to keep track of the number of features processed.
features_processed = 0

# Loop through each feature in the provided feature_list
for feature in feature_list:
    try:
        # Extract the ring data representing the wildfire geometry.
        ring_data = feature['geometry']['rings'][0]
                                                    
        # Calculate the shortest distance from the city location to the wildfire perimeter using a function.
        distance = shortest_distance_from_place_to_fire_perimeter(CITY_LOCATIONS['madison']['latlon'], ring_data)
        
        # Append the wildfire ID
        fire_ids.append(feature['attributes']['OBJECTID'])
        shortest_dist_from_edge.append(round(distance[0], 2))
    except KeyError:
        # Handle cases where the wildfire feature has an irregular shape by skipping it.
        print(f"{feature['attributes']['OBJECTID']} has irregular shape cannot compute distance")

    features_processed += 1
    if features_processed % 10000 == 0:
        print("Processed {0} features".format(features_processed))
        # Save intermediate results to a CSV file every 10000 features processed.
        dist_df = pd.DataFrame({'OBJECTID': fire_ids, 'shortest_dist': shortest_dist_from_edge})
        dist_df.to_csv('intermediary_files/distance.csv', index=False)

# Final save after processing all features to ensure all data is in the CSV
distance_df = pd.DataFrame({'OBJECTID': fire_ids, 'shortest_dist': shortest_dist_from_edge})

Processed 10000 features
Processed 20000 features
Processed 30000 features
Processed 40000 features
Processed 50000 features
Processed 60000 features
Processed 70000 features
Processed 80000 features
Processed 90000 features
Processed 100000 features
109605 has irregular shape cannot compute distance
Processed 110000 features
110224 has irregular shape cannot compute distance
110639 has irregular shape cannot compute distance
111431 has irregular shape cannot compute distance
111776 has irregular shape cannot compute distance
111897 has irregular shape cannot compute distance
112410 has irregular shape cannot compute distance
112415 has irregular shape cannot compute distance
113411 has irregular shape cannot compute distance
113665 has irregular shape cannot compute distance
113738 has irregular shape cannot compute distance
113766 has irregular shape cannot compute distance
113805 has irregular shape cannot compute distance
114309 has irregular shape cannot compute distance
114322 ha

#### Step 4: Files Generation

We will now save the distance information in a csv file for later use.

In [None]:
distance_df.to_csv('intermediary_files/distance.csv', index=False)

Next step is to read and store the fields that are relevant to our analysis from the features list. To do so we will convert out feature list into a dataframe and filter for years 1964 - 2024. We have two relevant date fields:
- The "Fire_Year" The calendar year designated as the fire year for the focal fire boundary during dataset creation, representing when dataset producers identified the fire event as having occurred.
- "Listed_Fire_Dates" - Each fire date listed from the merged dataset for fires intersecting this polygon in space and year, with the count of contributing features indicated in parentheses after each date. These dates could be fire start date, end date, carbon dated dates, assumed fire dates, fire documented date, dataset modified date etc.

The Fire_YEAR attribute is more reliable than Listed_fire_Dates for identifying fires within a certain year. This is because Fire_YEAR is directly assigned by the dataset producers as the main fire year, giving a standard, consistent year for each fire event. On the other hand, Listed_fire_Dates includes all individual dates from merged datasets that overlap in the same area and year. This can provide more details but also adds variation and possible duplication, especially when multiple sources are involved. Fire_YEAR simplifies finding fires in a specific timeframe by providing one clear reference year, without the need to interpret several dates. Hence we will use Fire_Year for filterimg the data, assuming all fires occured within the the months we have defined, since we do not have any reliable columns to determine the same.

Additionally I am loading only the columns that I determined to be relevant to our analysis.

In [23]:
# Initialize a list to store the data for the DataFrame
wildfire_data = []

# Loop through each feature in the GeoJSON FeatureCollection
for feature in feature_list:
    # Extract required attributes
    attributes = feature['attributes']
    
    # Create a dictionary with only the desired keys
    filtered_data = {
        'OBJECTID': attributes.get('OBJECTID'),
        'Assigned_Fire_Type': attributes.get('Assigned_Fire_Type'),
        'Fire_Year': attributes.get('Fire_Year'),
        'GIS_Acres': attributes.get('GIS_Acres'),
        'GIS_Hectares': attributes.get('GIS_Hectares'),
        'Listed_Fire_Names': attributes.get('Listed_Fire_Names'),
        'Overlap_Within_1_or_2_Flag': attributes.get('Overlap_Within_1_or_2_Flag'),
        'Shape_Length': attributes.get('Shape_Length'),
        'Shape_Area': attributes.get('Shape_Area'),
        'Circleness_Scale': attributes.get('Circleness_Scale'),
    }
    
    # Append the filtered data to the list
    wildfire_data.append(filtered_data)

# Create a DataFrame from the filtered data
wildfire_data_df = pd.DataFrame(wildfire_data)
wildfire_data_df = wildfire_data_df.loc[wildfire_data_df['Fire_Year'].between(STARTYEAR, ENDYEAR)]

 This filetered dataframe is now merged with distances file to include the distance column for every fire feature. We will save this into a csv file - [all_wildfire_1964_2024.csv](https://github.com/ManasaSRonur/data-512-project/blob/main/intermediary_files/all_wildfire_1964_2024.csv), which can be used later to anlayze and create visualizations.

In [24]:
# Merge the two DataFrames on 'OBJECTID'
merged_df = pd.merge(wildfire_data_df, distance_df, on='OBJECTID', how='inner')
merged_df.to_csv('intermediary_files/all_wildfire_1964_2024.csv', index = False)

We will now filter the all fires dataset to include only those fires that are within 650 mile of Madison. This will provide us with a final dataset that can be used for analysis and prediction. This is stored as a csv file for easy access - [madison_wildfire_1964_2024.csv](https://github.com/ManasaSRonur/data-512-project/blob/main/intermediary_files/madison_wildfire_1964_2024.csv).

In [25]:
merged_df = merged_df.loc[merged_df['shortest_dist']<= MILE_BOUNDARY]
print(f"{len(merged_df)} fires occured within {MILE_BOUNDARY} miles of Madison, Wisconsin from {STARTYEAR} to {ENDYEAR}")
merged_df.to_csv('intermediary_files/madison_wildfire_1964_2024.csv', index = False)


19627 fires occured within 650 miles of Madison, Wisconsin from 1964 to 2024


So we have around 19627 fires that occured within 650 miles from the year 1964 to 2024, this csv will serve as a base for our smoke estimates and contains the columns that I determined as relevent to our analysis.