In [None]:
## Phase I Project Proposal

### Predicting Wildfire Risk Using Meteorological and Spatiotemporal Data

#### Name: Jason Balayev, DS 3000

### Introduction

Wildfires serve as a major threat worldwide. Early prediction of wildfire risk is important for planning out evacuation processes and efforts to stay safe. I'm interested in determining whether specific combinations of weather conditions, temperature patterns, and characteristics can predict wildfire occurrences. Additionally, I'm also interested in how the temporal window of meteorological data before ignition affects prediction results.

### Data Collection

I plan to collect data using NOAA's Climate Data Online API, which provides  weather and climate data from weather stations across the United States. This API will allow me to track historical weather data, including temperature, humidity, and other meteorological variables that are crucial for wildfire prediction. I plan to combine this weather data with wildfire occurrence data from NASA's Fire Information for Resource Management System (FIRMS) API to create a comparable dataset.


The code below needs access to the NOAA API, which can be obtained by making an account with NCDC's website. I used my own API key, so you would have to replace the string with your own API key. **I have not yet incorporated NASA's Fire Information for Resource Management API System in this proposal.**

In [15]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# NOAA Climate Data Online api endpoint
url = "https://www.ncdc.noaa.gov/cdo-web/api/v2/"

# api token from NOAA
api_key = "LkVcOWIEiHjtXAECePfPfXJmgXlqsAnk" # replace with your own (do not share mine pls)

# set up headers with API token
headers = {
    'token': api_key
}

# weather stations in CA using Federal Information Processing Standards (FIPS) code system
stations_url = url + "stations"
params = {
    'locationid': 'FIPS:06', # california code (FIPS)
    'datasetid': 'GHCND',
    'datatypeid': 'TMAX,TMIN,PRCP,AWND',
    'startdate': '2020-01-01',
    'enddate': '2023-12-31',
    'limit': 50 # toggle limit value (default)
}

# set api requests
response = requests.get(stations_url, headers=headers, params=params)
stations_data = response.json()

print(f"{len(stations_data.get('results', []))} weather stations")
print("NY station data:")
if 'results' in stations_data:
    for station in stations_data['results'][:3]:
        print(f"- {station['name']}: {station['id']}")

# sample weather data from first station
if 'results' in stations_data and len(stations_data['results']) > 0:
    station_id = stations_data['results'][0]['id']

    # weather data
    data_url = url + "data"
    weather_params = {
        'datasetid': 'GHCND',
        'stationid': station_id,
        'startdate': '2023-07-01',
        'enddate': '2023-09-30',
        'datatypeid': 'TMAX,TMIN,PRCP,AWND',
        'limit': 1000
    }

    weather_response = requests.get(data_url, headers=headers, params=weather_params)
    weather_data = weather_response.json()

    # convert to DataFrame
    if 'results' in weather_data:
        df = pd.DataFrame(weather_data['results'])
        display(df.head())

50 weather stations
NY station data:
- LEAVITT LAKE, CA US: GHCND:USS0019L38S
- SUMMIT MEADOW, CA US: GHCND:USS0019L42S
- SQUAW VALLEY G.C., CA US: GHCND:USS0020K30S


Unnamed: 0,date,datatype,station,attributes,value
0,2023-07-01T00:00:00,AWND,GHCND:USS0019L38S,",,T,",10
1,2023-07-01T00:00:00,PRCP,GHCND:USS0019L38S,",,T,",0
2,2023-07-01T00:00:00,TMAX,GHCND:USS0019L38S,",,T,",220
3,2023-07-01T00:00:00,TMIN,GHCND:USS0019L38S,",,T,",105
4,2023-07-02T00:00:00,AWND,GHCND:USS0019L38S,",,T,",12


### Plans

In this code, I am just grabbing the first station without knowing if it has complete data or is in a relevant location. I would have to clean up the data, and properly convert value units. I would also have to set up NASA's Fire Information for Resource Management System API endpoint.

In terms of my ML model, I plan to use methods like Random Forest / Gradient Boosting that can capture temporal dependencies. I will also explore features that identify which meteorological factors are the most predictive of wildfire risk, and measure how prediction accuracy changes with the length of the temporal window used for forecasting.