***
# Project Proposal
***
*COMP5360 Introduction to Data Science, Spring 2023*

#### Project Title: Incendio

>Hannah Van Hollebeke	vanhollebeke.hannah@gmail.com          u0697848 <br>
>Jenine Rogel           	u0468294@umail.utah.edu	               u0468294 <br>
>Isabelle Cook	        u1316961@utah.edu	                   u1316961 <br>

**Github Repository:**

https://github.com/Bonampak1/Incendio

## Background and Motivation
***
In 2020, multiple large forest fires ravaged the western coast of the United States. Because of this, smoke spread over the west and the sky turned gray. This prompted an interest in the effects of forest fires. Furthermore, a study conducted by Li et al. showed that these fires adversely affected the air quality during that year. The study focused on the fires' effect on the pollutant PM2.5, which is one of the more harmful air pollutants produced by wildfire smoke, and found that "the West Coast wildfires contributed to 23% of surface PM2.5 pollution nationwide" (Li et al.). This led us to look at the correlations between forest fires and air quality. However, we will not only be looking at correlations for the PM2.5 pollutant but also for other pollutants produced by fires, such as carbon monoxide, and the daily air quality index (AQI). Below is a chart that explains the meaning of the daily AQI.

![AQI%20index.png](attachment:AQI%20index.png)
https://www.airnow.gov/aqi/aqi-basics/

Another thing that led us to choose this project is the potential negative impact of bad air quality on public health. In a study focusing the relation between air polution and asthma emergency room visits by Anenberg et al., it was approximated that 4-9% of asthma emergency room visits globally in 2015 could be attributed to the air pollutant PM2.5. While this percent is not the only contributor to severe asthma, it still accounts for about 5-10 million visits. This gives an example of how people can be negatively affected by air pollution. Moreover, because wildfires are one of the contributors to air pollutants, it is important to study the correlations between the two to potentially prevent the negative consequences of increased air pollutants.

Therefore, with this information, we chose to focus on correlations between air quality and forest fires to explore this aspect of wildfires.

## Project Objectives
***
We want to use the forest fires data as a way to predict air quality in the western United States as it relates to public health. We aim to answer the following questions:
- Do forest fires significantly correlate to different measures of air quality? 
- How do fires affect different pollutants?
- Are areas of the western US more prone to wildfires? Are the same areas most affected by poor air quality?
- What temporal trends have occurred over the past two decades? Can we use these to predict the impact of pollutants in the future?

By answering these questions, we expect to identify the ways that wildfires are connected to the air quality and showcase a few examples of how these correlations can be utilized. Moreover, understanding these research objectives can help policy makers and healthcare providers make faster and more informed decisions regarding public health. Furthermore, with the results from our research, wildfire hotspots could be identified, and fire safety measures could be improved in those areas. Lastly, air quality predictions could be improved with the knowledge of how air quality is affected by wildfires.


## Ethical Considerations
***
The following groups have been identified as stakeholders with whom the project results may impact. Stakeholders of high importance are those that are directly affected by the results of the project, such as homeowners and insurance providers in the affected area. The different committees, agencies, and government officials have median importance because they would be affected indirectly through related policies. However, they have a high influence on all other groups. Finally, the project members are stakeholders with low importance because they are only doing their job to share their findings.
![stakeholder-2.png](attachment:stakeholder-2.png)



## Data Source and Acquisition
***
#### Fire Data
NASA provides satellite imaging data for forest fires which can be observed nearly in real-time. NASA maintains a data archive for the past 20 years. General information can be found here: https://www.earthdata.nasa.gov/learn/toolkits/disasters-toolkit/wildfires-toolkit). 

Data from two imaging instruments are available per country and year. Our analysis will use data from the United States over a time interval of 20 years. The appropriate CSV files will be downloaded directly from the webpage: https://firms.modaps.eosdis.nasa.gov/active_fire/.

The file contains geographical coordinates, date and time of imaging, brightness measurements, and as well as some data source and quality attributes. 

In [1]:
# View a sample csv from MODIS instrument for the US in 2021
import pandas as pd
file = 'example_fire_data.csv'
sample_fire = pd.read_csv(file)
sample_fire.head()

FileNotFoundError: [Errno 2] No such file or directory: 'example_fire_data.csv'

#### Air Quality Data
The environmental Protection Agency (EPA) provides data on air quality in regard to various pollutant types. Daily data is available for direct download (https://www.epa.gov/outdoor-air-quality-data/download-daily-data). However, it requires selection of a specific pollutant, state, and year for a single csv file. This may be tedious with the amount of geographical data (~13 US states), multiple pollutants, and 20 years of data. The website provides an API which may be worth using to efficiently query the specific data we are interested in: https://aqs.epa.gov/aqsweb/documents/data_api.html.

The data includes geographical coordinates, date, pollutant measurements, and various identifying attributes. 

In [6]:
# View a sample csv for PM2.5 measurements Utah in 2021
file = 'example_air_quality_data.csv'
sample_airQ = pd.read_csv(file)
sample_airQ.head()

Unnamed: 0,Date,Source,Site ID,POC,Daily Mean PM2.5 Concentration,UNITS,DAILY_AQI_VALUE,Site Name,DAILY_OBS_COUNT,PERCENT_COMPLETE,AQS_PARAMETER_CODE,AQS_PARAMETER_DESC,CBSA_CODE,CBSA_NAME,STATE_CODE,STATE,COUNTY_CODE,COUNTY,SITE_LATITUDE,SITE_LONGITUDE
0,01/01/2021,AQS,490050007,1,20.5,ug/m3 LC,69,,1,100.0,88101,PM2.5 - Local Conditions,30860.0,"Logan, UT-ID",49,Utah,5,Cache,41.842649,-111.852199
1,01/02/2021,AQS,490050007,1,14.6,ug/m3 LC,56,,1,100.0,88101,PM2.5 - Local Conditions,30860.0,"Logan, UT-ID",49,Utah,5,Cache,41.842649,-111.852199
2,01/03/2021,AQS,490050007,1,15.5,ug/m3 LC,58,,1,100.0,88101,PM2.5 - Local Conditions,30860.0,"Logan, UT-ID",49,Utah,5,Cache,41.842649,-111.852199
3,01/04/2021,AQS,490050007,1,11.6,ug/m3 LC,48,,1,100.0,88101,PM2.5 - Local Conditions,30860.0,"Logan, UT-ID",49,Utah,5,Cache,41.842649,-111.852199
4,01/05/2021,AQS,490050007,1,3.3,ug/m3 LC,14,,1,100.0,88101,PM2.5 - Local Conditions,30860.0,"Logan, UT-ID",49,Utah,5,Cache,41.842649,-111.852199


All raw data files, cleaned data files, code, deliverable documents, and visualization outputs will be maintained in our github repository.

## Data Processing
***
Our datasets may need substantial cleanup which will be implemented in python via Jupyter Notebook. 

### Data Cleaning
- Merge multiple datasets 
    - Join fire and air quality data by date
    - Combine air quality data for various pollutants
- Check missing or outlier values
- Remove attributes that don't add value to our project objectives
- Assign numerical values for categorical data 
    - Convert day and night assignments into binary integer values
- Unit conversions 
    - Convert fire brightness from Kelvin to Celsius
- Convert and parse date columns
- Filter rows for geographical location specific to western US

### Data Quantities
**FIRE**
- Acquisition date
- Latitude and longitude
- Brightness (thermal measurement)
- Fire radiative power
- Time of day (binary day or night)
- Data quality attributes (e.g., Confidence and resolution)

**Air Quality**
- Date
- Latitude and longitude
- Mean pollutant concentration for multiple pollutants
- Daily AQI value
- Data quality attributes

### Data Processing Implementation
**Python, Jupyter Notebook**
- Pandas for data cleaning and manipulation
- SciPy for statistical analyses and machine learning implementation
- Matplotlib, seaborn, or Altair for flat visualizations for the final report
- PySimpleGUI for an interactive visualization app for the video presentation


## Exploratory Analysis
***
**Data summary**
<br>
A data summary will help us see in what ways the data was lacking and what was specifically cleaned in the data.
- Missing values
- Duplicates
- Data types
- Size and shape of the data

**Characterizing the data**
By using scattermatrices and a correlation matrix, we will be able to see what variables are highly correlated. We will then use regression plots to explore those highly correlated models and identify potential features to use in our final analysis of the data.
- Distribution plots and relationships via a scattermatrix plot
    - Plot variables for the fire and air quality datasets separately and once combined
    - Check the shape of the main data quantities such as brightness and geographical coordinates to identify skewness and outliers
    - Check for linear or non-linear relationships
    - Use visualization to identify patterns to investigate further
- Correlation matrix
    - Quantify feature relationships
- Regression analysis 
    - Create single and multi regression models to identify important features
    - Understand complex intereactions among variables
- Heatmap with location coordinates for fire and air data
    - Identify problematic areas to generate hypotheses

## Analysis Approach and Methodology
***
**Model Selection**
<br>
We would like to create an application to determine the location of a wildfire if given specific attributes and predict the air quality in the surrounding areas using the methods below. By creating a predictive app, we will be able to fully evaluate how strong the correlations between fires and their locations, as well as between fires and air quality. In other words, it will potentially tell us whether or not it is worth using fires as a predictive factor for air quality.
- Regression with cross-validation
- Classification with cross-validation
    - we will evaluate different methods (e.g., K-nearest neighbors or principle component analysis)
- Decision tree for location prediction
- GUI visualization

**Visualization**
<br>
We would like to create a visual that shows side-by-side geographical plots of fires and the air quality index over time using the techniques below. By looking at the results, we will be able to see if and how the AQI was affected by fires overtime.
- Geographical plots 
- Interactive Altair features

## Project Schedule
***
### Project Proposal Due March 17th
>**March 12th:** Group meeting to discuss project objectives and begin populating the proposal template. Divide up follow up tasks.
> - [x] Isabelle: polish background and motivation <br>
> - [x] Jenine: polish ethical considerations <br>
> - [x] Hannah: polish data section <br>
> - [x] Everyone: review data processing, exploration, and analysis sections <br>

>**March 16th:** Group meeting to review and finalize proposal document in Jupyter Notebook, set up Github repository
> - [x] Everyone: review and make final edits <br>

### 1st Project Milestone Submission Due April 4th
>**March 21st:** Group meeting to finalize dataset. Establish specific data cleaning and exploratory analysis tasks. Divide amongst team.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

>**March 28th:** Group meeting to review data cleaning and exploratory work. Identify further exploratory tasks and divide amongst team.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

>**April 3rd:** Group meeting to review and discuss completed work. Establish a more detailed analysis and visualization plan. Assign tasks. Compile document for milestone submission. 
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

### Final Submission Due April 21st
>**April 11th:** Group meeting to review initial analysis steps and visualization. Establish action items and divide tasks amongst team.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

>**April 18th:** Group meeting to review final analysis and documentation. Record video presentation. Identify any final action items.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

>**April 20th:** Group meeting to review final submission and make any final edits. Compile finalized documents.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

### References

Anenberg, S. C., Henze, D. K., Tinney, V., Kinney, P. L., Raich, W., Fann, N., Malley, C. S., Roman, H., Lamsal, L., Duncan, B., Martin, R. V., van Donkelaar, A., Brauer, M., Doherty, R., Jonson, J. E., Davila, Y., Sudo, K., &amp; Kuylenstierna, J. C. I. (2018). Estimates of the global burden of ambient PM2.5, Ozone, and no2 on asthma incidence and emergency room visits. Environmental Health Perspectives, 126(10), 107004. https://doi.org/10.1289/ehp3766 
<br>
Li, Y., Tong, D., Ma, S., Zhang, X., Kondragunta, S., Li, F., &amp; Saylor, R. (2021). Dominance of wildfires impact on air quality exceedances during the 2020 record‐breaking wildfire season in the United States. Geophysical Research Letters, 48(21). https://doi.org/10.1029/2021gl094908 