***
# Project Plan
***
*COMP5360 Introduction to Data Science, Spring 2023*

#### Project Title: Incendio

>Hannah Van Hollebeke	vanhollebeke.hannah@gmail.com          u0697848 <br>
>Jenine Rogel           	u0468294@umail.utah.edu	               u0468294 <br>
>Isabelle Cook	        u1316961@utah.edu	                   u1316961 <br>

**Github Repository:**

https://github.com/Bonampak1/Incendio

## Data Processing
***
Our datasets may need substantial cleanup which will be implemented in python via Jupyter Notebook. 

### Data Cleaning
- Merge multiple datasets 
    - Join fire and air quality data by date
    - Combine air quality data for various pollutants
- Check missing or outlier values
- Remove attributes that don't add value to our project objectives
- Assign numerical values for categorical data 
    - Convert day and night assignments into binary integer values
- Unit conversions 
    - Convert fire brightness from Kelvin to Celsius
- Convert and parse date columns
- Filter rows for geographical location specific to western US

### Data Quantities
**FIRE**
- Acquisition date
- Latitude and longitude
- Brightness (thermal measurement)
- Fire radiative power
- Time of day (binary day or night)
- Data quality attributes (e.g., Confidence and resolution)

**Air Quality**
- Date
- Latitude and longitude
- Mean pollutant concentration for multiple pollutants
- Daily AQI value
- Data quality attributes

### Data Processing Implementation
**Python, Jupyter Notebook**
- Pandas for data cleaning and manipulation
- SciPy for statistical analyses and machine learning implementation
- Matplotlib, seaborn, or Altair for flat visualizations for the final report
- PySimpleGUI for an interactive visualization app for the video presentation


## Exploratory Analysis
***
**Data summary**
<br>
A data summary will help us see in what ways the data was lacking and what was specifically cleaned in the data.
- Missing values
- Duplicates
- Data types
- Size and shape of the data

**Characterizing the data**
By using scattermatrices and a correlation matrix, we will be able to see what variables are highly correlated. We will then use regression plots to explore those highly correlated models and identify potential features to use in our final analysis of the data.
- Distribution plots and relationships via a scattermatrix plot
    - Plot variables for the fire and air quality datasets separately and once combined
    - Check the shape of the main data quantities such as brightness and geographical coordinates to identify skewness and outliers
    - Check for linear or non-linear relationships
    - Use visualization to identify patterns to investigate further
- Correlation matrix
    - Quantify feature relationships
- Regression analysis 
    - Create single and multi regression models to identify important features
    - Understand complex intereactions among variables
- Heatmap with location coordinates for fire and air data
    - Identify problematic areas to generate hypotheses

## Analysis Approach and Methodology
***
**Model Selection**
<br>
We would like to create an application to determine the location of a wildfire if given specific attributes and predict the air quality in the surrounding areas using the methods below. By creating a predictive app, we will be able to fully evaluate how strong the correlations between fires and their locations, as well as between fires and air quality. In other words, it will potentially tell us whether or not it is worth using fires as a predictive factor for air quality.
- Regression with cross-validation
- Classification with cross-validation
    - we will evaluate different methods (e.g., K-nearest neighbors or principle component analysis)
- Decision tree for location prediction
- GUI visualization

**Visualization**
<br>
We would like to create a visual that shows side-by-side geographical plots of fires and the air quality index over time using the techniques below. By looking at the results, we will be able to see if and how the AQI was affected by fires overtime.
- Geographical plots 
- Interactive Altair features

## Project Schedule
***
### Project Proposal Due March 17th
>**March 12th:** Group meeting to discuss project objectives and begin populating the proposal template. Divide up follow up tasks.
> - [x] Isabelle: polish background and motivation <br>
> - [x] Jenine: polish ethical considerations <br>
> - [x] Hannah: polish data section <br>
> - [x] Everyone: review data processing, exploration, and analysis sections <br>

>**March 16th:** Group meeting to review and finalize proposal document in Jupyter Notebook, set up Github repository
> - [x] Everyone: review and make final edits <br>

### 1st Project Milestone Submission Due April 4th
>**March 21st:** Group meeting to finalize dataset. Establish specific data cleaning and exploratory analysis tasks. Divide amongst team.
> - [ ] Isabelle: Download all data files and upload to github by Wednesday 3/21 <br>
> - [ ] Jenine: Write code to load in all data files in jupyter notebook and get notebook on github by Friday 3/24 <br>
> - [ ] Hannah: Merge data and parse date columns, filter for states of interest by 3/26 <br>
> - [ ] Isabelle: Remove columns not needed, assign numerical values to categorical values 3/27 <br>
> - [ ] Jenine: Check missing values, duplicate records, outliers 3/28 <br>
> - [ ] Everyone: General exploratory tasks 
<br>

>**March 28th:** Group meeting to review data cleaning and exploratory work. Identify further exploratory tasks and divide amongst team.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

>**April 3rd:** Group meeting to review and discuss completed work. Establish a more detailed analysis and visualization plan. Assign tasks. Compile document for milestone submission. 
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

### Final Submission Due April 21st
>**April 11th:** Group meeting to review initial analysis steps and visualization. Establish action items and divide tasks amongst team.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

>**April 18th:** Group meeting to review final analysis and documentation. Record video presentation. Identify any final action items.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

>**April 20th:** Group meeting to review final submission and make any final edits. Compile finalized documents.
> - [ ] Isabelle: <br>
> - [ ] Jenine: <br>
> - [ ] Hannah: <br>
> - [ ] Everyone: <br>

### References

Anenberg, S. C., Henze, D. K., Tinney, V., Kinney, P. L., Raich, W., Fann, N., Malley, C. S., Roman, H., Lamsal, L., Duncan, B., Martin, R. V., van Donkelaar, A., Brauer, M., Doherty, R., Jonson, J. E., Davila, Y., Sudo, K., &amp; Kuylenstierna, J. C. I. (2018). Estimates of the global burden of ambient PM2.5, Ozone, and no2 on asthma incidence and emergency room visits. Environmental Health Perspectives, 126(10), 107004. https://doi.org/10.1289/ehp3766 
<br>
Li, Y., Tong, D., Ma, S., Zhang, X., Kondragunta, S., Li, F., &amp; Saylor, R. (2021). Dominance of wildfires impact on air quality exceedances during the 2020 record‐breaking wildfire season in the United States. Geophysical Research Letters, 48(21). https://doi.org/10.1029/2021gl094908 