# **Models for Time Series Analysis to Forecast Wildfires in the USA**
## **Time Series Report**
### **Data Period**: 1992-2015
### **Prepared By**: **Group B**
#### Responsable: F.Salinas
## ** Abstract:** :

Because wildfires frequency in the USA is strongly seasonal. Where the wildfire frequencies are exponentially more frequent in summer than winter. Temperature, humidity, wind velocity and O2 gas concentration, are key variables that could explain the variance in wildfires between different years. In regard to the models, a high diversity of models has been already used with different wildfire data sets. Therefore, we will choose three types of models which have completely different setups and complexity levels. These time series models are ARIMA, Linear regression and Neuronal network Long-Short-Term-Memory(LSTM). Because the LSTM model is much more complex than ARIMA and Linear regression models, we expect this LSTM model to have the lower values in the error indexes. Our results supports the model LSTM as the one with a better ability to generalize. Moreover, if we compare the error results grouped by type of model. The monthly data had always the lower error indexes values. Therefore the LSTM model with monthly data as input was used to forecast the year 2015 using data from 1992 to 2014 as model training data.The arguably good results in the forecast of year 2015.


## **Aim:** :
 *Use daily and monthly registered fire events inside USA from 1992 to 2015. To fit ARIMA, Linear regression and neuronal network models. Moreover, the model with the least error in their cross validation will be selected to do a forecasting test.*

 
## References 

### Course Material
**Brockhaus, S. (2024).** *Angewandte Zeitreihenanalyse*. Hochschule München, Fakultät 07.[https://moodle.hm.edu](https://moodle.hm.edu) (restricted access)

**Hyndman, R.J., & Athanasopoulos, G. (2021).** *Forecasting: Principles and Practice* (3rd ed.). [https://otexts.com/fpp3/](https://otexts.com/fpp3/)



# Table of Contents

| **Section**                           | **Subsections**                                                                                                   |
|---------------------------------------|------------------------------------------------------------------------------------------------------------------|
| **1. Data Description**               | [1.1 Daily Data](#1-1-daily-data) <br> [1.1.1 Daily Fire Frequency](#1-1-1-daily-fire-frequency) <br> [1.1.2 Distribution of Fire Counts (Daily, not transformed)](#1-1-2-distribution-daily-not-transformed) <br> [1.1.3 Distribution of Fire Counts (Daily, Log Transformed)](#1-1-3-distribution-daily-log-transformed) <br> [1.1.4 Time Series Decomposition of Log-Transformed Wildfire Counts](#1-1-4-time-series-decomposition-daily) <br> [1.2 Monthly Data](#1-2-monthly-data) <br> [1.2.1 Monthly Fire Frequencies by Year](#1-2-1-monthly-fire-frequencies) <br> [1.2.2 Time Series Decomposition of Log-Transformed Monthly Fire Counts](#1-2-2-time-series-decomposition-monthly) <br> [1.3 Atmospheric Variables Time Series Description](#1-3-atmospheric-variables) |
| **2. ARIMA Model**                    | [2.1 Daily](#2-1-daily) <br> [2.1.1 Automatic ARIMA Model Selection](#2-1-1-arima-automatic-selection) <br> [2.1.2 Stationarity Test](#2-1-2-stationarity-test) <br> [2.1.3 Cross-Validation Error Values](#2-1-3-cross-validation-error-values) <br> [2.2 Monthly](#2-2-monthly) <br> [2.2.1 Automatic ARIMA Model Selection](#2-2-1-arima-automatic-selection) <br> [2.2.2 Stationarity Test](#2-2-2-stationarity-test) <br> [2.2.3 Cross-Validation Error Values](#2-2-3-cross-validation-error-values) |
| **3. Regression Model**               | [3.1 Daily](#3-1-daily) <br> [3.1.1 Model Description](#3-1-1-model-description) <br> [3.1.2 Cross-Validation Error Values](#3-1-2-cross-validation-error-values) <br> [3.2 Monthly](#3-2-monthly) <br> [3.2.1 Model Description](#3-2-1-model-description) <br> [3.2.2 Cross-Validation Error Values](#3-2-2-cross-validation-error-values) |
| **4. LSTM Model**                     | [4.1 Daily](#4-1-daily) <br> [4.1.1 Model Description](#4-1-1-model-description) <br> [4.1.2 Cross-Validation Error Values](#4-1-2-cross-validation-error-values) <br> [4.2 Monthly](#4-2-monthly) <br> [4.2.1 Model Description](#4-2-1-model-description) <br> [4.2.2 Cross-Validation Error Values](#4-2-2-cross-validation-error-values) |
| **5. Model Comparison and Selection** | [5.1 Comparison Summary](#5-1-comparison-summary) <br> [5.2 Observed vs Predicted](#5-2-observed-vs-predicted) |
| **6. Forecast for 2015**              | [6.1 Forecasting Methodology](#6-1-forecasting-methodology) <br> [6.2 Observed vs Forecasted](#6-2-observed-vs-forecasted) |
| **7. Conclusions**                    | [7.1 Key Insights](#7-1-key-insights) <br> [7.2 Recommendations](#7-2-recommendations) |
| **8. Data Generation**                | [8.1 Daily Data Methodology](#8-1-daily-data-methodology) <br> [8.2 Atmospheric Variables](#8-2-atmospheric-variables) <br> [8.3 Output Files](#8-3-output-files) <br> [8.4 Data Integration](#8-4-data-integration) <br> [8.5 Daily Averages](#8-5-daily-averages) <br> [8.6 Monthly Data](#8-6-monthly-data) |


# **1. Data Description**
<span id="data-description"></span>
The data was obtained from [Kaggle: 1.88 Million US Wildfires](https://www.kaggle.com/datasets/rtatman/188-million-us-wildfires/data). This dataset contains rows representing the discovery date (YY.MM.DD) of wildfires, coordinate variables, among others. The most relevant variables in this wildfire time series are the date, latitude, and longitude. The discovery date is used to group rows by specific dates regardless of the coordinates, creating the variable **Daily Fire Counts**. Latitude and longitude combined with the discovery date are used to select the nearest value of environmental variables from other databases. These include temperature, wind speed, precipitation, and for monthly data, atmospheric gases CO₂ and O₂ (sources in Section 8.2). This allows the consolidation of relevant variables into a single csv file.

Two main dataframes are created:
1. Daily data (1992–2015), containing the count of fires and estimated daily means of environmental variables.
2. Monthly data, derived by grouping daily data by month and year, including monthly averages of environmental variables. Additionally, monthly gas data was joined.

## **1.1 Daily Data**
<span id="daily-data"></span>

### **1.1.1 Daily Fire Frequency**
<span id="daily-fire-frequency"></span>
The plot shows that the fire count in the USA exhibits notable seasonality. Peaks in wildfire counts occur during warmer months (March to September, marked in yellow). A cyclic component might exist but is unclear in the selected period.

![Daily Fire Frequency](Abbildungen/daily_fire_frequency.png)  
*Figure 1: Daily frequency of wildfires in the USA (1992–2015). Peaks correspond to warmer months.*

### **1.1.2 Distribution of Fire Counts (Daily, not transformed)**
<span id="distribution-daily-not-transformed"></span>
The distribution of wildfire daily totals is right-skewed due to many days with zero wildfires. To address this, the wildfire counts were transformed.

![Daily Fire Counts Distribution](Abbildungen/not_transformed_fire_counts_distribution.png)  
*Figure 2: Distribution of daily wildfire counts before transformation.*

### **1.1.3 Distribution of Fire Counts (Daily, Log Transformed)**
<span id="distribution-daily-log-transformed"></span>
The wildfire counts were transformed using the natural logarithm. This transformation improved the distribution by addressing the skewness and abrupt left tail. All models in this project use log-transformed wildfire counts as the dependent variable.

![Log Daily Fire Counts Distribution](Abbildungen/log_fire_counts_distribution.png)  
*Figure 3: Distribution of log-transformed daily wildfire counts.*

### **1.1.4 Daily Time Series Decomposition of Log-Transformed Wildfire Counts**
<span id="time-series-decomposition-daily"></span>
The decomposition of daily wildfire data highlights seasonal components. Cyclic components cannot be clearly identified from the data.

![Log Daily Fire Counts Time Series Decomposition](Abbildungen/daily_fire_count_stl_log_decomposition.png)  
*Figure 4: Decomposition of daily log-transformed wildfire counts into trend, seasonal, and residual components.*

## **1.2 Monthly Data**
<span id="monthly-data"></span>
This dataset is derived from the daily dataset joined with monthly gas data (see the first paragraph of section 1). The strong linear trends in gas variables [Figure 7](#figure-7) E and F were removed using a detrending transformation to isolate seasonal variance.

### **1.2.1 Monthly Fire Frequencies by Year**
<span id="monthly-fire-frequencies"></span>
This plot shows the number of wildfires by month for each year, revealing a bimodal seasonal pattern. The decline between May and June creates the bimodal shape in the data.

![Monthly Fire Frequency by Year](Abbildungen/monthly_fire_frequency_by_year.png)  
*Figure 5: Monthly wildfire frequencies by year, showing bimodal seasonal patterns.*

### **1.2.2 Monthly Time Series Decomposition of Log-Transformed Wildfire Counts**
<span id="time-series-decomposition-monthly"></span>
Decomposition of the monthly wildfire data (log-transformed) enhances the visibility of trends and seasonal patterns.

![Log Monthly Fire Counts Time Series Decomposition](Abbildungen/monthly_fire_count_stl_log_decomposition.png)  
*Figure 6: Decomposition of log-transformed monthly wildfire counts into trend, seasonal, and residual components.*

## **1.3 Atmospheric Variables Time Series Description**
<span id="atmospheric-variables"></span>
Environmental variables included in the datasets exhibit strong seasonal components [Figure 7](#figure-7). We think that beacause the wildfires frequency is seasonal then the potential variables that are relevant in wildfire frecuency should also have a seasonal pattern. Because wildfires frequency is greater on warmer-dryer months. The higher peaks of fire frecuencies [Figure 7 red line](#figure-7) increases sincronically throug time with maximal and minimal temperatures (Figure 7 A and B blue lines) of variables like wind speed, precipitation and gases seems to be less sincronical to fire frecuency (Figure 7 C,D,E and F). Atmospheric gases (CO₂ and O₂) show strong trends (Figure 7 E and F). To address this, the gases values were transformed to detrend them. 

![Atmospheric Variables Time Series](Abbildungen/Environmental_variables_time_series.png)  
*Figure 7: Time series of atmospheric variables contrasted to wildfire frecuency, including temperature, wind speed, and precipitation, with trends removed for gases (CO₂, O₂).* The monthly data is presented because is easier to see the seasonal patern .A) Mean monthly minimal temperature in °C .B) Mean monthly maximal temperature in °C. C) Mean monthly wind speed in m/s. D) Mean monthly Precipitation in mm. E) One measure per month of O₂ concentration in per-meg units. F) One measure per monthh of CO₂ concentration in per-meg units. 
<span id="figure-7"></span>


# **2. ARIMA Model**
<span id="arima-model"></span>
ARIMA models are a combination of Autoregresive models, Moving average models and integration method. ARIMA models are represented as ARIMA(p,d,q)(P,D,Q)[m]. Symbol "m" is the length of the seasonal cycle. 

Autoregressive model is a multiple regression model with lag values of the wildfire frequency counts. The order of lags included in a model is represented as "p" and "P", for the non-seasonal and seasonal autoregressive parameters respectively. 

Moving average component is a multiple regression model of the past errors as predictors of the wildfire frequency counts.The order of lags included in a model is represented as "q" and "Q", for the non-seasonal and seasonal moving average error parameters respectively. 

Integration is used to apply differentiation to the data. Used as an alternative transformation of the data to make reach stationary in data with complex time series patterns. The order of differentiation included in this model is represented as "d" and "D", for the non-seasonal and seasonal differencing parameters respectively. 


Stationarity of the data is required to do reliable forecasts. Because when the residuals(distance/error) between the values of a predicted variable with the values of a given model is independent of time. Then, the fitted model has been able to capture the seasonal variance of the dataset. Indicating that the error of a forecast using this model, will be explained just by the effect of chance. But the main predictors of the future value should be enough to know to have a predicted value with low error.  

The error between the prediction and observed values was measured by indexes. Means square error (MSE), Mean absolute error (MAE) and Root mean square error (RMSE). The Mean square error is the average of the squared differences between predicted and observed values. Mean absolute error is the average of the absolute differences between predicted observed values. Root mean square error is the square root of the mean square error, which has the advantage that gives the error values in the same scale as the input (log(wildfire counts)). Due to this property of the RMSE we will focus on this index but the values of the other indexes were added as complement (([see figure 10](#figure-10)) ). Cross validation using five fold methodology was used to obtain the values of each error indexed mentioned before. Because each fold generates a index value, mean was used as expected value. 

Stationarity can be visually estimated using different representations of the residuals produced after fitting the model ([see as an example figure 8](#figure-8)). Plot A represents the residual values on time series, ideally should resemble white noise (type of random function). Plot B is the frequency distribution histogram of the values of the residuals. This plot should resemble a standard normal distribution with an expected value of 0. Plot C is the autocorrelation between a residual value contrasted to all past residual values of the same variable (lags). The ideal scenario would be to have residuals not autocorrelated. On the plot should be all correlation values under the lightblue area, independent of the negative or positive correlation value. All model tested here did not reach this ideal scenario. For example ([figure 8C](#figure-8)) show high autocorrelation to the first lag (day before value = 1) and cero autocorrelation with the other 50 lags, which is not ideal. Because means that this model prediction are almost completely based on the number of wildfires that were the day before. This type od prediction model over-fits very well but when tested with cross validation the error indexes values showed low forecast potential (([see figure 10](#figure-10)) ). Plot D is the partial autocorrelation of the residuals. This plot is similar to the autocorrelation plot but instead of using all residual here a customization of the lags is possible. This property is useful to check is an autoregressive model could be an improvement to the stationarity of residual of the current fitted model. In ([figure 8D](#figure-8)) the order of the autoregressive model used here was 3. Is important to mention that the orders of the ARIMA models for daily and monthly data were obtained using an auto-arima function which uses the input fire counts (log) and test different combinations of values for parameters orders (p,d,q)(S,Q,D)[m].

## **2.1 Daily**
<span id="arima-daily"></span>


### **2.1.1 Automatic ARIMA model selection with `auto_arima` in Python**
<span id="arima-automatic-selection"></span>
The selected ARIMA model by lowest AIC is: ARIMA(3,1,5)(0,0,0)[0].

### Equation:

Δyₜ = φ₁Δyₜ₋₁ + φ₂Δyₜ₋₂ + φ₃Δyₜ₋₃ − θ₁ϵₜ₋₁ − θ₂ϵₜ₋₂ − θ₃ϵₜ₋₃ − θ₄ϵₜ₋₄ − θ₅ϵₜ₋₅ + ϵₜ

### Legend:
- **Δyₜ**: The differenced value of \(y\) at time \(t\)
- **φ₁, φ₂, φ₃**: Autoregressive (AR) coefficients
- **Δyₜ₋₁, Δyₜ₋₂, Δyₜ₋₃**: Differenced values of \(y\) at lag 1, 2, and 3
- **θ₁, θ₂, θ₃, θ₄, θ₅**: Moving Average (MA) coefficients
- **ϵₜ₋₁, ϵₜ₋₂, ϵₜ₋₃, ϵₜ₋₄, ϵₜ₋₅**: Residual errors (white noise) at lag 1, 2, 3, 4, and 5
- **ϵₜ**: Residual error (white noise) at time \(t\)


 Here just the data from the wildfire frequency (log-transformed) was used, meaning that no environmental variable is used as explanatory variables inside this model. Therefore the model is completely based on the autocorrelation between the values itself. Because the description of the residuals of this model was done on section 2. Here will be shortly described. The residual analysis shows no stationarity. The residuals time series look like white noise with a mean close to 0 and a regular variation pattern (A). The residuals appear normally distributed (B) and with high autocorrelation (value of 1) for the first lag  ([figure 8 C and D](#figure-8)). This autocorrelation is strong enough to identify this model as not ideal to forecast. Because this model estimates the wildfire counts (log) strongly based on what was the wildfire count (log) the day before. This can be seen in the summary of the ARIMA model parameters and tests ([figure 9](#figure-9)). Where the lag of 1 day has the highest values for the coefficients AR(1) and MA (1). Moreover the strong aurocorrelation was tested with the Ljung-Box test of autocorrelation, which the null hypothesis states that the residuals of the model are not autocorrelated. In this case the p-value is over 0.05 (0.07), meaning that the residuals are significative correlated for this model ([figure 9](#figure-9)). Nevertheless, we also tested for stationarity using the KPSS (Kwiatkowski-Phillips-Schmidt-Shin) and ADF (Augmented Dickey-Fuller), but they indicate stationarity of the same data (please see Jupyter Notebook 8 for details). This could be because these test seek for unit roots (stocastic trends) meaning non stationarity. Which could make them less sensible to autocorrelation when all other conditions look like stationary residuals (such as normal distribution of residuals and mean 0 of residuals).


### **2.1.2 Stationarity Test for ARIMA(3,1,5)(0,0,0)[0]**
<span id="figure-8"></span>  
![ARIMA (3,1,5) Residuals](Abbildungen/daily_arima_residual_analysis.png)  
**Figure 8:** Residual analysis for the ARIMA(3,1,5)(0,0,0)[0] model. Plot A shows the residuals over time, Plot B shows the histogram of residuals, Plot C displays the autocorrelation plot, and Plot D shows the partial autocorrelation plot.  

<span id="figure-9"></span>  
![ARIMA (3,1,5) Coefficients](Abbildungen/ARIMA_daily_Summary.png)  
**Figure 9:** Summary of coefficients and statistical tests for the ARIMA(3,1,5)(0,0,0)[0] model, including estimated AR and MA parameters and residual diagnostics.

### **2.1.3 Cross-Validation Error Values**
<span id="figure-10"></span>  
![Daily ARIMA Prediction Error Indexes](Abbildungen/arima_daily_averaged_scores_table.png)  
**Figure 10:** Cross-validation error indices (MSE, MAE, and RMSE) for the ARIMA(3,1,5)(0,0,0)[0] model, averaged across 5 folds.



## **2.2 Monthly**
### **2.2.1 Automatic ARIMA model selection with `auto_arima` in Python**
The ARIMA(3,0,2)(0,0,0)[0] with intercept model was selected by the "auto_arima" function by its low AIC value.

### Equation:

Δyₜ = μ + φ₁Δyₜ₋₁ + φ₂Δyₜ₋₂ + φ₃Δyₜ₋₃ − θ₁ϵₜ₋₁ − θ₂ϵₜ₋₂ + ϵₜ

### Legend:
- **Δyₜ**: The differenced value of \(y\) at time \(t\)
- **μ**: Mean or intercept of the model
- **φ₁, φ₂, φ₃**: Autoregressive (AR) coefficients
- **Δyₜ₋₁, Δyₜ₋₂, Δyₜ₋₃**: Differenced values of \(y\) at lag 1, 2, and 3
- **θ₁, θ₂**: Moving Average (MA) coefficients
- **ϵₜ₋₁, ϵₜ₋₂**: Residual errors (white noise) at lag 1 and 2
- **ϵₜ**: Residual error (white noise) at time \(t\)



 The summary of the model parameters and autocorrelation tests [figure 10](#figure-10). Indicates a better distributed weight on the AR(1-2-3) and MA(1-2) lags. Moreover the Ljung-Box test has value <<< 0.05 inidicating no autocorrelation between the residual lags. All above indicates stationarity. 

<span id="figure-10"></span>  
![ARIMA (3,0,2) Residuals](Abbildungen\ARIMA_monthly_Summary.png)
**Figure 10:** Summary of the ARIMA(3,0,2)(0,0,0)[0] model with intercept, including parameters, residual diagnostics, and autocorrelation tests. 


### **2.2.2 Stationarity Test for ARIMA(3,0,2)(0,0,0)[0] intercept**
The residual analysis also supports stationarity of the residuals [figure 11](#figure-11). Because the time series of the residuals (plot A) look like white noise and with a normal distribution and mean 0 of the majority of residuals(plot B). But there is some autocorrelation each 12 months (C and D). Indicating that the model still is not fully capturing the seasonal component. 

The unit root test ADF and KPSS also indicated that the residuals for this model are stationary and therefore could be used to forecast (please see jupyter notebook 9 for details). The RMSE index shows a mean error e^0.57 ≈ 1.77 unit factor on the wildfires frequency. Meaning that if there are 100 observed wildfires count, then the estimation value interval is between (56-177 wildfires). 


<span id="figure-11"></span>  
![ARIMA (3,0,2) Residuals](Abbildungen/monthly_arima_residual_analysis.png)
**Figure 11:** Residual analysis for the ARIMA(3,0,2)(0,0,0)[0] model with intercept. Plot A shows the residuals over time, Plot B shows the histogram of residuals, Plot C displays the autocorrelation plot, and Plot D shows the partial autocorrelation plot. The analysis highlights some remaining autocorrelation at a seasonal lag of 12 months.

### **2.2.3 Cross-Validation Error Values**
The error indexes of the cross validation (5 Fold) are lower than the ARIMA model with daily data but still high.
<span id="figure-12"></span>  
![Monthly ARIMA Prediction Error Indexes](Abbildungen/arima_monthly_averaged_scores_table.png)
**Figure 12:** Cross-validation error indices (MSE, MAE, and RMSE) for the ARIMA(3,0,2)(0,0,0)[0] model with intercept, averaged across 5 folds.

# **3. Regression Model**

## **3.1 Daily**

### **3.1.1 Model with Time Lags and Environmental Variables as Predictors**
The regression model for the daily wildfire counts (log-transformed) was built with lag and environmental variables. 

### Equation:

Log_Fire_Counts = β₀ + β₁⋅Time_Trend + β₂⋅Lag_1 + β₃⋅Lag_2 + β₄⋅Precipitation + β₅⋅Wind_Speed + β₆⋅MaxTemperature + β₇⋅MinTemperature + ϵ

### Legend:
- **Log_Fire_Counts**: Log-transformed wildfire frequency (dependent variable)
- **β₀**: Intercept of the regression model
- **β₁**: Coefficient for the time trend
- **Time_Trend**: Temporal trend variable
- **β₂, β₃**: Coefficients for lag variables
- **Lag_1, Lag_2**: Wildfire counts from 1 and 2 time steps prior
- **β₄, β₅, β₆, β₇**: Coefficients for environmental variables
- **Precipitation**: Amount of rainfall
- **Wind_Speed**: Wind speed
- **MaxTemperature**: Maximum temperature
- **MinTemperature**: Minimum temperature
- **ϵ**: Residual error (white noise)


The model summary ([figure 13](#figure-13)) shows an R-squared value of 0.464, indicating that the model explains approximately 46.4% of the total variance. However, this is primarily due to the high autocorrelation of the lag 1 and lag 2 variables, which have the highest coefficient values. The environmental variables have low coefficient values, suggesting they do not significantly contribute to explaining the variance in this linear model.  

A model using only environmental data was also tested to isolate the effect of lag variables. The environmental variables still showed low explanatory power (Data not shown, but it can be configured in Jupyter Notebook 11).  

The residual analysis ([figure 14](#figure-14) ) shows the weekly aggregated residuals (aggregated for better visualization), which are described as white noise (A) and approximately normally distributed (B). There is high autocorrelation (C and E), as expected due to the lag variables. This indicates that the model cannot fully capture the complexity of the data.

<span id="figure-13"></span>  
![Daily TSLM Model Summary](Abbildungen/TSLM_model_summary.png)  
**Figure 13:** Summary of the daily regression model with time lags and environmental variables as predictors. The high R-squared value is primarily driven by lag variables.

  

<span id="figure-14"></span>  
![Residuals daily TSRM](Abbildungen/aggregated_tslm_residual_analysis.png)  
**Figure 14:** Residual analysis for the daily regression model. Plot A shows the residual time series, Plot B shows the histogram of residuals, Plot C displays the autocorrelation, and Plot E shows the partial autocorrelation.

---

### **3.1.2 Cross-Validation Error Values**
The regression model for the daily data shows poor predictive power due to the high error index values obtained from 5-fold cross-validation (([figure 15](#figure-15) )) 

<span id="figure-15"></span>  
![Daily TSLM Error Indexes](Abbildungen/TSLM_averaged_scores_table.png)  
**Figure 15:** Cross-validation error indices (MSE, MAE, and RMSE) for the daily regression model. High error values indicate poor prediction accuracy.

---

## **3.2 Monthly**

### **3.2.1 Model with Just Environmental Variables as Predictors**
The monthly regression model, only time trend and environmental variables were included.

### Equation:

Log_Fire_Counts = β₀ + β₁⋅Time_Trend + β₂⋅Precipitation + β₃⋅Inferred_Wind_Speed + β₄⋅MaxTemperature + β₅⋅Detrended_CO₂ + β₆⋅Detrended_O₂ + ϵ

### Legend:
- **Log_Fire_Counts**: Log-transformed wildfire frequency (dependent variable)
- **β₀**: Intercept of the regression model
- **β₁**: Coefficient for the time trend
- **Time_Trend**: Temporal trend variable
- **β₂**: Coefficient for precipitation
- **Precipitation**: Amount of rainfall
- **β₃**: Coefficient for inferred wind speed
- **Inferred_Wind_Speed**: Estimated or modeled wind speed
- **β₄**: Coefficient for maximum temperature
- **MaxTemperature**: Maximum temperature
- **β₅**: Coefficient for detrended CO₂ levels
- **Detrended_CO₂**: CO₂ levels after removing long-term trends
- **β₆**: Coefficient for detrended O₂ levels
- **Detrended_O₂**: O₂ levels after removing long-term trends
- **ϵ**: Residual error (white noise)





 The model explains almost half of the variance (R-squared = 0.494) using environmental variables. The environmental variables are ordered by decreasing coefficient weight: Precipitation, Wind Speed, Maximum Temperature, CO2, and O2.  

<span id="figure-16"></span>  
![Monthly TSLM Model Summary](Abbildungen/TSLM_monthly_model_summary.png)  
**Figure 16:** Summary of the monthly regression model using only environmental variables. Precipitation is the most influential predictor, followed by Wind Speed, Maximum Temperature, CO2, and O2.

The analysis of the residuals shows that they resemble white noise (A), are nearly normally distributed (B), and exhibit autocorrelation at 11 and 12 months (C and D).  

<span id="figure-17"></span>  
![Residuals monthly TSRM](Abbildungen/TSRM_monthly_residual_analysis.png)  
**Figure 17:** Residual analysis for the monthly regression model. Plot A shows the residual time series, Plot B shows the histogram of residuals, Plot C displays the autocorrelation, and Plot D shows the partial autocorrelation.

---

### **3.2.2 Cross-Validation Error Values**
Error indices for the monthly model also show high error values, as obtained through 5-fold cross-validation.  

<span id="figure-18"></span>  
![Monthly TSLM Error Summary](Abbildungen/TSLM_Monthly_averaged_scores_table.png)  
**Figure 18:** Cross-validation error indices (MSE, MAE, and RMSE) for the monthly regression model.


# **4. Long Short-Term Memory, Recurrent Neural Network Model (LSTM)**

LSTM (Long Short-Term Memory) is a type of neural network that can use a time series as input. It remembers patterns over time, which helps it make predictions based on past values. The architecture used here consisted of two LSTM layers (100 and 80 neurons) with two dense layers (50 units), and the final layer outputs a single prediction.

## **4.1 Daily**

### **4.1.1 LSTM Model with Lag and Environmental Variables as Predictors**
For the LSTM model using daily data, more variables were included:

- 20 lags
- Environmental variables
- Geospatial data (latitude, longitude)

The residuals do not resemble white noise due to a change in variance at the end of the series (A). The residuals appear normally distributed (B) but show high autocorrelation (C and D). [Figure 19](#figure-19)  

<span id="figure-19"></span>  
![Daily LSTM Model](Abbildungen/lstm_residual_analysis_weekly.png)  
**Figure 19:** Residual analysis for the daily LSTM model. Plot A shows the residuals over time, Plot B shows the histogram of residuals, Plot C displays the autocorrelation, and Plot D shows the partial autocorrelation. 

---

### **4.1.2 Cross-Validation Error Values**
The prediction error indices show a considerable error reduction compared to previous models. This indicates better generalization of the data, making it a potential forecasting model. However, the large number of variables makes predictions more challenging, especially as environmental variables must be estimated.[Figure 20](#figure-20)  

<span id="figure-20"></span>  
![Daily LSTM Model Error Summary](Abbildungen/NNR_daily_cross_validation_scores.png)  
**Figure 20:** Cross-validation error indices for the daily LSTM model. 
---

## **4.2 Monthly**

### **4.2.1 LSTM Model with Lag and Environmental Variables as Predictors**
Due to the lower amount of data in the monthly format, the residuals show a weak white noise distribution over time (A), weak normal distribution (B), and low autocorrelation. Some seasonal components remain in the residuals (C and D). However, the trend suggests that increasing the data could improve stationarity.[Figure 21](#figure-21)  

<span id="figure-21"></span>  
![Monthly LSTM Model](Abbildungen/lstm_monthly_residual_analysis.png)  
**Figure 21:** Residual analysis for the monthly LSTM model. Plot A shows the residuals over time, Plot B shows the histogram of residuals, Plot C displays the autocorrelation, and Plot D shows the partial autocorrelation.

---

### **4.2.2 Cross-Validation Error Values**
This model has the lowest prediction error among the models, as indicated by the error indices. The RMSE has a value of 0.2937, indicating that under 100 observed wildfires the prediction values would be around 75-134.[Figure 22](#figure-22)  

<span id="figure-22"></span>  
![Monthly LSTM Error Summary](Abbildungen/NNR_monthly_cross_validation_scores.png)  
**Figure 22:** Cross-validation error indices for the monthly LSTM model.


# **5. Model Comparison and Selection**

Due to the lower prediction error values, the neural network model with monthly data was selected to forecast [Figure 23](#figure-23). The distance differences between the testing and prediction values are shown in [Figure 24](#figure-24). This indicates that the model makes better predictions when the number of wildfires is in the range of 7.5-9.5 log fire counts. Outside this interval the predicted value underestimates the observed values. Therefore, the model do not capture an important part of the variance.


<span id="figure-23"></span>  
![Summary of error indexes](Abbildungen/tabelle_lstm_hervorgehoben.png)  
**Figure 23:** Summary of error indices for model comparison, for each model and time data.

<span id="figure-24"></span>  
![Monthly LSTM Test Prediction](Abbildungen/lstm_monthly_test_vs_predictions.png)  
**Figure 24:** Observed versus predicted wildfire counts for the monthly LSTM model.



# **6. Forecast of Year 2015 with LSTM Model of Monthly Data (Better Crossing Over Values)**

The forecast of the year 2015 was done using the rest of the years (1992-2014) as training data. To perform the forecast the model was constructed again (not loaded) to filter out the year to be forecasted.  The model was constructed similar as described above with the same lags and the respective environmental values (LSTM with monthly data). Then the trained model was used to forecast the year 2015 using the environmental values from 2015.


The observed values for wildfire counts (log) is mostly inside the interval where the model performed the best (7.5-9.5)[Figure 25](#figure-25). Therefore, the predicted values are arguably near to the observed ones. December estimation has the greater error, but is expected because the observed value is outside the interval(<7.5 Fire counts(log)).
 
 
<span id="figure-25"></span>  
![Forecast](Abbildungen/forecast_vs_actual_2015.png)
**Figure 25:** Forecast versus actual wildfire counts for the year 2015 using the monthly LSTM model.







# **7. Conclusions**

The Neuronal network model Long Short-Term Memory, had the least error in the prediction and forecasted the year 2015 arguably with low error. Nevertheless, some optimization can be done: 
- Add more relevant available data variables such as land use, holydays, weight by states with extreme values. 
- All model here can be optimized better. For example the ARIMA models could be differentiated to reach stationarity or increase number of epochs in the neuronal networks models.   
- the model that use external explicatory variables also need the a prediction oh these to be able to forecast. This can be seen as a problem because when values that come with a considerable probability of error are used to forecast, this error accumulates. Moreover, the good forecast values make sense to have a relatively good fit because the environmental variables values come from true measured values and not from predictions. Nevertheless, is still useful to recognize the variance that is caused for factors outside the explanatory variables used in this model. 
- Try Transformers neural network model which could achieve better results. 


# **8. Data Generation**

## **8.1 Daily Data Methodology**
Original data named "FPA_FOD_20170508.sqlite," which contains multiple interconnected tables regarding fire events registered in the USA between 1992 and 2015.

From this file, only the table containing information on fire events, the date of registration, and geospatial information was extracted. This data was used to create the wildfires in the USA dataset (1992–2015) as a `.csv` file: **"fires_data.csv"**. Each row of this file represents a registered fire in the US, with descriptive variables for the fire event:

Columns (Descriptive Variables): 
- `OBJECTID`, `FIRE_YEAR`, `DISCOVERY_DATE`, `CONT_DATE`, `LATITUDE`, `LONGITUDE`, `FIRE_SIZE`, `FIRE_SIZE_CLASS`, `STAT_CAUSE_DESCR`, `FIRE_NAME`

The relevant columns for this project are:
- `FIRE_YEAR`, `DISCOVERY_DATE`, `LATITUDE`, `LONGITUDE`

---

## **8.2 Atmospheric Variables**
The atmospheric variables used in the daily analysis are: **Wind, Temperature, and Precipitation**. The data was extracted for each day, limited by latitude and longitude (USA), and daily date (`yyyy-mm-dd`). For each atmospheric variable, a sub-variable measure was obtained from the **National Oceanic and Atmospheric Administration (NOAA)**:

### Files containing atmospheric variable values within the USA geospatial limits (24°–49° latitude, 235°–294° longitude) and days between 1992 and 2015:
- Geospatial data was transformed from Julian dates to the standard date format.

### Generated Files (neme of variable linked to download source):
- [Wind Speed](https://www.ncei.noaa.gov/access/monitoring/wind/overview): `Data_Waldbrand_Zeitreihen\Wind\wind_speed_1992_2015.csv` (m/s)
- [Maximum Temperature](https://downloads.psl.noaa.gov/Datasets/cpc_global_temp/): Code optimized to filter geospatially within the next processing step (°C)
- [Minimum Temperature](https://downloads.psl.noaa.gov/Datasets/cpc_global_temp/): Same as Maximum Temperature (°C)
- [Precipitation](https://downloads.psl.noaa.gov/Datasets/cpc_global_precip/): `Data_Waldbrand_Zeitreihen\Prazipitation\final_output_precipitation_data.csv` (mm)

The final output is a concatenation of subfiles in the **Bulk folder**, avoiding time waste by producing subsections of the whole output file. This concatenated output is not hosted in GitLab due to the large size of `.nc` and `.csv` files in the Bulk folder.

For each atmospheric sub-variable, based on daily date, latitude, and longitude, values were assigned from the `.csv` files to the `fires_data.csv` file (containing fire registration data). Assignments were made by matching date and geospatial variables using the nearest valid index value ([source](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.sel.html)).

---

## **8.3 Output Files**

- **Wind Speed**: `Data_Waldbrand_Zeitreihen\Wind\fire_wind_speed_inferred.csv` (m/s)
- **Maximum Temperature**: `Data_Waldbrand_Zeitreihen\Temp\final_output_Max_temperature_data.csv` (°C)
- **Minimum Temperature**: `Data_Waldbrand_Zeitreihen\Temp\final_output_Min_temperature_data.csv` (°C)
- **Precipitation**: `Data_Waldbrand_Zeitreihen\Prazipitation\final_output_precipitation_data.csv` (mm)

---

## **8.4 Merged Daily Data with External Variables**
All output files for atmospheric variables were joined by date and geospatial information into:
**"final_combined_environment_fire_data.csv"**.

A validation step was performed to:
- Identify rows in `fires_data.csv` potentially missing from `final_combined_environment_fire_data.csv`.

This step ensured that at least one match in date and geospatial data exists for each row. Note that some rows may contain `NA` values for atmospheric variables, resulting in:
**"final_combined_environment_fire_data_cleaned.csv"**.

---

## **8.5 Daily Averages File**
The output file **"daily_averages.csv"** contains the average values of numerical columns from **"final_combined_environment_fire_data_cleaned.csv"**:

Each row represents the daily mean for `LATITUDE`, `LONGITUDE`, `FIRE_SIZE`, `Precipitation`, `Inferred_Wind_Speed`, `MaxTemperature`, `MinTemperature` for all wildfires between 1992-01-01 and 2015-12-31.

- The daily wildfire count was estimated as the mean number of rows grouped by date. This output is saved as **"daily_fire_counts.csv"**.

Finally, **"daily_averages.csv"** and **"daily_fire_counts.csv"** were merged by date and saved as **"merged_with_daily_fire_counts.csv"**.

---

## **8.6 Monthly Data**
Using **"final_combined_environment_fire_data_cleaned.csv"**, the mean value for each month was calculated to obtain the expected monthly values for Temperature, Wind Speed, and Precipitation. The output file is **"monthly_averages.csv"**.


### Additional Atmospheric Variables:
- **O₂ and CO₂ values** were added, sourced from "La Jolla Pier" in California, as they align with monthly fire counts. Data gaps were filled using interpolation. The detrended values of the gases (generated by difference) was also added as a new column variable to check if can explain some of the variance. These variables were not used in the daily data models because just monthly data was founded as free source.

The complete monthly data is saved as:
**"merged_monthly_with_o2_co2_fire_counts.csv"**.

and has the following columns as variables: 
 `YearMonth`, `MinTemperature`, `MaxTemperature`, `Inferred_Wind_Speed`, `Precipitation`, `O2_Value`, `O2_seasonally`, `CO2_Value`, `CO2_seasonally`, `DISCOVERY_DATE`, `Monthly Fire Counts`, `Detrended_CO2`, `Detrended_O2`, and `Month` 

**Source for CO₂ and O₂ data**: [La Jolla Pier](https://scrippso2.ucsd.edu/)
