# 1 Report on Predicting Boston-area Precipitation

## 1.1 Summary

### Index of files

1.1 Summary

#### Data manipulation

2.1 Formatting Raw Data

2.2 Supplementing station data

2.3 Selecting stations of interest

#### Analysis

3.1 Descriptive statistics

3.2 Time Series Analyses: STL & ARIMA

3.3 Gradient-boosted model: LightGBM


In [6]:
import pandas as pd 

descriptive_statistics = pd.read_csv(r'C:/Users/evamb/OneDrive/Documents/Github/MAPrecipData/Data_Products/station_interest_desc_stats.csv')


### Introduction

As climate change impacts the frequency and intensity of rainfall events, predicting precipitation patterns will become even more important in preparing for and mitigating these changes. Whether preparing for floods or droughts, observing long-term patterns (the climate) allow us to predict specifically when and where we should expect precipitation or clear skies (the weather). Sufficiently long records of weather station data allow us to examine patterns, model cyclical behavior, and test predictions for specific climate measurements. 

I aim to explore monthly precipitation patterns around the Boston region using  quantitative and graphical means. Here, I prepare precipitation datasets from three Boston-area weather stations (section 2), calculating descriptive statistics and producing predictive preciptitation models for each station (section 3).

![Map_stations_selected_for_modeling.png](attachment:Map_stations_selected_for_modeling.png)

### Methods Overview

#### Data sources
The 'Precipitation Database Data 2019' dataset is available from the __[Massachusetts Department of Conservation & Recreation's Office of Water Resources](https://www.mass.gov/info-details/precipitation-data)__. This dataset is comprised of observed monthly precipitation totals (in inches) for weather stations across Massachusetts. Here, snowfall is melted into equivalent inches of water and added to rainfall. GPS coordinates for stations are available upon request from the MA DCR.

#### Dataset preparation
I separated the 'Precipitation Database Data 2019' dataset into three separate tables of preciptiation, station, and basin data. Then, I supplemented the station data table with record completeness data (calculated for each station) and gps coordinates (if available). I selected all stations with 30-year records between 1989 and 2019, and I exported records from 3 stations in Lynn (LYN614), Cohasset (COH726), and Weymouth (WEY738) (see 'Stations selected for precipitation modeling', above). 

#### Descriptive Statistics
I calculated descriptive statistics for each station's record, including mean, median, mode, standard deviation, and quartiles. I also calculated skew and kurtosis and performed a Shapiro-Wilkes test of normality to assess the distribution of monthly rainfall independent of time. If the p-value of the test is less than 0.05, the distribution is significantly different from a normal/Gaussian distribtuion and has implications for some statistical methods. Additionally, I perform a non-parametric Mann-Whitney U test for each combination of stations to assess significant differences in mean.

#### SARIMA Models
I created autocorrelation function plots and partial autocorrelation funtion plots for each station. I analyzed these plots to parameterize a set of SARIMA (seasonal auto-regressive integrated moving average) models for each station. The particular package I use allows the user to run a Seasonal Trend w/ Loess decomposition (STL) that feeds directly into a ARIMA model, allowing for some automatic parameterization (Seabold *et al.* 2010.). I tested multiple combinations of model parameters for each station based on autocorrelation and partial autocorrelation values (see table 2. SARIMA model parameters). I compared AIC (Akaike's information criterion) values for models at each station to select the best performing parameters and plotted those models. I also plotted the predicted and actual precipitation. After min-max normalizing the data, I repeated the SARIMA modeling process. 

#### LightGBM 
I chose to implement a gradient-boosted model using LightGBM; it is faster and often more accurate than xgboost (https://lightgbm.readthedocs.io/en/stable/Experiments.html). I trained three models on data from the first 26 years of each statations record (1990-2016). Then, I apply the models to predict monthly rainfall for the four-year period from 2016-2020 (the last four years represented in each dataset) and the following four-year period from 2020-2023. I plotted these models and the predicted and actual precipitation. I repeated the LightGBM modeling process after min-max normalizing the data.  


### Results

#### Descriptive Statistics

The distribution of rainfall at each station is not normal; table 1 shows the Shairo-Wilkes test p-values calculated for each station are much less than our chosen level of significance (alpha = 0.05). After normalizing by proportion of total values, the Weymouth precipitation distribution has a shorter right tale than the others and exhibites less compression, confirming what its lower skew and kurtosis values indicate (see 'Density distribution of monthly precipitiation values' and table 1, 'Distribution of normalized monthly rainfall values'). Mann-Whitney U tests indicate the mean monthly precipitation is significantly different in Cohasset and Lynn (u=63190.0, p-value=0.004) and Lynn and Weymouth (u=45339.5, p-value=0.016), but not between Cohasset and Weymouth (u=49502.5, p-value=0.729). 

In [2]:
print("Table 1. Descriptive statistics for select weather stations")
descriptive_statistics


Table 1. Descriptive statistics for select weather stations


Unnamed: 0,Statistics,COH726,LYN614,WEY738
0,Months Reported,327.0,342.0,298.0
1,Mean Precipitation,4.155,3.835,4.077
2,Standard Deviation,2.075,2.362,2.011
3,Minimum,0.11,0.17,0.63
4,25% Quartile,2.735,2.062,2.507
5,50% Quartile,3.9,3.35,3.805
6,75% Quartile,5.2,4.8,5.27
7,Maximum,13.85,15.5,11.27
8,Median,3.9,3.35,3.805
9,Mode,2.99,3.16,2.18


![Density-station-precipitation.png](attachment:Density-station-precipitation.png)

#### SARIMA Models

#### Sarima models

The best model for each station was determined by which had the lowest Akaike’s Information Criterion (AIC) value (see table 2a and 2b). 

Table 2a: Models with lowest AIC for non-normalized data

| Model | p | d | q | AIC value |
| :-: | :-: | :-: | :-: | :-: |
| COH726-7 | 3 | 0 | 3 | -359.062 |
| LYN614-5 | 3 | 0 | 3 | -327.033 |
| WEY738-49 | 12 | 0 | 0 | -239.199 |

Table 2b: Models with lowest AIC for normalized data 

| Model | p | d | q | AIC value |
| :-: | :-: | :-: | :-: | :-: |
| COH726-14 | 13 | 0 | 9 | 1520.805 |
| LYN614-5 | 3 | 0 | 3 | 1625.451 |
| WEY738-48 | 6 | 0 | 12 | 1453.258 |

Plotting the results for each station shows that the predictions for the training period generally appear to track the normalized data. However, the models show significantly less variation in the forecasted predictions; the forcast appear predominantly trend-driven with greatly reduced seasonality. 

![SARIMA_predictions_norm.png](attachment:SARIMA_predictions_norm.png)

Models based on the normalized dataset ('normalized models') perform slightly better than models based on non-normalized data in two ways. First, the models for station COH726 and LYN614 have higher r-values when comparing training-period predictions and normalized data (r=0.278 and r=0.154 respectively); WEY738 shows a very small decrease when modeling normalized data (r=0.298) compared to non-normalized data (r=0.300). Secondly, the normalized models' regression lines have slopes closer to 1. Neither the normalized nor non-normalized models do not reflect low- and high- precipitation months very well, so a higher slope would indicate better predictions of those low and high values.


![SARIMA_predictions_vs_norm.png](attachment:SARIMA_predictions_vs_norm.png)

The forecasted mean for all three models mostly overlays the real data, although there is significantly greater variation in the real data than in data forecasted for the same time period. There are several months of real data for each station that exceed the upper 95% confidence limit for forecasted data. In contrast, there are almost no months of real data that are less than the lower 95% confidence limit. (Note the lower 95% confidence limit for all three models drops below 0” of precipitation, indicating not rainfall is predicted.) This could indicate that the models are not handling the extent of the variation in the data well and may be biased towards better predicting low and average rainfall months. Alternatively, this could indicate that anomalously large rainfall events are driven by non-seasonal or indirectly related factors.

The model forecasts for future dates for all three stations show such little variation compared to the real data that is concerning. Plotting actual versus predicted monthly precipitation for each station (independent of date) shows that the models exhibit a strong pull towards the mean, where the spread of predicted values is narrowly concentrated around a mean compared to the spread of actual values.

It’s likely that month is not a strong indicator for rainfall amount because the climate is highly variable year-to-year and seaside towns are prone to high-intensity rainfall events during storms like Nor’easters or large snowfalls. I anticipate that adding other data to the models (e.g., temperature, wind direction, or humidity) would greatly increase a model’s predictive power in modeling precipitation.



In [4]:
#### Gradient-boosted model (LightGBM) predictions

In [5]:
#### Comparison with 2020 data

### References

Seabold, Skipper, and Josef Perktold. “Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.