# Predicting Solar Irradiance with LSTMs

[Notebook 1: EDA and Cleaning](./1_EDA and Cleaning.ipynb)

[Notebook 2: Modeling and Predictions](./2_Modeling and Predictions.ipynb)

[Notebook 3: Technical Report](./3_Technical_Report.ipynb)

## Background
Unlike most forms of energy, solar irradiance cannot be turned on and off, or stored in barrels or reservoirs. Unfortunately this makes solar energy risky for producers, users, and investors. Underpredicting solar output results in a loss of upfront users and investment, while overpredicting solar output means unreliability for users and loss of income for producers. De-risking solar is vital for achieving mainstream viability.

## Data
The data used was solar irradiance measurements from [Loyola Marymount University](https://midcdmz.nrel.gov/apps/go2url.pl?site=LMU_) from April 06, 2010 to May 05, 2016, collected from a rotating shadowband radiometer (RSR). This dataset was found via the National Renewable Energy Laboratory's (NREL) list of Measurement and Instrumentation Data Centers (MIDC).

### Metrics
The metric used for predictions was Direct Normal Irradiance (DNI), in W/m2. This is the amount of irradiance on a surface perpendicular to the sun. For reference, a square meter near the equator receives about 1 kW/m2 on a clear day.

### Limitations
LMU's SOLRMAP dataset website warns to use the data between December 2014 and July 9th, 2015 with caution. During this period the RSR was not taking a full charge, leading to some gaps in data.

## Data cleaning
The raw data had the following issues:
- No datetime index
- Some negative values (such as -99999) for features that should only have positive values
- Some outliers
- Missing data
- Unneeded columns

To get the data into usable form, I did the following steps:
- Wrote a custom function to convert existing time features to a datetime object
- Set negative values to 0
- Removed and imputed outliers
- Dropped dates with missing data
- Dropped unneeded columns

## Exploratory Data Analysis

### Predictors over time
This plot shows the irradiance metrics (DNI, DHI, and GHI) over time. Clearly there is a seasonal effect as values peak in the middle of the year (summer), and decline in the winter.
![pic](./assets/predictors.png)

### Predictors over day of year (average)
Predictors over day of year support seasonality.
![pic](./assets/predictor_doy.png)

### Predictors over time of day (average)
Predictors over time of day are unsurprising. The stepping effect is due to the original time format in HHMM.
![pic](./assets/predictor_time.png)

### Correlation coefficients of all features
The correlation matrix shows some multicollinearity between variables, as well as many weak correlations.
![pic](./assets/all_cc.png)

### Correlation coefficients of predictors
A closer look at correlation of irradiance metrics.
![pic](./assets/predictor_cc.png)

## Feature Engineering
No features were added, but time elements (hour of day, datetime index) were taken from original features.

## Modeling

### Resampling
Before modeling, the data was resampled to 15 minute increments (mean) to save run time. Lag amounts were in intervals of 15 minutes (96 = 24 hours, 672 = 1 week).

### LSTM (long short-term memory) RNN (recurrent neural network) in Keras
Predictions were made using an LSTM (long short-term memory) model. Data was lagged by 1 day and 1 week periods. Specific predictor features used were day of year, time of day, hour, lagged DNI, air temperature, and humidity.

### Train test split
Data was split at the year 2012. This resulted in about a 2:1 train-test-split.

### Hyperparameters
Hyperparameters used for the LSTM were:

- LSTM cells = number of hours predicting

- epochs = 10

- batch_size = 12

- dropout = .3

### Additional steps
After fitting each model and model results were saved for later use.

## Model evaluation
Models were scored on RMSE and r2 score.

Predictions were inverse scaled to return predictions to original scale.

## Predictions and results

Here are example results for modeling using the hyperparameters above. The plot area is just the last 300 hours of data (about 8 days).

### 24 hours predictions vs. actual values

![pic](./assets/res_96.png)

### 168 hours predictions vs. actual values

![pic](./assets/res_672.png)

## Interpretation

According to evaluation metrics, this model only predicts about 30-50% of the variance in solar output levels. This is not ideal for industrial or financial applications, but is useful in understanding the challenges of predicting solar irradiance. This model could potentially be supplemented by additional data such as satellite imaging.

**Dataset Citation:**
Andreas, A., Wilcox, S.; (2012). Solar Resource & Meteorological Assessment Project (SOLRMAP): Rotating Shadowband Radiometer (RSR); Los Angeles, California (Data); NREL Report No. DA-5500-56502. http://dx.doi.org/10.5439/1052230