## Forecasting Intro

The goal of this script is to study the correlations between the PV generation time-series and other weather-related time-series. You will also explore auto-corellation in the PV generation time-series.

### Loading the data

In [3]:
# Import the necessary packages
%matplotlib inline
import pandas as pd
import os
import matplotlib.pyplot as plt
import datetime as dt
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

###  Load prosumer data ###
file_P = os.path.join(os.getcwd(),'ProsumerHourly_withUTC.csv')
df_pro = pd.read_csv(file_P)
df_pro["TimeUTC"] = pd.to_datetime(df_pro["TimeUTC"])
df_pro = df_pro.loc[df_pro["TimeUTC"].dt.year.isin([2022,2023])]
df_pro = df_pro.reset_index(drop=True)
df_pro.rename(columns={'Consumption': 'Load'}, inplace=True)
df_pro.rename(columns={'TimeUTC': 'HourUTC'}, inplace=True)
df_pro = df_pro[["HourUTC","Load","PV"]]
    
###  Load weather data ###
file_P = os.path.join(os.getcwd(),'WeatherData.csv')
df_weather = pd.read_csv(file_P)
df_weather["HourUTC"] = pd.to_datetime(df_weather["HourUTC"])
df_weather = df_weather.loc[df_weather["HourUTC"].dt.year.isin([2022,2023])]
df_weather = df_weather.drop(df_weather.index[-1])

### Merge the dataframes ###
df_merged = pd.merge(df_pro, df_weather, on='HourUTC')
df_merged = df_merged.drop(['index'], axis=1)

df_merged.rename(columns={'snow_soiling_rooftop': 'snow', 
                          'clearsky_dhi': 'cs_dhi', 
                          'clearsky_dni': 'cs_dni',
                          'clearsky_ghi': 'cs_ghi', 
                          'clearsky_gti': 'cs_gti',
                          'cloud_opacity': 'opacity',
                          'dewpoint_temp': 'dew_t'}, inplace=True)

## Pearson correlation coefficients

Now you can calculate the PCCs and see which predictors are more correlated with PV generation.

* Some predictors have similar and high values
* Consider a multiple linear regression model 
* Which predictor(s) would you use in your model?

In [None]:
Corr = df_merged.drop(columns=["HourUTC"]).corr(method='pearson')
Corr = Corr.round(2)
# Set maximum number of columns to display
pd.set_option('display.width', 150)
print(Corr)

## Plotting PV generation vs various predictors

In the following you can filter a few days from the dataset and plot PV generation next to some potential predictors.

* Do you see similarities in the time-series?
* It is a good idea to normalize your values (this is already done)
* Choose a few predictors to see next to the PV generation by writing the predictor name in the code below

In [None]:
# Choose predictor
predictor = "gti"

# Filter time-series
t_s = pd.Timestamp(dt.datetime(2023, 7, 8, 0, 0, 0))
t_e = pd.Timestamp(dt.datetime(2023, 7, 16, 23, 0, 0))
df_week = df_merged.loc[(df_merged["HourUTC"]>=t_s) & (df_merged["HourUTC"]<=t_e)]

# Plot
plt.figure(figsize=(10, 6))
plt.plot(df_week["HourUTC"],df_week["PV"]/(max(df_merged["PV"])-min(df_merged["PV"])),label="PV")
plt.plot(df_week["HourUTC"], df_week[predictor]/(max(df_merged[predictor])-min(df_merged[predictor])),label=predictor)
plt.xticks(pd.date_range(start=t_s, end=t_e, freq='2D'), rotation=45)
plt.xlabel("Time")
plt.ylabel("Normalized values (-)")
plt.legend()  # Show legend
plt.show()

## Visualizing correlation

* You can use scatterplots to visualize the "similarity" between PV generation and other predictors.
* Below you see the scatterplot of PV generation and gti
* Check how the scatterplot looks for different predictors

In [None]:
plt.figure(figsize=(4, 4))
plt.scatter(df_merged["PV"], df_merged["gti"], s = 4)
plt.xlabel("PV generation")
plt.ylabel("Co-variate")
plt.show()

### Collinearity

* You may have noticed that a few predictors all have a high PCC
* Does every predictor carry **new** information? Or is it largely the same information?
* An extreme example of redundant information is a predictor which is simply another predictor multiplied by a constant number
* This effect is called collinearity and you can read more on the book "An Introduction to Statistical Learning" 3.3.3 - (6) Collinearity

## Correlation with lagged values

* Apart from correlation of PV generation with weather features, you may find similarities between PV generation at different time steps
* For example, between the current production $PV_t$ and that of one hour before $PV_{t-1}$ or two hours before $PV_{t-2}$
* This is called **autocorrelation** because there is correlation between values in the **same** time-series

The code below takes PV generation and adds lagged values as different columns and calculates correlation.

In [None]:
df = df_merged.copy()
df = df[["PV"]]

num_lags = 4  # Adjust this value as needed

# Create lagged columns
for i in range(1, num_lags + 1):
    df[f'lag{i}'] = df['PV'].shift(i)

# Add lagged values 1 and 2 days before
df['lag24'] = df['PV'].shift(24)
df['lag48'] = df['PV'].shift(48)

# Drop rows with NaN values (due to shifting)
df = df.dropna()

Corr = df.corr(method='pearson')
Corr = Corr.round(2)
print(Corr)

# Plot the scatterplot of PV generation and its lagged1 value
plt.figure(figsize=(2, 2))
plt.scatter(df["PV"], df["lag1"], s = 4)
plt.xlabel("PV generation")
plt.ylabel("Lag1 of PV generation")
plt.show()

## Discussion

Based on the analysis you carried out, discuss the following with your colleagues:

* How would you use all this information to forecast PV generation?
* What predictors would you use?
* Do you think all predictors are useful, or some carry redundant information (due to Collinearity)?
* What is the fundamental difference in using lagged values and external predictors (like irradiance, cloud opacity etc.)?