# Solar Energy Forecasting

## Introduction

Some exploratory data analysis to better understand the data

## Historical Data (Radiation, Air temperature, Wind speed)
Weather data for location: Freiburg im Breisgau  
Using data from the year 2020 (because it is the most recent year there is data for on the website)  
Get historical weather data from: https://joint-research-centre.ec.europa.eu/photovoltaic-geographical-information-system-pvgis/pvgis-tools/hourly-radiation_en

## Historical Data (Solar energy production)
Energy data for location: Transnet (Baden-Württemberg)  
Using data from the year 2020  
Get historical energy production data from: https://www.smard.de (https://www.smard.de/en/downloadcenter/download-market-data/?downloadAttributes=%7B%22selectedCategory%22:1,%22selectedSubCategory%22:1,%22selectedRegion%22:%22TransnetBW%22,%22selectedFileType%22:%22CSV%22,%22from%22:1577833200000,%22to%22:1609541999999%7D)

## Weather forecast data (Radiation, Air temperature, Wind speed)
Get weather forecast from: https://www.weatherapi.com/my/ 14 days is for free  
Do prediction, then we need to wait and download again from smard.de and we can then compare our prediction to the truth



In [66]:
import pandas as pd
import pickle

# use historical radiation data + generated energy data to later on train the model

# Load historical data
with open("../data/raw data/historical_solar_data.pkl", "rb") as file:
    X = pickle.load(file)
#print("X\n", X)

y = pd.read_csv("../data/raw data/Actual_generation_2020.csv", delimiter=";", dtype=str)
#print("y\n", y)

In [67]:
y = y[["Start date", "Photovoltaics [MWh] Calculated resolutions"]]
y = y.iloc[:-24,:] # remove last 24 rows (1 day), in order to have the full year 2020
# 8784 rows = 8760h + 24h (because 2020 was a leap year)
y = y.rename(columns={"Start date": "time", "Photovoltaics [MWh] Calculated resolutions": "energy_generated"})
y["time"] = pd.to_datetime(y["time"], format="%b %d, %Y %I:%M %p")
y["energy_generated"] = pd.to_numeric(y["energy_generated"].str.replace(",", ""), errors="coerce")
print("Y\n", y)

# shifting the index (time) of X by 10 minutes to match the index of y
X["time"] = pd.to_datetime(X["time"], format="%Y%m%d:%H%M")
X["time"] = X["time"] - pd.Timedelta(minutes=10)
print("X\n", X)

# take care of the big numbers XXXXX.XX that come from the wrongly interpreted dates in the energy_generated column from the original csv
# Obviously this is not clean, it could have been other numbers that got wrongly interpreted, but hey, it's a first step
# TOOK CARE OF with dtype=str in read_csv

# Obviously just shifting the data by 10 minutes is not the best way to do it, but it's a first step

print(max(y["energy_generated"]))


Y
                     time  energy_generated
0    2020-01-01 00:00:00               0.0
1    2020-01-01 01:00:00               0.0
2    2020-01-01 02:00:00               0.0
3    2020-01-01 03:00:00               0.0
4    2020-01-01 04:00:00               0.0
...                  ...               ...
8779 2020-12-31 19:00:00               0.0
8780 2020-12-31 20:00:00               0.0
8781 2020-12-31 21:00:00               0.0
8782 2020-12-31 22:00:00               0.0
8783 2020-12-31 23:00:00               0.0

[8784 rows x 2 columns]
X
                     time  G(i)   T2m  WS10m
0    2020-01-01 00:00:00   0.0 -1.09   2.28
1    2020-01-01 01:00:00   0.0 -1.10   2.07
2    2020-01-01 02:00:00   0.0 -1.15   2.00
3    2020-01-01 03:00:00   0.0 -1.37   2.00
4    2020-01-01 04:00:00   0.0 -1.52   2.07
...                  ...   ...   ...    ...
8779 2020-12-31 19:00:00   0.0 -0.78   2.07
8780 2020-12-31 20:00:00   0.0 -0.93   1.86
8781 2020-12-31 21:00:00   0.0 -1.06   1.66
8782 2020-12-

In [68]:
print(X.dtypes)
print("X describe:", X.describe())

time     datetime64[ns]
G(i)            float64
T2m             float64
WS10m           float64
dtype: object
X describe:                       time         G(i)          T2m        WS10m
count                 8784  8784.000000  8784.000000  8784.000000
mean   2020-07-01 23:30:00   145.073357     9.428526     2.023866
min    2020-01-01 00:00:00     0.000000   -10.250000     0.000000
25%    2020-04-01 11:45:00     0.000000     3.360000     1.310000
50%    2020-07-01 23:30:00     0.000000     8.940000     1.720000
75%    2020-10-01 11:15:00   200.257500    15.130000     2.480000
max    2020-12-31 23:00:00   956.010000    31.790000     9.030000
std                    NaN   232.219707     7.550996     1.161367


In [69]:
print(y.dtypes)
print(max(y["energy_generated"]))
print("Y describe:", y.describe())

time                datetime64[ns]
energy_generated           float64
dtype: object
4452.75
Y describe:                                 time  energy_generated
count                           8784       8784.000000
mean   2020-07-02 00:04:25.573770496        709.986766
min              2020-01-01 00:00:00          0.000000
25%              2020-04-01 12:45:00          0.000000
50%              2020-07-02 00:30:00         15.500000
75%              2020-10-01 12:15:00       1077.312500
max              2020-12-31 23:00:00       4452.750000
std                              NaN       1105.954943
