# Dazls model

Energy splitting method in openSTEF. The DAZLS model is able to split a forecast into wind (on shore), solar and other.

It trains one splitting model which can be used for every prediction job. As input it uses data from multiple substations with known components and it outputs the prediction of solar and wind power for unkown target substations.

The model contains 2-steps which are deployed in sequence:
1. Domain model (any data-driven model can be used);
2. Adaptation model (any data-driven model can be used).

#### For reference, see: [dazls.rst](https://github.com/OpenSTEF/openstef/tree/main/docs/dazls.rst)

## This notebook contains:

1. Preprocessing of the data which will be used for the model;

We use the "combined_data" folder, which contains the raw data, to preprocess the data and save it in the folder "prep_data". After this, the preprocessed data with metadata can be found in the "prep_data" file in the path we have set. Then we use the prep_data to run the dazls model.

2. Train dazls and generate a prediction for 1 out-of-sample csv file; 


3. Load and store the model.

The dazls_stored.sav file is being produced and is being used in openstef for the components forecast ([create_component_forecast](https://github.com/OpenSTEF/openstef/blob/main/openstef/pipeline/create_component_forecast.py))

In [None]:
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import glob
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
import random 
from sklearn.utils import shuffle
import joblib

from openstef.model.regressors.dazls import Dazls

pd.options.plotting.backend = 'plotly'

#Seed and preparation
random.seed(999)
np.random.seed(999)

#Path
combined_data=[]
station_name=[]

## Preprocess the data
Create the "prep_data" folder

In [None]:
#Read, create metadata and save in prep_data folder
for file_name in glob.glob('combined_data/*.csv'):
    
    #Read and fill missing values
    x = pd.read_csv(file_name, low_memory=False,parse_dates=["datetime"],index_col=0)
    x["datetime"]=pd.to_datetime(x["datetime"])
    x=x.set_index('datetime')
    x.replace([np.inf, -np.inf], np.nan, inplace=True)
    x=x.ffill()
    # x=x.interpolate(method='ffill')

    ## Get variance metadata ####
    var=x.iloc[:,:3].var()
    for i in range(3):
        x.loc[:, 'var'+str(i)] = var.iloc[i]
    ### end get variance ####

    ## Get sem metadata ####
    sem=x.iloc[:,:3].sem()
    for i in range(3):
        x.loc[:, 'sem'+str(i)] = sem.iloc[i]
    ### end get sem ####


    ## Get min-max capacity physical metadata ####
    mini=x.iloc[:,3:5].min()
    maxi=x.iloc[:,3:5].max()
    for i in range(2):
        x.loc[:, 'min'+str(i)] = mini.iloc[i]
    for i in range(2):
        x.loc[:, 'max'+str(i)] = maxi.iloc[i]    
    ### end get sem ####    
    
    combined_data.append(x)
    sn=os.path.basename(file_name)
    station_name.append(sn[:len(sn)-4])
    x.to_csv("prep_data/"+sn, index=True)

In [None]:
combined_data = []
station_name = []

# Read prepared data
for file_name in glob.glob('prep_data/*.csv'):
    x = pd.read_csv(file_name, low_memory=False, parse_dates=["datetime"])
    x["datetime"] = pd.to_datetime(x["datetime"])
    x = x.set_index('datetime')
    x.columns=[x.lower() for x in x.columns]
    combined_data.append(x)
    sn = os.path.basename(file_name)
    station_name.append(sn[:len(sn) - 4])


# Split data in train and test (the first substation is being used for the testing)
training_data = pd.concat(combined_data[1:])
test_data = combined_data[0]
target_columns =['total_solar_part', 'total_wind_part']
feature_columns = [x for x in test_data.columns if x not in target_columns]
print('Testing station:',station_name[0])

## Initialize DAZLS model

In [None]:
model = Dazls()
# Fit model
model.fit(training_data.loc[:,feature_columns], training_data.loc[:,target_columns])

## Generate a prediction
In this section, a prediction is generated using the DAZLS model.

In [None]:
# get predicted y
y = model.predict(test_data.loc[:,feature_columns])

In [None]:
# Examine the results, which are stored in the variable y. The other variables are added to these results by copying the test_data.
result = test_data.loc[:,target_columns].copy()
result['DAZLS_wind_split'] = y[:,0]
result['DAZLS_solar_split'] = y[:,1]
result.iloc[60:]

## Evaluate the results
Examine the results by looking at the build-in metrics and visualisation.

In [None]:
# print prediction performance
RMSE, R2=model.score(test_data.loc[:,target_columns], y)

print("The root-mean-square error (RMSE) is {} ".format(RMSE) + "and the R2 score is {}".format(R2))

In [None]:
# Visualize the results by plotting the results of the DAZLS model.

fig_result=result.plot()
fig_result.show()

In [None]:
# Compare the wind split results to the actual value
fig_compare_result_wind=pd.concat([result['DAZLS_wind_split'], test_data['total_wind_part']], axis=1).plot()
fig_compare_result_wind.show()

## Store the model

In [None]:
# Store the dazls file
filename = 'dazls_stored.sav'
joblib.dump(model, filename)