# Hyperparameter search in non-sequential pipelines

This notebook shows how hyperparameter search for pyWATTS Pipelines can be performed. To perform such a hyperparameter search, we first introduce the considered use-case.



## Use-Case 1: Simple Getting Started

## Use-Case 2: More Advanced

In this notebook, we consider the following simple forecasting scenario. We aim to forecast the day-ahead electricity price. Since the electricity price is dependent on the electrical demand, we create a pipeline that forecasts the electricity demand and uses this forecast as an input for the electricity price forecast. As additional information, we use calendar features.
So we use the following transformers in this pipeline and search for the best hyperparameter:
* CalendarExtraction
  * List of features
* Scaler for the Electricity price
* Scaler for the Electricity demand
* Forecaster for the Electricity price
* Forecaster for the Electricity demand


In [1]:
# Other modules required for the pipeline are imported
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

# From pyWATTS the pipeline is imported
from pywatts.callbacks import LinePlotCallback
from pywatts_pipeline.core.util.computation_mode import ComputationMode
from pywatts_pipeline.core.pipeline import Pipeline
# All modules required for the pipeline are imported
from pywatts.modules import CalendarExtraction, CalendarFeature, SKLearnWrapper
from pywatts.summaries import RMSE
from pywatts.modules.preprocessing.select import Select


In [2]:
# Create a pipeline
pipeline = Pipeline(path="../results")

# Extract dummy calendar features, using holidays from Germany
# NOTE: CalendarExtraction can't return multiple features.
calendar = CalendarExtraction(continent="Europe",
                              country="Germany",
                              features=[CalendarFeature.month,CalendarFeature.weekday,
                                        CalendarFeature.weekend],
                              name="calendar"
                              )(x=pipeline["load_power_statistics"])

# Scale the data using a standard SKLearn scaler
power_scaler = SKLearnWrapper(module=StandardScaler(), name="scaler")
scale_power_statistics = power_scaler(x=pipeline["load_power_statistics"])

# Create lagged time series to later be used as regressors
lag_features = Select(start=-2, stop=0, step=1, name="lag_features")(x=scale_power_statistics)

target_multiple_output = Select(start=0, stop=24, step=1, name="sampled_data")(x=scale_power_statistics)

# Create a linear regression that uses the lagged values to predict the current value
# NOTE: SKLearnWrapper has to collect all **kwargs itself and fit it against target.
#       It is also possible to implement a join/collect class
regressor_power_statistics = SKLearnWrapper(
    module=LinearRegression(fit_intercept=True)
)(
    features=lag_features,
    calendar=calendar,
    target=target_multiple_output,
)

# Rescale the predictions to be on the original time scale
inverse_power_scale = power_scaler(
    x=regressor_power_statistics, computation_mode=ComputationMode.Transform,
    method="inverse_transform", callbacks=[LinePlotCallback("rescale")]
)

# Calculate the root mean squared error (RMSE) between the linear regression and the true values
# save it as csv file
rmse = RMSE(name="rmse")(y_hat=inverse_power_scale, y=target_multiple_output)

pipeline.set_score("rmse") # TODO Angeben ob higher the better or lower the better!



In [3]:
#pipeline.draw()

# BG information about Hyperparameter Search with SKLearn/SKtime

In [4]:
pipeline.steps

{'load_power_statistics': <pywatts_pipeline.core.steps.start_step.StartStep at 0x1925a6ab430>,
 'calendar': <pywatts_pipeline.core.steps.step.Step at 0x1925a6ab700>,
 'scaler': <pywatts_pipeline.core.steps.step.Step at 0x1925a6c5cd0>,
 'lag_features': <pywatts_pipeline.core.steps.step.Step at 0x1925a708970>,
 'sampled_data': <pywatts_pipeline.core.steps.step.Step at 0x1925a708340>,
 'LinearRegression': <pywatts_pipeline.core.steps.step.Step at 0x1925a708100>,
 'scaler_1': <pywatts_pipeline.core.steps.step.Step at 0x1925a7089a0>,
 'rmse': <pywatts_pipeline.core.steps.summary_step.SummaryStep at 0x1925a708d90>}

In [5]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor

# TODO Define Hyperparameter space
params = {
 "LinearRegression__module" : [LinearRegression(), MLPRegressor()],
 "scaler__module" : [ MinMaxScaler(), StandardScaler()],
 "calendar__features" : [#[CalendarFeature.weekend],
                         [CalendarFeature.month_cos, CalendarFeature.month_sine, CalendarFeature.weekend],
                         [CalendarFeature.hour_cos, CalendarFeature.hour_sine, CalendarFeature.weekend]],
"lag_features__start":[-24, -1]
}

In [6]:
from sklearn.model_selection import GridSearchCV

data = pd.read_csv("../data/getting_started_data.csv",
                   index_col="time",
                   parse_dates=["time"],
                   infer_datetime_format=True,
                   sep=",")


train = data.iloc[:6000, :]
test = data.iloc[6000:, :]


result, summary = pipeline.train(data=train)

#pipeline.test(data=test)

FileNotFoundError: [Errno 2] No such file or directory: '../data/getting_started_data.csv'

In [None]:
pipeline.score(test)


In [None]:
pipeline.get_params(deep=True)

In [None]:
test = Pipeline(**pipeline.get_params(deep=True))
id(test.steps["scaler"].module), id(test.steps["scaler_1"].module)

In [None]:
pipeline.get_params(deep=True)["steps"][1][0], pipeline.get_params(deep=True)["steps"][5][0]

In [None]:
id(pipeline.get_params()["steps"][1][0]), id(pipeline.get_params()["steps"][5][0]), id(pipeline.steps["scaler"].module),id(pipeline.steps["scaler_1"].module)

In [None]:
pipeline.get_params(deep=True)

In [None]:
Pipeline(**pipeline.get_params()).get_params(deep=True)

In [None]:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(test_size=168*4)
pipeline_cv = GridSearchCV(pipeline, param_grid=params, cv=tscv)
pipeline_cv.fit(data)

In [None]:
pipeline_cv.best_params_

In [None]:
pipeline_cv.best_score_

In [None]:
pd.DataFrame(pipeline_cv.cv_results_)