# Workshop part 1 | Learn how to train a model
In this first part of the workshop, all preparation for training a model and the actual training are performed. 

The learning points are:
- How a prediction job works, and what the most important parameters mean; 
- What data is required;
- Experience with the train model pipeline;
- How the model gets automatically stored and loaded;
- How to get info on the trained model.

In [1]:
# Import all required packages
import openstef
from openstef.data_classes.prediction_job import PredictionJobDataClass
from openstef.pipeline.train_model import train_model_pipeline

from IPython.display import IFrame
import pandas as pd 

# Set plotly as the default pandas plotting backend
pd.options.plotting.backend = 'plotly'

  from .autonotebook import tqdm as notebook_tqdm


2024-02-19 09:01:56 [info     ] Proloaf not available, setting constructor to None


## Define the prediction job

OpenSTEF uses prediction jobs to define the properties of training and prediction. 

Exercise: define your own prediction job using the following parameters: 
- latitude = 53.0, longitude = 5.7
    - This is used to calculate the derived solar features (direct normal irradiance and the global tilted irradiance).*
    - Also used to retrieve weather data in openstef-dbc (database connector)
- horizon minutes = 0.25
    - The horizon of the desired forecast in minutes. It entails how far into the future we want to predict. The value of 15 entails that at the moment of prediction, you predict 15 minutes into the future. So let's say you make a prediction at one o'clock, than the prediction is for 13.15 o'clock. 
- quantile: 10, 30, 50, 70 and 90 percent
    - This provides a confidence interval within OpenSTEF, based on the standard deviation.

Hint: look at the documentation [here](https://openstef.github.io/openstef/openstef.data_classes.html#module-openstef.data_classes.prediction_job)

*The code which calculated the direct normal irradiance and global tilted irradiance can be found in ``weather_features.py``, [here](https://github.com/OpenSTEF/openstef/blob/main/openstef/feature_engineering/weather_features.py).

In [2]:
# Define properties of training/prediction. We call this a 'prediction_job'
pj = dict(id=287,
        model='xgb', 
        quantiles=[0.10,0.30,0.50,0.70,0.90],
        forecast_type="demand", 
        lat=53.0,
        lon=5.7,
        horizon_minutes=0.25,
        resolution_minutes=15,
        name="workshop_exercise_1",
        save_train_forecasts=True,
       )

pj=PredictionJobDataClass(**pj)

## Prepare and analyse the input data
OpenSTEF requires a certain input format: a dataframe. 

Exercise: look at the table and plots below and answer try to answer the following questions: 
- What type of features do you see in the input data? 
- How much time is there between two data points? 
- Look at the plots for radiation and windspeed, do you see any paterns? 
    - Hint: do you see something happening to the load when there is a peak in either radiation or wind speed? Can you explain why? 
    - Note: in these plots we zoomed in on a random week, for visibility purposes. 

Hint: you can zoom in on the plots to see more details.

In [3]:
input_data=pd.read_csv("../data/input_data_sun_heavy.csv", index_col=0, parse_dates=True)

In [4]:
pd.options.display.max_columns = None
display(input_data.head())

Unnamed: 0,load,clearSky_dlf,clearSky_ulf,clouds,humidity,mxlD,pressure,radiation,snowDepth,temp,winddeg,windspeed,windspeed_100m,APX,E1A_AMI_A,E1A_AMI_I,E1A_AZI_A,E1A_AZI_I,E1B_AMI_A,E1B_AMI_I,E1B_AZI_A,E1B_AZI_I,E1C_AMI_A,E1C_AMI_I,E1C_AZI_A,E1C_AZI_I,E2A_AMI_A,E2A_AMI_I,E2A_AZI_A,E2A_AZI_I,E2B_AMI_A,E2B_AMI_I,E2B_AZI_A,E2B_AZI_I,E3A_A,E3A_I,E3B_A,E3B_I,E3C_A,E3C_I,E3D_A,E3D_I,E4A_A,E4A_I
2023-01-01 00:00:00+00:00,,16.623535,16.470352,100.0,0.902977,1311.880981,100386.375,0.0,0.0,8.342224,261.046631,17.27199,25.281712,-1.46,4.1e-05,0.0,3.2e-05,0.0,7.9e-05,0.0,6.7e-05,0.0,6.6e-05,0.0,5.9e-05,0.0,3.5e-05,3e-07,2.5e-05,3e-07,6.6e-05,9.5e-07,5.4e-05,9.5e-07,5.6e-05,9.5e-07,5.6e-05,9.5e-07,5.6e-05,9.5e-07,5.6e-05,9.5e-07,7.9e-05,9.5e-07
2023-01-01 00:15:00+00:00,2.796667,16.922281,16.016937,100.0,0.89773,1306.568481,100393.054688,0.0,0.0,8.446587,262.409225,17.544581,25.427095,-1.46,4.1e-05,0.0,3e-05,0.0,7.9e-05,0.0,6.3e-05,0.0,6.6e-05,0.0,5.7e-05,0.0,3.5e-05,2.7e-07,2.5e-05,2.7e-07,6.6e-05,8.4e-07,5.5e-05,8.4e-07,5.8e-05,8.4e-07,5.8e-05,8.4e-07,5.8e-05,8.4e-07,5.8e-05,8.4e-07,7.9e-05,8.4e-07
2023-01-01 00:30:00+00:00,2.753333,17.221026,15.563522,100.0,0.892483,1301.255981,100399.734375,0.0,0.0,8.550949,263.77182,17.817173,25.572478,-1.46,4e-05,7e-08,3e-05,7e-08,7.7e-05,3.2e-07,6.2e-05,3.2e-07,6.2e-05,0.0,5.7e-05,0.0,3.5e-05,2.3e-07,2.5e-05,2.3e-07,6.6e-05,7.4e-07,5.5e-05,7.4e-07,5.8e-05,7.4e-07,5.8e-05,7.4e-07,5.8e-05,7.4e-07,5.8e-05,7.4e-07,7.9e-05,7.4e-07
2023-01-01 00:45:00+00:00,2.643333,17.519772,15.110107,100.0,0.887236,1295.943481,100406.414062,1.38824e-11,0.0,8.655312,265.134415,18.089765,25.717862,-1.46,3.8e-05,1.1e-07,2.8e-05,1.1e-07,7.4e-05,4.7e-07,5.8e-05,4.7e-07,6.1e-05,0.0,5.6e-05,0.0,3.5e-05,2.5e-07,2.5e-05,2.5e-07,6.5e-05,8.1e-07,5.4e-05,8.1e-07,5.7e-05,8.1e-07,5.7e-05,8.1e-07,5.7e-05,8.1e-07,5.7e-05,8.1e-07,7.9e-05,8.1e-07
2023-01-01 01:00:00+00:00,2.506667,17.818518,14.656693,100.0,0.881989,1290.630981,100413.09375,2.776479e-11,0.0,8.759674,266.497009,18.362356,25.863245,-1.52,3.7e-05,4e-08,2.8e-05,4e-08,7.2e-05,1.8e-07,5.7e-05,1.8e-07,5.9e-05,0.0,5.2e-05,0.0,3.4e-05,3.4e-07,2.5e-05,3.4e-07,6.4e-05,1.07e-06,5.3e-05,1.07e-06,5.8e-05,1.07e-06,5.8e-05,1.07e-06,5.8e-05,1.07e-06,5.8e-05,1.07e-06,7.9e-05,1.07e-06


In [5]:
# In the next section, the data will be split into training and testing data. The model should be only trained on the training part of the input data. Therefore, the input data should be split.
train_data=input_data.iloc[:-192,:] # Everything except the final 192 rows for training.
test_data=input_data.iloc[-192:,:] # Final 192 rows for testing.

In [6]:
fig_load=input_data["load"].iloc[57:729].plot()
fig_load.update_layout(
    xaxis_title='Timestamp',
    yaxis_title="Load [MW]"
)
fig_load.show()

In [7]:
fig_radiation=input_data["radiation"].iloc[57:729].plot()
fig_radiation.update_layout(
    xaxis_title='Timestamp',
    yaxis_title="rad"
)
fig_radiation.show()

In [8]:
fig_windspeed=input_data["windspeed"].iloc[57:729].plot()
fig_windspeed.update_layout(
    xaxis_title='Timestamp',
    yaxis_title="Windspeed"
)
fig_windspeed.show()

## Training the model
After defining the prediction job and preparing the input data, the model can be trained. 

Exercise: 
- Using the prediction job and input data prepared above, train a model using the OpenSTEF pipelines. 
- How much time did it take to train the model?

Hint: find the correct pipeline in the list provided on the OpenSTEF [website](https://openstef.github.io/openstef/user_guides.html).

In [9]:
train_data, validation_data, test_data=openstef.pipeline.train_model.train_model_pipeline(
    pj,
    train_data,
    check_old_model_age=False, 
    mlflow_tracking_uri="./mlflow_trained_models",
    artifact_folder="./mlflow_artifacts",
)

2024-02-19 09:02:00 [debug    ] MLflow tracking uri at init= ./mlflow_trained_models




2024-02-19 09:02:00 [info     ] Found 22 values of constant load (repeated values), converted to NaN value. cleansing_step=repeated_values frac_values=0.0006312950156388993 num_values=22 pj_id=287
2024-02-19 09:02:00 [info     ] Removed 22 NaN values          num_removed_values=22


2024/02/19 09:02:22 INFO mlflow.tracking.fluent: Experiment with name '287' does not exist. Creating a new experiment.


2024-02-19 09:02:23 [info     ] No previous model found in MLflow experiment_name=287
2024-02-19 09:02:30 [info     ] Model saved with MLflow        experiment_name=287
2024-02-19 09:02:35 [info     ] Logged figures to MLflow.     
2024-02-19 09:02:35 [info     ] Writing reports to ./mlflow_artifacts\287


## Analyse the trained model
Now that the model has been trained, you can inspect the results. 

Exercise: answer the following questions: 
- Are all of the features in the feature importance plot in the input data? Why?
    - What are the most important features? 
- Which time horizon is more accurate? 
    - Hint: zoom in on the same day for both the Predictor0.25 and Predictor47.0 and examine them next to each other.
- Where is my trained model? 

The first two plots are the 'predictor in action' plots for the two time horizons (0.25 means fifteen minutes ahead, 47.0 means 47 hours ahead). In these plots you can see three different data outputs: train, validation and test. For each of these, you can see an '_actual' and '_predict'. This entails that for everyone of these data outputs, the measured value and the predicted value by OpenSTEF is plotted. Thus 'train_predict' is the prediction by OpenSTEF based on the train data.  

The last plot is the feature importance, this plot shows all of your input features (radiation, windspeed, lagged load, etc, etc,) and how much they influence the forecast. If a block is relatively large, this means the feature is relatively important for the forecast. Thus, large changes in the value of this feature results in a large difference in forecast. 

In [10]:
# Inspect local files.
display(IFrame('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']), width=900, height=400))
display(IFrame('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']), width=800, height=400))
display(IFrame('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']), width=800, height=400))


## Visual Studio Code has difficulties with displaying htmls. If you are working with VSC and are not able to inspect the plots, uncomment the code below
## To open the plots in your browser.
# import webbrowser
# webbrowser.open('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']))
# webbrowser.open('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']))
# webbrowser.open('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']))