<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Energy Consumption Forecasting using Vantage and scikit</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial'>For energy trading companies, forecasting of electricity consumption is one key driver in building a successful business. Proper forecasting of market demand prevents losses (in case of overselling energy to market) as well as lost profits (in case of underestimating of demand). Also, the regulator of the energy market can apply fees or even disqualify a trading company for certain time periods in case of frequent inaccurate forecasts. This is why increasing the accuracy even by 0.1% can significantly improve the profitability of the energy trading company.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo we demonstrate how the full lifecycle of consumption forecast can be implemented using Vantage technologies and specifically, the combination of Bring Your Own Model (BYOM), Vantage Analytics Library (VAL) and teradataml python client library solution. This demo consists of four parts (details on Teradata "Analytics 1-2-3" strategy can be found <a href = 'https://assets.teradata.com/resourceCenter/downloads/WhitePapers/Analytics-123-Enabling-Enterprise-AI-at-Scale-MD006623.pdf'>here</a>):</p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Data discovery using client python libraries</li>
    <li>Feature Prep and Transformation using Vantage Analytic Library</li>
    <li>Model training using the scikit-learn LinearRegression algorithm</li>
    <li>Scoring the model in Vantage and analyzing the results</li>
    </ol>

<p style = 'font-size:16px;font-family:Arial'>The dataset used in this demo represents electricity consumption in Norway for the period from 1st of January 2016 to 31st of August 2019. Each line in this dataset reflects consumption for one hour. Apart of electricity consumption this datamart also reflects additional data: weather from multiple sources, daylight information and labor calendar. All data were collected from open data sources.</p>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Utilize Vantage to Operationalize the Machine Learning Process</b>

<p style = 'font-size:16px;font-family:Arial'>Open-source tools and techniques provide a rich ecosystem for data scientists and analysts to gain new insights into their data.  However, the process of obtaining these insights is manual, error-prone, and time-consuming process.  Most machine learning tools and platforms seek to make model training more efficient, and ignore the more significant challenges with;</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>Data Discovery and Statistical Analysis</li>
    <li>Data Preparation and Feature Engineering</li>
    <li>Model Deployment and Evaluation At Operational Scale</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial'>Traditional approaches require the developer to move data <b>from</b> the sources <b>to</b> the analytics.  Even "integrated" analytic systems like Apache Spark provide parallel processing for analyzing data, but don't optimize for loading data - neither locality nor quantity that needs to be moved.</p>



<p style = 'font-size:16px;font-family:Arial'>Teradata Vantage reverses this model; and provides the ability to PUSH processing down to the individual processing nodes where the data resides.  This allows for unprecedented scale of the analytical proccessing, reduced costs in data movement/egress charges, and drastically improved performance.</p> 

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Installing some dependencies</b>

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages
!pip install sklearn2pmml
!pip install jdk4py
!pip install teradataml

<p style = 'font-size:16px;font-family:Arial'>
    <i><b>*BEFORE proceeding, please RESTART the kernel to bring new software into Jupyter.</b></i>
</p>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import getpass
import sys
import os

from teradataml import *
from teradataml.analytics.valib import *
import teradataml.analytics.Transformations as tdtf

import pandas as pd
import numpy as np
from jdk4py import JAVA, JAVA_HOME, JAVA_VERSION

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml

import warnings
warnings.filterwarnings('ignore')

# Modify the following to match the specific client environment settings
configure.val_install_location = 'val'
configure.byom_install_location = 'mldb'
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA_HOME)
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA)[:-5]

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb

<p style = 'font-size:16px;font-family:Arial'>Below command will create a context to the Vantage connection.</p>

In [None]:
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

In [None]:
%sql SET query_band='DEMO=Consumption_Forecasting_BYOM.ipynb;' UPDATE FOR SESSION;

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage.  You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage.  There are two statements in the following cell, and one is commented out.  You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('demo_energy_cloud');"        # Takes 10 seconds
# %run -i ../run_procedure.py "call get_data('demo_energy_local');"        # Takes 30 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used. </p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Data Discovery using teradataml</b>

<table style = 'width:100%;table-layout:fixed;'>
<tr>
    <td style = 'vertical-align:top' width = '50%'>
        <p style = 'font-size:16px;font-family:Arial'>Users can access large volumes of data by connecting remotely using the teradataml client connection library.  Python methods are translated to SQL and run remotely on the Vantage system.  Only the minimal amount of data required is copied to the client; allowing users to interact with data sets of any size and scale.
        <ol style = 'font-size:16px;font-family:Arial'>
            <li>Create a "Virtual DataFrame" that points to the data set in Vantage</li>
            <br>
            <li>Use Pandas syntax to investigate the data</li>
        </ol>
    </td>
    <td><img src = 'images/connect_and_discover.png' width = '400'></td>
</tr>
</table>

In [None]:
tdf = DataFrame(in_schema("DEMO_Energy", "consumption"))
print(tdf.shape)
tdf.head(5)

<p style = 'font-size:16px;font-family:Arial'>The dataset above shows hourly consumption of energy. There are multiple columns that are potential factors affecting energy consumption such as: is_dark, is_holiday, etc.</p> 

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Feature Prep and Transformation with Vantage Analytic Library</b>

<table style = 'width:100%;table-layout:fixed;'>
<tr>
    <td style = 'vertical-align:top' width = '50%'>
        <p style = 'font-size:16px;font-family:Arial'>The Vantage Analytic Library is a suite of powerful functions that allows for whole-data-set desrcriptive analysis, data transformation, hypothesis testing, and algorithmic algorithms at extreme scale.  As with all Vantage capabilities, these functions run in parallel, at the source of the data</p>
        <ol style = 'font-size:16px;font-family:Arial'>
            <li>Create Feature Transformation objects</li>
            <br>
            <li>Define the columns to be retained in the analytic data set</li>
            <br>
            <li>Push the transformations to the data in Vantage</li>
            <br>
            <li>Inspect the results</li>
        </ol>
    </td>
    <td><img src = 'images/VAL_transformation.png' width = '400'></td>
</tr>
</table>

In [None]:
weekday_mapping = {1:'monday', 2:'tuesday', 3:'wednesday', 4:'thursday', 5:'friday', 6:'saturday', 7:'sunday'}
weekday_t = tdtf.OneHotEncoder(values = weekday_mapping, columns = 'weekday')

hour_t = tdtf.OneHotEncoder(values = [x for x in range(0,24)],  columns = 'h')

rs = tdtf.MinMaxScalar(columns = 'nasa_temp')

rt = Retain(columns = ['consumption',  
                       'cap_air_temperature', 'cap_cloud_area_fraction', 'cap_precipitation_amount', 
                       'is_dark', 'is_light', 'is_from_light_to_dark', 'is_from_dark_to_light', 
                       'is_holiday', 'is_pre_holiday'])

<p style = 'font-size:16px;font-family:Arial'>The above step created transformation objects i.e. weekday_t, hour_t, rs and rt. These will be used to convert weekdays and hours from numeric to one hot encoded columns, to scale nasa_temp using MinMaxScalar and rt object to retain given columns.</p>

In [None]:
t_output = valib.Transform(data = tdf,
                           one_hot_encode = [weekday_t, hour_t], 
                           rescale = [rs], 
                           index_columns = 'TD_TIMECODE',
                           retain = [rt])

In [None]:
t_output.result

<p style = 'font-size:16px;font-family:Arial'>Please scroll to right and observe that we now have columns named <b>monday-sunday</b> and <b>0_h - 23_h</b>. Also, nasa_temp has been scaled.

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Model Training</b>

<table style = 'width:100%;table-layout:fixed;'>
<tr>
    <td style = 'vertical-align:top' width = '50%'>
        <p style = 'font-size:16px;font-family:Arial'>With Vantage Bring Your Own Model; users can take advantage of a rich ecosystem of Machine Learning, Data Preparation, and Advanced analytical libraries available in the open-source and commercial space.  This demonstration illustrates how to utilize simple client-side training pipelines</p>
        <ol style = 'font-size:16px;font-family:Arial'>
            <li>Create Train and Test data sets in Vantage</li>
            <br>
            <li>Copy the training data to the client</li>
            <br>
            <li>Prepare data and train the model</li>
            <br>
            <li>Load the model into Vantage</li>
        </ol>
    </td>
    <td><img src = 'images/BYOM_model_training.png' width = '400'></td>
</tr>
</table>

In [None]:
copy_to_sql(t_output.result.iloc[int(t_output.result.shape[0])-168:],
            table_name = 'energy_consumption_variables_rescaled_test',
            if_exists = 'replace')

copy_to_sql(t_output.result.iloc[:int(t_output.result.shape[0])-168],
            table_name = 'energy_consumption_variables_rescaled_train',
            if_exists = 'replace')

<p style = 'font-size:16px;font-family:Arial'>The above step creates training and testing datasets. Last 168 hours i.e. 7 days are used for testing and remaining data is used for training.</p>

In [None]:
df = pd.read_sql('SELECT * FROM energy_consumption_variables_rescaled_train order by TD_TIMECODE;', eng)

<p style = 'font-size:16px;font-family:Arial'>We calculate the average consumption for a last day of train period. We will use this number for normalization of the target variable.</p>

In [None]:
normalize_value = int(df.tail(24).mean()['consumption'])
normalize_value

In [None]:
train_x = df.drop(['TD_TIMECODE', 'consumption'], axis = 1).astype(float)
feature_names = list(train_x.columns)
train_x.shape
train_y = df['consumption'] - normalize_value

<p style = 'font-size:16px;font-family:Arial'>TD_TIMECODE and consumption columns have been dropped from the training dataset as these are not useful for prediction. The target variable consumption has been normalized by subtracting the normaliza_value that we calculated in the previous step.</p>

In [None]:
pipeline_obj = PMMLPipeline([('lr', LinearRegression())])

pipeline_obj.fit(train_x,train_y)
sklearn2pmml(pipeline_obj, "energy_consumption_LR.pmml", with_repr = True)

<p style = 'font-size:16px;font-family:Arial'>The above step creates a PMML Pipeline which had Linear Regression object inside of it. This Pipeline is used for training the pipeline using the "fit" method. Also, the model is stored in a pmml file locally in the last line.</p>

In [None]:
# Load the PMML file into Vantage

model_id = 'energy_consumption_lr2'
model_file = 'energy_consumption_LR.pmml'
table_name = 'energy_models'

try:
    res = save_byom(model_id = model_id, model_file = model_file, table_name = table_name)

except Exception as e:
    # if our model exists, delete and rewrite
    if str(e.args).find('TDML_2200') >= 1:
        res = delete_byom(model_id = model_id, table_name = table_name)
        res = save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
        pass
    else:
        raise

# Obtain a pointer to the model
model_tdf = DataFrame.from_query(f"SELECT * FROM {table_name} WHERE model_id = '{model_id}'")

In [None]:
list_byom(table_name)

<p style = 'font-size:16px;font-family:Arial'>In the above steps, the pmml model is stored in a table named "energy_models". If it already exists, we delete the existing model with same model_id and save the latest model again using save_byom method.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Score and Evaluate the Model</b>

<table style = 'width:100%;table-layout:fixed;'>
<tr>
    <td style = 'vertical-align:top' width = '50%'>
        <p style = 'font-size:16px;font-family:Arial'>The final step in this process is to test the trained model.  The PMMLPredict function will take the stored pipeline object (including any data preparation and mapping tasks) and execute it against the data on the Vantage Nodes.  Note there can be many models stored in the model table; with versioning, last scored timestamp, or any other management data to allow for operational management of the process.</p>
        <ol style = 'font-size:16px;font-family:Arial'>
            <li>Create a pointer to the model in Vantage</li>
            <br>
            <li>Execute the Scoring function using the model against the testing data</li>
            <br>
            <li>Copy the results of the test to the client - only needs to be a subset of rows if desired</li>
            <br>
            <li>Visualize the results</li>
        </ol>
    </td>
    <td><img src = 'images/Score_and_Evaluate.png' width = '400'></td>
</tr>
</table>

In [None]:
tdf_test = DataFrame('energy_consumption_variables_rescaled_test')
# Run the PMMLPredict function in Vantage
result = PMMLPredict(
            modeldata = model_tdf,
            newdata = tdf_test,
            accumulate = ['TD_TIMECODE','consumption']
            )

<p style = 'font-size:16px;font-family:Arial'>In the above step, we use the PMMLPredict method from teradataml library to score the model in the database. Note that neither the model nor the data has been moved outside Vantage system.</p>

In [None]:
df_prediction = result.result.to_pandas()
df_prediction['prediction'] = df_prediction['prediction'].astype(float) + normalize_value
df_prediction['TD_TIMECODE'] = pd.to_datetime(df_prediction['TD_TIMECODE'])
df_prediction = df_prediction.set_index('TD_TIMECODE')

In [None]:
df_prediction.plot();

<p style = 'font-size:16px;font-family:Arial'>The above graph shows the actual consumption(in Blue) for a week and the predicted consumption(in Orange).</p>

<p style = 'font-size:16px;font-family:Arial'>This demonstration has illustrated a simplified - but complete - overview of how a typical machine learning workflow can be improved using Vantage in conjunction with open-source tools and techniques.  This combination allows users to leverage the innovation of open-source with the operational scale, power, and stability of Vantage.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('demo_energy');"        # Takes 5 seconds

In [None]:
remove_context()

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>