<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Energy Consumption Forecasting using Dataiku
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this business use case, we leverage the power of Dataiku and Teradata Vantage to enhance our machine learning capabilities and enable scalable model scoring. Our goal is to efficiently utilize the strengths of both platforms to streamline our data analysis and decision-making processes.
<br>
<img src="images/logo.jpg" alt="Dataiku X Teradata">
<br>
Dataiku serves as a comprehensive data science platform that empowers us to read data from Teradata Vantage, a powerful analytical database. By leveraging Dataiku's seamless integration with Vantage, we can easily extract and analyze large volumes of data stored within the database.
<br>
<br>
Within Dataiku, we harness its rich set of features and functionalities to build and fine-tune multiple machine learning models. With its user-friendly interface and wide array of machine learning algorithms, we can develop models that are tailored to our specific business requirements. Dataiku enables us to handle data preprocessing, feature engineering, model training, and evaluation, providing a complete end-to-end data science workflow.
<br>
<br>
Once the models are trained and validated within Dataiku, we can seamlessly bring them back to Teradata Vantage. Here, we leverage Vantage's advanced functionality known as BYOM (Bring Your Own Model), which allows us to score our machine learning models directly within the Vantage environment. This capability empowers us to perform model scoring at scale, leveraging the high-performance and parallel processing capabilities of Vantage.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset used in this demo represents electricity consumption in Norway from the 1st of January 2016 to the 31st of August 2019. Each line in this dataset reflects consumption for one hour. Apart from electricity consumption, this datamart also reflects additional data: weather from multiple sources, daylight information and labour calendar. We collected all data from open data sources.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>But what if I don't have Dataiku?</b>  Don't worry, we will execute the steps before Dataiku would be used, show you screen shots of what the Dataiku user would be doing, and then we've included the completed model that you will import into Vantage for scoring.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import os
import getpass
import sys
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

from jdk4py import JAVA, JAVA_HOME, JAVA_VERSION

from teradataml import *
from teradatasqlalchemy.types import *

display.max_rows = 5

# Modify the following to match the specific client environment settings
configure.val_install_location = 'val'
configure.byom_install_location = 'mldb'
os.environ['PATH'] = os.pathsep.join([os.environ['PATH'], str(JAVA_HOME), str(JAVA)[:-5]])

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>1. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Energy_Consumption_Forecasting_Dataiku.ipynb;' UPDATE FOR SESSION;''')

<img src="images/td-dataiku.png" alt="Teradata and Dataiku Integration">

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The flow of this notebook is as shown above.</p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
  <li>We get the data from cloud into Vantage.</li>
  <li>We process the data and build the model in Dataiku.</li>
  <li>We export the model back to Vantage in a PMML format file.</li>
  <li>We score the model inside Vantage without having to move the data.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_Energy_cloud');"        # Takes 1 minute
%run -i ../run_procedure.py "call get_data('DEMO_Energy_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>2. Dataiku</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now that you have got the data in Vantage, let's see how to create a dataiku flow which looks like below.</p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Dataiku Flow:</b></p>
<img src="images/flow.jpeg" alt="Dataiku Flow">

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.1 Create Connection</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a new Connection: Go to the 3x3 icon on the top right corner of the screen, then click Administration > Connections > + New Connection and choose Teradata from the dropdown.
<br>
<br>
Configure the following essential parameters to establish a connection to Teradata:</p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Host: You can get hostname from ClearScape dashboard</li>
    <li>Default Database: demo_user</li>
    <li>User: Your username is demo_user</li>
    <li>Password: Your password</li>
    <li>Advanced JDBC properties: CHARSET -> UTF8, TMODE -> TERA</li>
    <li>Tick the checkbox for Autocommit mode in Advanced</li>
    <li>Click on CREATE</li>
</ol>
<img src = './images/connection.png'>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.2 Import dataset</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a new project and click on <b>+ IMPORT YOUR FIRST DATASET</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Select your connection from the dropdown</li>
    <li>Enter Table as <b>consumption</b></li>
    <li>Enter Schema as <b>DEMO_Energy_db</b></li>
    <li>Click on CREATE</li>
</ol>
<img src = './images/data.png'>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.3 Data Preparation</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a new python recipe by clicking on <b>Python</b> icon in the right sidebar as shown below.</p>
<img src = 'images/fig1.png'>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Enter output dataset name</li>
    <li>Select your connection from the dropdown</li>
    <li>Click on CREATE DATASET</li>
    <li>Click on CREATE RECIPE</li>
</ol>
<img src = './images/fig2.png'>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You can write your data processing and preparation code here in the recipe created above. Sample code is given below. Use your credentials for creating context.
<br>
<br>
Hit Run button after creating the recipe.</p>

Code: [transformation.py](./transformation.py)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We'll now use the dataiku function to split the data into train and test dataset. Create a new recipe by clicking on <b>Split</b> icon in the right sidebar as shown below.</p>
<img src = './images/fig3.png'>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create two output dataset (train_df and test_df)</li>
    <li>Select your connection from the dropdown</li>
    <li>Click on CREATE DATASET</li>
    <li>Click on CREATE RECIPE</li>
    <img src = 'images/fig4.png'>
    <li>Click on <b>Dispatch percentiles of sorted data</b>. This is because the data is time series data and we do not want to split it randomly.</li>
    <img src = './images/fig5.png'>
    <li>Select <b>TD_TIMECODE</b> column from the dropdown. Enter desired ratio to split the data.</li>
    <img src = './images/fig6.png'>
</ol>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note:</b>The tables here will be saved as {Project_Name}_train_df and {Project_Name}_test_df. Here in this demo they are stored as DATAIKUBYOM_train_df and DATAIKUBYOM_test_df.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Hit Run button after creating the recipe.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.4 Model Training</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create a new python recipe by clicking on <b>Python</b> icon in the right sidebar as shown below. We'll use the train_df for model training purpose.</p>
<img src = './images/fig7.png'>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You can write your model training and exporting code here in the recipe created above. Sample code is given below. Use your credentials for creating context.
<br>
<br>
Hit Run button after creating the recipe.</p>

Code: [model_training.py](./model_training.py) 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><p style = 'font-size:16px;font-family:Arial;color:#00233C'>The previous step stores the models created in ClearScape on the local machine in a table called <b>dataiku_models</b>. In the next section, we will utilize these trained models to perform scoring within Vantage, eliminating the need to transfer data from Vantage.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Alternatively, you have the option to import the Dataiku project directly into Dataiku platform. You can find a zip file named <b>DATAIKUBYOM.zip</b> in this folder. You can utilize this file to set up the Dataiku project in Dataiku's platform, enabling seamless integration with your existing Dataiku environment.</p>

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note</b>: If you do not have Dataiku or did not perform the above steps, the following cell will do the required setup to run the remaining notebook.</i></p>
</div>

In [None]:
copy_to_sql(df = pd.read_csv('./train_df.csv'), table_name = 'DATAIKUBYOM_train_df', if_exists = 'replace')
copy_to_sql(df = pd.read_csv('./test_df.csv'), table_name = 'DATAIKUBYOM_test_df', if_exists = 'replace')

# Load the PMML file into Vantage

model_ids = ['lr', 'rf']
model_files = ['energy_consumption_LR.pmml', 'energy_consumption_RF.pmml']
table_name = 'dataiku_models'

for model_id, model_file in zip(model_ids, model_files):
    try:
        save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
    except Exception as e:
        # if our model exists, delete and rewrite
        if str(e.args).find('TDML_2200') >= 1:
            delete_byom(model_id = model_id, table_name = table_name)
            save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
        else:
            raise ValueError(f"Unable to save the model '{model_id}' in '{table_name}' due to the following error: {e}")

# Show the bank_models table
list_byom(table_name)

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>3. Model Scoring and Evaluation</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The final step in this process is to test the trained model.  The PMMLPredict function will take the stored pipeline object (including any data preparation and mapping tasks) and execute it against the data on the Vantage Nodes.  Note that we can keep many models in the model table, with versioning, last scored timestamp, or any other management data to allow for the operational management of the process.</p>
        <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
            <li>Create a pointer to the model in Vantage</li>
            <li>Execute the Scoring function using the model against the testing data</li>
            <li>Visualize the results</li>
        </ol>

In [None]:
normalize_value = int(DataFrame('DATAIKUBYOM_train_df').sort('TD_TIMECODE').tail(24)[['consumption']].mean().get_values()[0][0])

In [None]:
# Obtain a pointer to the model
table_name = 'dataiku_models'
model_id = 'lr'
model_lr = retrieve_byom(model_id, table_name = table_name)
df_test = DataFrame('DATAIKUBYOM_test_df')
df_test = df_test.assign(TD_TIMECODE = df_test.TD_TIMECODE.cast(type_=TIMESTAMP))

result_lr = PMMLPredict(
            modeldata = model_lr,
            newdata = df_test,
            accumulate = ['TD_TIMECODE','consumption'],
            ).result

result_lr = result_lr.assign(prediction=result_lr.prediction.cast(type_=FLOAT()))
result_lr = result_lr.assign(prediction=result_lr.prediction + normalize_value)

In [None]:
result_lr

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above step, we use the PMMLPredict method from teradataml library to score the model in the database. The PMMLPredict function in Teradata allows users to score the PMML model directly on the data in the Vantage system, without having to move the data or the model outside the system. This can help to improve the efficiency and security of the scoring process.</p>

In [None]:
# Obtain a pointer to the model
table_name = 'dataiku_models'
model_id = 'rf'
model_rf = retrieve_byom(model_id, table_name = table_name)
df_test = DataFrame('DATAIKUBYOM_test_df')
df_test = df_test.assign(TD_TIMECODE = df_test.TD_TIMECODE.cast(type_=TIMESTAMP))

result_rf = PMMLPredict(
            modeldata = model_rf,
            newdata = df_test,
            accumulate = ['TD_TIMECODE','consumption'],
            ).result

result_rf = result_rf.assign(prediction=result_rf.prediction.cast(type_=FLOAT()))
result_rf = result_rf.assign(prediction=result_rf.prediction + normalize_value)

In [None]:
result_rf

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>4. Visualize the results</b>

In [None]:
# Calculate RMS errors
rms_lr = RegressionEvaluator(data=result_lr, 
                             observation_column="consumption",
                             prediction_column="prediction",
                             metrics=['RMSE']).result.get_values()[0][0]

rms_rf = RegressionEvaluator(data=result_rf, 
                             observation_column="consumption",
                             prediction_column="prediction",
                             metrics=['RMSE']).result.get_values()[0][0]

In [None]:
figure = Figure(width=1400, height=600,  heading="Energy Consumption Prediction")

df_test.plot(
                x=df_test.TD_TIMECODE,
                y=[df_test.consumption, result_lr.prediction, result_rf.prediction],
                figure=figure,
                xtick_format='YYYY-MM-DD',
                xlabel='TD_TIMECODE',
                ylabel='Energy Consumption',
                legend=['Actual Consumption', f'Linear Regression (RMS = {rms_lr:.2f})', f'Random Forest (RMS = {rms_rf:.2f})'],
                legend_style='upper right',
                grid_linestyle='--',
                grid_linewidth=0.5,
                linestyle=['-', '--', '--']
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graph displays the Root Mean Squared (RMS) error values for both Linear Regression and Random Forest models. The lower the RMS error value, the better the model's performance. As we can see, Random Forest outperforms Linear Regression in predicting energy demand, as it has a lower RMS error value. Therefore, Random Forest is more suitable for proactively predicting energy demand in our use case.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This demonstration has illustrated a simplified - but complete - overview of how a typical machine learning workflow can be improved using Vantage in conjunction with 3rd-party tools and techniques.  This combination allows users to leverage 3rd-party innovation with Vantage's operational scale, power, and stability.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>5. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['DATAIKUBYOM_train_df', 'DATAIKUBYOM_test_df', 'dataiku_models']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name = table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Energy');"        # Takes 5 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023, 2024. All Rights Reserved
        </div>
    </div>
</footer>