<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Script Table Operator - execute custom python scripts in Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>In this functional demo we'll be see how we can run an external custom python script in Vantage using <b>Script Table Operator(STO)</b> to execute custom python scripts on Vantage. The STO operates by executing R and Python scripts from the command line of the Advanced SQL Engine underlying operating system, according to
the following sequence:
</p>

<ol style = 'font-size:16px;font-family:Arial'>
  <li>The language script is installed on the Advanced SQL Engine of the target Teradata Vantage system via a call to
an External Stored Procedure (XSP)</li>
      <li>The script is invoked by executing a SQL query that calls the STO</li>
  <li>Each Advanced SQL Engine AMP provides its own portion of input table data, if any, to the script. The script
reads its input from the standard input STDIN</li>
  <li>Each Advanced SQL Engine AMP runs a different instance of the same script. Hence, the script execution is an
operation that scales through system architecture as the same script is run concurrently on all AMPs</li>
  <li>The script executes its code and sends its results to the standard output STDOUT. Each Advanced SQL Engine
AMP individually picks up the corresponding script instance results, and returns them to the STO</li>
</ol>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths.</p>

In [None]:
import getpass
import pandas as pd
import datetime
from teradataml import (configure,warnings,
                        create_context,remove_context,
                        execute_sql,
                        DataFrame,in_schema,
                        Script,OrderedDict,
                        BIGINT,BYTEINT,FLOAT)


display.max_rows = 5
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Getting_Started_STO_Python.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage.  You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage.  There are two statements in the following cell, and one is commented out.  You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_HyperModel_cloud');"
 # Takes about 20 seconds
#%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_HyperModel_local');"
 # Takes about 50 secs

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>2. The Dataset</b></p>
<p style = 'font-size:16px;font-family:Arial'>Let us take a look at the sample dataset we are using.</p>

In [None]:
dataset = DataFrame(in_schema('DEMO_HyperModel', 'Dataset'))
dataset

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>3. Create the Python Script</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will create a python script named script_example.py which will use OneClassSVM from sklearn to identify outliers in the dataset.</p>

In [None]:
%%writefile script_example.py
#!/usr/bin/python
"""
This script reads tab-separated values from standard input to form a pandas DataFrame,
which it then preprocesses by specifying correct data types for each column. It uses
OneClassSVM from sklearn to identify outliers in the dataset. Each record's ID, Partition_ID,
and its anomaly status (inlier=1, outlier=-1) are printed as output.

The script expects the input data to be in a specific format, with columns specified for
float, integer, and categorical data types. The data is expected to be streamed and ends
either when a blank line is encountered or EOF is reached.

Columns:
- Partition_ID, ID (integers)
- X1 to X9, Y1 (floats)
- flag, Y2 (categorical)
- FOLD (integer but not explicitly converted)

Usage:
- The script is intended to be used in a pipeline where it reads from stdin.
- Example: cat data.tsv | ./script_example.py
"""

import sys
import pandas as pd
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler

# Define column names for the DataFrame to be created
column_names     = ['Partition_ID', 'ID', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'flag', 'Y1', 'Y2', 'FOLD']

# Specify which columns are to be treated as floating point numbers
float_columns    = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'Y1']

# Specify which columns are integer
integer_columns  = ['Partition_ID', 'ID']

# Specify which columns are categorical
category_columns = ['flag', 'Y2']

# Define the delimiter to be used for splitting input lines
DELIMITER        = '\t'

def reconstruct_pandas_dataframe():
    """
    Reads standard input line by line to construct a pandas DataFrame with
    appropriate data types specified for each column.

    Returns:
        df (DataFrame): A DataFrame with data read from standard input,
                        with columns typed as float, int, and category as appropriate.
    """
    data_Tbl = []
    while True:
        try:
            line = input()
            if line == '':
                break
            data_Tbl.append([x.replace(" ","") for x in line.split(DELIMITER)])
        except EOFError:
            break

    df = pd.DataFrame(data_Tbl, columns=column_names)
    
    for c in float_columns:
        df[c] = df[c].astype('float')
    for c in integer_columns:
        df[c] = df[c].astype('int')
    for c in category_columns:
        df[c] = df[c].astype('category')

    return df

df = reconstruct_pandas_dataframe()

if df.shape[0] == 0:
    sys.exit("The input DataFrame is empty. Exiting the script.")

# Extract the columns specified as float_columns from the DataFrame
data_subset = df[float_columns]

# Initialize a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data using the scaler object; this standardizes the data
data_scaled = scaler.fit_transform(data_subset)

# Initialize OneClassSVM with specified parameters
ocsvm = OneClassSVM(nu=0.05, kernel='rbf', gamma='auto')

# Train the OneClassSVM on the scaled data
ocsvm.fit(data_scaled)

# Use the trained model to predict anomalies; -1 for outliers and 1 for inliers
df['anomaly']           = ocsvm.predict(data_scaled)

df['decision_function'] = ocsvm.decision_function(data_scaled)

df['anomaly_score']     = ocsvm.score_samples(data_scaled)

# Print outputs: Partition_ID, ID, and predicted anomaly status for each record
for i, row in df.iterrows():
    print(row['Partition_ID'], DELIMITER, row['ID'], DELIMITER, row['anomaly'], DELIMITER, row['decision_function'], DELIMITER, row['anomaly_score'])


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>4. Execute the script in Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>The below cell will perform the following steps:</p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Set SEARCHUIFDBPATH to demo_user</li>
    <li>Install the external python file, script_example.py file on Vantage</li>
    <li>If the file is already installed, it will remove the file and install it again. This ensures we always have latest script in Vantage.</li>
</ol>

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>4.1 Set SEARCHUIFDBPATH</b></p>


In [None]:
database = 'demo_user'
execute_sql(f"SET SESSION SEARCHUIFDBPATH = {database};")
execute_sql(f'DATABASE "{database}";')

<p style = 'font-size:16px;font-family:Arial'><i>* The above code returns no output as we are only setting the Database value.</i></p>

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>4.2 Install script</b></p>
<p style = 'font-size:16px;font-family:Arial'>In the below cell we will install the python script in Vantage with all the required parameter values. If the file already exists we will replace the file.</p>

In [None]:
sto = Script(
    data                  = dataset[dataset.FOLD == 'train'],
    script_name           = 'script_example.py',
    files_local_path      = '.',
    script_command        = f"tdpython3 ./demo_user/script_example.py",
    data_order_column     = "ID",
    data_partition_column = 'Partition_ID',
    is_local_order        = False,
    nulls_first           = False,
    sort_ascending        = False,
    charset               = 'latin',
    returns               = OrderedDict(
        [
            ("Partition_ID" , BIGINT()),
            ("ID"           , BIGINT()),
            ("ANOMALY"      , BYTEINT()),
            ("DECISION_FUNCTION" , FLOAT()),
            ("ANOMALY_SCORE"     , FLOAT())
        ]
    )
)

In [None]:
try:
    sto.remove_file('script_example')
except:
    print('the file does not exist. No need to remove it.')
sto.install_file(
    file_identifier = 'script_example',
    file_name       = 'script_example.py',
    is_binary       = False
)

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>4.3 Script Execution</b></p>
<p style = 'font-size:16px;font-family:Arial'>In the below cell we will execute the script and record the total execution time.

In [None]:
%%time
tic = time.time()
print(sto.execute_script())
tac = time.time()
print('computation time :', tac-tic,'seconds')

<p style = 'font-size:18px;font-family:Arial'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this demo we have seen how we can install and execute external python script in Vantage.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call remove_data('DEMO_HyperModel');" 
#Takes 40 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
<div style="float:left;margin-top:14px">ClearScape Analytics™</div>
<div style="float:right;">
<div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
</div>
</div>
</footer>