## Overview of STOs

Script Table Operators (STOs) are a very powerful in-database option. 

#### map_row

It processes records on a row by row basis, in parallel across all AMPs in Teradata. Perfect for feature transformation. It automatically "chunks" the partitions behind the scenes which means you can process very large partitions using map_row without using much memory and without the risk of running out of memory due to having to keep the entire partition in memory.

If you are faimilar with `map` in functional programming, this is more or less what it is. It takes 1 input (a row) and returns 1 output (a row).


#### map_partition

It processes data on a partition by partition basis, in parallel across all AMPs in Teradata. This is well suited to partitioned modelling, where we want to train an individual model on each partition of the data. 

This is similar to `flat map` in functional programming in that it takes N rows (where N is the number of rows in the partition) and returns 0 or more rows. That is, it can return a different number of rows than it takes. There are a number of very interesting applications for this. 


#### Notes & Recommendations

Things to be aware of when using map_row/map_partition with teradataml

- Python versions must match between client and in-vantage
- The version of dill on both client and in-vantage must match
- Whatever libraries you use must have matching versions between client and in-vantage
- You must have the correct permissions to both install and delete files in Teradata
- Know what the ScriptMemLimit is. This determines what memory each AMP has in Teradata for Scripts. Default is 32MB!


Depending on the STO version and TD Version you have installed, you will be using different python versions and library versions. There will be python 3.7 and many new libraries (including the support for ONNX libraries) and upgrades to current library versions with the latest service pack. However, the python 3.7 change is very important to highlight along with the library version changes. If you save models or any pickle files in python, it is very dependent. Both the python version and the library versions. Therefore you need to be aware that any models you have saved using this verion will break / stop working if an upgrade to Teradata Vantage happens. So, what are our recommendations for using teradataml.

- map_row recommended for advanced feature engineering that you cannot do with SQL/VAL etc.
- map_partition recommended for partitioned processing like partitioned model training/evaluation/scoring
- Save partition models as ONNX or PMML
- We can execute PMML via IVSM / PMMLPredict
- We can execute ONNX (once library available) in either STO or via the ONNXPredict


#### Checking Versions

```python
from aoa.sto.util import collect_sto_versions, check_sto_version

collect_sto_versions()
check_sto_version()
```


#### Logging

If you want to enable query logging as a whole, do this at the sqlalchemly level
```python
import logging
logging.basicConfig()
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)
```

If you want to log just the MLE queries you can set

```python
display.print_sqlmr_query = True
```

In [1]:
from teradataml import DataFrame, create_context, get_connection, load_example_data
import getpass

create_context(host="3.238.151.85", username="AOA_DEMO", password=getpass.getpass("password"))

password········


Engine(teradatasql://AOA_DEMO:***@3.238.151.85)

In [3]:
load_example_data("dataframe", "admissions_train")



In [7]:
df = DataFrame('admissions_train')
df.head()

id,masters,gpa,stats,programming,admitted
3,no,3.7,Novice,Beginner,1
5,no,3.44,Novice,Novice,0
6,yes,3.5,Beginner,Advanced,1
7,yes,2.33,Novice,Novice,1
9,no,3.82,Advanced,Advanced,1
10,no,3.71,Advanced,Advanced,1
8,no,3.6,Beginner,Advanced,1
4,yes,3.5,Beginner,Novice,1
2,yes,3.76,Beginner,Beginner,0
1,yes,3.95,Beginner,Beginner,0


### Modify Column with map_row 

A simple example which shows how to modify an existing column in a dataset


In [11]:
df.map_row?

In [20]:
def increase_gpa(row, p):
    row['gpa'] = row['gpa'] + row['gpa'] * p
    return row

df = DataFrame('admissions_train')
df = df.map_row(lambda row: increase_gpa(row, 0.2))
df.head()

id,masters,gpa,stats,programming,admitted
3,no,4.44,Novice,Beginner,1
5,no,4.128,Novice,Novice,0
6,yes,4.2,Beginner,Advanced,1
7,yes,2.7960000000000003,Novice,Novice,1
9,no,4.584,Advanced,Advanced,1
10,no,4.452,Advanced,Advanced,1
8,no,4.32,Beginner,Advanced,1
4,yes,4.2,Beginner,Novice,1
2,yes,4.512,Beginner,Beginner,0
1,yes,4.74,Beginner,Beginner,0


### New column derived from others via map_row 

A simple example which shows how to return a new column for a dataset


In [21]:
from teradatasqlalchemy.types import INTEGER, VARCHAR, BIGINT, DECIMAL, CLOB
from collections import OrderedDict
import numpy as np

def increase_gpa(row):
    new_gpa = row['gpa'] + row['gpa'] * 0.2
    return np.array([row['id'], new_gpa])

df = DataFrame('admissions_train')
df = df.map_row(lambda row: increase_gpa(row), 
                returns=OrderedDict([("id", INTEGER()),
                                     ("new_gpa", DECIMAL())]))
df.head()

id,new_gpa
3,4.44
5,4.128
6,4.2
7,2.796
9,4.584
10,4.452
8,4.32
4,4.2
2,4.512
1,4.74


### Append new column derived from others via map_row 

A simple example which shows how to append a new column to a dataset


In [23]:
def increase_gpa(row):
    row["new_gpa"] = row['gpa'] + row['gpa'] * 0.2
    return row

df = DataFrame('admissions_train')

returns = OrderedDict(zip(df.columns, [col.type for col in df._metaexpr.c]))
returns["new_gpa"] = DECIMAL()

df = df.map_row(lambda row: increase_gpa(row), returns=returns)
df.head()

id,masters,gpa,stats,programming,admitted,new_gpa
3,no,3.7,Novice,Beginner,1,4.44
5,no,3.44,Novice,Novice,0,4.128
6,yes,3.5,Beginner,Advanced,1,4.2
7,yes,2.33,Novice,Novice,1,2.796
9,no,3.82,Advanced,Advanced,1,4.584
10,no,3.71,Advanced,Advanced,1,4.452
8,no,3.6,Beginner,Advanced,1,4.32
4,yes,3.5,Beginner,Novice,1,4.2
2,yes,3.76,Beginner,Beginner,0,4.512
1,yes,3.95,Beginner,Beginner,0,4.74


In [None]:

def increase_gpa(row):
    new_gpa = row['gpa'] + row['gpa'] * 0.2
    return np.array([row['id'], new_gpa])

df = DataFrame(query="SELECT * FROM admissions_train")
df = df.map_row(lambda row: increase_gpa(row), 
                returns=OrderedDict([("id", INTEGER()),
                                     ("new_gpa", DECIMAL())]))

df = DataFrame(query=f"""
SELECT * FROM IVSM.IVSM_SCORE (
    on (SELECT * FROM {df._table_name}) AS DataTable
    on (SELECT model_id, model FROM aoa_ivsm_models WHERE model_version = 'a8dca689-d932-483d-b109-8b5e2d965975') AS ModelTable DIMENSION
    using
        ModelID('03c9a01f-bd46-4e7c-9a60-4282039094e6')
        ColumnsToPreserve('PatientId')
        ModelType('PMML')
) sc;
""")


### Train a model on a parition of the data using map_partition

A simple example which shows how to train a model per partition of the data


In [29]:
from sklearn.linear_model import LogisticRegression
import base64
import dill

def train_partition_model(partition):
    # read all of the rows into memory (we can also process in chunks)
    rows = partition.read()
    
    # return if partition has no data
    if rows is None or len(rows) == 0:
        return None
    
    X = rows[["masters", "gpa", "stats", "programming"]]
    Y = rows[["admitted"]]
    
    clf = LogisticRegression()
    clf = clf.fit(X[["gpa"]], Y)
    
    partition_id = rows.partition_id.iloc[0]
    
    # we have to convert the model to base64 to store in a CLOB column (can't use BLOB with STOs)
    artefact = base64.b64encode(dill.dumps(clf))
    
    # here we return 1 row per partition - basically, the trained model for that partition
    return np.array([[partition_id, 'my_model_version', artefact]])


# lets add a synthetic partition_id column
df = DataFrame(query="SELECT MOD(id, 2) as partition_id, T.* FROM admissions_train T")

df = df.map_partition(lambda partition: train_partition_model(partition),
                      data_partition_column="partition_id",
                      returns=OrderedDict(
                                    [('partition_id', VARCHAR(255)),
                                     ('model_version', VARCHAR(255)),
                                     ('model_artefact', CLOB())]))
df.head()

partition_id,model_version,model_artefact
1,my_model_version,gANjc2tsZWFybi5saW5lYXJfbW9kZWwubG9naXN0aWMKTG9naXN0aWNSZWdyZXNzaW9uCnEAKYFxAX1xAihYBwAAAHBlbmFsdHlxA1gCAAAAbDJxBFgEAAAAZHVhbHEFiVgDAAAAdG9scQZHPxo24uscQy1YAQAAAENxB0c/8AAAAAAAAFgNAAAAZml0X2ludGVyY2VwdHEIiFgRAAAAaW50ZXJjZXB0X3NjYWxpbmdxCUsBWAwAAABjbGFzc193ZWlnaHRxCk5YDAAAAHJhbmRvbV9zdGF0ZXELTlgGAAAAc29sdmVycQxYBAAAAHdhcm5xDVgIAAAAbWF4X2l0ZXJxDktkWAsAAABtdWx0aV9jbGFzc3EPaA1YBwAAAHZlcmJvc2VxEEsAWAoAAAB3YXJtX3N0YXJ0cRGJWAYAAABuX2pvYnNxEk5YCAAAAGNsYXNzZXNfcRNjZGlsbC5fZGlsbApfZ2V0X2F0dHIKcRRjZGlsbC5fZGlsbApfaW1wb3J0X21vZHVsZQpxFVgcAAAAbnVtcHkuY29yZS5fbXVsdGlhcnJheV91bWF0aHEWhXEXUnEYWAwAAABfcmVjb25zdHJ1Y3RxGYZxGlJxG2NudW1weQpuZGFycmF5CnEcSwCFcR1DAWJxHodxH1JxIChLAUsChXEhY251bXB5CmR0eXBlCnEiWAIAAABpOHEjSwBLAYdxJFJxJShLA1gBAAAAPHEmTk5OSv////9K/////0sAdHEnYolDEAAAAAAAAAAAAQAAAAAAAABxKHRxKWJYBQAAAGNvZWZfcSpoG2gcSwCFcStoHodxLFJxLShLAUsBSwGGcS5oIlgCAAAAZjhxL0sASwGHcTBScTEoSwNoJk5OTkr/////Sv////9LAHRxMmKJQwi17wWkh/zMP3EzdHE0YlgKAAAAaW50ZXJjZXB0X3E1aBtoHEsAhXE2aB6HcTdScTgoSwFLAYVxOWgxiUMIFGlKdPZroD9xOnRxO2JYBwAAAG5faXRlcl9xPGgbaBxLAIVxPWgeh3E+UnE/KEsBSwGFcUBoIlgCAAAAaTRxQUsASwGHcUJScUMoSwNoJk5OTkr/////Sv////9LAHRxRGKJQwQDAAAAcUV0cUZiWBAAAABfc2tsZWFybl92ZXJzaW9ucUdYBgAAADAuMjAuM3FIdWIu
0,my_model_version,gANjc2tsZWFybi5saW5lYXJfbW9kZWwubG9naXN0aWMKTG9naXN0aWNSZWdyZXNzaW9uCnEAKYFxAX1xAihYBwAAAHBlbmFsdHlxA1gCAAAAbDJxBFgEAAAAZHVhbHEFiVgDAAAAdG9scQZHPxo24uscQy1YAQAAAENxB0c/8AAAAAAAAFgNAAAAZml0X2ludGVyY2VwdHEIiFgRAAAAaW50ZXJjZXB0X3NjYWxpbmdxCUsBWAwAAABjbGFzc193ZWlnaHRxCk5YDAAAAHJhbmRvbV9zdGF0ZXELTlgGAAAAc29sdmVycQxYBAAAAHdhcm5xDVgIAAAAbWF4X2l0ZXJxDktkWAsAAABtdWx0aV9jbGFzc3EPaA1YBwAAAHZlcmJvc2VxEEsAWAoAAAB3YXJtX3N0YXJ0cRGJWAYAAABuX2pvYnNxEk5YCAAAAGNsYXNzZXNfcRNjZGlsbC5fZGlsbApfZ2V0X2F0dHIKcRRjZGlsbC5fZGlsbApfaW1wb3J0X21vZHVsZQpxFVgcAAAAbnVtcHkuY29yZS5fbXVsdGlhcnJheV91bWF0aHEWhXEXUnEYWAwAAABfcmVjb25zdHJ1Y3RxGYZxGlJxG2NudW1weQpuZGFycmF5CnEcSwCFcR1DAWJxHodxH1JxIChLAUsChXEhY251bXB5CmR0eXBlCnEiWAIAAABpOHEjSwBLAYdxJFJxJShLA1gBAAAAPHEmTk5OSv////9K/////0sAdHEnYolDEAAAAAAAAAAAAQAAAAAAAABxKHRxKWJYBQAAAGNvZWZfcSpoG2gcSwCFcStoHodxLFJxLShLAUsBSwGGcS5oIlgCAAAAZjhxL0sASwGHcTBScTEoSwNoJk5OTkr/////Sv////9LAHRxMmKJQwgkg+OhIdikP3EzdHE0YlgKAAAAaW50ZXJjZXB0X3E1aBtoHEsAhXE2aB6HcTdScTgoSwFLAYVxOWgxiUMIhQVO86Hbyz9xOnRxO2JYBwAAAG5faXRlcl9xPGgbaBxLAIVxPWgeh3E+UnE/KEsBSwGFcUBoIlgCAAAAaTRxQUsASwGHcUJScUMoSwNoJk5OTkr/////Sv////9LAHRxRGKJQwQEAAAAcUV0cUZiWBAAAABfc2tsZWFybl92ZXJzaW9ucUdYBgAAADAuMjAuM3FIdWIu


In [28]:
df = DataFrame(query="""
SELECT partition_id, count(*) as c FROM (
    SELECT MOD(id, 2) as partition_id, T.* FROM admissions_train T) s 
    GROUP BY partition_id
""")
df.head()

partition_id,c
1,20
0,20


### How to validate library versions 

In [30]:
from aoa.sto.util import collect_sto_versions, check_sto_version

collect_sto_versions()

{'python_version': '3.6.7 (default, Nov 21 2019, 00:48:33)  [GCC 4.8.5]',
 'packages': {'cycler': '0.10.0',
  'PySAL': '1.14.4.post2',
  'pyflux': '0.4.15',
  'webencodings': '0.5.1',
  'ipykernel': '5.1.0',
  'entrypoints': '0.2.3',
  'Keras-Preprocessing': '1.0.5',
  'plotly': '3.7.1',
  'execnet': '1.5.0',
  'more-itertools': '4.3.0',
  'tornado': '5.1.1',
  'backcall': '0.1.0',
  'slackclient': '1.3.0',
  'coverage': '4.5.2',
  'jedi': '0.13.1',
  'statsmodels': '0.9.0',
  'requests': '2.20.1',
  'Bottleneck': '1.2.1',
  'murmurhash': '1.0.1',
  'scikit-learn': '0.20.3',
  'apipkg': '1.5',
  'jupyter-client': '5.2.3',
  'psutil': '5.4.8',
  'elasticsearch': '6.3.1',
  'pycparser': '2.19',
  'mpmath': '1.0.0',
  'prompt-toolkit': '2.0.7',
  'pluggy': '0.9.0',
  'nltk': '3.4',
  'wheel': '0.32.3',
  'numexpr': '2.6.8',
  'yt': '3.5.1',
  'Pygments': '2.2.0',
  'jdcal': '1.4',
  'pexpect': '4.6.0',
  'ruamel.yaml': '0.15.42',
  'packaging': '18.0',
  'PyYAML': '3.13',
  'plac': '0.9.6