# AMDP Generator based on HANA PAL via Python

In the following notebook we will demonstrate how to work with the predictive analytics library (PAL) via the Python API and a development build of the [**hana_ml**](https://pypi.org/project/hana-ml/) package.

## Demo

### 1. Install requirements

Like any other machine learning library in the python ecosystem, we need to install the **hana_ml** package (a development build) in order to be able to import the necessary requirements.

### 2. Importing Requirements

Let's import the necessary libraries for our use case. In here there is yaml for configuration management, a machine learning algorithm, a dataframe for data manipulation as well as the artifact generator and deployer.

In [1]:
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
import hana_ml.dataframe as dataframe

from hana_ml.artifacts.generators import AMDPGenerator
from hana_ml.artifacts.deployers import AMDPDeployer

### 3. Create connection context

In the following code block we just load our credentials from disk in order to not leak them into this notebook or the underlying git repository:

In [2]:
from .. import DataSets, Settings

In [3]:
try:
    import configparser
except ImportError:
    import ConfigParser as configparser
Settings.settings = configparser.ConfigParser()
Settings.settings.read("../../config/e2edata.ini")
url = Settings.settings.get("hana", "url")
port = Settings.settings.get("hana", "port")
user = Settings.settings.get("hana", "user")
passwd = Settings.settings.get("hana", "passwd")

Now its time to create a connection context for our HANA system. This allows us to access the required data, as well as the PAL procedures we need to call in order to train our model.

In [23]:
connection_context = dataframe.ConnectionContext(
    url, int(port), user, passwd)


In [24]:
connection_context.hana_version()

'4.50.000.00.1645273862 (master)'

In [25]:
connection_context.get_current_schema()

'PAL_TEST'

We also enable SQL tracing for later reuse of the model in the deployment.

### 4. Prepare the data

This block is part of the utils for this demo - it makes sure the dataset is in the system and creates it if necessary. In a real production use case this would obviously be unnecessary since the data is already in the system.

In [26]:
diabetes_full, diabetes_train_valid, diabetes_test, _ = DataSets.load_diabetes_data(connection_context)

Table PIMA_INDIANS_DIABETES_TBL exists.


INFO:hana_ml.ml_base:Executing SQL: DROP TABLE "#PAL_PARTITION_DATA_TBL_D227A17F_B5AB_11EC_9E0F_F47B099F40D8"
INFO:hana_ml.ml_base:Executing SQL: CREATE LOCAL TEMPORARY COLUMN TABLE "#PAL_PARTITION_DATA_TBL_D227A17F_B5AB_11EC_9E0F_F47B099F40D8" AS (SELECT "ID", "PREGNANCIES", "GLUCOSE", "SKINTHICKNESS", "INSULIN", "BMI", "AGE", "CLASS" FROM (SELECT "ID", "PREGNANCIES", "GLUCOSE", "SKINTHICKNESS", "INSULIN", "BMI", "AGE", "CLASS" FROM (SELECT * FROM "PAL_TEST"."PIMA_INDIANS_DIABETES_TBL") AS "DT_152") AS "DT_153")
INFO:hana_ml.ml_base:Executing SQL: DO
BEGIN
DECLARE param_name VARCHAR(5000) ARRAY;
DECLARE int_value INTEGER ARRAY;
DECLARE double_value DOUBLE ARRAY;
DECLARE string_value VARCHAR(5000) ARRAY;
param_name[1] := N'RANDOM_SEED';
int_value[1] := 1234;
double_value[1] := NULL;
string_value[1] := NULL;
param_name[2] := N'PARTITION_METHOD';
int_value[2] := 0;
double_value[2] := NULL;
string_value[2] := NULL;
param_name[3] := N'TRAINING_PERCENT';
int_value[3] := NULL;
double_value[3

In [27]:
diabetes_train_valid = diabetes_train_valid.save("DIABETES_TRAIN", force=True)
diabetes_test = diabetes_test.save("DIABETES_TEST", force=True)

INFO:hana_ml.ml_base:Executing SQL: DROP TABLE "DIABETES_TRAIN";
INFO:hana_ml.ml_base:Executing SQL: CREATE COLUMN TABLE "DIABETES_TRAIN" AS (SELECT a.* FROM #PAL_PARTITION_DATA_TBL_D227A17F_B5AB_11EC_9E0F_F47B099F40D8 a inner join #PAL_PARTITION_RESULT_TBL_D227A17F_B5AB_11EC_9E0F_F47B099F40D8 b        on a."ID" = b."ID" where b."PARTITION_TYPE" = 1)
INFO:hana_ml.ml_base:Executing SQL: DROP TABLE "DIABETES_TEST";
INFO:hana_ml.ml_base:Executing SQL: CREATE COLUMN TABLE "DIABETES_TEST" AS (SELECT a.* FROM #PAL_PARTITION_DATA_TBL_D227A17F_B5AB_11EC_9E0F_F47B099F40D8 a inner join #PAL_PARTITION_RESULT_TBL_D227A17F_B5AB_11EC_9E0F_F47B099F40D8 b        on a."ID" = b."ID" where b."PARTITION_TYPE" = 3)


### 5. Data Science Loop

In this section the real work of a data scientist happens. They manipulate the data, preprocess columns, choose a model and try different combinations of hyper parameters.

Since we just want to demonstrate the deployment, lets keep this short and just use a basic Random Decision Tree Classifier.

In [28]:
connection_context.sql_tracer.enable_sql_trace(True)
connection_context.sql_tracer.enable_trace_history(True)
rfc_params = dict(n_estimators=5, split_threshold=0, max_depth=10)
rfc = UnifiedClassification(func="RandomDecisionTree", **rfc_params)
rfc.fit(diabetes_train_valid, 
        key='ID', 
        label='CLASS', 
        categorical_variable=['CLASS'],
        partition_method='stratified',
        stratified_column='CLASS',)
cm = rfc.confusion_matrix_.collect()
rfc.predict(diabetes_test.deselect("CLASS"), key="ID")

INFO:hana_ml.ml_base:Executing SQL: DO
BEGIN
DECLARE param_name VARCHAR(5000) ARRAY;
DECLARE int_value INTEGER ARRAY;
DECLARE double_value DOUBLE ARRAY;
DECLARE string_value VARCHAR(5000) ARRAY;
param_name[1] := N'FUNCTION';
int_value[1] := NULL;
double_value[1] := NULL;
string_value[1] := N'RDT';
param_name[2] := N'KEY';
int_value[2] := 1;
double_value[2] := NULL;
string_value[2] := NULL;
param_name[3] := N'PARTITION_METHOD';
int_value[3] := 2;
double_value[3] := NULL;
string_value[3] := NULL;
param_name[4] := N'PARTITION_STRATIFIED_VARIABLE';
int_value[4] := NULL;
double_value[4] := NULL;
string_value[4] := N'CLASS';
param_name[5] := N'N_ESTIMATORS';
int_value[5] := 5;
double_value[5] := NULL;
string_value[5] := NULL;
param_name[6] := N'SPLIT_THRESHOLD';
int_value[6] := NULL;
double_value[6] := 0;
string_value[6] := NULL;
param_name[7] := N'MAX_DEPTH';
int_value[7] := 10;
double_value[7] := NULL;
string_value[7] := NULL;
param_name[8] := N'CATEGORICAL_VARIABLE';
int_value[8] := NULL;

<hana_ml.dataframe.DataFrame at 0x1d8a30ad220>

We can also view the confusion matrix and accuracy:

In [29]:
print(cm)
print(float(cm['COUNT'][cm['ACTUAL_CLASS'] == cm['PREDICTED_CLASS']].sum()) / cm['COUNT'].sum())

  ACTUAL_CLASS PREDICTED_CLASS  COUNT
0            0               0     39
1            0               1     10
2            1               0     17
3            1               1     11
0.6493506493506493


### 5. Generate abap managed database procedures (AMDP) artifact

At this point in the workflow, our data scientist has iterated on the model many times and found a satisfactory solution. He/she now decides that its time to deploy this to an ABAP system such that an application developer can easily work with it.

We start the process by creating some `.abap` files on our local machine based on the work that was done previously. This contains the SQL logic wrapped in AMDPs the data scientist created by interacting with the **hana_ml** package. You can also manually inspect the code at this point and make adaptions where you see fit.

In [38]:
generator = AMDPGenerator(project_name="AMDP_DEMO", version="1", connection_context=connection_context, outputdir="out")
generator.generate(training_dataset="DIABETES_TRAIN", apply_dataset="DIABETES_TEST")






- Is based on a classification scenario template
- Training and prediction datasets
- Configurable PAL parameters
- Training method includes
   - data preprocessing
   - partitioning
   - model training
   - scoring
   - quality metrics and performance chart calculations
- Predict method
   - data preprocessing
   - prediction call 
   - combining result set with reason codes




### 6. Generate AMDP artifact directly from unified classification object

In [40]:
rfc.create_amdp_class(amdp_name="DIABETES_AMDP", training_dataset="DIABETES_TRAIN", apply_dataset="DIABETES_TEST").build_amdp_class()

In [42]:
print(rfc.amdp_template)

CLASS DIABETES_AMDP DEFINITION
  PUBLIC
  FINAL
  CREATE PUBLIC.

  PUBLIC SECTION.
    INTERFACES if_hemi_model_management.
    INTERFACES if_hemi_procedure.
    INTERFACES if_amdp_marker_hdb.

    TYPES:
      BEGIN OF ty_train_input,
        id TYPE int4,
        pregnancies TYPE int4,
        glucose TYPE int4,
        skinthickness TYPE int4,
        insulin TYPE F,
        bmi TYPE F,
        age TYPE int4,
        class TYPE int4,
      END OF ty_train_input,
      tt_training_data TYPE STANDARD TABLE OF ty_train_input WITH DEFAULT KEY,
      tt_predict_data  TYPE STANDARD TABLE OF ty_train_input WITH DEFAULT KEY,
      
      BEGIN OF ty_predict_result,
        id TYPE int4,
        score TYPE int4,
        confidence TYPE f,
        reason_code_feature_1 TYPE shemi_reason_code_feature_name,
        reason_code_percentage_1 TYPE shemi_reason_code_feature_pct,
        reason_code_feature_2 TYPE shemi_reason_code_feature_name,
        reason_code_percentage_2 TYPE shemi_reason_co

In [41]:
rfc.write_amdp_file(version=1, outdir='out')