### 1. Importing Requirements

Let's import the necessary libraries for our use case. In here there is yaml for configuration management, a machine learning algorithm, a dataframe for data manipulation as well as the artifact generator and deployer.

In [1]:
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
import hana_ml.dataframe as dataframe

from hana_ml.artifacts.generators import hana



### 2. Create connection context

In the following code block we just load our credentials from disk in order to not leak them into this notebook or the underlying git repository:

In [2]:
from hana_ml.algorithms.pal.utility import DataSets, Settings

In [3]:
try:
    import configparser
except ImportError:
    import ConfigParser as configparser
Settings.settings = configparser.ConfigParser()
Settings.settings.read("../../config/e2edata.ini")
url = Settings.settings.get("hana", "url")
port = Settings.settings.get("hana", "port")
user = Settings.settings.get("hana", "user")
passwd = Settings.settings.get("hana", "passwd")

Now its time to create a connection context for our HANA system. This allows us to access the required data, as well as the PAL procedures we need to call in order to train our model.

In [4]:
connection_context = dataframe.ConnectionContext(
    url, int(port), user, passwd)


In [5]:
connection_context.hana_version()

'4.50.000.00.1663300048 (master)'

In [6]:
connection_context.get_current_schema()

'PAL_TEST'

We also enable SQL tracing for later reuse of the model in the deployment.

### 4. Prepare the data

This block is part of the utils for this demo - it makes sure the dataset is in the system and creates it if necessary. In a real production use case this would obviously be unnecessary since the data is already in the system.

In [7]:

diabetes_full, diabetes_train_valid, diabetes_test, _ = DataSets.load_diabetes_data(connection_context)
diabetes_train_valid = diabetes_train_valid.save("diabetes_train_valid")
diabetes_test = diabetes_test.save("diabetes_test")
Settings.set_log_level()

Table PIMA_INDIANS_DIABETES_TBL exists.


In [8]:
connection_context.sql_tracer.enable_sql_trace(True)
connection_context.sql_tracer.enable_trace_history(True)

### 5. Data Science Loop

In this section the real work of a data scientist happens. They manipulate the data, preprocess columns, choose a model and try different combinations of hyper parameters.

Since we just want to demonstrate the deployment, lets keep this short and just use a basic Random Decision Tree Classifier.

In [9]:

rfc_params = dict(n_estimators=5, split_threshold=0, max_depth=10)
rfc = UnifiedClassification(func="RandomDecisionTree", **rfc_params)
rfc.fit(diabetes_train_valid, 
        key='ID', 
        label='CLASS', 
        categorical_variable=['CLASS'],
        partition_method='stratified',
        stratified_column='CLASS',)
cm = rfc.confusion_matrix_.collect()
rfc.predict(diabetes_test.drop(cols=['CLASS']), key="ID")

INFO:hana_ml.ml_base:Executing SQL: DO
BEGIN
DECLARE param_name VARCHAR(5000) ARRAY;
DECLARE int_value INTEGER ARRAY;
DECLARE double_value DOUBLE ARRAY;
DECLARE string_value VARCHAR(5000) ARRAY;
param_name[1] := N'FUNCTION';
int_value[1] := NULL;
double_value[1] := NULL;
string_value[1] := N'RDT';
param_name[2] := N'KEY';
int_value[2] := 1;
double_value[2] := NULL;
string_value[2] := NULL;
param_name[3] := N'N_ESTIMATORS';
int_value[3] := 5;
double_value[3] := NULL;
string_value[3] := NULL;
param_name[4] := N'SPLIT_THRESHOLD';
int_value[4] := NULL;
double_value[4] := 0;
string_value[4] := NULL;
param_name[5] := N'MAX_DEPTH';
int_value[5] := 10;
double_value[5] := NULL;
string_value[5] := NULL;
param_name[6] := N'PARTITION_METHOD';
int_value[6] := 2;
double_value[6] := NULL;
string_value[6] := NULL;
param_name[7] := N'PARTITION_STRATIFIED_VARIABLE';
int_value[7] := NULL;
double_value[7] := NULL;
string_value[7] := N'CLASS';
param_name[8] := N'HANDLE_MISSING_VALUE';
int_value[8] := 0;
do

<hana_ml.dataframe.DataFrame at 0x1d1e3bef400>

We can also view the confusion matrix and accuracy:

In [None]:
print(cm)
print(float(cm['COUNT'][cm['ACTUAL_CLASS'] == cm['PREDICTED_CLASS']].sum()) / cm['COUNT'].sum())

### 4. Generate HDI artifact


In [13]:
hg = hana.HanaGenerator(project_name="test", version='1', grant_service='', connection_context=connection_context, outputdir=".\\test_out")

In [None]:
hg.config.config

In [None]:
hg.generate_artifacts()