# SAP HANA Cloud - Auto ML Hands On

### Documentation
- See **Pypi.org** project for [SAP HANA Python Client API for Machine Learning Algorithms](https://pypi.org/project/hana-ml/) for more information.

- For more information on **PAL** see [SAP HANA Predictive Analysis Library (PAL)](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/sap-hana-cloud-sap-hana-database-predictive-analysis-library-pal?locale=en-US) information page.

- See [Python Machine Learning Client for SAP HANA](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2023_2_QRC/en-US/hana_ml.html)to discover **hana-ml** libraries.


SAP HANA ML Library
You will be using the 'SAP HANA Python Client API for Machine Learning Algorithm'.

### Install hana-ml libraries



In [None]:
!pip install --upgrade hana_ml --break-system-packages
!pip install --upgrade matplotlib --break-system-packages

# Restart Kernel!

### Connect to your SAP HANA Cloud tenant

Define connection details and connect to SAP HANA Cloud tenant.

> **Important!** Use username and password supplied via registration e-mail, use hostname for **HANA_TENANT_HOST** supplied in hands-on guide **AutoML Introduction** table  

In [1]:
import hana_ml.dataframe as dataframe

hana_address = 'HANA_TENANT_HOST'
hana_port = 443
hana_user = 'USERNAME'
hana_password = 'PASSWORD'
hana_encrypt = True #for HANA Cloud

# Establish connection
conn = dataframe.ConnectionContext(address = hana_address,
                                   port = hana_port, 
                                   user = hana_user, 
                                   password = hana_password, 
                                   encrypt = hana_encrypt,
                                   sslValidateCertificate = 'false')

### Create data frame from remote HANA table

Data already exists in your schema in table **GX_TRANSACTIONS** . Create a data frame through the SQL or table function and get the row count.

> **Important!** Make sure you have successfully created the **GX_TRANSACTIONS** table as instructed in the ***Getting started*** section.

In [None]:
# Create data frame
df_remote = conn.table("GX_TRANSACTIONS")

# Count records in data frame
df_remote.count()

### Inspect data frame data types

Pre-conversion inpection

In [None]:
#control the variable types in SAP HANA
df_remote.dtypes()

### Convert the following variables accordingly

In [4]:
#transform the variable QUALITY
df_remote = df_remote.cast('FRAUD', 'NVARCHAR(20)')

df_remote = df_remote.cast('AMOUNT', 'DOUBLE')
df_remote = df_remote.cast('OLD_BALANCE_ORIGIN', 'DOUBLE')
df_remote = df_remote.cast('NEW_BALANCE_ORIGIN', 'DOUBLE')
df_remote = df_remote.cast('OLD_BALANCE_DEST', 'DOUBLE')
df_remote = df_remote.cast('NEW_BALANCE_DEST', 'DOUBLE')

### Post conversion - take a look at a short description of the data.

> **Note:** The target variable is called Fraud. In addition, there are eight predictors capturing different information of a transaction.

In [None]:
#control the variable types
df_remote.dtypes()

In [None]:
#describe the data in SAP HANA
df_remote.describe().collect()

### Split the data into a training and testing set

In [None]:
%%time
#create training and testing set
from hana_ml.algorithms.pal import partition
df_remote_train, df_remote_test, df_remote_val = partition.train_test_val_split(data = df_remote, 
                                                                                   training_percentage = 0.5, 
                                                                                   testing_percentage = 0.5,
                                                                                   validation_percentage = 0)

### Control the size of the training and testing datasets

In [None]:
#control the size of the training and testing set
print('Size of training subset: ' + str(df_remote_train.count()))
print('Size of test subset: ' + str(df_remote_test.count()))

### Import the following dependencies for the Automatic Classification.


In [11]:
from hana_ml import dataframe
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import DataSets, Settings
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.algorithms.pal.auto_ml import AutomaticClassification, AutomaticRegression
from hana_ml.visualizers.automl_progress import PipelineProgressStatusMonitor
from hana_ml.visualizers.automl_report import BestPipelineReport
from hana_ml.visualizers.unified_report import UnifiedReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import time
import json
import uuid

### Manage the workload in SAP HANA Cloud tenant by creating workload classes

Workload classes helps us to contol resource utilization in SAP HANA Cloud see [Managing Workload with Workload Classes](https://help.sap.com/docs/SAP_HANA_PLATFORM/6b94445c94ae495c83a19646e7c3fd56/5066181717df4110931271d1efd84cbc.html) for more information.

> **Note:** Ignore the error if the work class PAL_AUTOML_WORKLOAD already exists.

In [None]:
conn.execute_sql('''
CREATE WORKLOAD CLASS "PAL_AUTOML_WORKLOAD" SET 'PRIORITY' = '3', 'STATEMENT MEMORY LIMIT' = '3' , 'STATEMENT THREAD LIMIT' = '20'
''')

### Set the maximum runtime for individual pipeline evaluations with the parameter max_eval_time_mins 

The AutoML approach automatically executes data processing, model fitting, comparison and optimization.

First, create an AutoML classifier object auto_c in the following cell. It is helpful to review and set respective AutoML configuration parameters

The defined scenario will run two iterations of pipeline optimization. The total number of pipelines which will be evaluated is equal to population_size + generations × offspring_size. Hence, in this case this amounts to 15 pipelines.
With elite_number, you specify how many of the best pipelines you want to compare. Setting random_seed =1234 helps to get reproducable AutoML runs.


> **Important!** Change <YourName> to username supplied via **registration e-mail** in the .format() method.

In [None]:
import uuid
scenario_id = "{}_AutoMLc_{}".format("YOUR_USER_NAME", uuid.uuid1())
print(scenario_id)

# Set the initial AutoML scenario parameters
auto_c = AutomaticClassification(generations=2, 
                                 population_size=5,
                                 offspring_size=5, 
                                 elite_number=5,
                                 random_seed=1234,
                                 progress_indicator_id=scenario_id)

### Reinitialize and display the AutoML operators and their parameters.

>**Note:** A default set of AutoML classification operators and parameters is provided as the global config-dict, which can be adjusted to the needs of the targeted AutoML scenario. Use methods like **update_config_dict, delete_config_dict, display_config_dic** to update the scenario definition.

In [None]:
# Reinitialize the AutoML operators and their parameters
auto_c.reset_config_dict(conn)
auto_c.display_config_dict()

### Resampling method choose the SMOTETomek method

Adjust some of the settings to narrow the searching space. As the resampling method choose the SMOTETomek method, since the data is imbalanced.

In [None]:
# Modify the AutoML Classification Scenario

# Drop all Resampler
auto_c.delete_config_dict("SAMPLING")
auto_c.delete_config_dict("SMOTE")
auto_c.delete_config_dict("TomekLinks")

auto_c.display_config_dict(category="Resampler")

### Exclude the Transformer methods

Exclude the Transformer methods. As machine learning algorithms keep the Hybrid Gradient Boosting Tree and Multi Logistic Regression.

In [None]:
# Drop and select Transformer
auto_c.delete_config_dict(category="Transformer")

# Drop and select  Classifier
auto_c.delete_config_dict("DT_Classifier")
auto_c.delete_config_dict("SVM_Classifier")
auto_c.delete_config_dict("NB_Classifier")
auto_c.delete_config_dict("MLP_Classifier")
auto_c.delete_config_dict("RDT_Classifier")

auto_c.display_config_dict(category="Classifier")

### Set some parameters for the optimization of the algorithms.

In [None]:
# Change / update Classifier parameter values and ranges
auto_c.update_config_dict("M_LOGR_Classifier", "ENET_LAMBDA", [0.001, 0.01, 0.1])
auto_c.display_config_dict("M_LOGR_Classifier")

auto_c.update_config_dict("HGBT_Classifier", "ETA", [1e-2, 1e-1, 0.5])
auto_c.update_config_dict("HGBT_Classifier", "MAX_DEPTH", {'range': [1, 1, 11]})
auto_c.update_config_dict("HGBT_Classifier", "NODE_SIZE", {'range': [1, 1, 21]})
auto_c.display_config_dict("HGBT_Classifier")


### Review the complete AutoML configuration for the classification.

In [None]:
# Review complete AutoML Classification configuration
auto_c.display_config_dict()

### Fit the Auto ML scenario on the training data

Fit the Auto ML scenario on the training data. It may take a couple of minutes. If it takes to long exclude the SMOTETomek in the resampler() method of the config file.

Inspect the pipeline progress through the execution logs.

In [None]:
%%time
# enable_workload_class
auto_c.enable_workload_class(workload_class_name="PAL_AUTOML_WORKLOAD")

# invoke a PipelineProgressStatusMonitor
progress_status_monitor = PipelineProgressStatusMonitor(connection_context= conn, 
                                                        automatic_obj=auto_c)

progress_status_monitor.start()

# training
try:
    auto_c.fit(data=df_remote_train, key='TRANSACTION_ID', label = "FRAUD")
except Exception as e:
    raise e

### Evaluate the best model on the testing data

In [None]:
pipeline = auto_c.model_[1].collect().iat[0, 1]
res_ev = auto_c.evaluate(df_remote_test, pipeline=pipeline)
print(res_ev.collect())

### Create predictions with your machine learning model

In [None]:
res = auto_c.predict(df_remote_test.deselect("FRAUD"), key = 'TRANSACTION_ID')
print(res.collect())

### Save the best model in SAP HANA

> **Important!** Set **YourSchema** to your usename privided in registration e-mail.

In [22]:
from hana_ml.model_storage import ModelStorage
MODEL_SCHEMA = 'TAC004119U01' # HANA schema in which models are to be saved
model_storage = ModelStorage(connection_context=conn, schema=MODEL_SCHEMA)

### Save the model through the following command.


In [23]:
auto_c.name = 'AutoML Classification' 
auto_c.version = 1
model_storage.save_model(model=auto_c)