# hana-ml Tutorial - Auto ML

**Author: TI HDA DB HANA Core CN**

In this tutorial, we will show you how to use AutoML(AutomaticClassification/AutomaticRegression) in hana-ml to train classification/regression model with public datasets. 

## Import the Necessary Libraries and Functions

In [None]:
from hana_ml import dataframe
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import DataSets, Settings
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.algorithms.pal.auto_ml import AutomaticClassification, AutomaticRegression
from hana_ml.visualizers.automl_progress import PipelineProgressStatusMonitor
from hana_ml.visualizers.automl_report import BestPipelineReport
from hana_ml.visualizers.unified_report import UnifiedReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import time
import json
import uuid

## Create a connection to a SAP HANA instance

First, you need to create a connetion to a SAP HANA instance. In the following cell, we use a config file, config/e2edata.ini to control the connection parameters. 

In your case, please update the following url, port, user, pwd with your HANA instance information for setting up the connection. 

In [None]:
# Please replace url, port, user, pwd with your HANA instance information
connection_context = ConnectionContext(url, port, user, pwd)

## AutomaticClassification

Diabetes dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

1. **PREGNANCIES**: Number of times pregnant
2. **GLUCOSE**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. **BLOODPRESSURE**: Diastolic blood pressure (mm Hg)
4. **SKINTHICKNESS**: Triceps skin fold thickness (mm)
5. **INSULIN**: 2-Hour serum insulin (mu U/ml)
6. **BMI**: Body mass index (weight in kg/(height in m)^2)
7. **PEDIGREE**: Diabetes pedigree function
8. **AGE**: Age (years)
9. **CLASS**: Class variable (0 or 1),  **target varaible**.

In hana-ml, we provide a class called DataSets which contains several public datasets. You could use load_diabetes_data to load the diabetes dataset.

**Load the dataset**

In [None]:
# Load the data
diabetes_dataset, _, _, _ = DataSets.load_diabetes_data(connection_context)

# number of rows and number of columns
print("Shape of diabetes datset: {}".format(diabetes_dataset.shape))

# columns
print(diabetes_dataset.columns)

# cast the label to be NVARCHAR
diabetes_dataset = diabetes_dataset.cast('CLASS', 'VARCHAR')

# types of each column
print(diabetes_dataset.dtypes())

#### Dataset report

In [None]:
# Generate a Dataset Report
UnifiedReport(diabetes_dataset).build().display()

#### Split the dataset

In [None]:
# Split the dataset into a training and a test dataset
df_diabetes_train, df_diabetes_test, _ = train_test_val_split(data=diabetes_dataset, 
                                                              random_seed=2,
                                                              training_percentage=0.8,
                                                              testing_percentage=0.2,
                                                              validation_percentage=0,
                                                              id_column='ID',
                                                              partition_method='stratified',
                                                              stratified_column='CLASS')
print("Number of training samples: {}".format(df_diabetes_train.count()))
print("Number of test samples: {}".format(df_diabetes_test.count()))

# delete label column in the test dataset
df_diabetes_test = df_diabetes_test.deselect('CLASS')

In [None]:
# Look at the first three row of data
print(df_diabetes_train.head(3).collect())
print(df_diabetes_test.head(3).collect())

#### Invoke AutomaticClassification

When you invoke AutomaticClassification, please use enable_workload_class() to manage workload in your SAP HANA instance. More detail could be see in the SAP help portal:

https://help.sap.com/viewer/afa922439b204e9caf22c78b6b69e4f2/2.10.0.0/en-US/4499964b5ace432a80c572cc434240ab.html

In this example, we have configured a Workload Class in the SAP HANA database called "PAL_AUTOML_WORKLOAD".

In [None]:
# AutomaticClassification init 
progress_id = "automl_{}".format(uuid.uuid1())
auto_c = AutomaticClassification(generations=2, 
                                 population_size=5,
                                 offspring_size=5, 
                                 progress_indicator_id=progress_id,
                                 early_stop=1,
                                 max_eval_time_mins=1,
                                 random_seed=1234,
                                 scorings={"F1_SCORE_1": 1.0},
                                 elite_number=3)

# enable_workload_class
auto_c.enable_workload_class(workload_class_name="PAL_AUTOML_WORKLOAD")

# invoke a PipelineProgressStatusMonitor
progress_status_monitor = PipelineProgressStatusMonitor(connection_context=connection_context, 
                                                        automatic_obj=auto_c)

progress_status_monitor.start()

# training
try:
    auto_c.fit(data=df_diabetes_train, key="ID")
except Exception as e:
    raise e


#### Best pipeline plot

In [None]:
BestPipelineReport(auto_c).generate_notebook_iframe()

#### Make prediction

In [None]:
res = auto_c.predict(df_diabetes_test, key="ID")
print(res.collect())

#### Use the existing pipeline to fit and predict

In [None]:
# The best pipeline after training
auto_c.best_pipeline_.collect().iat[0, 1]

In [None]:
pipeline = auto_c.best_pipeline_.collect().iat[0, 1]

auto_c.fit(df_diabetes_train, pipeline=pipeline, key="ID")

res = auto_c.predict(df_diabetes_test, key="ID")
print(res.collect())


## AutomaticRegression

In [None]:
# Load Dataset
bike_dataset = DataSets.load_bike_data(connection_context)

# number of rows and number of columns
print("Shape of datset: {}".format(bike_dataset.shape))

# columns
print(bike_dataset.columns)

# types of each column
print(bike_dataset.dtypes())

# print the first 3 rows of dataset
print(bike_dataset.head(3).collect())

#### Dataset report

In [None]:
# Generate a Dataset Report
UnifiedReport(bike_dataset).build().display()

#### Split the dataset

In [None]:
# Add a ID column for AutomaticRegression, the last column is the label
bike_dataset = bike_dataset.add_id('ID', ref_col='days_since_2011')

# Split the dataset into training and test dataset
cols = bike_dataset.columns
cols.remove('cnt')
bike_data = bike_dataset[cols + ['cnt']]

bike_train = bike_data.filter('ID <= 600')
bike_test = bike_data.filter('ID > 600')
print(bike_train.head(3).collect())
print(bike_test.head(3).collect())

#### Invoke AutomaticRegression

In [None]:
# AutomaticRegression init 
progress_id = "automl_reg_{}".format(uuid.uuid1())
auto_r = AutomaticRegression(generations=2,
                             population_size=5,
                             offspring_size=5,                             
                             progress_indicator_id=progress_id)

# enable_workload_class
auto_r.enable_workload_class(workload_class_name="PAL_AUTOML_WORKLOAD")

# invoke a PipelineProgressStatusMonitor
progress_status_monitor = PipelineProgressStatusMonitor(connection_context=connection_context, 
                                                        automatic_obj=auto_r)

progress_status_monitor.start()
try:
    auto_r.fit(bike_train, key="ID")
except Exception as e:
    raise e


#### Best pipeline plot

In [None]:
BestPipelineReport(auto_r).generate_notebook_iframe()

#### Make prediction

In [None]:
res = auto_r.predict(bike_test.deselect('cnt'), key="ID")
print(res.collect())

#### Use the existing pipeline to fit and predict

In [None]:
print(auto_r.best_pipeline_.collect().iat[0, 1])

In [None]:
pipeline = auto_r.best_pipeline_.collect().iat[0, 1]

auto_r.fit(bike_train, pipeline=pipeline, key="ID")

res = auto_r.predict(bike_test.deselect('cnt'), key="ID")
print(res.collect())

### Close the connection

In [None]:
connection_context.close()

## Thank you!