# hana-ml Tutorial - Classification

**Author: TI HDA DB HANA Core CN**

In this tutorial, we will show you how to use functions in hana-ml to preprocess data and train a classification model with a public Diabetes dataset. We also display many useful functions of model storage, dataset & model report and model explainations. 

## Import necessary libraries and functions

In [None]:
from hana_ml import dataframe
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import DataSets, Settings
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
from hana_ml.algorithms.pal.model_selection import GridSearchCV
from hana_ml.model_storage import ModelStorage
from IPython.core.display import HTML
from hana_ml.visualizers.shap import ShapleyExplainer
from hana_ml.visualizers.unified_report import UnifiedReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import time
import json
%matplotlib inline

## Create a connection to a SAP HANA instance

First, you need to create a connetion to a SAP HANA instance. In the following cell, we use a config file, config/e2edata.ini to control the connection parameters. 

In your case, please update the following url, port, user, pwd with your HANA instance information for setting up the connection. 

In [None]:
# Please replace url, port, user, pwd with your HANA instance information
connection_context = ConnectionContext(url, port, user, pwd)

## Load the dataset

Diabetes dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. The meaning of each column is below:

1. **PREGNANCIES**: Number of times pregnant
2. **GLUCOSE**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. **BLOODPRESSURE**: Diastolic blood pressure (mm Hg)
4. **SKINTHICKNESS**: Triceps skin fold thickness (mm)
5. **INSULIN**: 2-Hour serum insulin (mu U/ml)
6. **BMI**: Body mass index (weight in kg/(height in m)^2)
7. **PEDIGREE**: Diabetes pedigree function
8. **AGE**: Age (years)
9. **CLASS**: Class variable (0 or 1),  **target varaible**.

In hana-ml, we provide a class called DataSets which contains several public datasets. You could use load_diabetes_data to load the diabetes dataset.

**Load the data**

In [None]:
diabetes_dataset, _, _, _ = DataSets.load_diabetes_data(connection_context)
# number of rows and number of columns
print("Shape of diabetes datset: {}".format(diabetes_dataset.shape))
# columns
print(diabetes_dataset.columns)
# types of each column
print(diabetes_dataset.dtypes())

**Generate a Dataset Report**

In [None]:
UnifiedReport(diabetes_dataset).build().display()

**Split the dataset**

In [None]:
df_diabetes_train, df_diabetes_test, _ = train_test_val_split(data=diabetes_dataset, 
                                                              random_seed=2,
                                                              training_percentage=0.8,
                                                              testing_percentage=0.2,
                                                              validation_percentage=0,
                                                              id_column='ID',
                                                              partition_method='stratified',
                                                              stratified_column='CLASS')

print("Number of training samples: {}".format(df_diabetes_train.count()))
print("Number of test samples: {}".format(df_diabetes_test.count()))
df_diabetes_test = df_diabetes_test.deselect('CLASS')

**Look at the first three row of data**

In [None]:
print(df_diabetes_train.head(3).collect())
print(df_diabetes_test.head(3).collect())

## Model training with CV

UnifiedClassification offers a varity of classfication algorithm and we select HybridGradientBoostingTree for training.
Other options are: 

- 'DecisionTree'
- 'HybridGradientBoostingTree'
- 'LogisticRegression'
- 'MLP'
- 'NaiveBayes'
- 'RandomDecisionTree'
- 'SVM'

In [None]:
uc_hgbt = UnifiedClassification(func='HybridGradientBoostingTree')

gscv = GridSearchCV(estimator=uc_hgbt, 
                    param_grid={'learning_rate': [0.001, 0.01, 0.1],
                                'n_estimators': [5, 10, 20, 50],
                                'split_threshold': [0.1, 0.5, 1]},
                    train_control=dict(fold_num=3,
                                       resampling_method='cv',
                                       random_state=1,
                                       ref_metric=['auc']),
                    scoring='error_rate')

gscv.fit(data=df_diabetes_train, 
         key= 'ID',
         label='CLASS',
         partition_method='stratified',
         partition_random_state=1,
         stratified_column='CLASS',
         build_report=False)

**Look at the model**

In [None]:
# Model table
print(gscv.estimator.model_[0].head(5).collect())
# Statistic 
print(gscv.estimator.model_[1].collect())

**Generate a model report**

In [None]:
UnifiedReport(gscv.estimator).build().display()

**Save the model**

In [None]:
model_storage = ModelStorage(connection_context=connection_context)
model_storage.clean_up()

# Saves the model for the first time
uc_hgbt.name = 'HGBT model'  # The model name is mandatory
uc_hgbt.version = 1
model_storage.save_model(model=uc_hgbt)

# Lists models
model_storage.list_models()

## Model prediction

In [None]:
# Prediction with explaining of model
features = df_diabetes_test.columns
features.remove('ID')
pred_res = gscv.predict(data=df_diabetes_test, 
                        attribution_method='tree-shap',
                        key='ID', 
                        features=features)

pred_res.head(10).collect()

In [None]:
# Look at the detail of first test instance
rc = pred_res.head(1).select("ID", "SCORE", "REASON_CODE").head(1).collect()
HTML(rc.to_html())

## Model Explainability

In [None]:
shapley_explainer = ShapleyExplainer(feature_data=df_diabetes_test.select(features), 
                                     reason_code_data=pred_res.select('REASON_CODE'))
shapley_explainer.summary_plot()

## Close the connection

In [None]:
connection_context.close()

## Thank you!