# hana-ml Tutorial - Regression

**Author: TI HDA DB HANA Core CN**

In this tutorial, we will show you how to use functions in hana-ml to preprocess data and train a regression model with a public bike dataset. We also display many useful functions of model storage, dataset & model report and model explainations. 

## Import necessary libraries and functions

In [None]:
from hana_ml import dataframe
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import DataSets, Settings
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.algorithms.pal.unified_regression import UnifiedRegression
from hana_ml.algorithms.pal.model_selection import GridSearchCV
from hana_ml.model_storage import ModelStorage
from IPython.core.display import HTML
from hana_ml.visualizers.shap import ShapleyExplainer
from hana_ml.visualizers.unified_report import UnifiedReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import time
import json
%matplotlib inline

## Create a connection to a SAP HANA instance

First, you need to create a connetion to a SAP HANA instance. In the following cell, we use a config file, config/e2edata.ini to control the connection parameters. 

In your case, please update the following url, port, user, pwd with your HANA instance information for setting up the connection. 

In [None]:
# Please replace url, port, user, pwd with your HANA instance information
connection_context = ConnectionContext(url, port, user, pwd)

## Load the dataset

In hana-ml, we provide a class called DataSets which contains several public datasets. You could use load_bike_data() to load the bike dataset.

**Load the data**

In [None]:
# load the dataset
df_bike = DataSets.load_bike_data(connection_context)

# Add a ID column for AutomaticRegression, the last column is the label
df_bike = df_bike.add_id('ID', ref_col='days_since_2011')
df_bike = df_bike.cast('yr', new_type="NVARCHAR")

# Split the dataset into training and test dataset
cols = df_bike.columns
cols.remove('cnt')
df_bike = df_bike[cols + ['cnt']]

# number of rows and number of columns
print("Shape of bike datset: {}".format(df_bike.shape))
# columns
print(df_bike.columns)
# types of each column
print(df_bike.dtypes())


**Generate a Dataset Report**

In [None]:
UnifiedReport(df_bike).build().display()

**Split the dataset**

In [None]:
df_bike_train, df_bike_test, _ = train_test_val_split(data=df_bike, 
                                                      random_seed=2,
                                                      training_percentage=0.75,
                                                      testing_percentage=0.25,
                                                      validation_percentage=0)

print("Number of training samples: {}".format(df_bike_train.count()))
print("Number of test samples: {}".format(df_bike_test.count()))
df_bike_test = df_bike_test.deselect('cnt')

**Look at the first three row of data**

In [None]:
print(df_bike_train.head(3).collect())
print(df_bike_test.head(3).collect())

## Model training with CV

UnifiedRegression offers a varity of regression algorithm and we select HybridGradientBoostingTree for training.
Other options are: 

- 'DecisionTree'
- 'HybridGradientBoostingTree'
- 'LinearRegression'
- 'RandomDecisionTree'
- 'MLP'
- 'SVM'
- 'GLM'
- 'GeometricRegression'
- 'PolynomialRegression'
- 'ExponentialRegression'
- 'LogarithmicRegression'

In [None]:
ur_hgbt = UnifiedRegression(func='HybridGradientBoostingTree')

gscv = GridSearchCV(estimator=ur_hgbt, 
                    param_grid={'learning_rate': [0.001, 0.01, 0.1],
                                'n_estimators': [5, 10, 20, 50],
                                'split_threshold': [0.1, 0.5, 1]},
                    train_control=dict(fold_num=3,
                                       resampling_method='cv',
                                       random_state=1,
                                       ref_metric=['rmse']),
                    scoring='rmse')

gscv.fit(data=df_bike_train, 
         key= 'ID',
         label='cnt',
         build_report=False)

**Look at the model**

In [None]:
# Model table
print(gscv.estimator.model_[0].head(5).collect())
# Statistic 
print(gscv.estimator.model_[1].collect())
#752.918, 727.203, 654.585

**Generate a model report**

In [None]:
UnifiedReport(gscv.estimator).build().display()

**Save the model**

In [None]:
model_storage = ModelStorage(connection_context=connection_context)
model_storage.clean_up()

# Saves the model for the first time
ur_hgbt.name = 'HGBT model'  # The model name is mandatory
ur_hgbt.version = 1
model_storage.save_model(model=ur_hgbt)

# Lists models
model_storage.list_models()

## Model prediction

In [None]:
# Prediction with explaining of model
features = df_bike_test.columns
features.remove('ID')
pred_res = gscv.predict(data=df_bike_test, 
                        attribution_method='tree-shap',
                        key='ID', 
                        features=features)

pred_res.head(10).collect()

In [None]:
# Look at the detail of first test instance
rc = pred_res.head(1).select("ID", "SCORE", "REASON").head(1).collect()
HTML(rc.to_html())

## Model Explainability

In [None]:
shapley_explainer = ShapleyExplainer(feature_data=df_bike_test.select(features), 
                                     reason_code_data=pred_res.select('REASON'))
shapley_explainer.summary_plot()

## Close the connection

In [None]:
connection_context.close()

## Thank you!