# Experiencing HANA ML Unified Report

## UnifiedClassification

## Pima Indians Diabetes Dataset

Original data comes from National Institute of Diabetes and Digestive and Kidney Diseases. The collected dataset is aiming at, based on certain diagnostic measurements, diagnostically predicting whether or not a patient has diabetes. In particular, patients contained in the dataset are females of Pima Indian heritage, all above the age of 20. Dataset is form Kaggle, for tutorials use only.

The dataset contains the following diagnositic <b>attributes</b>:<br>
$\rhd$ "PREGNANCIES" - Number of times pregnant,<br>
$\rhd$ "GLUCOSE" - Plasma glucose concentration a 2 hours in an oral glucose tolerance test,<br>
$\rhd$ "BLOODPRESSURE" -  Diastolic blood pressure (mm Hg),<br>
$\rhd$ "SKINTHICKNESS" -  Triceps skin fold thickness (mm),<br>
$\rhd$ "INSULIN" - 2-Hour serum insulin (mu U/ml),<br>
$\rhd$ "BMI" - Body mass index $(\text{weight in kg})/(\text{height in m})^2$,<br>
$\rhd$ "PEDIGREE" - Diabetes pedigree function,<br>
$\rhd$ "AGE" -  Age (years),<br>
$\rhd$ "CLASS" - Class variable (0 or 1) 268 of 768 are 1(diabetes), the others are 0(non-diabetes).



In [None]:
import hana_ml
from hana_ml import dataframe
from hana_ml.algorithms.pal import metrics
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
from hana_ml.algorithms.pal.unified_regression import UnifiedRegression

## Load Data

The data is loaded into 3 tables - full set, training-validation set, and test set as follows:

<li> PIMA_INDIANS_DIABETES_TBL</li>
<li> PIMA_INDIANS_DIABETES_TRAIN_VALID_TBL</li>
<li> PIMA_INDIANS_DIABETES_TEST_TBL</li>

To do that, a connection is created and passed to the loader.

There is a config file, <b>config/e2edata.ini</b> that controls the connection parameters and whether or not to reload the data from scratch. In case the data is already loaded, there would be no need to load the data. A sample section is below. If the config parameter, reload_data is true then the tables for test, training and validation are (re-)created and data inserted into them.

#########################<br>
[hana]<br>
url=host.sjc.sap.corp<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
#########################<br>

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
import plotting_utils
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")

connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_set, diabetes_train, diabetes_test, _ = DataSets.load_diabetes_data(connection_context)

Let us look at the number of rows in each dataset:

In [None]:
print('Number of rows in training set: {}'.format(diabetes_train.count()))
print('Number of rows in testing set: {}'.format(diabetes_test.count()))

Let us look at columns of the dataset:

In [None]:
print(diabetes_train.columns)

Let us also look some (in this example, the top 6) rows of the dataset:

In [None]:
diabetes_train.head(6).collect()

We can also check the data type of all columns:

In [None]:
diabetes_train.dtypes()

We have a 'CLASS' column in the dataset, let us check how many classes are contained in this dataset:

In [None]:
diabetes_train.distinct('CLASS').collect()

Two classes are available, assuring that this is a binary classification problem.

##  Model Creation & Model Selection
The lines below show the ease with which classification can be done.

Set up the label column, use default feature set and create the model:

In [None]:
from hana_ml.algorithms.pal.model_selection import GridSearchCV
from hana_ml.algorithms.pal.model_selection import RandomSearchCV
hgc2 = UnifiedClassification('HybridGradientBoostingTree')

gscv = GridSearchCV(estimator=hgc2, 
                    param_grid={'learning_rate': [0.1, 0.4, 0.7, 1],
                                'n_estimators': [4, 6, 8, 10],
                                'split_threshold': [0.1, 0.4, 0.7, 1]},
                    train_control=dict(fold_num=5,
                                       resampling_method='cv',
                                       random_state=1,
                                       ref_metric=['auc']),
                    scoring='error_rate')
gscv.fit(data=diabetes_train, key= 'ID',
         label='CLASS',
         partition_method='stratified',
         partition_random_state=1,
         stratified_column='CLASS',
         build_report=True)

In [None]:
from hana_ml.visualizers.unified_report import UnifiedReport

## Dataset Report

In [None]:
UnifiedReport(diabetes_train).build().display()

## Classification Report

In [None]:
UnifiedReport(gscv.estimator).display()

## Regression Report

In [None]:
dt_params = dict(model_format = 'pmml',
                 allow_missing_dependent = True,
                 percentage = 1,
                 use_surrogate = True,
                 split_threshold = 1e-5,
                 min_records_of_parent = 2,
                 min_records_of_leaf = 1,
                 thread_ratio = 0.5,
                 evaluation_metric='rmse')
udtr = UnifiedRegression(func = 'DecisionTree', **dt_params)
#udtr.fit(data = self.data_dt, partition_method = 'random',
#         partition_random_state=2, output_partition_result = True)

gscv = GridSearchCV(estimator=udtr, 
                    param_grid={'split_threshold': [0.1, 0.4, 0.7, 1]},
                    train_control=dict(fold_num=5,
                                       resampling_method='cv',
                                       random_state=1),
                    scoring='rmse')
gscv.fit(data=diabetes_train, key= 'ID',
         label='CLASS',
         partition_method='random',
         partition_random_state=1,
         build_report=True)

In [None]:
UnifiedReport(gscv.estimator).display()

In [None]:
connection_context.close()