# Unified Classification Example with Random Forest and Model Report

An example of Unified Calssification with Random Forest using Diabetes Dataset. 



# Pima Indians Diabetes Dataset

Original data comes from National Institute of Diabetes and Digestive and Kidney Diseases. The collected dataset is aiming at, based on certain diagnostic measurements, diagnostically predicting whether or not a patient has diabetes. In particular, patients contained in the dataset are females of Pima Indian heritage, all above the age of 20. Dataset is form Kaggle, for tutorials use only.

The dataset contains the following diagnositic <b>attributes</b>:<br>
$\rhd$ "PREGNANCIES" - Number of times pregnant,<br>
$\rhd$ "GLUCOSE" - Plasma glucose concentration a 2 hours in an oral glucose tolerance test,<br>
$\rhd$ "BLOODPRESSURE" -  Diastolic blood pressure (mm Hg),<br>
$\rhd$ "SKINTHICKNESS" -  Triceps skin fold thickness (mm),<br>
$\rhd$ "INSULIN" - 2-Hour serum insulin (mu U/ml),<br>
$\rhd$ "BMI" - Body mass index $(\text{weight in kg})/(\text{height in m})^2$,<br>
$\rhd$ "PEDIGREE" - Diabetes pedigree function,<br>
$\rhd$ "AGE" -  Age (years),<br>
$\rhd$ "CLASS" - Class variable (0 or 1) 268 of 768 are 1(diabetes), the others are 0(non-diabetes).

Import the related function:

In [None]:
import hana_ml
from hana_ml import dataframe
from hana_ml.algorithms.pal import metrics
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification, json2tab_for_reason_code
import pandas as pd

# Load Data

The data is loaded into 3 tables - full set, training-validation set, and test set as follows:

<li> PIMA_INDIANS_DIABETES_TBL</li>
<li> PIMA_INDIANS_DIABETES_TRAIN_VALID_TBL</li>
<li> PIMA_INDIANS_DIABETES_TEST_TBL</li>

To do that, a connection is created and passed to the loader.

There is a config file, <b>config/e2edata.ini</b> that controls the connection parameters and whether or not to reload the data from scratch. In case the data is already loaded, there would be no need to load the data. A sample section is below. If the config parameter, reload_data is true then the tables for test, training and validation are (re-)created and data inserted into them.

#########################<br>
[hana]<br>
url=host.sjc.sap.corp<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
#########################<br>

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
import plotting_utils
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_set, diabetes_train, diabetes_test, _ = DataSets.load_diabetes_data(connection_context)

# Simple Exploration

Let us look at the number of rows in each dataset:

In [None]:
print('Number of rows in training set: {}'.format(diabetes_train.count()))
print('Number of rows in testing set: {}'.format(diabetes_test.count()))

Let us look at columns of the dataset:

In [None]:
print(diabetes_train.columns)

Let us also look some (in this example, the top 6) rows of the dataset:

In [None]:
diabetes_train.head(3).collect()

Check the data type of all columns:

In [None]:
diabetes_train.dtypes()

We have a 'CLASS' column in the dataset, let's check how many classes are contained in this dataset:

In [None]:
diabetes_train.distinct('CLASS').collect()

Two classes are available, assuring that this is a binary classification problem.

#  Model Training
Invoke the unified classification to train the model using random forest: 

In [None]:
rdt_params = dict(random_state=2,
                  split_threshold=1e-7,
                  min_samples_leaf=1,
                  n_estimators=10,
                  max_depth=55)
uc_rdt = UnifiedClassification(func = 'RandomForest', **rdt_params)

uc_rdt.fit(data=diabetes_train,
           key= 'ID', 
           label='CLASS',
           partition_method='stratified',
           stratified_column='CLASS', 
           partition_random_state=2,
           training_percent=0.7, ntiles=2)

## Visualize the model
In unifiedclassfication function, we provide a function generate_notebook_iframe_report() to visualize the results.

In [None]:
uc_rdt.build_report()
uc_rdt.generate_notebook_iframe_report()

## Output
We could also see the result one by one:
### Output 1: variable importance
Indicates the importance of variables:

In [None]:
uc_rdt.importance_.collect().set_index('VARIABLE_NAME').sort_values(by=['IMPORTANCE'],ascending=False)

### Output 2: confusion matrix


In [None]:
uc_rdt.confusion_matrix_.collect()

### Output 3: statistics

In [None]:
uc_rdt.statistics_.collect()

Obtain the auc value for drawing the ROC curve in the next step:

In [None]:
dtr_auc=uc_rdt.statistics_.filter("STAT_NAME='AUC'").cast('STAT_VALUE','DOUBLE').collect().at[0, 'STAT_VALUE']
dtr_auc

### Output 4: metrics and draw ROC curve

In [None]:
uc_rdt.metrics_.collect()

Draw the ROC curve based on the metrics_:

In [None]:
import matplotlib.pyplot as plt

tpr=uc_rdt.metrics_.filter("NAME='ROC_TPR'").select('Y').collect()
fpr=uc_rdt.metrics_.filter("NAME='ROC_FPR'").select('Y').collect()

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % dtr_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

# Prediction

Obtain the features in the prediction:

In [None]:
features = diabetes_train.columns
features.remove('CLASS')
features.remove('ID')
print(features)

Invoke the prediction with diabetest_test:

In [None]:
pred_res = uc_rdt.predict(diabetes_test, key='ID', features=features)
pred_res.head(10).collect()

##### Global Interpretation using Shapley values
Now that we can calculate Shap values for each feature of every observation, we can get a global interpretation using Shapley values by looking at it in a combined form. 
Let’s see how we can do that:

In [None]:
from hana_ml.visualizers.model_debriefing import TreeModelDebriefing

In [None]:
shapley_explainer = TreeModelDebriefing.shapley_explainer(pred_res, diabetes_test, key='ID', label='CLASS')
shapley_explainer.summary_plot()

Expand the REASON_CODE to see the detail of each item:

In [None]:
json2tab_for_reason_code(pred_res).collect()

confusion_matrix:

In [None]:
ts = diabetes_test.rename_columns({'ID': 'TID'}) .cast('CLASS', 'NVARCHAR(256)')
jsql = '{}."{}"={}."{}"'.format(pred_res.quoted_name, 'ID', ts.quoted_name, 'TID')
results_df = pred_res.join(ts, jsql, how='inner')
cm_df, classification_report_df = metrics.confusion_matrix(results_df, key='ID', label_true='CLASS', label_pred='SCORE') 

In [None]:
import matplotlib.pyplot as plt
from hana_ml.visualizers.metrics import MetricsVisualizer
f, ax1 = plt.subplots(1,1)
mv1 = MetricsVisualizer(ax1)
ax1 = mv1.plot_confusion_matrix(cm_df, normalize=False)

In [None]:
print("Recall, Precision and F_measures.")
classification_report_df.collect()

# Score

In [None]:
_,_,_,metrics_res = uc_rdt.score(data=diabetes_test, key='ID', label='CLASS')
metrics_res.collect()

In [None]:
metrics_res.distinct('NAME').collect()

Draw the cumulative lift curve:

In [None]:
import matplotlib.pyplot as plt
cumlift_x=metrics_res.filter("NAME='CUMLIFT'").select('X').collect()
cumlift_y=metrics_res.filter("NAME='CUMLIFT'").select('Y').collect()
plt.figure()
plt.plot(cumlift_x, cumlift_y, color='darkorange', lw=1)
plt.xlim([0.0, 1.0])
plt.ylim([0.8, 2.05])
plt.xlabel('Pencetage')
plt.ylabel('Cumulative lift')
plt.title('model: Random forest')
plt.show()

Draw the cumulative gains curve:

In [None]:
import matplotlib.pyplot as plt
cumgains_x=metrics_res.filter("NAME='CUMGAINS'").select('X').collect()
cumgains_y=metrics_res.filter("NAME='CUMGAINS'").select('Y').collect()
plt.figure()
plt.plot(cumgains_x, cumgains_y, color='darkorange', lw=1)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Pencetage')
plt.ylabel('Cumulative gains')
plt.title('model: Random forest')
plt.show()