What's New and Changed in version 2.8.210321
--------------------------------------------

Version 2.8.210321 supports **SAP HANA SPS05** and **SAP HANA Cloud**

Enhancement:

    - Enhanced sql() to enable multiline execution.
    - Enhanced save() to add append option.
    - Enhanced diff() to enable negative input.
    - Enhanced model report functionality of UnifiedClassification with added model and data visualization.
    - Enhanced dataset_report module with a optimized process of report generation and better user experience.
    - Enhanced UnifiedClustering to support 'distance_level' in AgglomerateHierarchicalClustering and DBSCAN functions. Please refer to documentation for details.
    - Enahnced model storage to support unified report.

New functions:

    - Added generate_html_report() and generate_notebook_iframe_report() functions for UnifiedRegression which could display the output, e.g. statistic and model.
    - APL Gradient Boosting: the **other_params** parameter is now supported.
    - APL all models: a new method, **get_model_info**, is created, allowing users to retrieve the summary and the performance metrics of a saved model.
    - APL all models: users can now specify the weight of explanatory variables via the **weight** parameter.
    - Added LSTM.
    - Added Text Mining functions support for both SAP HANA on-premise and cloud version.
        - tf_analysis
        - text_classification
        - get_related_doc
        - get_related_term
        - get_relevant_doc
        - get_relevant_term
        - get_suggested_term
    - Added unified report.

New dependency:

    - Added new dependency 'htmlmin' for generating dataset and model report.

API change:

    - KMeans with two added parameters 'use_fast_library' and 'use_float'.
    - UnifiedRegression with one added parameter 'build_report'.
    - Added a parameter 'distance_level' in UnifiedClustering when 'func' is AgglomerateHierarchicalClustering and DBSCAN. Please refer to documentation for details.
    - Renamed 'batch_size' with 'chunk_size' in create_dataframe_from_pandas.
    - OnlineARIMA has two added parameters 'random_state', 'random_initialization' and its partial_fit() function supports two parameters 'learning_rate' and 'epsilon' for updating the values in the input model.

Bug fixes:

    - Fixed onlineARIMA model storage support.
    - Fixed inflexible default locations of selected columns of input data, e.g. key, features and endog.
    - Fixed accuracy_measure issue in AutoExponentialSmoothing.


## Multiline SQL execution

We've enhanced connection context's sql function to support multiline sql execution and return the last query statement.

In [None]:
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import Settings, DataSets
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
# the connection
connection_context = ConnectionContext(url, port, user, pwd)
df = connection_context.sql(
"""
DO
BEGIN
outtab = SELECT 1 KEY, 2.2 ENDOG FROM DUMMY;
CREATE LOCAL TEMPORARY TABLE #AABB AS (SELECT * FROM :outtab);
END;

SELECT * FROM #AABB
"""
)
df.collect()

## LSTM
Data from PAL example.

In [None]:
datalist = [
                (0 ,20.7),
                (1 ,17.9),
                (2 ,18.8),
                (3 ,14.6),
                (4 ,15.8),
                (5 ,15.8),
                (6 ,15.8),
                (7 ,17.4),
                (8 ,21.8),
                (9 ,20),
                (10,16.2),
                (11,13.3),
                (12,16.7),
                (13,21.5),
                (14,25),
                (15,20.7),
                (16,20.6),
                (17,24.8),
                (18,17.7),
                (19,15.5),
                (20,18.2),
                (21,12.1),
                (22,14.4),
                (23,16),
                (24,16.5),
                (25,18.7),
                (26,19.4),
                (27,17.2),
                (28,15.5),
                (29,15.1),
                (30,15.4),
                (31,15.3),
                (32,18.8),
                (33,21.9),
                (34,19.9),
                (35,16.6),
                (36,16.8),
                (37,14.6),
                (38,17.1),
                (39,25),
                (40,15),
                (41,13.7),
                (42,13.9),
                (43,18.3),
                (44,22),
                (45,22.1),
                (46,21.2),
                (47,18.4),
                (48,16.6),
                (49,16.1),
                (50,15.7),
                (51,16.6),
                (52,16.5),
                (53,14.4),
                (54,14.4),
                (55,18.5),
                (56,16.9),
                (57,17.5),
                (58,21.2),
                (59,17.8),
                (60,18.6),
                (61,17),
                (62,16),
                (63,13.3),
                (64,14.3),
                (65,11.4),
                (66,16.3),
                (67,16.1),
                (68,11.8),
                (69,12.2),
                (70,14.7),
                (71,11.8),
                (72,11.3),
                (73,10.6),
                (74,11.7),
                (75,14.2),
                (76,11.2),
                (77,16.9),
                (78,16.7),
                (79,8.1),
                (80,8),
                (81,8.8),
                (82,13.4),
                (83,10.9),
                (84,13.4),
                (85,11),
                (86,15),
                (87,15.7),
                (88,14.5),
                (89,15.8),
                (90,16.7),
                (91,16.8),
                (92,17.5),
                (93,17.1),
                (94,18.1),
                (95,16.6),
                (96,10),
                (97,14.9),
                (98,15.9),
                (99,13)]
datalist_predict = [
        (0,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9,15.5,16.4,14.5,12.6,13.6,11.2,11,12),
        (1,11.9,14.7,9.4,6.6,7.9,11,15.7,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1),
        (2,14.7,9.4,6.6,7.9,11,15.7,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13),
        (3,9.4,6.6,7.9,11,15.7,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4),
        (4,6.6,7.9,11,15.7,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3),
        (5,7.9,11,15.7,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9),
        (6,11,15.7,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12),
        (7,15.7,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7),
        (8,15.2,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6),
        (9,15.9,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3),
        (10,10.6,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7),
        (11,8.3,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2),
        (12,8.6,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5),
        (13,12.7,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9),
        (14,10.5,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9,15.5),
        (15,12,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9,15.5,16.4),
        (16,11.1,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9,15.5,16.4,14.5),
        (17,13,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9,15.5,16.4,14.5,12.6),
        (18,12.4,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9,15.5,16.4,14.5,12.6,13.6),
        (19,13.3,15.9,12,13.7,17.6,14.3,13.7,15.2,14.5,14.9,15.5,16.4,14.5,12.6,13.6,11.2)
        ]

In [None]:
import pandas as pd
from hana_ml.dataframe import create_dataframe_from_pandas
lstm_data = create_dataframe_from_pandas(connection_context=connection_context,
                                         pandas_df=pd.DataFrame(datalist, columns=["KEY", "VALUE"]),
                                         table_name="#LSTM_TRAIN",
                                         force=True)
lstm_predict = create_dataframe_from_pandas(connection_context=connection_context,
                                            pandas_df=pd.DataFrame(datalist_predict, columns=["ID",
                                                                                              "VAL1",
                                                                                              "VAL2",
                                                                                              "VAL3",
                                                                                              "VAL4",
                                                                                              "VAL5",
                                                                                              "VAL6",
                                                                                              "VAL7",
                                                                                              "VAL8",
                                                                                              "VAL9",
                                                                                              "VAL10",
                                                                                              "VAL11",
                                                                                              "VAL12",
                                                                                              "VAL13",
                                                                                              "VAL14",
                                                                                              "VAL15",
                                                                                              "VAL16" ]),
                                         table_name="#LSTM_PREIDCT",
                                         force=True)

In [None]:
from hana_ml.algorithms.pal.tsa import lstm
lstm = lstm.LSTM(gru='lstm',
                 bidirectional=False,
                 time_dim=16,
                 max_iter=1000,
                 learning_rate=0.01,
                 batch_size=32,
                 hidden_dim=128,
                 num_layers=1,
                 interval=1,
                 stateful=False,
                 optimizer_type='Adam')
lstm.fit(lstm_data)
res = lstm.predict(lstm_predict)
res.head(2).collect()

## SHAPLEY Explainer in Unified Classification
Diabetes data.

In [None]:
from hana_ml.algorithms.pal.model_selection import GridSearchCV
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification

full_set, diabetes_train, diabetes_test, _ = DataSets.load_diabetes_data(connection_context)

uc_hgbdt = UnifiedClassification('HybridGradientBoostingTree')

gscv = GridSearchCV(estimator=uc_hgbdt, 
                    param_grid={'learning_rate': [0.1, 0.4, 0.7, 1],
                                'n_estimators': [4, 6, 8, 10],
                                'split_threshold': [0.1, 0.4, 0.7, 1]},
                    train_control=dict(fold_num=5,
                                       resampling_method='cv',
                                       random_state=1,
                                       ref_metric=['auc']),
                    scoring='error_rate')
gscv.fit(data=diabetes_train, key= 'ID',
         label='CLASS',
         partition_method='stratified',
         partition_random_state=1,
         stratified_column='CLASS',
         build_report=True)
features = diabetes_train.columns
features.remove('CLASS')
features.remove('ID')
pred_res = gscv.predict(diabetes_test, key='ID', features=features)

In [None]:
from hana_ml.visualizers.model_debriefing import TreeModelDebriefing

shapley_explainer = TreeModelDebriefing.shapley_explainer(pred_res, diabetes_test, key='ID', label='CLASS')
shapley_explainer.summary_plot()

## Unified Report (support model storage)

In [None]:
from hana_ml.model_storage import ModelStorage

model_storage = ModelStorage(connection_context=connection_context)
gscv.estimator.name = 'HGBT' 
gscv.estimator.version = 1
model_storage.save_model(model=gscv.estimator)


In [None]:
from hana_ml.visualizers.unified_report import UnifiedReport

mymodel = model_storage.load_model('HGBT', 1)

UnifiedReport(mymodel).build().display()

In [None]:
UnifiedReport(diabetes_test).build().display()

## Text Mining Functions

cloud version vs on-premise version

data from PAL example

In [None]:
conn_onpremise = ConnectionContext(userkey="leiyiyao")
conn_cloud = ConnectionContext(userkey="raymondyao")

In [None]:
data = pd.DataFrame({"ID" : ['doc1', 'doc2', 'doc3', 'doc4', 'doc5', 'doc6'],
                     "CONTENT" : ['term1 term2 term2 term3 term3 term3',
                                  'term2 term3 term3 term4 term4 term4',
                                  'term3 term4 term4 term5 term5 term5',
                                  'term3 term4 term4 term5 term5 term5 term5 term5 term5',
                                  'term4 term6',
                                  'term4 term6 term6 term6'],
                     "CATEGORY" : ['CATEGORY_1', 'CATEGORY_1', 'CATEGORY_2', 'CATEGORY_2', 'CATEGORY_3', 'CATEGORY_3']})
df_test1 = pd.DataFrame({"CONTENT":["term2 term2 term3 term3"]})
df_test2 = pd.DataFrame({"CONTENT":["term3"]})
df_test3 = pd.DataFrame({"CONTENT":["doc3"]})
df_test4 = pd.DataFrame({"CONTENT":["term3"]})

In [None]:
df_onpremise = create_dataframe_from_pandas(connection_context=conn_onpremise, pandas_df=data, table_name="TM_DEMO", force=True)
df_cloud = create_dataframe_from_pandas(connection_context=conn_cloud, pandas_df=data, table_name="TM_DEMO", force=True)

### TFIDF (cloud only)

In [None]:
from hana_ml.text.tm import tf_analysis

tfidf= tf_analysis(df_cloud)
tfidf[0].head(3).collect()

### Text Classification

#### via reference data

In [None]:
from hana_ml.text.tm import text_classification

res, stat = text_classification(df_cloud.select(df_cloud.columns[0], df_cloud.columns[1]), df_cloud)
res.head(1).collect()

In [None]:
res = text_classification(df_onpremise.select(df_onpremise.columns[0], df_onpremise.columns[1]), df_onpremise)
res.head(1).collect()

#### via calculated TFIDF (cloud only)

In [None]:
res, stat = text_classification(df_cloud.select(df_cloud.columns[0], df_cloud.columns[1]), tfidf)
res.head(1).collect()

In [None]:
from hana_ml.text.tm import get_related_doc, get_related_term, get_relevant_doc, get_relevant_term, get_suggested_term

df_test1_cloud = create_dataframe_from_pandas(connection_context=conn_cloud,
                                                        pandas_df=df_test1,
                                                        table_name="#TM_DATA1",
                                                        force=True)

df_test2_cloud = create_dataframe_from_pandas(connection_context=conn_cloud,
                                                        pandas_df=df_test2,
                                                        table_name="#TM_DATA2",
                                                        force=True)

df_test3_cloud = create_dataframe_from_pandas(connection_context=conn_cloud,
                                                        pandas_df=df_test3,
                                                        table_name="#TM_DATA3",
                                                        force=True)

df_test4_cloud = create_dataframe_from_pandas(connection_context=conn_cloud,
                                                        pandas_df=df_test4,
                                                        table_name="#TM_DATA4",
                                                        force=True)
df_test1_onpremise = create_dataframe_from_pandas(connection_context=conn_onpremise,
                                                    pandas_df=df_test1,
                                                    table_name="TM_DATA1",
                                                    force=True)

df_test2_onpremise = create_dataframe_from_pandas(connection_context=conn_onpremise,
                                                    pandas_df=df_test2,
                                                    table_name="TM_DATA2",
                                                    force=True)

df_test3_onpremise = create_dataframe_from_pandas(connection_context=conn_onpremise,
                                                    pandas_df=df_test3,
                                                    table_name="TM_DATA3",
                                                    force=True)

df_test4_onpremise = create_dataframe_from_pandas(connection_context=conn_onpremise,
                                                    pandas_df=df_test4,
                                                    table_name="TM_DATA4",
                                                    force=True)

### get related doc

In [None]:
get_related_doc(df_test1_cloud, tfidf).collect()

In [None]:
grd_onpremise = get_related_doc(df_test1_onpremise, df_onpremise)
print(grd_onpremise.select_statement)

In [None]:
grd_onpremise.collect()

### get related term

In [None]:
get_related_term(df_test2_cloud, df_cloud).collect()

In [None]:
grt_onpremise = get_related_term(df_test2_onpremise, df_onpremise)
print(grt_onpremise.select_statement)

In [None]:
grt_onpremise.collect()

### get relevant doc

In [None]:
get_relevant_doc(df_test2_cloud, df_cloud).collect()

In [None]:
grvd_onpremise = get_relevant_doc(pred_data=df_test2_onpremise, ref_data=df_onpremise, top=4)
print(grvd_onpremise.select_statement)

In [None]:
grvd_onpremise.collect()

### get relevant term

In [None]:
get_relevant_term(df_test4_cloud, df_cloud).collect()

In [None]:
grvt_onpremise = get_relevant_term(df_test4_onpremise, df_onpremise)
print(grvt_onpremise.select_statement)

In [None]:
grvt_onpremise.collect()

### get suggested term

In [None]:
get_suggested_term(df_test4_cloud, df_cloud).collect()

In [None]:
gst_onpremise = get_suggested_term(df_test4_onpremise, df_onpremise)
print(gst_onpremise.select_statement)

In [None]:
gst_onpremise.collect()