What's New and Changed in version 2.11.211211
---------------------------------------------
New functions:

    - Added FeatureSelection.
    - Added BSTS.
    - Added Word Cloud.
    - Added hdbprocedure generation in pal_base and applied to all functions.
    - Added GARCH.
    - APL classification, regression, clustering: a new method, 'export_apply_code', generates code which can be used to apply a trained model outside APL.

Enhancement:

    - Enhanced Preprocessing with FeatureSelection.
    - Enhanced the model storage with fit parameters in json format.
    - Enhanced GARCH model with details.
    - Enhanced PCA categorical support.
    - Enhanced model storage with fit parameters info.
    - Enhanced UnifiedExponentialSmoothing with massive mode.
    - Enhanced AMDP generation as a function in unified_classification.
    - Enhanced ARIMA with a explainer in the predict function.
    - Enhanced additive_model_forecast with a explainer in the predict function.
    - Enhanced HybridGradientBoostingClassifier with continue training of a trained HybridGradientBoostingClassifier model.
    - Enhanced APL AutoTimeSeries with advanced predict outputs: the 'APL/ApplyExtraMode' parameter can be set in 'extra_applyout_settings'.
    - Enhanced the stored procedure information retrieval.

API change:

    - Added 'background_size' in the init() and 'thread_ratio', 'top_k_attributions', 'trend_mod', 'trend_width', 'seasonal_width' in the predict() function of ARIMA() and AutoARIMA().
    - Added 'show_explainer', 'decompose_seasonality', 'decompose_holiday' in the predict() function of additive_model_forecast().
    - Added 'warm_start' in the fit() function of HybridGradientBoostingClassifier() and HybridGradientBoostingRegressor() for continuing training with exisiting model.

Bug fixes:
    - Fixed index creation bug in on-premise text_classification api.

#### Feature Selection

In [None]:
import pandas as pd
from hana_ml import dataframe
from hana_ml.algorithms.pal.utility import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")

connection_context = dataframe.ConnectionContext(url, port, user, pwd)


In [None]:
df = dataframe.create_dataframe_from_pandas(connection_context,
                                            pandas_df=pd.read_csv("https://raw.githubusercontent.com/SAP-samples/hana-ml-samples/main/Python-API/pal/datasets/21QRC04_feature_selection.csv"),
                                            table_name="#PAL_FS_TBL",
                                            force=True)

- Statistical based FS methods:
    - 'anova':Anova.
    - 'chi-squared': Chi-squared.
    - 'gini-index': Gini Index.
    - 'fisher-score': Fisher Score.
- Information theoretical based FS methods:
    - 'information-gain': Information Gain.
    - 'MRMR': Minimum Redundancy Maximum Relevance.
    - 'JMI': Joint Mutual Infromation.
    - 'IWFS': Interaction Weight Based Feature Selection.
    - 'FCBF': Fast Correlation Based Filter.
- Similarity based FS methods:
    - 'laplacian-score': Laplacian Score.
    - 'SPEC': Spectral Feature Selection.
    - 'ReliefF': ReliefF.
- Sparse Learning Based FS method:
    - 'ADMM': ADMM.
- Wrapper method:
    - 'CSO': Competitive Swarm Optimizer.

In [None]:
from hana_ml.algorithms.pal.preprocessing import FeatureSelection

fs = FeatureSelection(fs_method='CSO', seed=1)
fs_df = fs.fit_transform(df, label='Y')

In [None]:
fs_df.collect()

In [None]:
fs_df.collect()

#### AMDP generator without sql_tracer

In [None]:
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification

full_set, diabetes_train, diabetes_test, _ = DataSets.load_diabetes_data(connection_context)


In [None]:
rfc_params = dict(n_estimators=5, split_threshold=0, max_depth=10)
rfc = UnifiedClassification(func="RandomDecisionTree", **rfc_params)
rfc.fit(diabetes_train, 
        key='ID', 
        label='CLASS', 
        categorical_variable=['CLASS'],
        partition_method='stratified',
        stratified_column='CLASS',)
cm = rfc.confusion_matrix_.collect()
rfc.predict(diabetes_test.drop(cols=['CLASS']), key="ID")

In [None]:
rfc.create_amdp_class(amdp_name="DIABETES_AMDP").build_amdp_class()

In [None]:
rfc.write_amdp_file()

In [None]:
from hana_ml.model_storage import ModelStorage

ms = ModelStorage(connection_context)
ms.clean_up()
rfc.name = "RDT_AMDP"
ms.save_model(rfc)

In [None]:
ms.list_models()

In [None]:
rfc_load = ms.load_model("RDT_AMDP", 1)

In [None]:
rfc_load.create_amdp_class(amdp_name="DIABETES_AMDP").build_amdp_class()
print(rfc_load.amdp_template)

#### hdbprocedure

In [None]:
print(rfc.get_pal_function())

In [None]:
print(rfc.get_fit_parameters())

In [None]:
print(rfc.get_fit_output_table_names())

In [None]:
print(rfc.fit_hdbprocedure)

In [None]:
print(rfc.consume_fit_hdbprocedure("test1", in_tables=["a1"], out_tables=["b1", "b2"])['base'], "\n")
print(rfc.consume_fit_hdbprocedure("test1", in_tables=["a1"], out_tables=["b1", "b2"])['consume'])

In [None]:
print(rfc.predict_hdbprocedure)

In [None]:
print(rfc.consume_predict_hdbprocedure("test1", in_tables=["a1", "a2"], out_tables=["b1", "b2"])['base'], "\n")
print(rfc.consume_predict_hdbprocedure("test1", in_tables=["a1", "a2"], out_tables=["b1", "b2"])['consume'])

#### Time Series Explainer - ARIMA and Addtive Model Forecast

Dataset:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.linalg import cholesky
import numpy as np
from numpy.random import rand

num_samples = 600
S1 = 12
S2 = 100

np.random.seed(seed=2334)

x1 = norm.rvs(loc=0, scale=1, size=(1, num_samples))[0]
x2 = norm.rvs(loc=0, scale=1, size=(1, num_samples))[0]
x3 = norm.rvs(loc=0, scale=1, size=(1, num_samples))[0]
x4 = norm.rvs(loc=0, scale=1, size=(1, num_samples))[0]

std_m = np.array([
    [6.8, 0, 0, 0],
    [0, 1.4, 0, 0],
    [0, 0, 1.4, 0],
    [0, 0, 0, 2.9]
])

# specify desired correlation
corr_m = np.array([
    [1, .35, 0.33, 0.78],
    [.35, 1, 0.90, 0.28],
    [.33, 0.90, 1, 0.27],
    [.78, 0.28, 0.27, 1]
])

# calc desired covariance (vc matrix)
cov_m = np.dot(std_m, np.dot(corr_m, std_m))
L = cholesky(cov_m, lower=True)
corr_data = np.dot(L, [x1, x2, x3, x4]).T

beta=np.array([-3.49, 13, 13, 0.0056])
omega1 = 2*np.pi/S1
omega2 = 2*np.pi/S2
timestamp = np.array([i for i in range(num_samples)])
y1 = np.multiply(50*rand(num_samples), 20*rand(1)*np.cos(omega1*timestamp)) \
+ np.multiply(32*rand(num_samples), 30*rand(1)*np.cos(3*omega1*timestamp)) \
+ np.multiply(rand(num_samples), rand(1)*np.sin(omega2*timestamp)) 

y2 = np.multiply(rand(num_samples), timestamp)
y3 = corr_data.dot(beta.T)
y = y1 + y2 + y3

plt.plot(y)


#### ARIMA explainer

In [None]:
from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA

timestamp = [i for i in range(len(y))]
raw = {'ID':timestamp, 'Y':y, 'X1':corr_data[:,0], 'X2':corr_data[:,1], 'X3':corr_data[:,2], 'X4':corr_data[:,3]}
rdata = pd.DataFrame(raw)
cutoff = (int)(rdata.shape[0]*0.9)

df_fit = dataframe.create_dataframe_from_pandas(connection_context, rdata.iloc[:cutoff,:], table_name='PAL_ARIMA_FIT_TBL', force=True)
df_predict = dataframe.create_dataframe_from_pandas(connection_context, rdata.iloc[cutoff:,:], table_name='PAL_ARIMA_PREDICT_TBL', force=True)

arima= AutoARIMA(background_size=-1)
arima.fit(df_fit, key='ID', endog='Y', exog=['X1', 'X2', 'X3', 'X4'])

res = arima.predict(df_predict, top_k_attributions=30, seasonal_width=0.035, trend_width=0.035, show_explainer=True)

print(res.head(5).collect())
print(arima.explainer_.head(5).collect())

#### Additive Model Forecast explainer

In [None]:
from hana_ml.algorithms.pal.tsa import additive_model_forecast

dates = pd.date_range('2018-01-01', '2019-08-23',freq='D')
data_additive = {'ID':dates, 'Y':y, 'X1':corr_data[:,0], 'X2':corr_data[:,1], 'X3':corr_data[:,2], 'X4':corr_data[:,3]}
data = pd.DataFrame(data_additive)
cutoff = (int)(data.shape[0]*0.9)
df_fit_additive = dataframe.create_dataframe_from_pandas(connection_context, data.iloc[:cutoff,:], table_name='PAL_ADDITIVE_FIT_TBL', force=True)
df_predict_additive= dataframe.create_dataframe_from_pandas(connection_context, data.iloc[cutoff:,:], table_name='PAL_ADDITIVE_PREDICT_TBL', force=True)

holiday_dic={"Date":['2018-01-01','2018-01-04','2018-01-05','2019-06-25','2019-06-29'],
             "Name":['A', 'A', 'B', 'A', 'D']}
df=pd.DataFrame(holiday_dic)
df_holiday= dataframe.create_dataframe_from_pandas(connection_context, df, table_name='PAL_HOLIDAY_TBL', force=True)
df_holiday=df_holiday.cast('Date', 'TIMESTAMP')

amf = additive_model_forecast.AdditiveModelForecast(growth='linear',
                                                    regressor = ['{"NAME": "X1", "PRIOR_SCALE":4, "MODE": "additive" }',
                                                                 '{"NAME": "X2", "PRIOR_SCALE":4, "MODE": "multiplicative"}'],
                                                    seasonality=['{ "NAME": "yearly", "PERIOD":365.25, "FOURIER_ORDER":10 }',
                                                                 '{ "NAME": "weekly", "PERIOD":7, "FOURIER_ORDER":3 }',
                                                                 '{ "NAME": "daily", "PERIOD":1, "FOURIER_ORDER":4 }'])

amf.fit(df_fit_additive, key='ID', endog='Y', exog=['X1','X2','X3','X4'], holiday=df_holiday)
model_content = amf.model_.collect()['MODEL_CONTENT']

res = amf.predict(data=df_predict_additive, key= 'ID', show_explainer=True, decompose_seasonality=True, decompose_holiday=True)
print(amf.explainer_.head(5).collect())
print(amf.explainer_.head(15).collect()['SEASONAL'])
print(amf.explainer_.head(15).collect()['HOLIDAY'])
print(amf.explainer_.head(5).collect()['EXOGENOUS'][0])

#### BSTS

In [None]:
from hana_ml.algorithms.pal.tsa.bsts import BSTS

bt = BSTS(burn=0.6, expected_model_size=2,
          seasonal_period=12, niter=2000,
          seed=1)

bt.fit(df_fit, key='ID', endog='Y', exog=['X1', 'X2', 'X3', 'X4'])

fct_res = bt.predict(df_predict.deselect("Y"), key='ID')[0]

print(fct_res.head(3).collect())

In [None]:
df_fit.head(10).collect()

In [None]:
df_fit.select(["ID", "Y"]).collect().to_csv("./test.csv")

In [None]:
df_fit.save("GARCH_TEST")

#### GARCH

In [None]:
from hana_ml.algorithms.pal.tsa.garch import GARCH
gh = GARCH(p=1, q=1)
gh.fit(data=df_fit.set_index('ID'), endog='Y')
vari, stats = gh.predict(horizon=3)

print(vari.head(3).collect())

#### Word Cloud

In [None]:
data = pd.DataFrame({"ID" : ['doc1', 'doc2', 'doc3', 'doc4', 'doc5', 'doc6'],
                     "CONTENT" : ['term1 term2 term2 term3 term3 term3',
                                  'term2 term3 term3 term4 term4 term4',
                                  'term3 term4 term4 term5 term5 term5',
                                  'term3 term4 term4 term5 term5 term5 term5 term5 term5',
                                  'term4 term6',
                                  'term4 term6 term6 term6'],
                     "CATEGORY" : ['CATEGORY_1', 'CATEGORY_1', 'CATEGORY_2', 'CATEGORY_2', 'CATEGORY_3', 'CATEGORY_3']})
df_wc = dataframe.create_dataframe_from_pandas(connection_context=connection_context, pandas_df=data, table_name="#WC_DEMO", force=True)

In [None]:
from hana_ml.visualizers.word_cloud import WordCloud
wordcloud = WordCloud(background_color="white", max_words=2000,
                      max_font_size=100, random_state=42, width=1000,
                      height=860, margin=2).build(df_wc, content_column="CONTENT")


In [None]:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

#### Parameterised view

In [None]:
df_view = connection_context.sql("""
SELECT *
  FROM "DBM2_RFULL_TBL"
  WHERE JOB=:job and AGE=:age;
""")

In [None]:
df_view.save("TEST_VIEW2", table_type="VIEW",
             view_structure={"job": "VARCHAR(500)", "age": "INT"}, force=True)

In [None]:
new_df_view = connection_context.table("TEST_VIEW2", view_params=('entrepreneur', 37))

In [None]:
new_df_view.collect()