# DKOM 2019
We show three areas of innovation.  The purpose is to demonstrate how data scientists can explore and perform machine learning tasks on large data sets that are stored in HANA.  We show the power of pushing processing power closer to where the data exists.  The benefits of using the power of HANA are:
<li>Performance:  We see orders of magnitude performance gains and it only gets better when data sets are large.</li>
<li>Security:  Since the data is all in HANA and processing done there, all the security measures are enforced along with the data.</li>
<br>
<br>
From a data scientist point of view, they use Python and Python like APIs that they are comfortable with.
<br>
<br>
We will cover the following:
<li><b>Dataframes:</b>A reference to a relation in HANA.  No need for deep SQL knowledge.</li>
<li><b>HANA ML API:</b>Exploit HANA's ML capabilities using a SciKit type of Python interface.</li>
<li><b>Exploratory Data Analysis and Visualization:</b>Analyze large data sets without the performance penalty or running out of resources on the client</li>

# Dataframes
The SAP HANA Python Client API for machine learning algorithms (Python Client API for ML) provides a set of client-side Python functions for accessing and querying SAP HANA data, and a set of functions for developing machine learning models.

The Python Client API for ML consists of two main parts:

<li>A set of machine learning APIs for different algorithms.</li>
<li>The SAP HANA dataframe, which provides a set of methods for analyzing data in SAP HANA without bringing that data to the client.</li>

This library uses the SAP HANA Python driver (hdbcli) to connect to and access SAP HANA.
<br>
<br>
<img src="images/highlevel_overview2_new.png" title="Python API Overview" style="float:left;" width="300" height="50" />
<br>
A dataframe represents a table (or any SQL statement).  Most operations on a dataframe are designed to not bring data back from the database unless explicitly asked for.

In [None]:
from hana_ml import dataframe
import logging

## Setup connection and data sets
Let us load some data into a HANA table.  The data is loaded into 4 tables - full set, test set, training set, and the validation set:DBM2_RFULL_TBL, DBM2_RTEST_TBL, DBM2_RTRAINING_TBL, DBM2_RVALIDATION_TBL.

The data is related with direct marketing campaigns of a Portuguese banking institution. More information regarding the data set is at https://archive.ics.uci.edu/ml/datasets/bank+marketing#. For tutorials use only.

To do that, a connection is created and passed to the loader.  There is a config file, <b>config/e2edata.ini</b> that controls the connection parameters.  Please edit it to point to your hana instance.

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_set, training_set, validation_set, test_set = DataSets.load_bank_data(connection_context, force=False, chunk_size=50000)

### Simple DataFrame
A dataframe is a reference to a relation.  This can be a table, view, or any relation from a SQL statement
<table align="left"><tr><td>
</td><td><img src="images/Dataframes_1.png" style="float:left;" width="600" height="400" /></td></tr></table>
<br>
<b>Let's take a look at a dataframe created using our training table.</b>
<br>

In [None]:
dataset = training_set
print(dataset.select_statement)

In [None]:
print(type(dataset))

### Bring data to client
#### Fetch 5 rows into client as a <b>Pandas Dataframe</b>

In [None]:
dataset.head(5).collect()

## SQL Operations
We now show simple SQL operations.  No extensive SQL knowledge is needed.

### Projection
<img src="images/Projection.png" style="float:left;" width="150" height="750" />

In [None]:
dsp = dataset.select("ID", "AGE", "JOB", ('"AGE"*2', "TWICE_AGE"))
dsp.head(5).collect()  # collect() brings data to the client)

### Filtering Data
<img src="images/Filter.png" style="float:left;" width="200" height="100" />

In [None]:
dataset.filter('AGE > 60').head(10).collect()

### Sorting
<img src="images/Sort.png" style="float:left;" width="200" height="100" />

In [None]:
dataset.filter('AGE>60').sort(['AGE']).head(2).collect()

### Grouping Data
<img src="images/Grouping.png" style="float:left;" width="300" height="200" />

In [None]:
dataset.agg([('count', 'AGE', 'COUNT_OF_AGE')], group_by='AGE').head(4).collect()

### Simple Joins
<img src="images/Join.png" style="float:left;" width="300" height="200" />

In [None]:
ds1 = dataset.select(["ID", "AGE"])
ds2 = dataset.select(["ID", "JOB"])
condition = '{}."ID"={}."ID"'.format(ds1.quoted_name, ds2.quoted_name)
dsj = ds1.join(ds2, condition)
dsj.select_statement

### Describing a dataframe
<img src="images/Describe.png" style="float:left;" width="300" height="200" />

In [None]:
dataset.describe().collect()

In [None]:
dataset.describe().select_statement

# ML API Wrapping Predictive Analytics Library

## Classification - Logistic Regression Example
### Bank dataset to determine if a customer would buy a CD
The data is related with direct marketing campaigns of a Portuguese banking institution.  The marketing campaigns were based on phone calls.  A number of features such as age, kind of job, marital status, education level, credit default, existence of housing loan, etc. were considered.  The classification goal is to predict if the client will subscribe (yes/no) a term deposit.

More information regarding the data set is at https://archive.ics.uci.edu/ml/datasets/bank+marketing#. For tutorials use only.

<font color='blue'>__ The objective is to demonstrate the use of logistic regression and to tune hyperparameters enet_lamba and enet_alpha. __</font>

### Attribute Information:

#### Input variables:
##### Bank client data:
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')

##### Related with the last contact of the current campaign:
8. contact: contact communication type (categorical: 'cellular','telephone') 
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

##### Other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

##### Social and economic context attributes:
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric) 
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)

#### Output variable (desired target):
21. y - has the client subscribed a term deposit? (binary: 'yes','no')


### Load the data set and create data frames

In [None]:
from hana_ml import dataframe
from hana_ml.algorithms.pal import linear_model
from hana_ml.algorithms.pal import clustering
from hana_ml.algorithms.pal import trees
import numpy as np
import matplotlib.pyplot as plt
import logging
from IPython.core.display import Image, display

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_set, training_set, validation_set, test_set = DataSets.load_bank_data(connection_context, force=False, chunk_size=50000)

### Let us look at some rows

In [None]:
training_set.head(5).collect()

# Create Model and Tune Hyperparameters
Try different hyperparameters and see what parameter is best.
The results are stored in a list called res which can then be used to visualize the results.

_The variable "quick" is to run the tests for only a few values to avoid running the code below for a long time._


In [None]:
features = ['AGE','JOB','MARITAL','EDUCATION','DBM_DEFAULT', 'HOUSING','LOAN','CONTACT','DBM_MONTH','DAY_OF_WEEK','DURATION','CAMPAIGN','PDAYS','PREVIOUS','POUTCOME','EMP_VAR_RATE','CONS_PRICE_IDX','CONS_CONF_IDX','EURIBOR3M','NREMPLOYED']
label = "LABEL"

In [None]:
quick = True
enet_lambdas = np.linspace(0.01,0.02, endpoint=False, num=1) if quick else np.append(np.linspace(0.01,0.02, endpoint=False, num=4), np.linspace(0.02,0.02, num=5))
enet_alphas = np.linspace(0, 1, num=4) if quick else np.linspace(0, 1, num=40)
res = []
for enet_alpha in enet_alphas:
    for enet_lambda in enet_lambdas:
        lr = linear_model.LogisticRegression(solver='Cyclical', tol=0.000001, max_iter=10000, 
                                             stat_inf=True,pmml_export='multi-row', enet_lambda=enet_lambda, enet_alpha=enet_alpha,
                                             class_map0='no', class_map1='yes')
        lr.fit(training_set, features=features, label=label)
        accuracy_val = lr.score(validation_set, 'ID', features, label)
        res.append((enet_alpha, enet_lambda, accuracy_val, lr.coef_))

## Graph the results
Plot the accuracy on the validation set against the hyperparameters.

This is only done if all the combinations are tried.

In [None]:
%matplotlib inline
if not quick:
    arry = np.asarray(res)
    fig = plt.figure(figsize=(10,10))
    plt.title("Validation accuracy for training set with different lambdas")
    ax = fig.add_subplot(111)
    most_accurate_lambda = arry[np.argmax(arry[:,2]),1]
    best_accuracy_arg = np.argmax(arry[:,2])
    for lamda in enet_lambdas:
        if lamda == most_accurate_lambda:
            ax.plot(arry[arry[:,1]==lamda][:,0], arry[arry[:,1]==lamda][:,2], label="%.3f" % round(lamda,3), linewidth=5, c='r')
        else:
            ax.plot(arry[arry[:,1]==lamda][:,0], arry[arry[:,1]==lamda][:,2], label="%.3f" % round(lamda,3))
    plt.legend(loc=1, title="Legend (Lambda)", fancybox=True, fontsize=12)
    ax.set_xlabel('Alpha', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.grid()
    plt.show()
    print("Best accuracy: %.4f" % (arry[best_accuracy_arg][2]))
    print("Value of alpha for maximum accuracy: %.3f\nValue of lambda for maximum accuracy: %.3f\n" % (arry[best_accuracy_arg][0], arry[best_accuracy_arg][1]))
else:
    display(Image('images/bank-data-hyperparameter-tuning.png', width=800, unconfined=True))
    print("Best accuracy: 0.9148")
    print("Value of alpha for maximum accuracy: 0.769")
    print("Value of lambda for maximum accuracy: 0.010")

# Predictions on test set
Let us do the predictions on the test set using these values of alpha and lambda

In [None]:
alpha = 0.769
lamda = 0.01
lr = linear_model.LogisticRegression(solver='Cyclical', tol=0.000001, max_iter=10000, 
                                       stat_inf=True,pmml_export='multi-row', enet_lambda=lamda, enet_alpha=alpha,
                                       class_map0='no', class_map1='yes')
lr.fit(training_set, features=features, label=label)

## Look at the predictions

In [None]:
result_df = lr.predict(test_set, 'ID')
result_df.filter('"CLASS"=\'no\'').head(5).collect()

## What about the final score?

In [None]:
lr.score(test_set, 'ID')

# KMeans Clustering Example

A data set that identifies different types of iris's is used to demonstrate KMeans in SAP HANA.
## Iris Data Set
The data set used is from University of California, Irvine (https://archive.ics.uci.edu/ml/datasets/iris). For tutorials use only. This data set contains attributes of a plant iris.  There are three species of Iris plants.
<table>
<tr><td>Iris Setosa</td><td><img src="images/Iris_setosa.jpg" title="Iris Sertosa" style="float:left;" width="300" height="50" /></td>
<td>Iris Versicolor</td><td><img src="images/Iris_versicolor.jpg" title="Iris Versicolor" style="float:left;" width="300" height="50" /></td>
<td>Iris Virginica</td><td><img src="images/Iris_virginica.jpg" title="Iris Virginica" style="float:left;" width="300" height="50" /></td></tr>
</table>

The data contains the following attributes for various flowers:
<table align="left"><tr><td>
<li align="top">sepal length in cm</li>
<li align="left">sepal width in cm</li>
<li align="left">petal length in cm</li>
<li align="left">petal width in cm</li>
</td><td><img src="images/sepal_petal.jpg" style="float:left;" width="200" height="40" /></td></tr></table>

Although the flower is identified in the data set, we will cluster the data set into 3 clusters since we know there are three different flowers.  The hope is that the cluster will correspond to each of the flowers.

A different notebook will use a classification algorithm to predict the type of flower based on the sepal and petal dimensions.

### Load the data set and create data frames

In [None]:
from hana_ml import dataframe
from hana_ml.algorithms.pal import clustering
import numpy as np
import pandas as pd
import logging
import itertools
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d, Axes3D

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_set, training_set, validation_set, test_set = DataSets.load_iris_data(connection_context, force=False, chunk_size=50000)

### Let's check how many SPECIES are in the data set.

In [None]:
full_set.distinct("SPECIES").collect()

# Create Model
The lines below show the ease with which clustering can be done.

Set up the features and labels for the model and create the model

In [None]:
features = ['SEPALLENGTHCM','SEPALWIDTHCM','PETALLENGTHCM','PETALWIDTHCM']
label = ['SPECIES']

In [None]:
kmeans = clustering.KMeans(thread_ratio=0.2, n_clusters=3, distance_level='euclidean', 
                           max_iter=100, tol=1.0E-6, category_weights=0.5, normalization='min_max')
predictions = kmeans.fit_predict(full_set, 'ID', features).collect()
predictions.head(5)

# Plot the data

In [None]:
def plot_kmeans_results(data_set, features, predictions):
    # use this to estimate what each cluster_id represents in terms of flowers
    # ideal would be 50-50-50 for each flower, so we can see there are some mis clusterings
    class_colors = {0: 'r', 1: 'b', 2: 'k'}
    predictions_colors = [class_colors[p] for p in predictions['CLUSTER_ID'].values]

    red = plt.Line2D(range(1), range(1), c='w', marker='o', markerfacecolor='r', label='Iris-virginica', markersize=10, alpha=0.9)
    blue = plt.Line2D(range(1), range(1), c='w', marker='o', markerfacecolor='b', label='Iris-versicolor', markersize=10, alpha=0.9)
    black = plt.Line2D(range(1), range(1), c='w', marker='o', markerfacecolor='k', label='Iris-setosa', markersize=10, alpha=0.9)

    for x, y in itertools.combinations(features, 2):
        plt.figure(figsize=(10,5))
        plt.scatter(full_set[[x]].collect(), data_set[[y]].collect(), c=predictions_colors, alpha=0.6, s=70)
        plt.grid()
        plt.xlabel(x, fontsize=15)
        plt.ylabel(y, fontsize=15)
        plt.tick_params(labelsize=15)
        plt.legend(handles=[red, blue, black])
        plt.show()

    %matplotlib inline
    #above allows interactive 3d plot

    sizes=10
    for x, y, z in itertools.combinations(features, 3):
        fig = plt.figure(figsize=(8,5))

        ax = fig.add_subplot(111, projection='3d')
        ax.scatter3D(data_set[[x]].collect(), data_set[[y]].collect(), data_set[[z]].collect(), c=predictions_colors, s=70)
        plt.grid()

        ax.set_xlabel(x, labelpad=sizes, fontsize=sizes)
        ax.set_ylabel(y, labelpad=sizes, fontsize=sizes)
        ax.set_zlabel(z, labelpad=sizes, fontsize=sizes)
        ax.tick_params(labelsize=sizes)
        plt.legend(handles=[red, blue, black])
        plt.show()

In [None]:
print(pd.concat([predictions, full_set[['SPECIES']].collect()], axis=1).groupby(['SPECIES','CLUSTER_ID']).size())


In [None]:
%matplotlib inline
plot_kmeans_results(full_set, features, predictions)

# Exploratory Data Analysis and Visualization

## Titanic Data Set (~1K rows)
This dataset is from https://github.com/awesomedata/awesome-public-datasets/tree/master/Datasets For tutorials use only.

In [None]:
from hana_ml import dataframe
from hana_ml.algorithms.pal import trees
from hana_ml.visualizers.eda import EDAVisualizer as eda
import pandas as pd
import matplotlib.pyplot as plt
import time
from hana_ml.visualizers.eda import EDAVisualizer

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_set, training_set, validation_set, test_set = DataSets.load_titanic_data(connection_context, force=True, chunk_size=50000)

In [None]:
# Create the HANA Dataframe (df_train) and point to the training table.
data = full_set
data = data.fillna(25, ['AGE'])
data.head(5).collect()
data.dtypes()

### Histogram plot for AGE distribution

In [None]:
bins=15
f = plt.figure(figsize=(bins*1.5, bins*0.5))
ax1 = f.add_subplot(121)
ax2 = f.add_subplot(122)
start = time.time()
eda = EDAVisualizer(ax1)
ax1, dist_data1 = eda.distribution_plot(data, column="AGE", bins=bins, title="Distribution of AGE (All)")
eda = EDAVisualizer(ax2)
ax2, dist_data2 = eda.distribution_plot(data.filter('SURVIVED=1'), column="AGE", bins=bins, title="Distribution of AGE (Survived)")
end = time.time()
plt.show()
print("Time: {}s.  Time taken to do this by getting the data from the server was 0.86s".format(round(end-start, 2)))

In [None]:
### Pie plot for PCLASS (passenger class) distribution

In [None]:
f = plt.figure(figsize=(20,10))
ax1 = f.add_subplot(121)
ax2 = f.add_subplot(122)
start = time.time()
eda = EDAVisualizer(ax1)
ax1, pie_data = eda.pie_plot(data, column="PCLASS", title="Proportion of passengers in each class")
eda = EDAVisualizer(ax2)
ax2, pie_data = eda.pie_plot(data.filter('SURVIVED=1'), column="PCLASS", title="Proportion of passengers in each class who survived")
end = time.time()
plt.show()
print("Time: {}s.  Time taken to do this by getting the data from the server was 0.88s".format(round(end-start, 2)))

### Correlation plot - Look at all numeric columns

In [None]:
f = plt.figure(figsize=(10,10))
ax1 = f.add_subplot(111)
start = time.time()
eda = EDAVisualizer(ax1)
ax1, corr = eda.correlation_plot(data)
end = time.time()
plt.show()
print("Time: {}s.  Time taken to do this by getting the data from the server was 2s".format(round(end-start, 2)))

### Performance Comparison

In [None]:
# Box plot time for the large data set is 1600s!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
col_names =  ['chart', 'dataSize', 'noDataframe', 'withDataframe']
comparison_df  = pd.DataFrame(columns = col_names)
comparison_df.loc[len(comparison_df)] = ['Distribution', '1K',  0.86,  0.14]
comparison_df.loc[len(comparison_df)] = ['Pie', '1K',  0.88,  0.09]
comparison_df.loc[len(comparison_df)] = ['Correlation', '1K',  2.0,   2.78]

comparison_df.loc[len(comparison_df)] = ['Distribution', '8K',  7.5,   0.18]
comparison_df.loc[len(comparison_df)] = ['Pie', '8K',  7.6,   0.26]
comparison_df.loc[len(comparison_df)] = ['Correlation', '8K',  9.2,   2.1]

comparison_df.loc[len(comparison_df)] = ['Distribution', '500K',  360,   0.29]
comparison_df.loc[len(comparison_df)] = ['Pie', '500K',  450,   0.23]
comparison_df.loc[len(comparison_df)] = ['Correlation', '500K',  400,   4.3]

comparison_df.loc[len(comparison_df)] = ['Distribution', '1M',  950,   0.33]
comparison_df.loc[len(comparison_df)] = ['Pie', '1M',  940,   0.22]
comparison_df.loc[len(comparison_df)] = ['Correlation', '1M',  950,   6.28]
comparison_df['noDataframe'] = np.log10(comparison_df['noDataframe']*1000)
comparison_df['withDataframe'] = np.log10(comparison_df['withDataframe']*1000)
#comparison_df[comparison_df['chart'] == 'Distribution']['noDataframe']

In [None]:
f = plt.figure(figsize=(15,10))
ax = f.add_subplot(111)
N = 4
width = 0.10
ind = np.arange(N) 
ax.bar(ind,           comparison_df[comparison_df['chart'] == 'Distribution']['noDataframe'], width, label='Distribution (No DF)')
ax.bar(ind +   width, comparison_df[comparison_df['chart'] == 'Distribution']['withDataframe'], width, label='Distribution (DF)')
gap = 0.05
ax.bar(ind + gap + 2*width, comparison_df[comparison_df['chart'] == 'Pie']['noDataframe'], width, label='Pie (No DF)')
ax.bar(ind + gap + 3*width, comparison_df[comparison_df['chart'] == 'Pie']['withDataframe'], width, label='Pie (DF)')

ax.bar(ind + 2*gap + 4*width, comparison_df[comparison_df['chart'] == 'Correlation']['noDataframe'], width, label='Correlation (No DF)')
ax.bar(ind + 2*gap + 5*width, comparison_df[comparison_df['chart'] == 'Correlation']['withDataframe'], width, label='Correlation (DF)')


#plt.ylabel('Scores')
#plt.title('Scores by group and gender')

#ax.xticks(ind + width*2, comparison_df['dataSize'].unique())
ax.set_xticks(ind + width*2)
ax.set_xticklabels(comparison_df['dataSize'].unique(), fontsize=20)
ax.set_xlabel("Data Size (# Rows)", fontsize=20)
ax.set_yticklabels([0, .01, .1, 1, 10, 100, 1000], fontsize=20)
ax.set_ylabel("Time in seconds (Log Scale)", fontsize=20)
ax.legend(loc='best', fontsize=20)
plt.show()

# SUMMARY
## What we covered
<li><b>Dataframes:</b>A reference to a relation in HANA.  No need for deep SQL knowledge.</li>
<li><b>HANA ML API:</b>Exploit HANA's ML capabilities using a SciKit type of Python interface.</li>
<li><b>Exploratory Data Analysis and Visualization:</b>Analyze large data sets without the performance penalty or running out of resources on the client</li>
<br>

## Main benefits
<li>Ease of Use: For the data scientists.</li>
<li>Performance:  Orders of magnitude performance gains.</li>
<li>Security:  Centralized security.</li>
<br>
<br>