# Dataframes
The SAP HANA Python Client API for machine learning algorithms (Python Client API for ML) provides a set of client-side Python functions for accessing and querying SAP HANA data, and a set of functions for developing machine learning models.

The Python Client API for ML consists of two main parts:

<li>A set of machine learning APIs for different algorithms.</li>
<li>The SAP HANA dataframe, which provides a set of methods for analyzing data in SAP HANA without bringing that data to the client.</li>

This library uses the SAP HANA Python driver (hdbcli) to connect to and access SAP HANA.
<br>
<br>
<img src="images/highlevel_overview2_new.png" title="Python API Overview" style="float:left;" width="300" height="50" />
<br>
A dataframe represents a table (or any SQL statement).  Most operations on a dataframe are designed to not bring data back from the database unless explicitly asked for.

In [2]:
from hana_ml import dataframe
import logging

## Setup connection and data sets
Let us load some data into a HANA table.  The data is loaded into 4 tables - full set, test set, training set, and the validation set:DBM2_RFULL_TBL, DBM2_RTEST_TBL, DBM2_RTRAINING_TBL, DBM2_RVALIDATION_TBL.

The data is related with direct marketing campaigns of a Portuguese banking institution. More information regarding the data set is at https://archive.ics.uci.edu/ml/datasets/bank+marketing#.

To do that, a connection is created and passed to the loader.  There is a config file, <b>config/e2edata.ini</b> that controls the connection parameters.  Please edit it to point to your hana instance.

In [3]:
from data_load_utils import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_tbl, training_tbl, validation_tbl, test_tbl = DataSets.load_bank_data(connection_context)

Table DBM2_RFULL_TBL exists and data exists


### Simple DataFrame
<table align="left"><tr><td>
</td><td><img src="images/Dataframes_1.png" style="float:left;" width="600" height="400" /></td></tr></table>

In [4]:
dataset1 = connection_context.table(training_tbl)
# Alternatively, it could be any SELECT
print(dataset1.select_statement)

SELECT * FROM "DBM2_RTRAINING_TBL"


### Simple Operations
#### Drop duplicates

In [5]:
dataset2 = dataset1.drop_duplicates()
print(dataset2.select_statement)

SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_1"


#### Remove a column

In [6]:
dataset3 = dataset2.drop(["LABEL"])
print(dataset3.select_statement)

SELECT "ID", "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_1") AS "DT_2"


#### Take null values and substitute with a specific value

In [7]:
dataset4 = dataset2.fillna(25, ["AGE"])
print(dataset4.select_statement)

SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_1") dt


### Bring data to client
#### Fetch 5 rows into client as a <b>Pandas Dataframe</b>

In [8]:
dataset4.head(5).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,30505,45,blue-collar,married,basic.9y,no,yes,yes,cellular,may,...,4,999,0,nonexistent,-1.8,92.893,-46.2,1.354,5099,no
1,18199,30,unemployed,single,university.degree,no,yes,yes,cellular,jul,...,3,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228,no
2,30837,35,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.344,5099,no
3,31569,36,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.327,5099,no
4,5813,43,services,married,high.school,no,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no


In [9]:
pd1 = dataset4.head(5).collect()
print(type(pd1))

<class 'pandas.core.frame.DataFrame'>


### Projection
<img src="images/Projection.png" style="float:left;" width="150" height="750" />

In [10]:
dsp = dataset4.select("ID", "AGE", "JOB", ('"AGE"*2', "TWICE_AGE"))
dsp.head(5).collect()  # collect() brings data to the client)

Unnamed: 0,ID,AGE,JOB,TWICE_AGE
0,8746,54,admin.,108
1,8041,36,housemaid,72
2,8214,38,technician,76
3,9441,42,blue-collar,84
4,8272,39,admin.,78


In [11]:
dsp.select_statement

'SELECT "ID", "AGE", "JOB", "AGE"*2 AS "TWICE_AGE" FROM (SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_1") dt) AS "DT_4"'

### Filtering Data
<img src="images/Filter.png" style="float:left;" width="200" height="100" />

In [12]:
dataset4.filter('AGE > 60').head(10).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,29830,69,retired,divorced,basic.4y,no,no,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099,no
1,36021,61,unknown,single,basic.4y,no,yes,no,cellular,may,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.266,5099,no
2,30030,64,retired,married,university.degree,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099,no
3,28514,61,retired,married,university.degree,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.423,5099,yes
4,28726,69,retired,married,unknown,no,no,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.41,5099,no
5,30134,79,retired,married,basic.9y,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,yes
6,30391,71,retired,divorced,basic.4y,no,yes,no,telephone,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
7,35862,66,housemaid,married,high.school,no,yes,no,cellular,may,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.259,5099,no
8,30242,81,retired,married,professional.course,no,no,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
9,35962,61,retired,married,basic.9y,no,no,no,cellular,may,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.264,5099,no


In [13]:
dataset4.filter('AGE > 60').select_statement

'SELECT * FROM (SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_1") dt) AS "DT_4" WHERE AGE > 60'

### Sorting
<img src="images/Sort.png" style="float:left;" width="200" height="100" />

In [14]:
dataset4.filter('AGE>60').sort(['AGE']).head(2).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,41000,61,housemaid,married,basic.4y,no,yes,no,telephone,oct,...,2,999,2,failure,-1.1,94.601,-49.5,1.016,4963,no
1,40261,61,admin.,married,unknown,no,yes,yes,cellular,jul,...,3,999,1,failure,-1.7,94.215,-40.3,0.889,4991,no


### Simple Joins
<img src="images/Join.png" style="float:left;" width="300" height="200" />

In [15]:
condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)
dataset5 = dataset4.join(dataset2, condition)

In [16]:
dataset5.head(5).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,8746,54,admin.,married,high.school,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.866,5228,no
1,8041,36,housemaid,married,high.school,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.865,5228,no
2,8214,38,technician,married,basic.9y,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.864,5228,no
3,9441,42,blue-collar,married,basic.9y,unknown,no,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228,no
4,8272,39,admin.,married,university.degree,no,no,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.864,5228,no


### Describing a dataframe
<img src="images/Describe.png" style="float:left;" width="300" height="200" />

In [17]:
dataset4.describe().collect()

Unnamed: 0,column,count,unique,nulls,mean,std,min,max,median,25_percent_cont,25_percent_disc,50_percent_cont,50_percent_disc,75_percent_cont,75_percent_disc
0,ID,20594,20594,0,20598.859911,11895.67744,1.0,41187.0,20569.0,10268.25,10268.0,20568.5,20568.0,30885.25,30886.0
1,AGE,20594,76,0,40.088375,10.38661,17.0,98.0,38.0,32.0,32.0,38.0,38.0,47.0,47.0
2,DURATION,20594,1326,0,260.616199,265.660606,0.0,4918.0,180.0,102.0,102.0,180.0,180.0,321.0,321.0
3,CAMPAIGN,20594,40,0,2.573662,2.76135,1.0,43.0,2.0,1.0,1.0,2.0,2.0,3.0,3.0
4,PDAYS,20594,25,0,961.629795,188.985754,0.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
5,PREVIOUS,20594,8,0,0.173691,0.493915,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,EMP_VAR_RATE,20594,10,0,0.082325,1.572978,-3.4,1.4,1.1,-1.8,-1.8,1.1,1.1,1.4,1.4
7,CONS_PRICE_IDX,20594,26,0,93.576548,0.580561,92.201,94.767,93.749,93.075,93.075,93.749,93.749,93.994,93.994
8,CONS_CONF_IDX,20594,26,0,-40.476988,4.635267,-50.8,-26.9,-41.8,-42.7,-42.7,-41.8,-41.8,-36.4,-36.4
9,EURIBOR3M,20594,302,0,3.620535,1.736291,0.634,5.045,4.857,1.344,1.344,4.857,4.857,4.961,4.961


In [18]:
dataset4.describe().select_statement

'SELECT * FROM (SELECT "SimpleStats".*, "Percentiles"."25_percent_cont", "Percentiles"."25_percent_disc", "Percentiles"."50_percent_cont", "Percentiles"."50_percent_disc", "Percentiles"."75_percent_cont", "Percentiles"."75_percent_disc" FROM (select \'ID\' as "column", COUNT("ID") as "count", COUNT(DISTINCT "ID") as "unique", SUM(CASE WHEN "ID" is NULL THEN 1 ELSE 0 END) as "nulls", AVG(TO_DOUBLE("ID")) as "mean", STDDEV("ID") as "std", MIN("ID") as "min", MAX("ID") as "max", MEDIAN("ID") as "median" FROM (SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_1") dt) AS "DT_4" UNION ALL select \'AGE\' as "column", COUNT("AGE") as "count", COUNT(DISTINCT "AGE") as "unique", SUM(CASE 

### Saving a dataframe

In [19]:
dataset4.head(10).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,30505,45,blue-collar,married,basic.9y,no,yes,yes,cellular,may,...,4,999,0,nonexistent,-1.8,92.893,-46.2,1.354,5099,no
1,18199,30,unemployed,single,university.degree,no,yes,yes,cellular,jul,...,3,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228,no
2,30837,35,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.344,5099,no
3,31569,36,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.327,5099,no
4,5813,43,services,married,high.school,no,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
5,25063,56,technician,married,professional.course,unknown,no,no,cellular,nov,...,2,999,0,nonexistent,-0.1,93.2,-42.0,4.153,5195,no
6,16119,44,blue-collar,married,basic.6y,unknown,yes,yes,cellular,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228,no
7,22831,47,blue-collar,married,university.degree,no,no,no,cellular,aug,...,3,999,0,nonexistent,1.4,93.444,-36.1,4.965,5228,no
8,30412,32,blue-collar,married,professional.course,no,yes,no,cellular,apr,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,yes
9,1644,46,admin.,married,university.degree,no,yes,yes,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191,no


In [20]:
dataset4.count()

20594

In [21]:
dataset4.save("#MYTEST2")

<hana_ml.dataframe.DataFrame at 0x7f6fb4cc1fd0>

In [22]:
dataset8 = connection_context.table("#MYTEST2")

In [23]:
dataset8.head(10).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,30505,45,blue-collar,married,basic.9y,no,yes,yes,cellular,may,...,4,999,0,nonexistent,-1.8,92.893,-46.2,1.354,5099,no
1,18199,30,unemployed,single,university.degree,no,yes,yes,cellular,jul,...,3,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228,no
2,30837,35,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.344,5099,no
3,31569,36,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.327,5099,no
4,5813,43,services,married,high.school,no,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
5,25063,56,technician,married,professional.course,unknown,no,no,cellular,nov,...,2,999,0,nonexistent,-0.1,93.2,-42.0,4.153,5195,no
6,16119,44,blue-collar,married,basic.6y,unknown,yes,yes,cellular,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228,no
7,22831,47,blue-collar,married,university.degree,no,no,no,cellular,aug,...,3,999,0,nonexistent,1.4,93.444,-36.1,4.965,5228,no
8,30412,32,blue-collar,married,professional.course,no,yes,no,cellular,apr,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,yes
9,1644,46,admin.,married,university.degree,no,yes,yes,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191,no


In [24]:
dataset8.count()

20594

### Pivotting

In [25]:
dataset8.pivot_table(values='EMP_VAR_RATE', index='ID', columns='EDUCATION', aggfunc='avg').head(10).collect()

Unnamed: 0,ID,university.degree,basic.4y,professional.course,high.school,unknown,illiterate,basic.6y,basic.9y
0,1,,1.1,,,,,,
1,2,,,,1.1,,,,
2,5,,,,1.1,,,,
3,9,,,1.1,,,,,
4,12,,,,1.1,,,,
5,13,,,,1.1,,,,
6,14,,1.1,,,,,,
7,20,,,,,,,,1.1
8,24,,,,1.1,,,,
9,25,,,,1.1,,,,


### Load Pandas DataFrame

In [27]:
dataframe.create_dataframe_from_pandas(connection_context, dataset8.head(10).collect(), 'MYTEST3', replace=True)

<hana_ml.dataframe.DataFrame at 0x7f6fb4d105f8>

In [28]:
connection_context.table("MYTEST3").head(10).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,30505,45,blue-collar,married,basic.9y,no,yes,yes,cellular,may,...,4,999,0,nonexistent,-1.8,92.893,-46.2,1.354,5099,no
1,18199,30,unemployed,single,university.degree,no,yes,yes,cellular,jul,...,3,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228,no
2,30837,35,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.344,5099,no
3,31569,36,technician,married,professional.course,no,yes,no,cellular,may,...,1,999,1,failure,-1.8,92.893,-46.2,1.327,5099,no
4,5813,43,services,married,high.school,no,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
5,25063,56,technician,married,professional.course,unknown,no,no,cellular,nov,...,2,999,0,nonexistent,-0.1,93.2,-42.0,4.153,5195,no
6,16119,44,blue-collar,married,basic.6y,unknown,yes,yes,cellular,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228,no
7,22831,47,blue-collar,married,university.degree,no,no,no,cellular,aug,...,3,999,0,nonexistent,1.4,93.444,-36.1,4.965,5228,no
8,30412,32,blue-collar,married,professional.course,no,yes,no,cellular,apr,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,yes
9,1644,46,admin.,married,university.degree,no,yes,yes,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191,no
