# Dataframes
The SAP HANA Python Client API for machine learning algorithms (Python Client API for ML) provides a set of client-side Python functions for accessing and querying SAP HANA data, and a set of functions for developing machine learning models.

The Python Client API for ML consists of two main parts:

<li>A set of machine learning APIs for different algorithms.</li>
<li>The SAP HANA dataframe, which provides a set of methods for analyzing data in SAP HANA without bringing that data to the client.</li>

This library uses the SAP HANA Python driver (hdbcli) to connect to and access SAP HANA.
<br>
<br>
<img src="images/highlevel_overview2_new.png" title="Python API Overview" style="float:left;" width="300" height="50" />
<br>
A dataframe represents a table (or any SQL statement).  Most operations on a dataframe are designed to not bring data back from the database unless explicitly asked for.

In [None]:
from hana_ml import dataframe
import logging

## Setup connection and data sets
Let us load some data into a HANA table.  The data is loaded into 4 tables - full set, test set, training set, and the validation set:DBM2_RFULL_TBL, DBM2_RTEST_TBL, DBM2_RTRAINING_TBL, DBM2_RVALIDATION_TBL.

The data is related with direct marketing campaigns of a Portuguese banking institution. More information regarding the data set is at https://archive.ics.uci.edu/ml/datasets/bank+marketing#.

To do that, a connection is created and passed to the loader.  There is a config file, <b>config/e2edata.ini</b> that controls the connection parameters.  Please edit it to point to your hana instance.

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_df, training_df, validation_df, test_df = DataSets.load_bank_data(connection_context)

### Simple DataFrame
<table align="left"><tr><td>
</td><td><img src="images/Dataframes_1.png" style="float:left;" width="600" height="400" /></td></tr></table>

In [None]:
dataset1 = training_df
# Alternatively, it could be any SELECT
print(dataset1.select_statement)

### Simple Operations
#### Drop duplicates

In [None]:
dataset2 = dataset1.drop_duplicates()
print(dataset2.select_statement)

#### Remove a column

In [None]:
dataset3 = dataset2.drop(["LABEL"])
print(dataset3.select_statement)

#### Take null values and substitute with a specific value

In [None]:
dataset4 = dataset2.fillna(25, ["AGE"])
print(dataset4.select_statement)

In [None]:
import pandas as pd
dataset_null = dataframe.create_dataframe_from_pandas(connection_context=connection_context,
                                                      pandas_df=pd.DataFrame({"ID": [1,2,5],
                                                                              "ID2": [1,None,5],
                                                                              "V3": [2,3,4],
                                                                              "V4": [3,3,3],
                                                                              "V5": ['a', None, 'b']}),
                                                      table_name="#tt_null",
                                                      force=True)

In [None]:
dataset_null.collect()

In [None]:
dataset_null.fillna(0).collect()

In [None]:
dataset_null.fillna('').collect()

In [None]:
dataset_null.fillna('').fillna(0).collect()

### Bring data to client
#### Fetch 5 rows into client as a <b>Pandas Dataframe</b>

In [None]:
dataset4.head(5).collect()

In [None]:
pd1 = dataset4.head(5).collect()
print(type(pd1))

### Projection
<img src="images/Projection.png" style="float:left;" width="150" height="750" />

In [None]:
dsp = dataset4.select("ID", "AGE", "JOB", ('"AGE"*2', "TWICE_AGE"))
dsp.head(5).collect()  # collect() brings data to the client)

In [None]:
dsp.select_statement

### Filtering Data
<img src="images/Filter.png" style="float:left;" width="200" height="100" />

In [None]:
dataset4.filter('AGE > 60').head(10).collect()

In [None]:
dataset4.filter('AGE > 60').select_statement

### Sorting
<img src="images/Sort.png" style="float:left;" width="200" height="100" />

In [None]:
dataset4.filter('AGE>60').sort(['AGE']).head(2).collect()

### Simple Joins
<img src="images/Join.png" style="float:left;" width="300" height="200" />

In [None]:
condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)
dataset5 = dataset4.join(dataset2, condition)

In [None]:
dataset5.head(5).collect()

In [None]:
import pandas as pd
df1 = dataframe.create_dataframe_from_pandas(connection_context=connection_context,
                                             pandas_df=pd.DataFrame({"ID": [1,2,3],
                                                                     "ID2": [1,2,3],
                                                                     "V1": [2,3,4]}),
                                             table_name="#tt1",
                                             force=True)
df2 = dataframe.create_dataframe_from_pandas(connection_context=connection_context,
                                             pandas_df=pd.DataFrame({"ID": [1,2],
                                                                     "ID2": [1,2],
                                                                     "V2": [2,3]}),
                                             table_name="#tt2",
                                             force=True)
df3 = dataframe.create_dataframe_from_pandas(connection_context=connection_context,
                                             pandas_df=pd.DataFrame({"ID": [1,2,5],
                                                                     "ID2": [1,2,5],
                                                                     "V3": [2,3,4],
                                                                     "V4": [3,3,3],
                                                                     "V5": ['a','a','b']}),
                                             table_name="#tt3",
                                             force=True)


In [None]:
dfs = [df1.set_index("ID"), df2.set_index("ID"), df3.set_index("ID")]
print(dfs[0].join(dfs[1:]).collect())

In [None]:
dfs = [df1.set_index(["ID", "ID2"]), df2.set_index(["ID", "ID2"]), df3.set_index(["ID", "ID2"])]
print(dfs[0].join(dfs[1:]).collect())

In [None]:
print(dfs[0].union([dfs[0], dfs[0]]).collect())

### Cast

In [None]:
dataset4.cast({"AGE": "BIGINT", "JOB": "NVARCHAR(50)"}).get_table_structure()

### Sort by index

In [None]:
df1.sort_index().collect()

### Take min, max, sum, median, mean

In [None]:
df1.min()

In [None]:
df1.select("V1").min()

In [None]:
df1.sum()

### value_counts

In [None]:
df3.value_counts().collect()

In [None]:
subset = None
if subset is None:
    subset = df3.columns
count_df = []
id_df = []
for col in subset:
    id_df.append(df3.select(col).rename_columns({col: "VALUES"}).cast("VALUES", 'NVARCHAR(255)'))
    count_df.append(df3.agg([("count", col, "NUM_{}".format(col))], group_by=col).set_index(col))
idf = id_df[0].union(id_df[1:]).distinct().set_index("VALUES")

In [None]:
idf.head(1).collect()

### Describing a dataframe
<img src="images/Describe.png" style="float:left;" width="300" height="200" />

In [None]:
dataset4.describe().collect()

### Saving a dataframe

In [None]:
dataset4.head(10).collect()

In [None]:
dataset4.count()

In [None]:
dataset4.save("#MYTEST2")

In [None]:
dataset8 = connection_context.table("#MYTEST2")

In [None]:
dataset8.head(10).collect()

In [None]:
dataset8.count()

### Pivotting

In [None]:
dataset8.pivot_table(values='EMP_VAR_RATE', index='ID', columns='EDUCATION', aggfunc='avg').head(10).collect()

### Load Pandas DataFrame

In [None]:
dataframe.create_dataframe_from_pandas(connection_context, dataset8.head(10).collect(), 'MYTEST3', replace=True)

In [None]:
connection_context.table("MYTEST3").head(10).collect()

### Split column

In [None]:
import pandas as pd
split_df = \
dataframe.create_dataframe_from_pandas(connection_context,
                                       pandas_df=pd.DataFrame({"ID": [1,2],
                                                               "COL": ['1,2,3', '3,4,4']}),
                                       table_name="#split_test",
                                       force=True)

In [None]:
new_df = split_df.split_column(column="COL", separator=",", new_column_names=["COL1", "COL2", "COL3"])
new_df.collect()

### Concat columns

In [None]:
new_df.concat_columns(columns=["COL1", "COL2", "COL3"], separator=",").collect()