## Check the setup and connect to the database

Check if your connection with HANA DB is still active.

In [None]:
%run 'exercise1-check_setup.ipynb'

## Use HANA DataFrame and Pandas DataFrame

Check which tables are available in schema **DB1**. Table **TITANIC** contains all historical data including the **SURVIVED** column. **TITANIC_TEST** contains fictional testing data (without the **SURVIVED** column). **TITANIC_TRUTH** contains the same data as **TITANIC_TEST**, and it also contains the **SURVIDED** column.

In [None]:
myconn.get_tables(schema='DB_1')

A table with data already exist in your SAP HANA database, so you use [the `table()` method](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2023_3_QRC/en-US/hana_ml.dataframe.html#hana_ml.dataframe.ConnectionContext.table) to create a HANA DataFrame from an existing database table. 

In [3]:
hdf_train=myconn.table('TITANIC', schema='DB_1')

You can always use [the `select_statement` property](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2023_3_QRC/en-US/hana_ml.dataframe.html#hana_ml.dataframe.DataFrame) to check the SQL SELECT statement that backs a HANA DataFrame. 

In [None]:
hdf_train.select_statement

HANA DataFrame represents only SQL SELECT statement, but does not store data...

In [8]:
hdf_train_first10recs=hdf_train.head(10)

In [None]:
hdf_train_first10recs.select_statement

...until [a `collect()` method](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2023_3_QRC/en-US/hana_ml.dataframe.html#hana_ml.dataframe.DataFrame.collect) is executed, which returns a result as a Pandas dataframe on a client side. The data is from data science competition website Kaggle (https://www.kaggle.com/competitions/titanic). More info is available on the meaning of the columns etc. here: https://www.kaggle.com/competitions/titanic/data.

In [None]:
hdf_train_first10recs.collect()

You use [HANA `DataFrame` methods](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2023_3_QRC/en-US/hana_ml.dataframe.html#hana_ml.dataframe.DataFrame) to query the data from SAP HANA database.

In [None]:
print(hdf_train.value_counts(['PCLASS']).select_statement)

In [None]:
hdf_train.value_counts(['PCLASS']).collect()

In [None]:
print(hdf_train.value_counts(['PCLASS']).sort('NUM_PCLASS', desc=True).select_statement)

In [None]:
hdf_train.value_counts(['PCLASS']).sort('NUM_PCLASS', desc=True).collect()

You use [Pandas `DataFrame` and/or `Series`](https://pandas.pydata.org/docs/user_guide/10min.html#minutes-to-pandas) methods to query the data returned to a client as a result of the `collect()` method.

In [None]:
hdf_train.value_counts(['PCLASS']).collect().sort_values('NUM_PCLASS')

ðŸ¤“ **Let's discuss**:
1. HANA DataFrames
2. Pandas DataFrames/Series