## Load the user connection data and connect to the SAP HANA database instance

Before running the next cell make sure [../0x00-setup/temp_user.ini](../0x00-setup/temp_user.ini) is copied to [../0x00-setup/user.ini](../0x00-setup/user.ini) and the user + password are set appropriately.

In [None]:
from hana_ml.algorithms.pal.utility import Settings
myhost, myport, myuser, mypwd = Settings.load_config("../0x00-setup/user.ini")

In [None]:
from hana_ml import dataframe as hdf
myconn=hdf.ConnectionContext(
    address=myhost, 
    port=myport, 
    user=myuser,
    password=mypwd
)
print(f"Connected to SAP HANA db version {myconn.hana_version()} \nat {myhost}:{myport} as user {myuser}")

In [None]:
print(myconn.sql("SELECT NOW() FROM DUMMY").collect().CURRENT_TIMESTAMP[0])

# Tables from SAP HANA

In [None]:
hdf_train=myconn.table('TRAIN', schema='TITANIC')

In [None]:
hdf_train.get_table_structure()

# Categorical and Continuous Variables

Continuous variables are measured numerically, and have an infinite number of possible values.

Categorical variables are also known as discrete or qualitative variables. 

Categorical variables can be further categorized as either nominal or ordinal.



## Univariate Analysis

[Univariate](https://en.wikipedia.org/wiki/Univariate_(statistics)) is a term to describe a type of data which consists of observations on only a single characteristic or attribute.

In [None]:
label = ['Survived']
categorical_features = ['PClass','Gender','Embarked']

In [None]:
from hana_ml.algorithms.pal.stats import univariate_analysis

In [None]:
continuous, categorical = univariate_analysis(
    data=hdf_train,
    key='PassengerId',
    categorical_variable=categorical_features + label)

In [None]:
continuous.head(15).collect()

In [None]:
continuous.filter(condition="VARIABLE_NAME='Fare'").collect()

In [None]:
categorical.head(15).collect()

In [None]:
categorical.filter(condition="VARIABLE_NAME='Survived'").collect()

🤓 **Let's discuss**:
- Categorical vs Continuous Variables

## Categorical variables evaluation (bar + pie charts)

In [None]:
from hana_ml.visualizers.eda import *
from matplotlib import pyplot as plt

In [None]:
ax, bar_data = EDAVisualizer().bar_plot(data=hdf_train, column='PClass', aggregation={'PClass':'count'})

In [None]:
ax, pie_data = EDAVisualizer().pie_plot(data=hdf_train, column='PClass')

## Continuous variables evaluation (histograms + boxplots)

In [None]:
numeric_features = ['Age', 'SibSp', 'ParCh', 'Fare']

In [None]:
from hana_ml.visualizers import eda

In [None]:
eda.hist(data=hdf_train.dropna(), columns=['Age', 'SibSp'], default_bins=20)

In [None]:
_, bp = EDAVisualizer().box_plot(
    data=hdf_train.dropna(subset=['Age']), column='Age',
    legend=False, outliers=True
)

#### Output combined charts as one

In [None]:
fig = plt.figure() #figsize=(10,5))
print (fig.figbbox)
ax1 = fig.add_subplot(211)
eda1 = EDAVisualizer(ax=ax1)
ax1, dist_data = eda1.distribution_plot(data=hdf_train.dropna(subset=['Age']), column="Age", bins=30)
# plt.show()
ax2 = fig.add_subplot(212)
eda2 = EDAVisualizer(ax=ax2)
ax2, corr_data = eda2.box_plot(data=hdf_train.dropna(subset=['Age']), column='Age', outliers=True)


## Bivariate Analysis

[Bivariate analysis](https://en.wikipedia.org/wiki/Bivariate_analysis) involves the analysis of two variables for the purpose of determining the relationship between them.

### Box plot with `group by`

In [None]:
_ = EDAVisualizer().box_plot(
    data=hdf_train.dropna(subset=['Age']), column='Age',
    legend=False, outliers=False, groupby='Survived'
)

# Multivariate analysis (MVA)

[MVA](https://en.wikipedia.org/wiki/Multivariate_statistics#Multivariate_analysis) is used to address the situations where the relations among multiple measurements and their structures are important.

In [None]:
EDAVisualizer().correlation_plot(
    data=hdf_train, corr_cols=['PClass', 'Fare', 'Age'],
    #cmap='bwr'
);

#Note the use of color pallets, best: diverging colormaps