# hana-ml Tutorial - Clustering 

**Author: TI HDA DB HANA Core CN**

In this tutorial, we will show you how to use clustering functions in hana-ml to preprocess data and train a model with a public Iris dataset.  

## Import necessary libraries and functions

In [None]:
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal import clustering
from hana_ml.algorithms.pal.utility import DataSets, Settings
from hana_ml.visualizers.unified_report import UnifiedReport
from hana_ml.algorithms.pal.partition import train_test_val_split
import numpy as np
import pandas as pd

## Create a connection to a SAP HANA instance

First, you need to create a connetion to a SAP HANA instance. In the following cell, we use a config file, config/e2edata.ini to control the connection parameters. 

In your case, please update the following url, port, user, pwd with your HANA instance information for setting up the connection. 

In [None]:
# Please replace url, port, user, pwd with your HANA instance information
connection_context = ConnectionContext(url, port, user, pwd)

## Load the dataset

The Iris data set used is from University of California, Irvine (https://archive.ics.uci.edu/ml/datasets/iris, for tutorials use only).  This data set contains attributes of a plant iris.  There are three species of Iris plants.
The data contains the following attributes for various flowers:

- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm

Although the flower is identified in the dataset, we will cluster the data set into 3 clusters since we know there are three different flowers. The hope is that the cluster will correspond to each of the flowers.

In hana-ml, we provide a class called DataSets which contains several public datasets. You could use load_iris_data() to load the Iris dataset.

In [None]:
# load the dataset
iris_df, _, _, _ = DataSets.load_iris_data(connection_context)

# number of rows and number of columns
print("Shape of Iris datset: {}".format(iris_df.shape))
# columns
print(iris_df.columns)
# types of each column
print(iris_df.dtypes())
# check how many SPECIES are in the data set
print(iris_df.distinct("SPECIES").collect())

**Generate a Dataset Report**

In [None]:
UnifiedReport(iris_df).build().display()

**Split the dataset**

In [None]:
df_iris_train, df_iris_test, _ = train_test_val_split(data=iris_df, 
                                                      random_seed=2,
                                                      training_percentage=0.8,
                                                      testing_percentage=0.2,
                                                      validation_percentage=0,
                                                      id_column='ID',
                                                      partition_method='stratified',
                                                      stratified_column='SPECIES')

print("Number of training samples: {}".format(df_iris_train.count()))
print("Number of test samples: {}".format(df_iris_test.count()))

# Model Training

In [None]:
features = ['SEPALLENGTHCM','SEPALWIDTHCM','PETALLENGTHCM','PETALWIDTHCM']
kmeans = clustering.KMeans(thread_ratio=0.2, 
                           n_clusters=3, 
                           distance_level='euclidean', 
                           max_iter=100, 
                           tol=1.0E-6, 
                           category_weights=0.5, 
                           normalization='min_max')
km = kmeans.fit(data=df_iris_train, key='ID', features=features)
print(km.labels_.collect())

## Prediction

In [None]:
res = km.predict(data=df_iris_test, key='ID', features=features)
print(res.collect())

## Close the connection

In [None]:
connection_context.close()

## Thank you!