## End-to-End Scenario: Predict Segmentation of New Customers for a Supermarket

Author: TI HDA DB HANA Core CN

In this end-to-end scenario, We will show how to predict the segmentation/cluster of new customers for a supermarket. Firstly, we use a K-means function to cluster existing customers of the supermarket. Then, the output can be used as the training data for the C4.5 Decision Tree function to predict new customers’ segmentation/cluster.

### 1. Technology Background

K-means clustering is a method of cluster analysis whereby the algorithm partitions N observations or records into K clusters, in which each observation belongs to the cluster with the nearest center. It is one of the most commonly used algorithms in clustering method.

Decision trees are powerful and popular tools for classification and prediction. Decision tree learning, used in statistics, data mining, and machine learning uses a decision tree as a predictive model which maps the observations about an item to the conclusions about the item’s target value.

### 2. Implementation Steps

**Setup the Connection to SAP HANA**

First, you need to create a connetion to a SAP HANA instance. In the following cell, we use a config file, config/e2edata.ini to control the connection parameters. 

In your case, please update the following url, port, user, pwd with your HANA instance information. 

In [None]:
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import Settings

# please update the following url, port, user, pwd with your HANA instance information
connection_context = ConnectionContext(url, port, user, pwd)

Connection status:

In [None]:
print(connection_context.connection.isconnected())

**Step 1: Invoke K-means**

Input customer data and use the K-means function to partition the data set into K clusters. In this example, nine rows of data will be as the input. K equals 3, which means the customers will be partitioned into three levels.

Generate the customer dataframe with ID, AGE, INCOME:

In [None]:
import pandas as pd
from hana_ml.dataframe import create_dataframe_from_pandas

data = {'ID':  [1,2,3,4,5,6,7,8,9],
        'AGE': [20, 21, 22, 30, 31, 32, 40, 41, 42],
        'INCOME': [100000, 101000, 102000, 200000, 201000, 202000, 400000, 401000, 402000]}
customer = pd.DataFrame (data, columns = ['ID','AGE','INCOME'])
customer_df = create_dataframe_from_pandas(connection_context=connection_context, 
                                           pandas_df=customer, 
                                           table_name='CUSTOMER_TBL', 
                                           force=True, 
                                           replace=True)
customer_df.collect()

Call KMeans to cluster the customers:

In [None]:
from hana_ml.algorithms.pal.clustering import KMeans

kmeans = KMeans(n_clusters=3, 
                init='first_k', 
                max_iter=100,
                tol=1.0E-6, 
                distance_level='Euclidean',
                normalization='no')

kmeans.fit(data=customer_df, key='ID', features=['AGE', 'INCOME'])
print(kmeans.labels_.collect())

Join the customer_df with the result column of kmeans.labels_ for next step:

In [None]:
result = kmeans.labels_.select("ID", "CLUSTER_ID").rename_columns(names=["ID_R", "CLUSTER_ID"])    
    
data_cluster = customer_df.join(other=result, condition="ID = ID_R")

# select the necessary columns for modeling in the next step
data_cluster = data_cluster.select(['AGE', 'INCOME', 'CLUSTER_ID'])

# rename the  column 'CLUSTER_ID' to be 'LEVEL'
data_cluster = data_cluster.rename_columns(names=['AGE', 'INCOME', 'LEVEL'])
print(data_cluster.collect())

**Step 2: Invoke C4.5 Decision Tree**


Use the above output dataframe data_cluster as the training data of C4.5 Decision Tree. The C4.5 Decision Tree function will generate a tree model which maps the observations about an item to the conclusions about the item’s target value.

In [None]:
from hana_ml.algorithms.pal.trees import DecisionTreeClassifier

# convert data type of column LEVEL to be VARCHAR(10)
data_cluster = data_cluster.cast(cols='LEVEL', new_type='VARCHAR (10)')

dt = DecisionTreeClassifier(algorithm='c45',
                            percentage=1.0,
                            model_format='pmml')
dt.fit(data_cluster)

# have a look at the result decision.rules
print(dt.decision_rules_.collect())

**Step 3 : Prediction with Tree Model**

Use the above tree model to map each new customer to the corresponding level he or she belongs to:

In [None]:
data = {'ID':  [10, 11, 12],
        'AGE': [20, 30, 40],
        'INCOME': [100003, 200003, 400003]}
new_data = pd.DataFrame(data, columns = ['ID', 'AGE', 'INCOME'])
new_data_df = create_dataframe_from_pandas(connection_context=connection_context, 
                                           pandas_df=new_data, 
                                           table_name='NEW_CUSTOMER_TBL', 
                                           force=True, 
                                           replace=True)

result = dt.predict(data=new_data_df, key="ID")
print(result.collect())

### 3. Drop Tables and Close the HANA Connection

In [None]:
connection_context.drop_table(table="CUSTOMER_TBL")
connection_context.drop_table(table="NEW_CUSTOMER_TBL")
connection_context.close()