<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Customer Segmentation with AutoCluster 
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;'>Leverage native Vantage processing for efficient and highly scalable data preparation, model training, and evaluation workflows</p>

<p style = 'font-size:16px;font-family:Arial'>AutoCluster(Automated Machine Learning for Clustering) functionality to automate the entire process of developing a predictive model. It will perform feature exploration, feature engineering, data preparation, model training and evaluation on dataset in auto run and at end we will get leaderboard containined different models along with their performance. Model will also have rank associated with them which indicates which is best performing model for given data followed by other models.</p>

<img src = 'images/K-means_convergence.gif' style=" border: 2px solid #404040; border-radius: 10px;"/>

<p style = 'font-size:16px;font-family:Arial'>AutoCluster is a dedicated AutoML pipeline designed specifically for clustering tasks. It automates the process of building, training, and evaluating clustering models, streamlining the workflow for unsupervised learning use cases where the goal is to group data into clusters.</p> 

<p style = 'font-size:18px;font-family:Arial'>Import the necessary libraries.</p>

In [None]:
# getpass to ask password to user and prevent storing it plain in the Notebook
import getpass
import pandas as pd

# import all Teradataml functions and supporting libraries
from teradataml import *

display.max_rows = 5

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press Enter, then use down arrow to go to next cell.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Customer_Segmentation_AutoCluster_Python.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Medical_local');" # Takes about 2 minutes
#%run -i ../run_procedure.py "call get_data('DEMO_Medical_cloud');"

<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>Create a Teradata DataFrame (virtual DataFrame)</b>
<p style = 'font-size:16px;font-family:Arial'>For this demo, data is already resident in Object Storage which we are accessing via ReadNOS, create a reference to the table, and sample the contents using the get_data procedure used above.  Data could just as easily reside in permanent tables, another RDBMS, or another Vantage system.</p>

In [None]:
cluster_df = DataFrame(in_schema('DEMO_Medical', 'Mental_Illness_Data'))
cluster_df

<p style = 'font-size:16px;font-family:Arial'>Here, we will inspect the original data set, and perform various preparation tasks.</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Inspect the rows of the table</li>
    <li>Inspect the column metadata using <a href = 'https://docs.teradata.com/search/all?query=Python+ColumnSummary&content-lang=en-US'>ColumnSummary</a></li>
    <li>Split off a testing data set to be used in evaluation</li>
    </ol>

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>2.1 View Column information</b></p>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://docs.teradata.com/search/all?query=Python+ColumnSummary&content-lang=en-US'>ColumnSummary</a> provides more details on column values and ranges. Note that the resulting DataFrame is a property of the function object. </p>

In [None]:
obj = ColumnSummary(data=cluster_df, target_columns=':')

# Note: The resulting DataFrame is accessed as a property of the function object.
obj.result

In [None]:
cluster_df.shape

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>2.2 Create a data set for AutoCluster Function</b></p>

<p style = 'font-size:16px;font-family:Arial'></p>

In [None]:
cluster_df_sample = cluster_df.sample(frac = [0.85, 0.15])

TrainTestSplit_out = TrainTestSplit(data = cluster_df,
                                    id_column="passenger",
                                    train_size=0.80,
                                    test_size=0.20,
                                    seed=42)
       
dataset_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
dataset_test  = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)

In [None]:
cluster_df_train= cluster_df_sample[cluster_df_sample['sampleid'] == 1].drop('sampleid', axis=1)
cluster_df_test = cluster_df_sample[cluster_df_sample['sampleid'] == 2].drop('sampleid', axis=1)

In [None]:
cluster_df_test.shape

In [None]:
cluster_df_train.shape

In [None]:
cluster_df_train.head()

In [None]:
cluster_df_test.head()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 3. Running the AutoCluster Function  </b></p>

<p style = 'font-size:16px;font-family:Arial'>AutoCluster is a dedicated AutoML pipeline designed specifically for clustering tasks.It automates the process of building, training, and evaluating clustering models,
    streamlining the workflow for unsupervised learning use cases where the goal is 
    to group data into clusters.</b>

In [None]:
cl = AutoCluster(verbose=2,
                 max_runtime_secs=300)

<p style = 'font-size:16px;font-family:Arial'><i>* Note: The below command will take approx 15-20minutes to run</i>

In [None]:
cl.fit(cluster_df_train)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 4. Leaderboard  </b></p>
<p style = 'font-size:16px;font-family:Arial'> Leaderboard will display all the models and their metrics </p>

In [None]:
cl.leaderboard()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 5. Best Performing Model  </b></p>
<p style = 'font-size:16px;font-family:Arial'> Best Performing Model will print the best performing model  </p>

In [None]:
cl.leader()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 6. Get Hyperparameter for Trained Model  </b></p>
<p style = 'font-size:16px;font-family:Arial'> Below command will display the hyperparamters of the model specified  </p>

In [None]:
cl.model_hyperparameters(rank=1)

In [None]:
cl.model_hyperparameters(rank=5)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 7. Generate Prediction  </b></p>
<p style = 'font-size:16px;font-family:Arial'> Generate the predictions from the model selected  </p>

In [None]:
prediction = cl.predict(cluster_df_test, rank=1)

In [None]:
prediction.head()

In [None]:
prediction_2 = cl.predict(cluster_df_test, rank=5)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>8. Clean up</b></p>

<p style = 'font-size:20px;font-family:Arial'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Medical');"  # takes about 5 seconds, optional if you want to use the data later

In [None]:
remove_context()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b> 9. Conclusion </b> </p>
<p style = 'font-size:16px;font-family:Arial'>In this notebook we have seen how we can use inDb Auto cluster function to automatically segment the data.</p>

<p style = 'font-size:20px;font-family:Arial'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
        <li>Teradata Auto Cluster Function Reference:
        <a href = 'https://docs.teradata.com/search/all?query=AutoCluster&content-lang=en-US'> docs</a></li>
  
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2026. All Rights Reserved
        </div>
    </div>
</footer>