<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Enterprise Feature Store - DatasetCatalog
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial;'><b>Retail Computers Feature Store & DatasetCatalog Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;'>This notebook demonstrates the end-to-end workflow for building a feature store and dataset catalog for retail computer pricing analytics using TeradataML. Key steps include:
<ul>
  <li>Loading and transforming raw computer sales data to engineer features such as total, maximum, and count of prices by RAM size.</li>
  <li>Creating a centralized feature store `retail_computers_feature_store` within the `computers_pricing` data domain to manage and govern these features.</li>
  <li>Ingesting and versioning features using FeatureProcess, enabling traceability and reusability.</li>
  <li>Building a DatasetCatalog to assemble curated datasets from selected feature versions for downstream analytics.</li>
    <li>Applying the KMeans clustering algorithm to segment computers based on engineered features, supporting business use cases like pricing strategy, inventory management, and targeted marketing.</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>The notebook provides a practical example of operationalizing feature engineering, dataset management, and machine learning in a collaborative, production-ready environment.</p>

<p style = 'font-size:18px;font-family:Arial;'><b>Disclaimer</b></p>

<p style = 'font-size:12px;font-family:Arial;'>
The sample code (“Sample Code”) provided is not covered by any Teradata agreements. Please be aware that Teradata has no control over the model responses to such sample code and such response may vary. The use of the model by Teradata is strictly for demonstration purposes and does not constitute any form of certification or endorsement. The sample code is provided “AS IS” and any express or implied warranties, including the implied warranties of merchantability and fitness for a particular purpose, are disclaimed. In no event shall Teradata be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) sustained by you or a third party, however caused and on any theory of liability, whether in contract, strict liability, or tort arising in any way out of the use of this sample code, even if advised of the possibility of such damage.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>1. Connect to Vantage, Import python packages and explore the dataset</b></p>

In [None]:
!pip install teradataml==20.0.0.7 --quiet

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>Please execute the above pip install to get the latest version of the required library. Be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
import os
from teradataml import *
from getpass import getpass
import warnings
warnings.filterwarnings('ignore')

display.max_rows = 5

<hr style="height:2px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'> 1.1 Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=EFS-DatasetCatalog.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Setup a Feature Store Repository</b></p>
<p style = 'font-size:18px;font-family:Arial;'><b>2.1 Create the FeatureStore</b></p>

In [None]:
fs = FeatureStore(repo="retail_computers_feature_store", data_domain="computers_pricing")

<p style = 'font-size:18px;font-family:Arial;'><b>2.2 Setup the FeatureStore</b></p>

In [None]:
fs.setup()

<p style = 'font-size:18px;font-family:Arial;'><b>2.3 Checking Availability</b></p>

In [None]:
fs = FeatureStore(repo="retail_computers_feature_store", data_domain="computers_pricing")

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>3. Get Data For demo</b>
<p style = 'font-size:18px;font-family:Arial;'><b>3.1 Load the computers_train1 data</b></p>

In [None]:
load_example_data("kmeans", "computers_train1")
computers_train1 = DataFrame("computers_train1")
computers_train1

<p style = 'font-size:18px;font-family:Arial;'><b>3.2 Perform Data Transformation</b></p>
<p style = 'font-size:16px;font-family:Arial;'><b>Transformation Details:</b>    
In this step, we filter the computers dataset to include only records where the price is less than 2000. Then, we group the filtered data by the 'ram' column and compute three aggregated features for each RAM size:
<ul>
  <li><code>total_price :</code> The sum of prices for all computers with the same RAM size.</li>
  <li><code>max_price :</code> The maximum price among computers with the same RAM size.</li>
  <li><code>count_price :</code> The total number of computers for each RAM size.</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>These features are essential for understanding pricing trends and inventory distribution by RAM configuration.</p>

In [None]:
df1 = computers_train1[computers_train1['price'] < 2000]

df2 = df1.groupby('ram').assign(total_price=df1.price.sum(),
                                max_price=df1.price.max(),
                                count_price=df1.price.count())
df2

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>4. Store the data transformations</b>
<p style = 'font-size:16px;font-family:Arial;'>We are storing the transformation here. So, even if underlying data varies, the data transformation steps remain same.</p>

In [None]:
computers_train_df = df2.create_view(view_name="computers_train1_view")

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>5. Ingest the features</b>
<p style = 'font-size:16px;font-family:Arial;'>
<ul>
  <li>Store the feature values of 'count_price', 'max_price', 'total_price' features.</li>
  <li>Run the FeatureProcess</li>
</ul>

<p style = 'font-size:18px;font-family:Arial;'><b>5.1 Create the FeatureProcess and run it</b></p>

In [None]:
fp = FeatureProcess(repo="retail_computers_feature_store",
                    data_domain="computers_pricing",
                    object=computers_train_df,
                    entity='ram',
                    features=['count_price', 'max_price', 'total_price'])
fp.run()

<p style = 'font-size:18px;font-family:Arial;'><b>5.2 See the mind_map for Feature Store</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We ingested three features—<code>count_price</code>, <code>max_price</code>, and <code>total_price</code>—from a single feature process. This demonstrates how multiple related features can be managed and tracked together within the feature store, maintaining their lineage to the originating process.</p>

In [None]:
fs.mind_map()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>6. Build the Dataset</b>
<p style = 'font-size:18px;font-family:Arial;'><b>6.1 Create DatasetCatalog object</b></p>

In [None]:
dc = DatasetCatalog(repo="retail_computers_feature_store",
                    data_domain="computers_pricing")
dc

<p style = 'font-size:18px;font-family:Arial;'><b>6.2 Build the dataset with ingested features</b></p>

In [None]:
dataset_name = "kmeans_dataset"

# Build dataset with features and their versions (process_id)
selected_features = {"count_price": fp.process_id,
                     "max_price": fp.process_id,
                     "total_price": fp.process_id}

In [None]:
data = dc.build_dataset(entity="ram", 
                        selected_features=selected_features, 
                        view_name=dataset_name,
                        description="Dataset for KMeans test")

<p style = 'font-size:18px;font-family:Arial;'><b>6.3 See the mind_map for Feature Store</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We ingested three features—<code>count_price</code>, <code>max_price</code>, and <code>total_price</code>—from a single feature process. This demonstrates how multiple related features can be managed and tracked together within the feature store, maintaining their lineage to the originating process.</p>

In [None]:
fs.mind_map()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>7. Execute KMeans Analytics function on build dataset</b>
<p style = 'font-size:16px;font-family:Arial;'><b>KMeans Clustering Details:</b>
In this step, we apply the KMeans clustering algorithm to the engineered dataset. The goal is to segment computers into three distinct clusters based on their aggregated features: <code>count_price</code>, <code>max_price</code>, and <code>total_price</code> for each RAM size. This unsupervised learning technique helps identify natural groupings in the data, such as common pricing tiers or inventory segments, which can be used for targeted marketing, inventory management, or further analytics.</p>

In [None]:
KMeans_out = KMeans(id_column="ram",
                    target_columns=['count_price', 'max_price', 'total_price'],
                    data=data,
                    num_clusters=3)

In [None]:
KMeans_out.result

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>8. Explore DataDomain</b>
<p style = 'font-size:18px;font-family:Arial;'><b>8.1 Explore DatasetCatalog properties</b></p>
<p style = 'font-size:16px;font-family:Arial;'>The <code>data_domain</code> property shows the domain associated with the dataset catalog.</p>

In [None]:
dc.data_domain

<p style = 'font-size:18px;font-family:Arial;'><b>8.2 Explore DatasetCatalog methods</b></p>
<p style = 'font-size:16px;font-family:Arial;'><b>List the datasets</b></p>

In [None]:
dc.list_datasets()

<p style = 'font-size:16px;font-family:Arial;'><b>List the entities</b></p>

In [None]:
dc.list_entities()

<p style = 'font-size:16px;font-family:Arial;'><b>List the features</b></p>

In [None]:
dc.list_features()

<p style = 'font-size:16px;font-family:Arial;'><b>Get the dataset</b></p>
<p style = 'font-size:16px;font-family:Arial;'>This step is going go to require a couple of conversion steps to get the dataset_id from a DataFrame index object into a string object that we can use in the next statements.
</p>


In [None]:
dcdf = dc.list_features()
pdf = dcdf.to_pandas()
pdf_reset = pdf.reset_index()
dataset_id_str = str(pdf.index[0])
# print(dataset_id_str)

In [None]:
dc.get_dataset(f'{dataset_id_str}')

<p style = 'font-size:16px;font-family:Arial;'><b>Archive the dataset</b></p>

In [None]:
dc.archive_datasets(f'{dataset_id_str}')

In [None]:
dc.list_datasets()

<p style = 'font-size:16px;font-family:Arial;'><b>Delete the dataset</b></p>

In [None]:
dc.delete_datasets(f'{dataset_id_str}')

In [None]:
dc.list_datasets()

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>9. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;'> <b>Work Tables and Views </b></p>

In [None]:
db_drop_view("computers_train1_view")

In [None]:
db_drop_table("computers_train1")

In [None]:
remove_context()

<p style = 'font-size:18px;font-family:Arial;'><b>9.1 Delete the Feature Store</b></p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<p style = 'font-size:16px;font-family:Arial;'><b>Note :</b> This will drop the database if all objects are removed.</p>

In [None]:
fs = FeatureStore(repo="retail_computers_feature_store", data_domain="computers_pricing")

In [None]:
fs.delete()

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>