# Enterprise Feature Store - DatasetCatalog

## Disclaimer
The sample code (“Sample Code”) provided is not covered by any Teradata agreements. Please be aware that Teradata has no control over the model responses to such sample code and such response may vary. The use of the model by Teradata is strictly for demonstration purposes and does not constitute any form of certification or endorsement. The sample code is provided “AS IS” and any express or implied warranties, including the implied warranties of merchantability and fitness for a particular purpose, are disclaimed. In no event shall Teradata be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) sustained by you or a third party, however caused and on any theory of liability, whether in contract, strict liability, or tort arising in any way out of the use of this sample code, even if advised of the possibility of such damage.

## Context
**Retail Computers Feature Store & DatasetCatalog Demo**

This notebook demonstrates the end-to-end workflow for building a feature store and dataset catalog for retail computer pricing analytics using TeradataML. Key steps include:
- Loading and transforming raw computer sales data to engineer features such as total, maximum, and count of prices by RAM size.
- Creating a centralized feature store `retail_computers_feature_store` within the `computers_pricing` data domain to manage and govern these features.
- Ingesting and versioning features using FeatureProcess, enabling traceability and reusability.
- Building a DatasetCatalog to assemble curated datasets from selected feature versions for downstream analytics.
- Applying the KMeans clustering algorithm to segment computers based on engineered features, supporting business use cases like pricing strategy, inventory management, and targeted marketing.
The notebook provides a practical example of operationalizing feature engineering, dataset management, and machine learning in a collaborative, production-ready environment.

## 1. Import the required libraries

In [3]:
import os
from teradataml import create_context, DataFrame, DataSource, Entity, Feature, FeatureGroup, FeatureStore, \
FeatureType, FeatureStatus, load_example_data, remove_context, FeatureProcess, FeatureCatalog, in_schema, \
DatasetCatalog, db_drop_table, db_drop_view, KMeans
from getpass import getpass

## 2. Connect to Vantage with Admin user 

Connecting to Vantage with an Admin user is required for initial setup tasks such as creating the feature store, configuring storage, and granting permissions to other users. These operations typically require elevated privileges.

In [113]:

context=create_context(config_file='admin_config_file.env')

## 3. Setup a Feature Store Repository

### 3.1. Create the FeatureStore

In [9]:
fs = FeatureStore(repo="retail_computers_feature_store", data_domain="computers_pricing")

Repo retail_computers_feature_store does not exist. Run FeatureStore.setup() to create the repo and setup FeatureStore.


### 3.2. Setup the FeatureStore

In [11]:
fs.setup()

True

### 3.3. Grant the access to user

**Note:** 
Granting read/write access to a user is necessary so they can create, modify, and manage features and metadata within the feature store. This ensures the specified user has the required permissions to work with the feature store objects. If needed, you can later revoke these rights using `fs.revoke.read_write(username)`.

In [15]:
username = getpass(prompt = 'username: ')
fs.grant.read_write(username)

username:  ········


True

## 4. Connect to a Vantage system with user that has granted the permissions

### 4.1. Remove context with Admin user

In [16]:
remove_context()

True

### 4.2. Create context with non-admin user

In [18]:
context=create_context(config_file='non_admin_config_file.env')

### 4.3. Create Feature Store object with non-admin user to ingest feature values. 

In [20]:
fs = FeatureStore(repo="retail_computers_feature_store", data_domain="computers_pricing")

FeatureStore is ready to use.


## 5. Get Data For demo

### 5.1. Load the computers_train1 data.

In [24]:
load_example_data("kmeans", "computers_train1")
computers_train1 = DataFrame("computers_train1")
computers_train1



id,price,speed,hd,ram,screen
2345,1795.0,33.0,420.0,8.0,15.0
3487,1999.0,33.0,420.0,8.0,15.0
3956,2095.0,100.0,214.0,4.0,14.0
469,2599.0,50.0,405.0,8.0,14.0
4690,3440.0,33.0,1000.0,24.0,15.0
3018,2469.0,33.0,340.0,4.0,15.0
938,2799.0,50.0,320.0,8.0,15.0
4894,1739.0,33.0,212.0,4.0,17.0
5832,1490.0,66.0,340.0,4.0,15.0
6036,1445.0,100.0,528.0,4.0,14.0


### 5.2. Perform Data Transformation

**Transformation Details:**
In this step, we filter the computers dataset to include only records where the price is less than 2000. Then, we group the filtered data by the 'ram' column and compute three aggregated features for each RAM size:
- `total_price`: The sum of prices for all computers with the same RAM size.
- `max_price`: The maximum price among computers with the same RAM size.
- `count_price`: The total number of computers for each RAM size.
These features are essential for understanding pricing trends and inventory distribution by RAM configuration.

In [27]:
df1 = computers_train1[computers_train1['price'] < 2000]

df2 = df1.groupby('ram').assign(total_price=df1.price.sum(),
                                max_price=df1.price.max(),
                                count_price=df1.price.count())
df2



ram,count_price,max_price,total_price
4.0,1238,1999.0,2040832.0
16.0,3,1999.0,5745.0
2.0,267,1995.0,440293.0
8.0,587,1999.0,1065001.0


## 6. Store the data transformations

We are storing the transformation here. So, even if underlying data varies, the data transformation steps remain same.

In [30]:
computers_train_df = df2.create_view(view_name="computers_train1_view")

## 7. Ingest the features 
- Store the feature values of 'count_price', 'max_price', 'total_price' features.
- Run the FeatureProcess

### 7.1. Create the FeatureProcess and run it

In [34]:
fp = FeatureProcess(repo="retail_computers_feature_store",
                    data_domain="computers_pricing",
                    object=computers_train_df,
                    entity='ram',
                    features=['count_price', 'max_price', 'total_price'])
fp.run()

Process 'b09fc5e4-8a24-11f0-a993-b0dcef8381ea' started.
Process 'b09fc5e4-8a24-11f0-a993-b0dcef8381ea' completed.


True

### 7.2. See the mind_map for Feature Store

We ingested three features—`count_price`, `max_price`, and `total_price`—from a single feature process. This demonstrates how multiple related features can be managed and tracked together within the feature store, maintaining their lineage to the originating process.

In [38]:
fs.mind_map()

## 8. Build the Dataset

### 8.1. Create DatasetCatalog object

In [41]:
dc = DatasetCatalog(repo="retail_computers_feature_store",
                    data_domain="computers_pricing")
dc

DatasetCatalog(repo=retail_computers_feature_store, data_domain=computers_pricing)

### 8.2. Build the dataset with ingested features

In [43]:
dataset_name = "kmeans_dataset"

# Build dataset with features and their versions (process_id)
selected_features = {"count_price": fp.process_id,
                     "max_price": fp.process_id,
                     "total_price": fp.process_id}

In [44]:
data = dc.build_dataset(entity="ram", 
                        selected_features=selected_features, 
                        view_name=dataset_name,
                        description="Dataset for KMeans test")
data



ram,count_price,max_price,total_price
2.0,267,1995.0,440293.0
8.0,587,1999.0,1065001.0
4.0,1238,1999.0,2040832.0
16.0,3,1999.0,5745.0


### 8.3. See the mind_map for Feature Store

We ingested three features—`count_price`, `max_price`, and `total_price`—from a single feature process. This demonstrates how multiple related features, datasets can be managed and tracked together within the feature store, maintaining their lineage to the originating process.

In [47]:
fs.mind_map()

## 9. Execute KMeans Analytics function on build dataset

**KMeans Clustering Details:**
In this step, we apply the KMeans clustering algorithm to the engineered dataset. The goal is to segment computers into three distinct clusters based on their aggregated features: `count_price`, `max_price`, and `total_price` for each RAM size. This unsupervised learning technique helps identify natural groupings in the data, such as common pricing tiers or inventory segments, which can be used for targeted marketing, inventory management, or further analytics.

In [50]:
KMeans_out = KMeans(id_column="ram",
                    target_columns=['count_price', 'max_price', 'total_price'],
                    data=data,
                    num_clusters=3)

In [51]:
KMeans_out.result



td_clusterid_kmeans,count_price,max_price,total_price,td_size_kmeans,td_withinss_kmeans,ram,td_modelinfo_kmeans
,,,,,,,Number of Iterations : 2
,,,,,,,Between_SS : 1.86304398202450E+12
,,,,,,,Method for InitialCentroids : Random
,,,,,,,Number of Clusters : 3
0.0,267.0,1995.0,440293.0,1.0,0.0,,
,,,,,,,Converged : True
,,,,,,,Total_WithinSS : 4.76123282181000E+11
2.0,3.0,1999.0,5745.0,1.0,0.0,,
1.0,912.5,1999.0,1552916.5,2.0,476123282181.0,,


## 10. Explore DatasetCatalog 

### 10.1. Explore DatasetCatalog properties

The `data_domain` property shows the domain associated with the dataset catalog.

In [55]:
dc.data_domain

'computers_pricing'

### 10.2. Explore DatasetCatalog methods

#### 10.2.1. list the datasets

In [58]:
dc.list_datasets()



id,data_domain,name,entity_name,description,valid_start,valid_end
b556cbb1-0676-4521-acc4-0fe9df9acf62,computers_pricing,kmeans_dataset,ram,Dataset for KMeans test,2025-09-05 06:52:01.560000+00:,9999-12-31 23:59:59.999999+00:


#### 10.2.2. list the entities

In [60]:
dc.list_entities()



id,data_domain,name,entity_name,description
b556cbb1-0676-4521-acc4-0fe9df9acf62,computers_pricing,kmeans_dataset,ram,Dataset for KMeans test


#### 10.2.3. list the features

In [62]:
dc.list_features()



dataset_id,data_domain,feature_id,feature_name,feature_view
b556cbb1-0676-4521-acc4-0fe9df9acf62,computers_pricing,1,count_price,kmeans_dataset
b556cbb1-0676-4521-acc4-0fe9df9acf62,computers_pricing,2,max_price,kmeans_dataset
b556cbb1-0676-4521-acc4-0fe9df9acf62,computers_pricing,3,total_price,kmeans_dataset


#### 10.2.4. Get the dataset

In [78]:
dc.get_dataset('b556cbb1-0676-4521-acc4-0fe9df9acf62')

Dataset(repo=retail_computers_feature_store, id=b556cbb1-0676-4521-acc4-0fe9df9acf62, data_domain=computers_pricing)

#### 10.2.5. Archive the dataset

In [80]:
dc.archive_datasets('b556cbb1-0676-4521-acc4-0fe9df9acf62')

Dataset id(s) 'b556cbb1-0676-4521-acc4-0fe9df9acf62' is/are archived from the dataset catalog.


True

In [81]:
dc.list_datasets()



id,data_domain,name,entity_name,description,valid_start,valid_end
b556cbb1-0676-4521-acc4-0fe9df9acf62,computers_pricing,kmeans_dataset,ram,Dataset for KMeans test,2025-09-05 06:52:01.560000+00:,2025-09-05 06:53:13.470000+00:


#### 10.2.6. Delete the dataset

In [88]:
dc.delete_datasets('b556cbb1-0676-4521-acc4-0fe9df9acf62')

Dataset id(s) 'b556cbb1-0676-4521-acc4-0fe9df9acf62' is/are deleted from the dataset catalog.


True

In [89]:
dc.list_datasets()



id,data_domain,name,entity_name,description,valid_start,valid_end


## 11. Cleanup

### 11.1. Drop View

In [92]:
db_drop_view("computers_train1_view")

True

### 11.2. Drop Table

In [94]:
db_drop_table("computers_train1")

True

### 11.2. Remove the Context

In [97]:
remove_context()

True

### 11.3. Delete the Feature Store

In [108]:
context=create_context(config_file='admin_config_file.env')

**Note** : This will drop the database if all objects are removed.

In [102]:
fs = FeatureStore(repo="retail_computers_feature_store", data_domain="computers_pricing")
fs.delete()

FeatureStore is ready to use.


The function removes Feature Store and drops the corresponding repo also. Are you sure you want to proceed? (Y/N):  y


True

In [110]:
remove_context()

True