<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Feature Store and Feature Engineering using tdfs4ds
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Successful AI/ML implementations face three main challenges:</p>
<li style = 'font-size:16px;font-family:Arial'><b>The Data Problem:</b> Quality data and feature engineering consume 80% of the implementation time. Even when different use cases share the same source data and features, organizations often handle data preparation separately.</li>
<li style = 'font-size:16px;font-family:Arial'><b>The Scale Problem:</b> Real-world use cases often require multiple models. In production, these models require fresh features engineered in the same way as during training. Ensuring the auditability of these features behind model decisions is crucial.</li>
<li style = 'font-size:16px;font-family:Arial'><b>The Deployment Problem:</b> Transitioning prototypes to production, especially operationalizing data prep pipelines, is often problematic.</li></p>

<p style = 'font-size:16px;font-family:Arial'>Addressing these challenges requires strategic planning, skilled talent, and integration with existing systems. Oraganizations with a history in Data Management recognize the benefits of reusable Data Products, making Enterprise Feature Stores a valuable investment.</p>

<p style = 'font-size:16px;font-family:Arial'>A Feature Store is a curated repository of pre-calculated features, simplifying the journey from data to actionable insights. An Enterprise Feature Store extends across domains/teams, incorporating a Governance Framework for predictable feature delivery. </p>
    
<p style = 'font-size:16px;font-family:Arial'><b>While most features are reusable, some need model-specific calculations before integration into a unified dataset.</b></p>
    
<p style = 'font-size:16px;font-family:Arial'>The key difference between Feature Store (FS) and Enterprise Feature Store (EFS) is the scope across multiple domains/teams along with the Governance Framework (that gives an assurance that features are delivered under predictable SLAs and it also defines the operating model how the EFS is used across different teams/domains and how features lifecycle is managed). Although most Features are considered as re-usable, there is still some minor part of Features that must be calculated as model-specific (e.g., scaled variables, principal components, etc.) and then combined with the rest of the pre-calculated Features into a single data set (ADS). The figure below describes this co-existence of model-specific ADS(es) and model-independent EFS.</p>

<img src='images/EFS.png'>

<p style = 'font-size:18px;font-family:Arial'><b>Business Values</b></p>

<li style = 'font-size:16px;font-family:Arial'>Rapid model creation and deployment through enterprise feature reuse.</li>
<li style = 'font-size:16px;font-family:Arial'>Flexible creation and usage of new features without extensive engineering support.</li>
<li style = 'font-size:16px;font-family:Arial'>Consistent definitions ensure accuracy and quick deployment.</li>
<li style = 'font-size:16px;font-family:Arial'>Collaboration and sharing of features among teams.</li>
<li style = 'font-size:16px;font-family:Arial'>Maintained feature lifecycle for compliance and auditability.</li>
</p>

<p style = 'font-size:18px;font-family:Arial'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial'>There are several reasons why EFS naturally fits to Teradata Vantage:</p>
<li style = 'font-size:16px;font-family:Arial'>Utilizes Teradata Vantage with its powerful Analytical Library and SQL Engine.</li>
<li style = 'font-size:16px;font-family:Arial'>Primary Index enables efficient single-row access for online feature use.</li>
<li style = 'font-size:16px;font-family:Arial'>Single platform for both online and offline feature stores.</li>
<li style = 'font-size:16px;font-family:Arial'>Macros reduce parsing overhead from API access.</li>
<li style = 'font-size:16px;font-family:Arial'>R and Python code execution within SQL Engine.</li>
<li style = 'font-size:16px;font-family:Arial'>Bi-temporal querying capability.</li>
<li style = 'font-size:16px;font-family:Arial'>Scalable MPP power for feature computation.</li>
<li style = 'font-size:16px;font-family:Arial'>Industry-specific Logical Data Model as a feature source.</li>
<li style = 'font-size:16px;font-family:Arial'>Connectivity to Object Storage via NOS for feature data sourcing.</li>
<li style = 'font-size:16px;font-family:Arial'>Query Grid facilitates access to multiple data sources.</li>
<li style = 'font-size:16px;font-family:Arial'>Close-to-real-time feature delivery using Query Services and Teradata Intelligent Memory.</li>
<li style = 'font-size:16px;font-family:Arial'>Workload management prioritizes tasks effectively.</li></p>
<p style = 'font-size:16px;font-family:Arial'>The unique massively-parallel architecture of Teradata Vantage allows users to prepare data, train, evaluate, and deploy models at unprecedented scale.</p>


<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage, Import python packages and explore the dataset</b></p>


<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
%%capture
# # '%%capture' suppresses the display of installation steps of the following packages
!pip install tdfs4ds --upgrade

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>Please execute the above pip install to get the latest version of the required library. Be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
#import libraries
import warnings
warnings.filterwarnings("ignore")

from teradataml import *
import pandas as pd
import matplotlib.pyplot as plt
import json
from getpass import getpass
display.max_rows=5

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_Feature_Engineering_using_Feature_Store.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.1 Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. We will create tables using the below call to the procedure to get the data for this demo.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_FeatureEngg_local');" 
# takes about 1 minute 30 seconds, estimated space: 4 MB

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Setup a Feature Store</b></p>

<p style = 'font-size:16px;font-family:Arial'>We can now set-up the feature store using the tdfs4dslibrary.</p>

<p style = 'font-size:16px;font-family:Arial'>The first task is to deploy a feature store in the Data Lab. However, before proceeding with the deployment, we will perform a cleanup to ensure the environment is ready for a fresh start, especially if you wish to rerun the demo from scratch.
</p>
<p style = 'font-size:16px;font-family:Arial'>Once the cleanup is complete, we will focus on the actual deployment of the feature store using the TDFS4DS package.</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>2.1 Clean up</b></p>

<p style = 'font-size:16px;font-family:Arial'>Before deploying a fresh feature store with TDFS4DS, we do some cleanup. This will drop the objects creating by the demo in case you have already run it.</p>

In [None]:
list_of_tables = db_list_tables()
list_of_tables

<p style = 'font-size:16px;font-family:Arial'>Drop Views</p>

In [None]:
list_of_tables = db_list_tables()
[execute_sql(f"DROP VIEW DEMO_USER.{t}") for t in list_of_tables.TableName if t.startswith('FS_V')]

<p style = 'font-size:16px;font-family:Arial'>Drop Tables</p>

In [None]:
list_of_tables = db_list_tables()
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if t.startswith('FS_T')]

In [None]:
list_of_tables = db_list_tables()
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if t.startswith('FS_')]

In [None]:
[execute_sql(f"DROP VIEW DEMO_USER.{t}") for t in list_of_tables.TableName if t in ['FEAT_ENG_CAT','FEAT_ENG_CUST']]

In [None]:
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if t in ['temp','tdfs__fgjnojnsmdoignmosnig']]

In [None]:
list_of_tables = db_list_tables()
[execute_sql(f"DROP VIEW DEMO_USER.{t}") for t in list_of_tables.TableName if '_CAT' in t or '_CUST' in t]

In [None]:
[execute_sql(f"DROP VIEW DEMO_USER.{t}") for t in list_of_tables.TableName if t in ['BUSINESS_FILTER','BUSINESS_DATE_SEQ','BUSINESS_DATE','HYBRID_BUSINESS_FILTER']]

In [None]:
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if 'HIDDEN' in t ]

In [None]:
list_of_tables = db_list_tables()
list_of_tables

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>2.2 Deploying a feature store in a datalab</b></p>

<p style = 'font-size:16px;font-family:Arial'>It’s time to deploy a feature store! The process is straightforward. We will start by importing the <code>TFS4DS</code> package and checking its version. This step is informative as it provides details about the detected database and its working environment. Here are the steps we will follow:</p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li><strong>Import the Package:</strong> Import the <code>TFS4DS</code> package.</li>
    <li><strong>Check the Version:</strong> Verify the package version and observe the output for insights into the database configuration.</li>
    <li><strong>Deploy the Feature Store:</strong> Use a single command to set up the feature store. Specify the database where it should be deployed. Note that the user must have the necessary permissions in the database to complete this operation.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial'>During the deployment, several tables will be created as part of the feature store:</p>
    <li style = 'font-size:16px;font-family:Arial'><strong>Feature Catalog:</strong> A table that organizes and manages the features.</li>
    <li style = 'font-size:16px;font-family:Arial'><strong>Process Catalog:</strong> A table that tracks feature engineering workflows and processes.</li>
    <li style = 'font-size:16px;font-family:Arial'><strong>Optional Tables:</strong> Additional tables for managing data distribution parameters, filter management, and other advanced capabilities (details on these will be explained in future notebooks).</li>
</p>

<p style = 'font-size:16px;font-family:Arial'>Once the setup is complete, you can inspect the definitions of the created tables. Notably, the feature store relies on temporal table capabilities, enabling native <strong>time travel</strong> functionality for features. This powerful feature will be discussed in detail in the notebook.</p>


In [None]:
import tdfs4ds
tdfs4ds.__version__

In [None]:
tdfs4ds.setup(database='DEMO_USER')

In [None]:
FS_Objects = [t for t in db_list_tables().sort_values('TableName').TableName.values if t not in list_of_tables.TableName.values]
FS_Objects

In [None]:
for t in FS_Objects:
    try:
        print(tdfs4ds.utils.lineage.get_ddl(view_name=t,schema_name='DEMO_USER', object_type='table'))
        print()
    except:
        ''

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>3. Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we will focus on the feature engineering process using the dataset available in the <strong>DEMO_USER</strong> database. For this, we will use the <code>teradataml</code> package, which enables Data Scientists to leverage the power of the Vantage system using Python.</p>

<p style = 'font-size:16px;font-family:Arial'>We will demonstrate two feature engineering processes:</p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>
        A feature computed for each <strong>CustomerID</strong>, describing specific attributes of the customer.
    </li>
    <li>
        A feature summarizing <strong>Category</strong> of transactions associated with the customer.
    </li>
</ol>

<p style = 'font-size:16px;font-family:Arial'>The dataset we will use is called <code>transactions</code>, which is a table with the following five columns:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Transaction Date:</strong> The date when the transaction occurred.</li>
    <li><strong>Transaction Amount:</strong> The monetary value of the transaction.</li>
    <li><strong>Customer ID:</strong> The identifier for the customer making the transaction.</li>
    <li><strong>Merchant ID:</strong> The identifier for the merchant or retailer involved in the transaction.</li>
    <li><strong>Category:</strong> A tag classifying the transaction into a specific category.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>We will do feature engineering without leveraging a feature store. It demonstrates how a Data Scientist might manually implement a feature engineering process using Vantage and the <code>teradataml</code> package. In the subsequent steps, we will revisit these processes and show how to integrate them with the feature store for a more streamlined and efficient workflow.</p>
<p style = 'font-size:16px;font-family:Arial'>Here, we define the source_database and create a teradataml dataframe using the <code>transactions</code> table in Vantage</p>

In [None]:
# pd_df = pd.read_csv('sample_dataset.csv')

In [None]:
# copy_to_sql(pd_df, table_name='transactions', if_exists='replace')

In [None]:
source_database = 'Demo_FeatureEngg' #---'DB_SOURCE'
df_transactions = DataFrame(in_schema(source_database,'transactions'))
df_transactions

<p style = 'font-size:16px;font-family:Arial'>In this feature engineering process, we will use the <code>teradataml</code> package to create Teradata DataFrames that implement the required computations. The two main feature engineering tasks are:</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>
        <strong>Statistics on Customer:</strong> For each <code>customer ID</code>, we will compute:
        <ul>
            <li>The sum of all transaction amounts.</li>
            <li>The average amount per transaction.</li>
            <li>The total number of transactions.</li>
            <li>The number of days since the last transaction.</li>
        </ul>
    </li>
    <li>
        <strong>Spending Category Distribution:</strong> For each transaction category, we will compute:
        <ul>
            <li>The sum of transaction amounts.</li>
            <li>The mean, standard deviation, maximum, and median of transaction amounts.</li>
        </ul>
    </li>
</ol>

<p style = 'font-size:16px;font-family:Arial'>These computations will result in two Teradata DataFrames:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><code>df_eng_feat_cust</code>: Features computed per customer.</li>
    <li><code>df_eng_feat_cat</code>: Features computed for spending category distribution.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Note that these DataFrames only implement the processing logic and do not generate data until explicitly stored or exported. When displaying the content of these DataFrames, only a sample of the results will be shown. To generate the actual data, you would need to either:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Store the results in another table within the database.</li>
    <li>Export the results to a <code>pandas</code> DataFrame, files, ...</li>
</ul>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.1 Statistics on customers</b></p>

In [None]:
from sqlalchemy import literal_column
df_eng_feat_cust = df_transactions.groupby('CustomerID').agg({'Transaction_Amount' : ['sum','mean','count'], 'Date_transaction':['max']})
df_eng_feat_cust

In [None]:
df_eng_feat_cust.tdtypes

In [None]:
df_eng_feat_cust = df_eng_feat_cust.assign(nb_days_since_last_transactions = literal_column('INTERVAL(PERIOD(max_Date_transaction, CURRENT_DATE)) DAY(4)',type_= INTEGER))

In [None]:
df_eng_feat_cust = df_eng_feat_cust[['CustomerID','sum_Transaction_Amount','mean_Transaction_Amount','count_Transaction_Amount','nb_days_since_last_transactions']]
df_eng_feat_cust

In [None]:
df_transactions.shape

In [None]:
df_eng_feat_cust.shape

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.2 Spending Category Distribution</b></p>

In [None]:
df_eng_feat_cat = df_transactions.groupby('Category').agg({'Transaction_Amount':['sum','mean','std','min','max','median']})
df_eng_feat_cat

In [None]:
df_eng_feat_cat = df_eng_feat_cat.join(
    df_transactions[['Category','Transaction_Amount']].groupby("Category").percentile(0.25),
    on = 'Category',
    how = 'inner',
    rprefix = 'r'
)[df_eng_feat_cat.columns + ['percentile_Transaction_Amount']]
df_eng_feat_cat = df_eng_feat_cat.assign(quartile_1_Transaction_Amount=df_eng_feat_cat.percentile_Transaction_Amount)
df_eng_feat_cat = df_eng_feat_cat[[c for c in df_eng_feat_cat.columns if c not in ['percentile_Transaction_Amount']]]
df_eng_feat_cat = df_eng_feat_cat.join(
    df_transactions[['Category','Transaction_Amount']].groupby("Category").percentile(0.25),
    on = 'Category',
    how = 'inner',
    rprefix = 'r'
)[df_eng_feat_cat.columns + ['percentile_Transaction_Amount']]
df_eng_feat_cat = df_eng_feat_cat.assign(quartile_3_Transaction_Amount=df_eng_feat_cat.percentile_Transaction_Amount)
df_eng_feat_cat = df_eng_feat_cat[[c for c in df_eng_feat_cat.columns if c not in ['percentile_Transaction_Amount']]]
df_eng_feat_cat

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>4. Register your feature engineering in the feature store</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we will demonstrate how the feature engineering tasks can be registered in a feature store. We will use the feature store deployed earlier in the series.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Steps for Feature Store Integration:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li><strong>Convert DataFrames to Views:</strong> The feature engineering logic will be stored as database views. These views persist beyond the session and allow for the computation of results whenever they are queried.</li>
    <li><strong>Dynamic Updates:</strong> As the underlying tables are updated (via ingestion processes), rerunning the views will compute the updated results without requiring multiple file reprocessing.</li>
    <li><strong>Register Processes in the Feature Store:</strong> Once the views are created, they can be registered in the feature store. This involves:
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Uploading features by specifying which columns represent entities (e.g., <code>customer ID</code>) and which columns represent features (e.g., transaction sums, averages, counts, etc.).</li>
            <li>Adding metadata to describe the process, such as project details, authorship, and other relevant information.</li>
        </ul>
    </li>
</ol>

<p style = 'font-size:16px;font-family:Arial'><b>Additional Concepts Introduced:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Data Domains:</strong> A namespace for features to prevent name collisions across projects or teams. This is a best practice to ensure clarity and manageability.</li>
    <li><strong>Entity-Feature Mapping:</strong> Identifying entity columns (e.g., <code>customer ID</code>) and feature columns (e.g., transaction statistics) in the output to provide clear structure to the data.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Here, we will see how to seamlessly integrate feature engineering processes with the feature store and ensure their usability across various projects and teams. This is achieved with a simple command, <code>upload_features</code>, where you define the process, specify entity and feature columns, and optionally include metadata for further documentation and organization.</p>


<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>4.1 Process to compute "Statistics on customers"</b></p>

<p style = 'font-size:16px;font-family:Arial'>The first step is to convert the feature engineering DataFrames into feature engineering processes by creating views. For this, we will use the <code>crystallize_view</code> function.</p>

<p style = 'font-size:16px;font-family:Arial'>About the <code>crystallize_view</code> Function:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Purpose:</strong> Converts a DataFrame into a view that persists the feature engineering logic in the database.</li>
    <li><strong>Inputs:</strong>
        <ul>
            <li>The DataFrame to be crystallized.</li>
            <li>The name of the view to be created.</li>
            <li>The database where the view should be created.</li>
        </ul>
    </li>
    <li><strong>Output:</strong> A Teradata DataFrame connected directly to the created view, allowing for seamless interaction with the crystallized process.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>This function provides an efficient way to store feature engineering processes in the database, ensuring they are reusable and persist across sessions.</p>

In [None]:
from tdfs4ds.utils.lineage import crystallize_view
df_eng_feat_cust_proc = crystallize_view(df_eng_feat_cust, view_name='FEAT_ENG_CUST', schema_name='DEMO_USER')
df_eng_feat_cust_proc

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>4.2 Process to compute "Spending Category Distribution"</b></p>

In [None]:
df_eng_feat_cat_proc = crystallize_view(df_eng_feat_cat, view_name='FEAT_ENG_CAT', schema_name='DEMO_USER')
df_eng_feat_cat

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5 Connecting Feature Engineering Processes in the Feature Store</b></p>

<p style = 'font-size:16px;font-family:Arial'>Once the feature engineering processes have been created as permanent views, it’s time to bridge the gap and connect them to the feature store. Here’s the step-by-step process:</p>

<p style = 'font-size:16px;font-family:Arial'><b>Connecting to the Feature Store</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li><strong>Import the TDFS4DS Package:</strong> Use the <code>import</code> statement to load the <code>TDFS4DS</code> package.</li>
    <li><strong>Establish a Connection:</strong> Use the <code>connect</code> method to connect to the feature store.</li>
    <li><strong>Define a Data Domain:</strong> A data domain acts as a namespace to organize features and processes. By default, the database name is used, but it is recommended to choose a more descriptive name, such as the project, team, or use case, for better clarity and organization.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial'><b>Registering a Feature Engineering Process</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Use the <code>upload_features</code> function to register the process in the feature store. Specify:
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Which columns represent the <strong>entity</strong> (e.g., <code>customer ID</code>).</li>
            <li>Which columns represent the <strong>features</strong> (e.g., transaction statistics).</li>
        </ul>
    </li>
    <li>Upon execution, the function:
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Analyzes the process and registers the features and entities in the <strong>Feature Catalog</strong>.</li>
            <li>Registers the process in the <strong>Process Catalog</strong>, generating a unique <code>process ID</code>.</li>
            <li>Computes the results of the process and ingests them into the feature store, timestamping them with the current time to support time travel queries later.</li>
        </ul>
    </li>
</ol>

<p style = 'font-size:16px;font-family:Arial'><b>Rerunning and Scheduling Processes</b></p>
<p style = 'font-size:16px;font-family:Arial'>The generated <code>process_id</code> allows you to rerun or schedule the refresh of a specific process without needing to reference the original code. This simplifies maintenance and ensures reproducibility.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Processing Additional Features</b></p>
<p style = 'font-size:16px;font-family:Arial'>The same steps can be followed for other feature engineering processes. Each process can be registered and ingested into the feature store using the same workflow, ensuring consistency and efficiency.</p>


<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.1 Connecting to the feature store</b></p>

In [None]:
import tdfs4ds
tdfs4ds.connect(database='DEMO_USER')

In [None]:
# Let's define a data domain for this use case
tdfs4ds.DATA_DOMAIN = "Customer Transaction Analytics"

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.2 Registering the Statistics on customers process</b></p>

In [None]:
# let's define the entity and the features in the outputs of the process
entity   = 'CustomerID'
features = ['sum_Transaction_Amount','mean_Transaction_Amount','count_Transaction_Amount','nb_days_since_last_transactions']

In [None]:
from tdfs4ds import upload_features
tdfs4ds.DEBUG_MODE = False

In [None]:
upload_features(
    df_eng_feat_cust_proc, # <-- the teradata dataframe pointing to the process view
    entity_id     = entity,
    feature_names = features,
    metadata      = {'project' : 'customer transactions'} # <-- some informative metadata
)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.3 Registering the "Spending Category Distribution"</b></p>

In [None]:
# let's define the entity and the features in the outputs of the process
entity   = 'Category'
features = ['sum_Transaction_Amount','mean_Transaction_Amount','std_Transaction_Amount','min_Transaction_Amount','max_Transaction_Amount','median_Transaction_Amount','quartile_1_Transaction_Amount','quartile_3_Transaction_Amount']

In [None]:
from tdfs4ds import upload_features
tdfs4ds.DEBUG_MODE = False

In [None]:
upload_features(
    df_eng_feat_cat_proc, # <-- the teradata dataframe pointing to the process view
    entity_id     = entity,
    feature_names = features,
    metadata      = {'project' : 'customer transactions'} # <-- some informative metadata
)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6  Inspecting the feature store</b></p>

<p>After the first upload, you can inspect the contents of the feature store. The feature store maintains two catalogs: a feature catalog and a process catalog.</p>

<p>This organization ensures that all features and processes are well-documented, accessible, and manageable within the feature store.</p>


<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>6.1 Feature catalog </b></p>

<p style = 'font-size:16px;font-family:Arial'><strong>Feature Catalog</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Provides a list of features with unique <strong>Feature IDs</strong> assigned to each feature name.</li>
    <li>Includes information about the location of each feature, specifying the feature table and the feature database where it is stored.</li>
    <li>Offers views connected to the feature tables, simplifying multiple and concurrent access to the feature store.</li>
    <li>Displays the <strong>entity name</strong> and the <strong>data domain</strong> the features belong to.</li>
    <li>Includes temporal fields such as <strong>valid start</strong> and <strong>valid end</strong>, enabling native support for time travel functionality.</li>
</ul>

In [None]:
tdfs4ds.feature_catalog()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>6.2 - Process catalog</b></p>

<p style = 'font-size:16px;font-family:Arial'><strong>Process Catalog</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Lists the names of the views created to describe the feature engineering processes.</li>
    <li>Displays the <strong>Process IDs</strong> generated for each process.</li>
    <li>Indicates the features and entities associated with each process (e.g., <strong>Feature ID</strong> and <strong>Entity ID</strong>).</li>
    <li>Contains additional parameters related to the processes (details of these parameters will be explained in future notebooks).</li>
</ul>

In [None]:
tdfs4ds.process_catalog()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>7 Building Datasets (snapshot) from the Feature Store</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now we focus on building datasets from the feature store. As a Data Scientist, we will construct a dataset using features previously ingested into the feature store.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Steps to Build a Dataset</strong></p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li><strong>Retrieve Entities:</strong> 
        <p>The first step is to list the available entities in the feature store. This is scoped to a specific <strong>data domain</strong> to avoid confusion between features with similar names but different meanings, as they may belong to different teams, departments, or use cases. Remember, the data domain serves as the namespace for features and entities.</p>
    </li>
    <li><strong>Select an Entity:</strong>
        <p>For this demonstration, we will select <code>customer ID</code> as the entity.</p>
    </li>
    <li><strong>Retrieve Features for the Entity:</strong> 
        <p>Next, we will query the feature catalog to list features available for the selected entity. Since features may be computed by different processes, we will also retrieve and specify the <strong>process ID</strong> associated with each feature.</p>
    </li>
    <li><strong>Define the Dataset:</strong> 
        <p>We will create a dataset by mapping each feature to its corresponding <code>process ID</code>. This will be done using a dictionary, which acts as a structured representation of the features and processes.</p>
    </li>
    <li><strong>Build the Dataset:</strong> 
        <p>Using the <code>build_and_scale_dataset</code> function, we will generate a denormalized view of the feature store. This view will include:
            <ul style = 'font-size:16px;font-family:Arial'>
                <li>A list of <code>entity IDs</code> to include in the dataset. Columns for the entity (e.g., <code>customer ID</code>).</li>
            <li>A dictionary where the keys are feature names, and the values are the corresponding feature versions (process IDs).</li>
            <li>The name of the view to be created for the dataset.</li>
            <li>The database where the view will be stored (optional).</li>
            <li>An optional comment to document the dataset view, useful for database administrators or IT personnel.</li>
        </ul>
    </li>
</ol>

<p style = 'font-size:16px;font-family:Arial'><strong>Output of the <code>build_dataset</code> Function</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Creates a view in the database that represents the dataset.</li>
    <li>Returns a Teradata DataFrame connected directly to the created view.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>The created dataset is efficient and fast to query as it leverages the indexing capabilities of Vantage (hash indexing, data partitioning, statistics collection). This makes it suitable for feeding machine learning models at high speed or add another layer of feature engineering. For example, you can aggregate the dataset and validate that each <code>entity ID</code> has a unique feature value, which is critical for consistency.</p>
            

<p style = 'font-size:16px;font-family:Arial'><strong>Dataset Patterns</strong></p>
<p style = 'font-size:16px;font-family:Arial'>The <code>tdfs4ds</code> package supports two patterns for datasets:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Snapshot:</strong> Retrieves the current value of features at the present time. This is the focus of this notebook.</li>
    <li><strong>Time Series:</strong> Allows retrieval of feature values over time, leveraging the temporal capabilities of Teradata Vantage. This pattern will be covered in a future demonstration.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Key Details</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>The dataset created is a <strong>view</strong>, meaning no new data is generated. It provides a current view of the features based on the valid timestamp, ensuring up-to-date results.</li>
    <li>If features are updated or refreshed through scheduled processes, querying the dataset will automatically retrieve the latest values without additional effort.</li>
    <li>The temporal capabilities of Teradata Vantage ensure seamless management of feature versions over time.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Consuming the Dataset</strong></p>
<p style = 'font-size:16px;font-family:Arial'>The dataset view can be named and consumed in various tools, such as this Python notebook, BI tools, or any application that supports database connections and SQL queries. This makes it highly flexible for integration with other workflows and applications.</p>


<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>7.1 Get available entity </b></p>

In [None]:
from tdfs4ds.feature_store.feature_query_retrieval import get_list_entity
get_list_entity()

In [None]:
# we have two entities: CustomerID and Category that are described with features
entity = 'CustomerID'

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>7.2 Get available features for the selected entity</b></p>

In [None]:
from tdfs4ds.feature_store.feature_query_retrieval import get_available_features
get_available_features(entity_id=entity)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>7.3 Select the feature version</b></p>

<p style = 'font-size:16px;font-family:Arial'>Each feature may have more than one version, meaning they could be computed by different processes.</p>

In [None]:
from tdfs4ds.feature_store.feature_query_retrieval import get_feature_versions
tdfs4ds.DEBUG_MODE = False

In [None]:
feature_selection = get_feature_versions(
    entity_name = entity,
    features    = [
        'count_Transaction_Amount',
        'mean_Transaction_Amount',
        'nb_days_since_last_transactions',
        'sum_Transaction_Amount'
    ]
)
feature_selection

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>7.4 Build the dataset (view)</b></p>

In [None]:
from tdfs4ds import build_dataset,build_dataset_opt
tdfs4ds.DEBUG_MODE = False

In [None]:
# this build a view on the feature store data model with the entity and the selected features
# (the features values are the up-to-date values of the features)
dataset = build_dataset(
    entity_id         = entity,
    selected_features = feature_selection,
    view_name         = 'DATASET_CUSTOMER',
    comment           = 'my dataset for curve clustering'
)
dataset

In [None]:
dataset.groupby('CustomerID').count()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>8 Update the features and re-run the feature engineering processes</b></p>

<p style = 'font-size:16px;font-family:Arial'>In this notebook, we will explore how to refresh features in the feature store without requiring access to the original code that defined the feature engineering processes or created datasets. This approach leverages the metadata stored in the feature store to simplify and streamline the refresh process.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Steps to Refresh Features</strong>

<ol style = 'font-size:16px;font-family:Arial'>
    <li><strong>Connect to the Feature Store:</strong> Establish a connection to the feature store.</li>
    <li><strong>Inspect the Process Catalog:</strong> Review the <strong>Process Catalog</strong> to identify the processes associated with the features you want to refresh. 
        <ul>
            <li>The catalog includes details such as <code>process ID</code>, <code>entity ID</code>, <code>data domain</code>, and the features involved.</li>
            <li>Select the processes to refresh based on the updated source data.</li>
        </ul>
    </li>
    <li><strong>Rerun the Process:</strong> Use the <code>run</code> method from the <code>TFS4DS</code> package. Provide the <code>process ID</code> as an argument to execute the process and refresh the feature content.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial'><strong>How It Works</strong></p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Source Table Updates:</strong> In real-world industrial scenarios, data ingestion processes continuously update source tables in a structured data model. These updates are seamlessly integrated into the feature store without requiring table schema changes.</li>
    <li><strong>Incremental Processing:</strong> Ideally, processes are engineered to handle incremental updates rather than reprocessing entire tables. This minimizes computation time and optimizes efficiency.</li>
    <li><strong>Temporal Features:</strong> The feature store uses temporal capabilities to manage feature validity:
        <ul>
            <li>If the feature value for an entity remains unchanged, the validity range of the feature is extended without adding new data.</li>
            <li>If the feature value changes, a new row is added to record the update, closing the previous validity period and starting a new one with the current timestamp.</li>
        </ul>
    </li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Benefits of Temporal Features</strong></p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>Supports versioning of features over time, allowing reconstruction of changes as a time series or snapshots.</li>
    <li>Enables time travel queries to analyze features at different points in time.</li>
    <li>Optimizes storage by avoiding redundant data insertion for unchanged features.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Automating the Refresh Process</strong></p>

<p style = 'font-size:16px;font-family:Arial'>The <code>run</code> method can be scheduled using any orchestrator that supports a Python runtime. This allows for automated feature refreshes as part of operational workflows. Note that feature refresh operations typically occur in production environments rather than in a data lab. By following this approach, operational teams can ensure that features remain up-to-date and reliable, enabling seamless integration into downstream applications and analytics workflows.</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>8.1 Connect to the Feature Store</b></p>

<p style = 'font-size:16px;font-family:Arial'>The first step after connecting to the Vantage system is to establish a connection to the deployed feature store. This is done using the <code>connect</code> method, which requires the following parameters:</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Database:</strong> Specify the database where the feature store is hosted.</li>
    <li><strong>Data Domain:</strong> Define the namespace to restrict the search for entities, features, and versions to the appropriate scope. This ensures clarity and prevents conflicts across different teams, projects, or use cases.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Establishing this connection is essential for working with the feature store and accessing its capabilities within the defined context.</p>

In [None]:
import tdfs4ds
tdfs4ds.connect(database='DEMO_USER')

In [None]:
# Let's define a data domain for this use case
tdfs4ds.DATA_DOMAIN = "Customer Transaction Analytics"

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>8.2 Get existing feature engineering processes (process catalog)</b></p>

<p style = 'font-size:16px;font-family:Arial'>You can retrieve the <code>process ID</code> by inspecting the <strong>Process Catalog</strong>. The Process Catalog provides detailed information about all implemented processes, including their associated <code>process ID</code>, entities, features, and other metadata.</p>

<p style = 'font-size:16px;font-family:Arial'>Refer to the catalog to identify the relevant <code>process ID</code> for the feature or process you wish to refresh.</p>

In [None]:
tdfs4ds.process_catalog()

In [None]:
list_process_views = tdfs4ds.process_catalog()[['VIEW_NAME']].to_pandas().VIEW_NAME.values
view_cat  = [c for c in list_process_views if 'CAT' in c][0]
view_cust = [c for c in list_process_views if 'CUST' in c][0]

In [None]:
process_id_cat  = tdfs4ds.process_store.process_query_administration.get_process_id(view_name=view_cat)
process_id_cust = tdfs4ds.process_store.process_query_administration.get_process_id(view_name=view_cust)

In [None]:
print('process_id_cat  : ',process_id_cat)
print('process_id_cust : ',process_id_cust)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>8.3 Execute existing Feature Engineering processes</b></p>

<p style = 'font-size:16px;font-family:Arial'>Once you have identified the relevant <code>process ID</code>, you can re-execute the process with the updated data and ingest the results into the feature store. This process leverages the temporal capabilities of Teradata Vantage.</p>

<p style = 'font-size:16px;font-family:Arial'>To re-execute a process, use the <code>run</code> function from the <code>TDFS4DS</code> package

In [None]:
# this will execute the corresponding processes and update the temporal tables of the feature store
tdfs4ds.run(process_id=process_id_cat)
tdfs4ds.run(process_id=process_id_cust)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>9 Time management in feature store</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now that we have explored the basic usage of the feature store with <code>tdfs4ds</code>, this notebook addresses another common challenge. Customers starting feature store ingestion today may only capture a single version of the feature. However, users might desire a feature store that reflects data as if it had been refreshed over a historical period, such as the past two years, with updates occurring daily.</p>

<p style = 'font-size:16px;font-family:Arial'>This notebook demonstrates how to address this challenge by positioning the processes and the feature store at an earlier date and simulating feature ingestion over time. This capability is enabled by the temporal nature of the feature store and involves time management techniques.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Steps for Time Management</strong>

<ol style = 'font-size:16px;font-family:Arial'>
    <li><strong>Setting an Earlier Date:</strong> Operate the feature store at a specific earlier date, such as two years ago, and simulate daily updates.</li>
    <li><strong>Synchronizing Time:</strong>
        <ul>
            <li>Synchronize the <code>tdfs4ds</code> package with the Vantage platform.            </li>
            <li>Ensure the client session, processes, and database data are aligned to the specified date.</li>
        </ul>
    </li>
    <li><strong>Processing Data for Specific Dates:</strong> Write processes to handle only the data available at the selected date.</li>
    <li><strong>Simulate Ingestion:</strong> Simulate the ingestion and upload of features at the earlier date, ensuring the feature store reflects the data for that time.</li>

</ol>

<p style = 'font-size:16px;font-family:Arial'><strong>Introducing the Time Manager</strong></p>

<p style = 'font-size:16px;font-family:Arial'>The concept of a <strong>Time Manager</strong> is introduced to facilitate this process. The Time Manager is an object represented by a table and a view that allows you to select a specific date from a list. It enables synchronization between:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>The feature store in the client session.</li>
    <li>The processes to ensure they only operate on the data available for the selected date.</li>
</ul>


<p style = 'font-size:16px;font-family:Arial'><strong>Key Outcomes</strong></p>

<p style = 'font-size:16px;font-family:Arial'>By using the Time Manager and aligning processes to specific dates:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Features are uploaded with their validity starting at the selected earlier date.</li>
    <li>The feature store provides a time-accurate representation of features as they would have been at that time.</li>
    <li>Users can simulate historical feature store operations and analyze features over time using temporal capabilities.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>This approach allows for a robust and flexible time management strategy in the feature store, enabling retrospective feature engineering and analysis.</p>


<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>9.1 Simulating Daily Ingestion</b></p>

<p style = 'font-size:16px;font-family:Arial'>In this notebook, we will compute the features on a daily basis, in order to simulate a daily run.</p>
<p style = 'font-size:16px;font-family:Arial'>Hence the dataset is a single date. To make this easier, we will use a TimeManager, that is made of:</p>
<li style = 'font-size:16px;font-family:Arial'>a table that contains the list of days</li>
<li style = 'font-size:16px;font-family:Arial'> a view pointing on this table that output only one row of this table</li>
<p style = 'font-size:16px;font-family:Arial'>Then, in the next notebook, we will loop over the days in the table and synchronize with the tdfs4ds package.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Creation of the time manager</b></p>

In [None]:
# Let's define a data domain for this use case
tdfs4ds.DATA_DOMAIN = "Customer Transaction Analytics Time Management"

In [None]:
from tdfs4ds.utils.time_management import TimeManager
# this create the BUSINESS_DATE view pointing to the BUSINESS_DATE_HIDDEN table
business_time = TimeManager(table_name = 'BUSINESS_DATE', schema_name = 'DEMO_USER')

<p style = 'font-size:16px;font-family:Arial'><b>Updating the list of date in the time manager</b></p>

In [None]:
# here we will fill the BUSINESS_DATE_HIDDEN table with all the dates available in our source table (transactions)
source_data = DataFrame(in_schema(source_database, 'transactions'))
business_time.load_time_steps(source_data.groupby('Date_transaction').count(), 'Date_transaction')

<p style = 'font-size:16px;font-family:Arial'>The display() method displays the current time of the time manager object, the one that will be later used in the process definition. And the update() method that can position the cursor of the time manager to another date.</p>

In [None]:
# get the current time of the time manager, here a date
# it actually return the select of the BUSINESS_DATE view
# by default, it points to the oldest time in BUSINESS_DATE_HIDDEN
business_time.display()

In [None]:
# change the current time of the time manager as the 10th value of the BUSINESS_DATE_HIDDEN table
business_time.update(time_id=10)

In [None]:
business_time.display()

In [None]:
# Let's set it back to the first date:
business_time.update(time_id=1)

In [None]:
business_time.display()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>9.2 Synchronization with tdfs4ds</b></p>

<p style = 'font-size:16px;font-family:Arial'>The feature store package includes a parameter called <code>FEATURE_STORE_TIME</code>, which defines the internal date and time of the feature store for data ingestion. This parameter determines the <strong>valid time</strong> recorded in the temporal table during the ingestion process.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Default Behavior</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>By default, <code>FEATURE_STORE_TIME</code> is set to the system's current time when the package is imported.</li>
    <li>If no changes are made, data ingested into the feature store will use the system's current time as the valid time.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Using a Specific Date and Time</strong></p>
<p style = 'font-size:16px;font-family:Arial'>To set a specific earlier date and time for the feature store:</p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Provide a <code>timestamp</code> value to the <code>FEATURE_STORE_TIME</code> parameter.</li>
    <li>Retrieve the desired timestamp value, such as the current time from the <strong>Time Manager</strong>, and assign it to <code>feature_store_time</code>.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial'><strong>Impact on Feature Uploads</strong></p>
<p style = 'font-size:16px;font-family:Arial'>When the <code>upload_feature</code> method is used, the <code>FEATURE_STORE_TIME</code> parameter ensures that the specified timestamp is recorded as the valid time for the ingested feature values. This allows for precise control of temporal data in the feature store.</p>

<p style = 'font-size:16px;font-family:Arial'>This functionality is particularly useful when simulating historical data ingestion or aligning feature validity with specific historical contexts, enabling robust time travel and retrospective analysis.</p>


In [None]:
# retrieve the current time of the time manager in the format expected by tdfs4ds
business_time.get_date_in_the_past()

In [None]:
# synchronize the two
tdfs4ds.FEATURE_STORE_TIME = business_time.get_date_in_the_past()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>9.3 Use the time manager to filter out the source data</b></p>

<p style = 'font-size:16px;font-family:Arial'>The time from the <strong>Time Manager</strong> is also accessible within the Vantage system, allowing it to be incorporated into process definitions. This enables the processes to work with data that aligns with the selected time, such as filtering transactions based on the current transaction date.</p>

<p style = 'font-size:16px;font-family:Arial'>The <code>Time Manager</code> is used to define a filter that ensures only transactions matching the current date are processed. However, this part of the workflow cannot be fully automated due to the variability in use cases and process requirements.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Customization by the User</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>The logic for incorporating the <code>Time Manager</code> depends on the specific use case and process being implemented.</li>
    <li>Only the user fully understands the nature of their data and can determine how to position the filter to achieve the desired results.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Important Considerations</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Positioning the filter at the end of the process is generally not effective. Filters should typically be applied to the source data or at appropriate stages earlier in the process.</li>
    <li>This step is inherently a manual task, as it requires domain knowledge to ensure the filter is applied correctly and effectively.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Ultimately, the responsibility for positioning and defining these filters rests with the user to align the logic with their specific requirements and data characteristics.</p>


<p style = 'font-size:16px;font-family:Arial'>Here, we apply the time filtering using the = sign on the source data. Then our feature engineering process will use the filtered source instead of the original one.</p>

In [None]:
input_source = DataFrame.from_query(
f"""
SELECT 
    CustomerID
,   Transaction_Amount
,   Date_transaction
,   Category
,   Merchant_Name
FROM {source_database}.transactions A
WHERE A.Date_transaction = (SEL BUSINESS_DATE FROM DEMO_USER.BUSINESS_DATE)
"""
)
input_source

In [None]:
input_source.shape

<p style = 'font-size:16px;font-family:Arial'>In this feature engineering process, we will use two main feature engineering tasks are:</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>
        <strong>Statistics on Customer:</strong> For each <code>customer ID</code>, we will compute:
        <ul>
            <li>The sum of all transaction amounts.</li>
            <li>The average amount per transaction.</li>
            <li>The total number of transactions.</li>
            <li>The number of days since the last transaction.</li>
        </ul>
    </li>
    <li>
        <strong>Spending Category Distribution:</strong> For each transaction category, we will compute:
        <ul>
            <li>The sum of transaction amounts.</li>
            <li>The mean, standard deviation, maximum, and median of transaction amounts.</li>
        </ul>
    </li>
</ol>

<p style = 'font-size:16px;font-family:Arial'>These computations will result in two Teradata DataFrames:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><code>df_eng_feat_cust</code>: Features computed per customer.</li>
    <li><code>df_eng_feat_cat</code>: Features computed for spending category distribution.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Note that these DataFrames only implement the processing logic and do not generate data until explicitly stored or exported. When displaying the content of these DataFrames, only a sample of the results will be shown. To generate the actual data, you would need to either:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Store the results in another table within the database.</li>
    <li>Export the results to a <code>pandas</code> DataFrame, files, ...</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Note that we use a input_source, the filtered source using the time of the time manager as input dataframe.</p>

<p style = 'font-size:16px;font-family:Arial'>Statistics on Customer</p>

In [None]:
from sqlalchemy import literal_column
df_eng_feat_cust = input_source.groupby('CustomerID').agg({'Transaction_Amount' : ['sum','mean','count'], 'Date_transaction':['max']})
df_eng_feat_cust = df_eng_feat_cust.assign(nb_days_since_last_transactions = literal_column('INTERVAL(PERIOD(max_Date_transaction, CURRENT_DATE)) DAY(4)',type_= INTEGER))
df_eng_feat_cust = df_eng_feat_cust[['CustomerID','sum_Transaction_Amount','mean_Transaction_Amount','count_Transaction_Amount','nb_days_since_last_transactions']]
df_eng_feat_cust

<p style = 'font-size:16px;font-family:Arial'>Spending Category Distribution</p>

In [None]:
df_eng_feat_cat = input_source.groupby('Category').agg({'Transaction_Amount':['sum','mean','std','min','max','median']})
df_eng_feat_cat

In [None]:
df_eng_feat_cat = df_eng_feat_cat.join(
    input_source[['Category','Transaction_Amount']].groupby("Category").percentile(0.25),
    on = 'Category',
    how = 'inner',
    rprefix = 'r'
)[df_eng_feat_cat.columns + ['percentile_Transaction_Amount']]
df_eng_feat_cat = df_eng_feat_cat.assign(quartile_1_Transaction_Amount=df_eng_feat_cat.percentile_Transaction_Amount)
df_eng_feat_cat = df_eng_feat_cat[[c for c in df_eng_feat_cat.columns if c not in ['percentile_Transaction_Amount']]]
df_eng_feat_cat = df_eng_feat_cat.join(
    input_source[['Category','Transaction_Amount']].groupby("Category").percentile(0.25),
    on = 'Category',
    how = 'inner',
    rprefix = 'r'
)[df_eng_feat_cat.columns + ['percentile_Transaction_Amount']]
df_eng_feat_cat = df_eng_feat_cat.assign(quartile_3_Transaction_Amount=df_eng_feat_cat.percentile_Transaction_Amount)
df_eng_feat_cat = df_eng_feat_cat[[c for c in df_eng_feat_cat.columns if c not in ['percentile_Transaction_Amount']]]
df_eng_feat_cat

<p style = 'font-size:16px;font-family:Arial'>Process to compute "Statistics on customers"</p>

In [None]:
from tdfs4ds.utils.lineage import crystallize_view
df_eng_feat_cust_proc = crystallize_view(df_eng_feat_cust, view_name='FEAT_ENG_CUST_DAILY', schema_name='DEMO_USER')
df_eng_feat_cust_proc

<p style = 'font-size:16px;font-family:Arial'>Process to compute "Spending Category Distribution"</p>

In [None]:
df_eng_feat_cat_proc = crystallize_view(df_eng_feat_cat, view_name='FEAT_ENG_CAT_DAILY', schema_name='DEMO_USER')

In [None]:
df_eng_feat_cat_proc

<p style = 'font-size:16px;font-family:Arial'>Registering the Statistics on customers process</p>

In [None]:
# let's define the entity and the features in the outputs of the process
entity   = 'CustomerID'
features = ['sum_Transaction_Amount','mean_Transaction_Amount','count_Transaction_Amount','nb_days_since_last_transactions']

In [None]:
from tdfs4ds import upload_features
upload_features(
    df_eng_feat_cust_proc, # <-- the teradata dataframe pointing to the process view
    entity_id     = entity,
    feature_names = features,
    metadata      = {'project' : 'customer transactions'}, # <-- some informative metadata
)

<p style = 'font-size:16px;font-family:Arial'>Registering the "Spending Category Distribution"</p>

In [None]:
# let's define the entity and the features in the outputs of the process
entity   = 'Category'
features = ['sum_Transaction_Amount','mean_Transaction_Amount','std_Transaction_Amount','min_Transaction_Amount','max_Transaction_Amount','median_Transaction_Amount','quartile_1_Transaction_Amount','quartile_3_Transaction_Amount']

In [None]:
from tdfs4ds import upload_features
upload_features(
    df_eng_feat_cat_proc, # <-- the teradata dataframe pointing to the process view
    entity_id     = entity,
    feature_names = features,
    metadata      = {'project' : 'customer transactions'} # <-- some informative metadata
)

<p style = 'font-size:16px;font-family:Arial'>Inspecting the feature store</p>

In [None]:
tdfs4ds.feature_catalog()

In [None]:
tdfs4ds.feature_catalog()[['DATA_DOMAIN','FEATURE_ID']].groupby('DATA_DOMAIN').count()

In [None]:
tdfs4ds.process_catalog()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>10 Roll out Feature Engineering Processes over time</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we will demonstrate how to update features iteratively from this earlier date to the current date. This is achieved by looping over all the dates available in the <strong>Time Manager</strong> and updating the features for each date.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Automating Feature Updates</strong></p>
<p style = 'font-size:16px;font-family:Arial'>Instead of manually managing the <code>FEATURE_STORE_TIME</code> of the package, a higher-level function called <code>roll_out</code> simplifies this process. The <code>roll_out</code> function automates the execution of processes over a range of dates specified in the Time Manager.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Key Features of <code>roll_out</code></strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Input:</strong> Takes a list of <code>process IDs</code> to execute.</li>
    <li><strong>Time Manager:</strong> Utilizes the dates from the specified <strong>Time Manager</strong> to trigger these processes sequentially.</li>
    <li><strong>Optional Parameters:</strong>
        <ul>
            <li><code>time_id_start:</code> Specifies the starting point in the Time Manager's list of dates (optional).</li>
            <li><code>time_id_end:</code> Specifies the ending point in the Time Manager's list of dates (optional).</li>
        </ul>
    </li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Example Usage</strong></p>
<p style = 'font-size:16px;font-family:Arial'>Here is an example call to the <code>roll_out</code> function:</p>

<pre><code>roll_out(
    process_list = [process_id_cat, process_id_cust],
    time_manager = business_time,
    time_id_start = 125, # optional
    time_id_end   = 130  # optional
)</code></pre>

<p style = 'font-size:16px;font-family:Arial'>This example runs the processes <code>process_id_cat</code> and <code>process_id_cust</code> sequentially over the specified time range, leveraging the dates in the <code>business_time</code> Time Manager.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Practical Considerations</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>In this notebook, the process will be restricted to a few iterations for demonstration purposes.</li>
    <li>For a full historical data load, you can run the <code>roll_out</code> function over the entire date range in the Time Manager.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Using <code>roll_out</code>, you can efficiently update features over time, ensuring the feature store reflects the desired historical data timeline.</p>


<p style = 'font-size:16px;font-family:Arial'>So far, our data are static. However, in practice, there is a continous ingestion of new data.</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>10.1 Roll out existing processes over time</b></p>

<p style = 'font-size:16px;font-family:Arial'>The first step is to retrieve the <strong>Time Manager</strong>. If a Time Manager already exists, the constructor will connect to it. If it does not exist, the constructor will create a new one.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Verifying the Starting Point</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>To ensure the process starts from the first date in the Time Manager, use the <code>get_current_step</code> method.</li>
    <li>If <code>get_current_step</code> returns <code>1</code>, the process will start at the first date.</li>
    <li>If it does not return <code>1</code>, you can specify the starting <code>time_id</code> directly in the <code>roll_out</code> function.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Practical Usage</strong></p>
<p style = 'font-size:16px;font-family:Arial'>By checking and setting the starting point, you can ensure the process begins from the correct date, whether performing a full run or focusing on a specific range of dates.</p>


In [None]:
from tdfs4ds.utils.time_management import TimeManager
# this create the BUSINESS_DATE view pointing to the BUSINESS_DATE_HIDDEN table
business_time = TimeManager(table_name = 'BUSINESS_DATE', schema_name = 'DEMO_USER')
business_time.get_current_step()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>10.2 Trigger the roll out</b></p>

<p style = 'font-size:16px;font-family:Arial'>Using <code>roll_out</code>, you can efficiently update features over time, ensuring the feature store reflects the desired historical data timeline.</p>

In [None]:
from tdfs4ds import roll_out
roll_out(
    process_list = [process_id_cat, process_id_cust],
    time_manager = business_time,
    time_id_start= 125,
    time_id_end  = 130
)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>11 Monitoring Process Execution with the Follow-Up Table</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <code>TDFS4DS</code> package provides a follow-up table to monitor the execution of processes in the feature store. This table contains detailed information about each process run, helping you track progress and troubleshoot issues.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Key Information in the Follow-Up Table </strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Run ID:</strong> Identifies each triggered run.</li>
    <li><strong>Process Type:</strong> Indicates the type of process (e.g., upload).</li>
    <li><strong>Process ID:</strong> The ID of the process being executed.</li>
    <li><strong>Start and Completion Times:</strong> Timestamps indicating when the process started and finished.</li>
    <li><strong>Valid Time:</strong> The time at which the data was inserted into the feature store.</li>
    <li><strong>Status:</strong> Tracks the current state of the process (e.g., running, failed, or completed).</li>
    <li><strong>Additional Metadata:</strong> Includes other relevant details to aid in monitoring and debugging.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Here, we will examine the follow-up table to review the processes executed in previous notebooks. This provides insight into the operational framework of <code>TDFS4DS</code> and ensures that processes are functioning as expected.</p>

<p style = 'font-size:16px;font-family:Arial'>The follow-up table is an essential tool for managing and monitoring feature store operations, enabling efficient tracking and resolution of any issues that may arise during process execution.</p>


In [None]:
tdfs4ds.process_store.process_followup.follow_up_report().sort('START_DATETIME', ascending=False)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>12 Processing Sequentially with Filter Manager</b></p>

<p style = 'font-size:16px;font-family:Arial'>This notebook introduces the <strong>Filter Manager</strong>, a tool for handling large data volumes during feature engineering and feature store processes. The Filter Manager addresses scenarios where processes handle data that may be too large for the system to process efficiently, potentially leading to spool issues or system constraints.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Key Concepts of the Filter Manager</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Partitioning Data:</strong> The Filter Manager allows you to partition your data into smaller subsets (partitions) and process each partition sequentially. This reduces the data volume processed at any given time.</li>
    <li><strong>Sequential Processing:</strong> While the overall process may take longer, partitioning ensures feasibility by dividing the workload into manageable portions.</li>
    <li><strong>Granularity Control:</strong> Define the granularity of your processing and iterate over partitions during feature engineering, process execution, and result ingestion into the feature store.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Benefits of the Filter Manager</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Progress Monitoring:</strong> Track the progression of your process, ensuring visibility into each partition’s execution.</li>
    <li><strong>Error Recovery:</strong> If a partition fails, you do not need to rerun completed partitions. The follow-up table can help identify which partitions have been successfully processed.</li>
    <li><strong>Operational Efficiency:</strong> Enables reliable and incremental execution of feature engineering processes in environments with large datasets.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Practical Impact</strong></p>
<p style = 'font-size:16px;font-family:Arial'>When attaching a Filter Manager to a process definition, it impacts the way the <code>upload_features</code> method runs by:
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Splitting the processing and ingestion tasks based on the defined partitions.</li>
    <li>Ensuring that results for each partition are ingested into the feature store sequentially and independently.</li>
</ul>
</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Usage in Operations</strong></p>
<p style = 'font-size:16px;font-family:Arial'>By using the Filter Manager, you can:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Optimize processing for systems with limited capacity.</li>
    <li>Prevent data loss by ensuring completed partitions are not rerun unnecessarily.</li>
    <li>Monitor and manage partition execution through the follow-up table.</li>
</ul>
</p>

<p style = 'font-size:16px;font-family:Arial'>Here, we will demonstrate how to define and use the Filter Manager in a process definition and explains its impact on feature store uploads and operations.</p>

<p style = 'font-size:16px;font-family:Arial'>We use a new DATA_DOMAIN for the sake of clarity.</p>

In [None]:
# Let's define a data domain for this use case
tdfs4ds.DATA_DOMAIN = "Customer Transaction Analytics Time and Category Management"

<p style = 'font-size:16px;font-family:Arial'>Reuse of the time manager</p>

In [None]:
from tdfs4ds.utils.time_management import TimeManager
# this create the BUSINESS_DATE view pointing to the BUSINESS_DATE_HIDDEN table
business_time = TimeManager(table_name = 'BUSINESS_DATE_SEQ', schema_name = 'DEMO_USER')
# here we will fill the BUSINESS_DATE_HIDDEN table with all the dates available in our source table (transactions)
business_time.load_time_steps(source_data.groupby('Date_transaction').count(), 'Date_transaction')
# synchronize tdfs4ds with the time manager
tdfs4ds.FEATURE_STORE_TIME = business_time.get_date_in_the_past()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>12.1 Creation of the filter manager</b></p>

<p style = 'font-size:16px;font-family:Arial'>Like for the time manager, TDFS4DS provide a filter manager to make the partitioning of the process easy.</p>

In [None]:
from tdfs4ds.utils.filter_management import FilterManager
# this create the BUSINESS_FILTER view pointing to the BUSINESS_FILTER_HIDDEN table
filtermanager = FilterManager(table_name = 'BUSINESS_FILTER',schema_name='DEMO_USER')

<p style = 'font-size:16px;font-family:Arial'>Here, we see how we define the granulatiry of the process partitioning. In this example, we want to split the processing by category. Note that the partitioning must be compliant with the business logic, meaning it should not have an unexpected impact on the result of the computations.</p>

In [None]:
# here we will fill the BUSINESS_FILTER_HIDDEN table with all the Categories available in our source table (Category)
filtermanager.load_filter(source_data.groupby('Category').count()[['Category']])

<p style = 'font-size:16px;font-family:Arial'>Like with time manager, the fitler manager exhibit a display and update method to show the current value of the filter and position the cursor on a specific partition.</p>

In [None]:
# get the current time of the filter manager, here a category
# it actually return the select of the BUSINESS_FILTER view
filtermanager.display()

In [None]:
# change the current filter of the filter manager as the 2nd value of the BUSINESS_FILTER_HIDDEN table
filtermanager.update(filter_id=2)
filtermanager.display()

In [None]:
# Let's set it back to the first filter:
filtermanager.update(filter_id=1)
filtermanager.display()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>12.2 Use both the time and filter managers to filter out the source data</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <strong>Filter Manager</strong> is accessible within the Vantage system, allowing it to be incorporated into process definitions. It enables processes to work with data that is partitioned based on user-defined filters, ensuring efficient processing of large datasets by splitting the workload into manageable partitions.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Example Usage</strong></p>

<p style = 'font-size:16px;font-family:Arial'>In this example, the Filter Manager is used to define partitions, ensuring that each partition of the data is processed sequentially. This approach reduces the volume of data processed at any given time, making it feasible to execute resource-intensive processes on large datasets.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Customization by the User</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>The logic for incorporating the Filter Manager depends on the specific use case and process requirements.</li>
    <li>Only the user fully understands the nature of their data and can determine how to define and position the filter to achieve the desired results.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Important Considerations</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Positioning the filter at the end of the process is generally not effective. Filters should typically be applied at the source data or appropriate stages earlier in the process pipeline.</li>
    <li>Defining the filter requires domain knowledge to ensure it is applied correctly and effectively for the specific dataset and process.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Ultimately, the responsibility for defining and positioning the filters rests with the user, ensuring alignment with the logic of their processes and the characteristics of their data. The Filter Manager is a powerful tool to optimize processing, but its proper application depends on user expertise.</p>


<p style = 'font-size:16px;font-family:Arial'><strong>Registering a Feature Engineering Process with a Filter Manager</strong><p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Use the <code>upload_features</code> function to register the process in the feature store. Specify:
        <ul>
            <li>Which columns represent the <strong>entity</strong> (e.g., <code>customer ID</code>).</li>
            <li>Which columns represent the <strong>features</strong> (e.g., transaction statistics).</li>
            <li><strong>Attach a Filter Manager:</strong> Add the Filter Manager as an additional argument. This partitions the process into smaller subsets for sequential processing.</li>
        </ul>
    </li>
    <li>Upon execution, the function:
        <ul>
            <li>Analyzes the process and registers the features and entities in the <strong>Feature Catalog</strong>.</li>
            <li>Registers the process in the <strong>Process Catalog</strong>, generating a unique <code>process ID</code>.</li>
            <li>Computes the results of the process and ingests them into the feature store, timestamping them with the current time to support time travel queries later.</li>
        </ul>
    </li>
</ol>

<p style = 'font-size:16px;font-family:Arial'><strong>Executing Partitioned Processes</strong></p>
<p style = 'font-size:16px;font-family:Arial'>When the process is run, it will iterate over the partitions defined by the Filter Manager. This sequential processing reduces the volume of data handled at any given time, enabling the execution of large-scale processes on systems with limited resources.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Rerunning and Scheduling Processes</strong></p>
<p style = 'font-size:16px;font-family:Arial'>The generated <code>process_id</code> allows you to rerun or schedule the refresh of a specific process without needing to reference the original code. If the Filter Manager is attached, the process will respect the defined partitions during execution, ensuring consistency and efficiency.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Processing Additional Features</strong></p>
<p style = 'font-size:16px;font-family:Arial'>The same steps can be followed for other feature engineering processes. Each process can be registered and ingested into the feature store using the same workflow, ensuring consistency and efficiency. Adding a Filter Manager further enhances flexibility by supporting partitioned processing where necessary.</p>


In [None]:
input_source = DataFrame.from_query(
f"""
SELECT 
    A.CustomerID
,   A.Transaction_Amount
,   A.Date_transaction
,   A.Category
,   A.Merchant_Name
FROM {source_database}.transactions A
, {filtermanager.schema_name}.{filtermanager.view_name} F
, {business_time.schema_name}.{business_time.view_name} T
WHERE A.Date_transaction = T.BUSINESS_DATE
AND A.Category = F.Category
"""
)
input_source

In [None]:
df_eng_feat_cat = input_source.groupby('Category').agg({'Transaction_Amount':['sum','mean','std','min','max','median']})
df_eng_feat_cat

In [None]:
df_eng_feat_cat = df_eng_feat_cat.join(
    input_source[['Category','Transaction_Amount']].groupby("Category").percentile(0.25),
    on = 'Category',
    how = 'inner',
    rprefix = 'r'
)[df_eng_feat_cat.columns + ['percentile_Transaction_Amount']]
df_eng_feat_cat = df_eng_feat_cat.assign(quartile_1_Transaction_Amount=df_eng_feat_cat.percentile_Transaction_Amount)
df_eng_feat_cat = df_eng_feat_cat[[c for c in df_eng_feat_cat.columns if c not in ['percentile_Transaction_Amount']]]
df_eng_feat_cat = df_eng_feat_cat.join(
    input_source[['Category','Transaction_Amount']].groupby("Category").percentile(0.25),
    on = 'Category',
    how = 'inner',
    rprefix = 'r'
)[df_eng_feat_cat.columns + ['percentile_Transaction_Amount']]
df_eng_feat_cat = df_eng_feat_cat.assign(quartile_3_Transaction_Amount=df_eng_feat_cat.percentile_Transaction_Amount)
df_eng_feat_cat = df_eng_feat_cat[[c for c in df_eng_feat_cat.columns if c not in ['percentile_Transaction_Amount']]]
df_eng_feat_cat

In [None]:
from tdfs4ds.utils.lineage import crystallize_view
df_eng_feat_cat_proc = crystallize_view(df_eng_feat_cat, view_name='FEAT_ENG_CAT_DAILY_FILTERED', schema_name='DEMO_USER')

In [None]:
df_eng_feat_cat

<p style = 'font-size:16px;font-family:Arial'>Registering the "Spending Category Distribution"</p>

In [None]:
# let's define the entity and the features in the outputs of the process
entity   = 'Category'
features = ['sum_Transaction_Amount','mean_Transaction_Amount','std_Transaction_Amount','min_Transaction_Amount','max_Transaction_Amount','median_Transaction_Amount','quartile_1_Transaction_Amount','quartile_3_Transaction_Amount']

In [None]:
from tdfs4ds import upload_features
upload_features(
    df_eng_feat_cat_proc, # <-- the teradata dataframe pointing to the process view
    entity_id     = entity,
    feature_names = features,
    metadata      = {'project' : 'customer transactions'}, # <-- some informative metadata
    filtermanager = filtermanager #<-- now we have to tell the feature store there is a filter manager
)

<p style = 'font-size:16px;font-family:Arial'>After the first upload, we can inspect the contents of the feature store. The feature store maintains two catalogs: a feature catalog and a process catalog.</p>

<p style = 'font-size:16px;font-family:Arial'>This organization ensures that all features and processes are well-documented, accessible, and manageable within the feature store.</p>

In [None]:
tdfs4ds.feature_catalog().show_query()

In [None]:
tdfs4ds.feature_catalog()[['DATA_DOMAIN','FEATURE_ID']].groupby('DATA_DOMAIN').count()

In [None]:
tdfs4ds.process_catalog()

In [None]:
tdfs4ds.process_store.process_query_administration.remove_process('dc4fc42d-5388-4f88-b992-3700cb1e2ced')

In [None]:
tdfs4ds.feature_catalog()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>12.3 Get existing feature engineering processes (process catalog)</b></p>

<p style = 'font-size:16px;font-family:Arial'>We can retrieve the <code>process ID</code> by inspecting the <strong>Process Catalog</strong>. The Process Catalog provides detailed information about all implemented processes, including their associated <code>process ID</code>, entities, features, and other metadata.</p>

<p style = 'font-size:16px;font-family:Arial'>Refer to the catalog to identify the relevant <code>process ID</code> for the feature or process you wish to refresh.</p>

In [None]:
# Let's define a data domain for this use case
tdfs4ds.DATA_DOMAIN = "Customer Transaction Analytics Time and Category Management"

In [None]:
tdfs4ds.process_catalog()

In [None]:
list_process_views = tdfs4ds.process_catalog()[['VIEW_NAME']].to_pandas().VIEW_NAME.values
view_cat  = [c for c in list_process_views if 'CAT_DAILY_FILTERED' in c][0]

In [None]:
process_id_cat  = tdfs4ds.process_store.process_query_administration.get_process_id(view_name=view_cat)

In [None]:
print('process_id_cat  : ',process_id_cat)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>13 Roll out existing processes over time</b></p>

<p style = 'font-size:16px;font-family:Arial'>The first step is to retrieve the <strong>Time Manager</strong>. If a Time Manager already exists, the constructor will connect to it. If it does not exist, the constructor will create a new one.</p>

<p style = 'font-size:16px;font-family:Arial'><strong>Verifying the Starting Point</strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>To ensure the process starts from the first date in the Time Manager, use the <code>get_current_step</code> method.</li>
    <li>If <code>get_current_step</code> returns <code>1</code>, the process will start at the first date.</li>
    <li>If it does not return <code>1</code>, you can specify the starting <code>time_id</code> directly in the <code>roll_out</code> function.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><strong>Practical Usage</strong></p>
<p style = 'font-size:16px;font-family:Arial'>By checking and setting the starting point, you can ensure the process begins from the correct date, whether performing a full run or focusing on a specific range of dates.</p>


<p style = 'font-size:16px;font-family:Arial'>Get the time manager</p>

In [None]:
from tdfs4ds.utils.time_management import TimeManager
# this create the BUSINESS_DATE view pointing to the BUSINESS_DATE_HIDDEN table
business_time = TimeManager(table_name = 'BUSINESS_DATE_SEQ', schema_name = 'DEMO_USER')
business_time.get_current_step()

<p style = 'font-size:18px;font-family:Arial'>Trigger the roll out</p>

<p style = 'font-size:16px;font-family:Arial'>Using <code>roll_out</code>, you can efficiently update features over time, ensuring the feature store reflects the desired historical data timeline.</p>

<p style = 'font-size:16px;font-family:Arial'>As we can see, the <code>roll_out</code> does not require any filter manager specification, since the filter manager is part of the process definition itself.</p>

In [None]:
from tdfs4ds import roll_out
roll_out(
    process_list = [process_id_cat],
    time_manager = business_time,
    time_id_start= 136,
    time_id_end  = 140
)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>13.1 the Follow-up Table</b></p>

<p style = 'font-size:16px;font-family:Arial'><strong>Key Information in the Follow-Up Table </strong></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><strong>Run ID:</strong> Identifies each triggered run.</li>
    <li><strong>Process Type:</strong> Indicates the type of process (e.g., upload).</li>
    <li><strong>Process ID:</strong> The ID of the process being executed.</li>
    <li><strong>Start and Completion Times:</strong> Timestamps indicating when the process started and finished.</li>
    <li><strong>Valid Time:</strong> The time at which the data was inserted into the feature store.</li>
    <li><strong>Status:</strong> Tracks the current state of the process (e.g., running, failed, or completed).</li>
    <li><strong>Additional Metadata:</strong> Includes other relevant details to aid in monitoring and debugging.</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Note the <strong>APPLIED_FILTER</strong> column which identifies the filter manager state of a executed process</p> 

In [None]:
tdfs4ds.process_store.process_followup.follow_up_report().sort('START_DATETIME', ascending=False)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>10. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial'> <b>Work Tables </b></p>

In [None]:
list_of_tables = db_list_tables()
list_of_tables

<p style = 'font-size:16px;font-family:Arial'>Drop Views</p>

In [None]:
list_of_tables = db_list_tables()
[execute_sql(f"DROP VIEW DEMO_USER.{t}") for t in list_of_tables.TableName if t.startswith('FS_V')]

<p style = 'font-size:16px;font-family:Arial'>Drop Tables</p>

In [None]:
list_of_tables = db_list_tables()
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if t.startswith('FS_T')]

In [None]:
list_of_tables = db_list_tables()
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if t.startswith('FS_')]

In [None]:
[execute_sql(f"DROP VIEW DEMO_USER.{t}") for t in list_of_tables.TableName if t in ['FEAT_ENG_CAT','FEAT_ENG_CUST']]

In [None]:
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if t in ['temp','tdfs__fgjnojnsmdoignmosnig']]

In [None]:
[execute_sql(f"DROP VIEW DEMO_USER.{t}") for t in list_of_tables.TableName if t in ['BUSINESS_FILTER','BUSINESS_DATE_SEQ','BUSINESS_DATE','HYBRID_BUSINESS_FILTER']]

In [None]:
[execute_sql(f"DROP TABLE DEMO_USER.{t}") for t in list_of_tables.TableName if 'HIDDEN' in t ]

<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call remove_data('DEMO_FeatureEngg');" 
# #Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>