<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Retail Product Hierarchy Clustering with In-Database K-means Clustering
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The retail industry faces challenges with their product hierarchies. Current hierarchies typically group products in a way that makes sense for procurement but often do not reflect how customers shop.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To align product hierarchies with customer shopping behavior, products should be grouped based on similar sales and demand patterns, indicating a strong correlation in their sales time series. However, the existing hierarchies are based on business rules rather than time series analysis. For example, all types of bread are placed in the same subgroup. While this approach works well for some product families—perhaps up to 80%—the remaining 20% are less straightforward and require further investigation. This can be achieved by applying AI/ML models to uncover the diverse dynamics within a subgroup  </p> 


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Business Value </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Aligning product hierarchies with customer shopping behavior using AI/ML models can offer several business benefits:</p>

<li style = 'font-size:16px;font-family:Arial'><b>Improved Inventory Management:</b> Understanding the true demand patterns of products allows for better inventory forecasting and management. This can reduce stockouts and overstock situations, leading to cost savings and more efficient operations.</li>

<li style = 'font-size:16px;font-family:Arial'><b>Optimized Product Placement:</b> In physical stores, aligning product hierarchies with customer behavior can improve product placement strategies. Products that are frequently bought together can be placed near each other, enhancing the shopping experience and potentially increasing sales.</li> 

<li style = 'font-size:16px;font-family:Arial'><b>Relevant Recommendations:</b> Improved product hierarchies can lead to better product recommendations. When customers see suggestions that are relevant to their interests and needs, they feel understood and valued.</li>

<li style = 'font-size:16px;font-family:Arial'><b>Efficient Shopping:</b> A well-organized store layout, whether online or in physical stores, saves customers time. When they can quickly find related products, it makes their shopping trip more efficient and pleasant.</li>

<li style = 'font-size:16px;font-family:Arial'><b>Enhanced Loyalty:</b> Satisfied customers are more likely to return. By consistently meeting their needs and expectations, retailers can build stronger customer loyalty.</li>

<p style = 'font-size:16px;font-family:Arial'>Business value is achieved through grouping products with similar sales/demand behaviors and deriving the correlation between the time series of the sales unit by product. 
</p>

  
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage? </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To build more effective ML and AI models, developers and data scientists must explore unconventional data sources, tools, and techniques to continuously improve their models’ accuracy, speed, and efficacy. However, this creativity often comes at a cost. Additionally, integrating diverse analytics and data into the development pipeline typically increases complexity, fragility, and operational challenges</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata Vantage provides ClearScape Analytics functions which allow users to seamlessly combine a wide range of behavioral, text processing, statistical analysis, and advanced analytic functions with model training and deployment tools on the same platform.</p> 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This allows for rapid development, testing, and validation of new techniques at scale in near-real time so new, more accurate models can easily be deployed to production.</p>

<p></p>    
 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The demonstration  aims to show how to enhance product hierarchies in the retail industry by leveraging time series analysis and clustering techniques. The objective is to identify clusters of products exhibiting similar sales/demand behaviors across the product hierarchy, often defined according to product descriptions or business rules. The first part focuses on data exploration and visualization to understand product relationships and assess clustering strategies. The second part involves clustering by correlation, using a coclustering algorithm and in-database correlation matrix computations to group products with similar sales patterns. The third part covers specific feature engineering, utilizing functions like TD_NORMALIZE_OVERLAP and applying K-means clustering to refine product groupings based on engineered features.
</p>


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We start by importing the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
%%capture
%pip install --upgrade tdnpathviz

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above library needs to be upgraded for some of the functions used in this demonstration. Please be sure to restart the kernel after installing/upgrading the library. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
import time
from teradataml import (
    create_context,
    execute_sql,
    DataFrame,
    in_schema,
    configure,
    remove_context)
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

import getpass
import warnings
import matplotlib.pyplot as plt
import numpy as np

display.max_rows=5

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_Retail_Product_Hierarchy_Clustering_Python.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Getting Data for This Demo </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>


In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_ProductHierarchy_cloud');"
 # Takes about 30 seconds
%run -i ../run_procedure.py "call get_data('DEMO_ProductHierarchy_local');"
 # Takes about 70 seconds

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Analyze Raw Data.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage.</p>

In [None]:
dataset = DataFrame(in_schema("DEMO_ProductHierarchy","Retail_Product_Hierarchy"))
dataset

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The table retail product hierarchy has division as it's main hierarchy. Than we have department, sections, product code and than there is a subgroup code. The Product ID can be an id for any product like tea, beans, salt, paper etc. These products are sold by many stores on same dates or different dates. It also stores the no of units sold by different stores on a particular date. </p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.1 - Explore Product Hierarchy (Plotly icicle)</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will try to visualize this product hierarchy to get a better idea of the hierarchy.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we will import the `ProductHierarchySelector` class from the `tdnpathviz` package, which is used for hierarchical data exploration and define the list of columns that represent the product hierarchy in the dataset. We will than initialize the `ProductHierarchySelector` with the dataset and the defined product hierarchy.</p>


In [None]:
from tdnpathviz.hierarchy_exploration import ProductHierarchySelector
product_hierarchy = [
    'Division_Code',
    'Department',
    'Section',
    'Product_Code',
    'Subgroup_Code'
]

In [None]:
selector = ProductHierarchySelector(dataset, product_hierarchy)
selector.visualize_only()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We see that the values in the first row of records for the first 3 columns in the hierarchy are GQP, GQP_DDP and GQP_DDP_WDH (Division_Code, Department, Section).  The first bar does not display a color because it doesn't have a parent.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can interactively select specific parts of the product hierarchy. Once you select any blue box in the above figure it will show you the layers in the product hierarchy. You can select any box based on which the hierarchy will be selected.</p> 

In [None]:
selector.select()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will than retrieve the filtered dataset and plot the sales units.</p> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will filter the data based on the level of hierarchy selected and than use that data in further analysis.</p>

In [None]:
dataset_filtered = selector.get_filtered_dataset()
print(dataset_filtered.shape)
dataset_filtered

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.2 In-database plotting</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will visualize the data for each product by CALENDAR_DATE  and calculated the sum of ACTUAL_SALES_SINGLES for each group. The objective is to prepare a time series dataset for further analysis.</p>

In [None]:
time_series = dataset_filtered[['Date_Calendar','Base_Product_ID','Sales_Units']]\
                .groupby(['Date_Calendar','Base_Product_ID']).sum()
time_series

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Below plot shows the sales for all products selected in the hierarchy selected.</p>

In [None]:
from tdnpathviz.visualizations import plotcurves
plotcurves(
    time_series, 
    field='sum_Sales_Units', 
    row_axis = 'Date_Calendar', 
    series_id = 'Base_Product_ID', 
    row_axis_type = 'TIMECODE',
    legend='upper right')

In [None]:
time_series.to_sql(table_name = 'time_series',if_exists = 'replace', primary_index = 'Date_Calendar')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Compute Correlation between Time Series</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.1 Pivoting the time series</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the pivot function to check the no of units sold for each product on all the dates. So we keep the Date as is, and use the Product id to get the sum of units sold for  further analysis.</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>
The time series data is pivoted to create a matrix where each column represents a different product's sales over time.</li> 
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Missing values in the pivoted data are filled with the mean of the respective columns</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Common substrings are removed from feature names to simplify the dataset.</li></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>First, we exclude 0 sales that will artificially increase the correlation</p>

In [None]:
time_series0 = DataFrame('time_series')
time_series = time_series0[time_series0.sum_Sales_Units > 0].dropna() 
time_series

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use pivot on <code>time_series</code> with <code>BASE_PRODUCT_NUMBER</code> as columns and summing <code>sum_ACTUAL_SALES_SINGLES</code>. Missing values in the pivoted DataFrame are filled with the mean of their respective columns.</p>

In [None]:
time_series_pivoted = time_series.pivot(
    columns=time_series.Base_Product_ID,
    aggfuncs=time_series.sum_Sales_Units.sum()
)

time_series_pivoted = time_series_pivoted.fillna(
    'MEAN', 
    time_series_pivoted.columns[1::]
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will than clean and rename the features in the pivoted DataFrame:</p>


<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>new_features</code> is created by removing common substrings from the feature names.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>new_vars</code> is an OrderedDict that starts with the <code>CALENDAR_DATE</code> column.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>A loop iterates over <code>new_features</code> and the original columns to populate <code>new_vars</code> with the cleaned feature names.
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>time_series_pivoted</code> is updated with the new feature names and the original columns are dropped.</li>
</p>

In [None]:
from tdnpathviz.utils import remove_common_substring_from_features
from collections import OrderedDict

# Remove common substring from feature names
new_features = remove_common_substring_from_features(
    features=time_series_pivoted.columns[1:]
)

# Create an ordered dictionary with the new features
new_vars = OrderedDict({'Date_Calendar': time_series_pivoted.Date_Calendar})

# Assign new feature names to corresponding old features
for new_f, old_f in zip(new_features, time_series_pivoted.columns[1:]):
    new_vars[new_f] = time_series_pivoted[old_f]

# Assign the new variables to the DataFrame, dropping old columns
time_series_pivoted = time_series_pivoted.assign(
    drop_columns=True, **new_vars
)

# Return or print the updated DataFrame
time_series_pivoted


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.2 - Compute the correlation matrix </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this analysis, we will utilize a previously created time series table to explore correlations within the data. Our goal is to identify and group products into clusters based on their correlation, potentially forming subclusters of highly correlated products.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>By using correlation as a clustering criterion, we aim to classify products into different sublevels within the product hierarchy. This classification will be based on sales patterns rather than traditional business descriptions, such as grouping similar items like beans together.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Additionally, this method offers an opportunity to evaluate the quality of the current product hierarchy. We will assess whether the tightest subgroups within the hierarchy make sense from a sales pattern perspective, providing insights into customer behavior and product relationships.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will compute the correlation matrix for the features in the pivoted DataFrame. Correlation matrix is created on the selected features and is displayed.</p>

In [None]:
from tdnpathviz.visualizations import (
    plot_correlation_heatmap,
    compute_correlation_matrix, 
    reorder_correlation_matrix)

configure.val_install_location = "VAL"

features = time_series_pivoted.columns[1::]
features[0:5]

In [None]:
corr = compute_correlation_matrix(time_series_pivoted[features])
corr

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.3 - Visualize the correlation matrix</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><code>plot_correlation_heatmap</code> is called with the correlation matrix corr to generate and display the heatmap.</p>

In [None]:
plot_correlation_heatmap(corr_matrix=corr) #, figsize=(40,32), fontsize=.5)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using a correlation heatmap, we can see the matrix with colors indicating the strength of the correlation. Dark blue represents anticorrelation, while red indicates a perfect correlation of one. In practice, perfect correlations are rare, but high correlations around 0.7 or 0.68 are quite good.
The heatmap reveals that the correlations are scattered without a clear pattern. Zooming in on a smaller section of the matrix can help us better understand these relationships.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will zoom the correlation heatmap on a subset of the correlation matrix (first 10 rows and columns) to generate and display a zoomed-in heatmap.</p>

In [None]:
# zoom-in
plot_correlation_heatmap(corr_matrix=corr.iloc[0:10,0:10]) #, figsize=(10,8), fontsize=1)

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.4 - Clustering by correlation (co-clustering)</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To identify clusters within our data, we can reorder the correlation matrix to place highly correlated products next to each other. This technique, known as co-clustering, helps reveal clusters within the data. By reordering the correlation matrix, we can test multiple methods to identify clusters. For instance, if we aim to create 10 clusters, we can run the clustering algorithm to classify products based on their correlations. The resulting reordered matrix will display blocks, each representing a cluster.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Each block index corresponds to a cluster number. For example, cluster 0 might contain several products, cluster 1 might have only two products, and cluster 2 could include more products. This visualization helps us understand the grouping of products based on their correlation patterns.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will reorder the correlation matrix using co-clustering. <code>ordered_corr</code> and <code>blocks</code> are created by calling <code>reorder_correlation_matrix</code> with the correlation matrix, specifying the co-clustering method and 10 clusters. The resulting blocks are displayed.</p>

In [None]:
ordered_corr, blocks = reorder_correlation_matrix(corr,method='coclustering', n_clusters=10)
blocks

<p style = 'font-size:18px;font-family:Arial;color:#00233C'>We will visualize the correlated matrix</p>

In [None]:
plot_correlation_heatmap(corr_matrix=ordered_corr)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will zoom the correlation matrix by setting the block_index to 2, selecting the indices of the features in the specified block and displaying the heatmap with the subset of the reordered correlation matrix corresponding to the selected block.</p>

In [None]:
# zoom-in
block_index = 2
indices = blocks[blocks.block_index == block_index].index.to_list()
plot_correlation_heatmap(corr_matrix=ordered_corr.iloc[indices,indices]) #, figsize=(10,8), fontsize=1)


This code cell zooms in on the original correlation matrix for a specific block:
- `indices` are created by selecting the feature indices of the specified block.
- `plot_correlation_heatmap` is called with the subset of the original correlation matrix corresponding to the selected block to generate and display a zoomed-in heatmap.

In [None]:
# zoom-in
indices = blocks[blocks.block_index == block_index].feature_index.to_list()
plot_correlation_heatmap(corr_matrix=corr.iloc[indices,indices]) #, figsize=(10,8), fontsize=1)

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.5 - Compute, Cluster and Vizualize all in one</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will reorders the correlation matrix and output the results. <code>ordered_corr</code> and <code>blocks</code> are created by calling <code>plot_correlation_heatmap</code> with the pivoted DataFrame, specifying the co-clustering method, 15 clusters, and requesting output results.</p>

In [None]:
ordered_corr, blocks = plot_correlation_heatmap(
    time_series_pivoted[features],
    method              = 'coclustering',
    n_clusters          = 15,
    output_results      = True
)

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Time Series Clusters</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 Cluster with large correlations</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will now check the block index which shows maximum correlation. We will aggregate and sort the blocks by median correlation. The resulting DataFrame is sorted by average_corr in descending order and displayed.</p>

In [None]:
blocks.groupby('block_index').agg({'average_corr':'median', 'feature' : 'count'})\
        .sort_values('average_corr',ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we can see that the block_index 5 has the maximum average correlation, so for further analysis we will go ahead with this block_index. In case you want to change this values based on the results you have recieved or you wish to analysize any other block_index, it can be changed in the below cell and that will be used further.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2 - Visualize a cluster </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will set the block_index to 5, indicating the cluster to be visualized in subsequent steps. We will filter the blocks DataFrame to select rows where the block_index column matches the specified block_index value. This is useful for isolating a specific subset of data based on the block index. It then extracts the feature column from the filtered DataFrame and converts it to a list, storing it in the selected_products variable. Finally, it displays the selected_products list.</p>

In [None]:
block_index = 5 # the cluster we want to visualize
blocks[blocks['block_index'] == block_index]
selected_products = blocks[blocks['block_index'] == block_index].feature.to_list()
selected_products

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will filter the time_series DataFrame to include only the rows where the BASE_PRODUCT_NUMBER column values are in the selected_products list. This helps in narrowing down the time series data to only the selected products. We will than visualize the filtered data. This will help us understand the sales for these base products on different dates which will be further used in our analysis.</p>

In [None]:
from tdnpathviz.visualizations import plotcurves
time_series[time_series.Base_Product_ID.isin(selected_products)]
plotcurves(
    time_series0,
    select_id=selected_products, 
    field='sum_Sales_Units', 
    row_axis = 'Date_Calendar', 
    series_id = 'Base_Product_ID', 
    row_axis_type = 'TIMECODE',
    legend='upper right')

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.3 - Time Series Normalization</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will need to normalize the series for 0 sales to make the series more useful for modelling so we will calculate the mean and standard deviation of the actual sales for each product. This provides statistical insights into the sales data for each product. We will than normalize the sum_ACTUAL_SALES_SINGLES field in the time_series0 DataFrame. It joins the mean and standard deviation of sum_ACTUAL_SALES_SINGLES for each BASE_PRODUCT_NUMBER and then standardizes the sales data. The normalized DataFrame is then filtered to retain only the original columns.</p>

In [None]:
# Grouping 'sum_Sales_Units' by 'Base_Product_ID' and aggregating mean and std
aggregated_sales = time_series[['Base_Product_ID', 'sum_Sales_Units']].groupby(
    'Base_Product_ID').agg({'sum_Sales_Units': ['mean', 'std']})

# Joining the aggregated statistics back to the original time_series
normalized_time_series = time_series0.join(
    other=aggregated_sales,
    on='Base_Product_ID',
    how='inner',
    rsuffix='r'
)

# Calculating normalized sales (z-score-like normalization)
normalized_time_series = normalized_time_series.assign(
    sum_ACTUAL_SALES_SINGLES=(
        normalized_time_series['sum_Sales_Units']
        - normalized_time_series['mean_sum_Sales_Units']
    ) / (1e-10 + normalized_time_series['std_sum_Sales_Units'])
)

# Reordering columns to match the original time_series
normalized_time_series = normalized_time_series[time_series.columns]

# Output the normalized DataFrame
normalized_time_series


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the plotcurves function to plot the normalized time series data. It visualizes the normalized sum_ACTUAL_SALES_SINGLES field over the CALENDAR_DATE for the selected products (selected_products).</p>

In [None]:
plotcurves(
    normalized_time_series,
    select_id= selected_products, 
    field='sum_Sales_Units', 
    row_axis = 'Date_Calendar', 
    series_id = 'Base_Product_ID', 
    row_axis_type = 'TIMECODE',
    legend='upper right')

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.4 - Plot pair plot</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>After getting the normalized series we will check the correlation by plotting the values for different products using the pair plots. If we have something highly correlated, we should see points aligned along a diagonal. In this matrix plot, we can observe the sales units of one product versus the sales units of another product on the same day within the same group. The other data points show a good correlation, as indicated by the trend.</p>

In [None]:
from tdnpathviz.visualizations import pair_plot
pair_plot(time_series_pivoted[selected_products],width = 900, height = 900)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will export the time_series_pivoted DataFrame for the selected_products to a CSV file named examples.csv. This allows for external analysis or sharing of the data.</p>

In [None]:
time_series_pivoted[selected_products].to_csv('examples.csv')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will also take a look at the results after resampling or smoothing. By looking at the smoothed data, we can gain a better understanding, and it will likely affect the correlation. There are many factors to enhance this correlation analysis.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will redo the analysis with a smoothing window of 3 days. </p>

<code># Redo with smoothing on 3 days</code>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will calculate and display the minimum value for each column in the time_series DataFrame. This provides a quick overview of the lowest values in the dataset.</p>

In [None]:
time_series.min()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will create a TDSeries object from the time_series DataFrame. The TDSeries function creates a series object from a teradataml DataFrame representing a SERIES in time series which is used as input to Unbounded Array Framework. Here, we create the time series by specifying the BASE_PRODUCT_NUMBER as the ID, CALENDAR_DATE as the row index, and sum_ACTUAL_SALES_SINGLES as the payload field.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Resample() function transforms an irregular time series into a regular time series. It can also be used to alter the sampling interval for a time series. Here, we will resample the time series data to a daily frequency using linear interpolation, starting from a specified timestamp. The resampled DataFrame is adjusted to match the original column names.</p>

In [None]:
from teradataml import TDSeries, Resample, Smoothma

# Create TDSeries from the time_series DataFrame
data_series_df = TDSeries(
    data=time_series,
    id="Base_Product_ID",
    row_index_style="TIMECODE",
    row_index="Date_Calendar",
    payload_field="sum_Sales_Units",
    payload_content="REAL"
)

# Execute Resample for TDSeries with linear interpolation
resampled_timeseries = Resample(
    data=data_series_df,
    interpolate='LINEAR',
    timecode_start_value="TIMESTAMP '2023-05-22 00:00:00.000000'",
    timecode_duration="DAYS(1)"
).result

# Assign the 'Date_Calendar' field from the row index and reorder columns
resampled_timeseries = resampled_timeseries.assign(
    Date_Calendar=resampled_timeseries.ROW_I
)[time_series.columns]

# Output the resampled DataFrame
resampled_timeseries


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Smoothma() function applies a smoothing function to a time series which results in a series that highlights the time series mean. For non-stationary time series with non-constant means, the smoothing function is used to create a result series. When the result series is subtracted from the original series, it removes the non-stationary mean behavior.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we will create a TDSeries object from the resampled_timeseries DataFrame and than apply exponential moving average smoothing with a lambda value of 0.5 to the time series data. The smoothed DataFrame is adjusted to match the original column names.</p>

In [None]:
# Create a new TDSeries from the resampled timeseries DataFrame
data_series_df = TDSeries(
    data=resampled_timeseries,
    id="Base_Product_ID",
    row_index_style="TIMECODE",
    row_index="Date_Calendar",
    payload_field="sum_Sales_Units",
    payload_content="REAL"
)

# Apply exponential smoothing using Smoothma
smoothed_time_series = Smoothma(
    data=data_series_df,
    ma='EXPONENTIAL',
    lambda1=0.5
).result

# Assign the 'Date_Calendar' from the row index and reorder columns
smoothed_time_series = smoothed_time_series.assign(
    Date_Calendar=smoothed_time_series.ROW_I
)[time_series.columns]

# Output the smoothed DataFrame
smoothed_time_series


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will visualize the smoothed sum_ACTUAL_SALES_SINGLES field over the CALENDAR_DATE for the selected products (selected_products)</p>

In [None]:
plotcurves(
    smoothed_time_series,
    select_id= selected_products, 
    field='sum_Sales_Units', 
    row_axis = 'Date_Calendar', 
    series_id = 'Base_Product_ID', 
    row_axis_type = 'SEQUENCE',
    legend='upper right')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Data Preparation for Clustering</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.1 Identify NULL periods in Time Series</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We calculate the duration between consecutive sales dates for each product and the results are stored in a DataFrame named df_consecutive_NULLs, sorted by the number of consecutive days without sales in descending order.</p>

In [None]:
time_series = DataFrame(in_schema("DEMO_ProductHierarchy","Retail_Product_Hierarchy"))
time_series = time_series[['Date_Calendar','Base_Product_ID','Sales_Units']]\
                .groupby(['Date_Calendar','Base_Product_ID']).sum()
time_series.materialize()

In [None]:
query = f"""
SELECT 
    Date_Calendar
    ,Base_Product_ID 
    ,CAST((INTERVAL(DURATION) DAY(4)) AS FLOAT) AS FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES
FROM (
    SELECT
          Date_Calendar
          ,Base_Product_ID
          ,sum_Sales_Units
          ,LEAD(Date_Calendar) OVER (PARTITION BY Base_Product_ID ORDER BY Date_Calendar) AS NEXT_DATE
          ,PERIOD(Date_Calendar, NEXT_DATE) AS DURATION
    FROM {time_series._table_name}
    QUALIFY NEXT_DATE IS NOT NULL
    ) A
"""

df_consecutive_NULLs = DataFrame.from_query(query).sort('FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES', ascending=False)
df_consecutive_NULLs

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We aggregate the data to compute various statistics for each product</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The data is grouped by <code>BASE_PRODUCT_NUMBER</code></li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Aggregated statistics (minimum, maximum, mean, standard deviation, count, and median) of the <code>FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES</code> are calculated for each product.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Data is filtered based on the <code>max_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES</code> column.</li></p>

In [None]:
dataset = df_consecutive_NULLs.groupby('Base_Product_ID')\
            .agg({'FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES' : ['min','max','mean','std','count','median']})
dataset = dataset[dataset.max_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES>1]
dataset

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.2 - K-means</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We’ll focus on identifying null periods or long periods where we do not have any data for a product. If we have an abandoned product, we expect that after its end of life, there will be no data for this product. It’s important to recognize this because it can be misleading if we compute correlations or other statistics on it. We need to spot these periods quickly, as they can indicate data quality issues. For instance, there might be no sales at all for a product, possibly due to a broken supply chain that later resumed. This affects the reliability of our correlation analysis, as we expect a regular time series of sales, whether seasonal or not. In reality, we might encounter missing values or long periods of zero sales, which disrupt the product’s dynamics.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here’s what we aim to do:</p>

<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Identify null periods in the time series.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Identify zero sales periods in the time series.</li>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We’ll use a SQL query to count and build periods of non-missing values. By using functions like LEAD and LAG, we can compute the duration of these zero sales periods.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For example, for a product starting at a specific date, we might find a 218-day period without any data. Similarly, another product might have a 197-day period without data. Multiple periods with no data can occur for different products, resulting in scattered or sparse time series.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We’ll analyze each base product to determine the minimum, maximum, standard deviation, count, and median of these missing values. This analysis will be very insightful. For instance, a product might have at least two periods without sales, which could be at the beginning and the end of its lifecycle. We can also calculate the mean of these periods.</p>


In [None]:
from teradatasqlalchemy.types import *
dataset = dataset.assign(
    count_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES=dataset['count_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES'].cast(FLOAT))

dataset.to_sql(table_name = 'kmeans_dataset', temporary = True, primary_index = 'Base_Product_ID')
dataset = DataFrame('kmeans_dataset')

dataset

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will define a list of feature columns to be used in the clustering analysis. The list features includes the columns <code>max_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES</code>, <code>mean_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES</code>, <code>std_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES</code>, and <code>count_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES</code>.

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next, we can run multiple K-means to plot the elbow, which is what we’ll do here. We can see the plot elbow function that calls the in-database K-means. We’ll loop over the number of clusters we ask for in K-means, and this is how you call it. First, we’ll find 100 K-means with two clusters, then three, and so on until ten. </p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The <code>plot_elbow</code> function uses the below parameters to determine the optimal number of clusters for K-means clustering:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The <code>dataset</code> DataFrame is passed along with the list of feature columns.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The <code>BASE_PRODUCT_NUMBER</code> column is used as the index.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The maximum number of clusters to consider is set to 10.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Data scaling is enabled.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Additional arguments for the K-means algorithm, such as the random seed, are specified.</li></p>

In [None]:
from tdnpathviz.visualizations import plot_elbow
features = ['max_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES','mean_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES',
            'std_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES','count_FOLLOWING_NB_DAYS_WITHOUT_SALES_VALUES']
plot_elbow(
    tddf             = dataset,
    features         = features,
    index_columns    = 'Base_Product_ID',
    nb_cluster_max   = 10,
    scaling          = True,
    tdml_KMeans_args = {'seed' : 42}
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here it is, and we get the famous elbow plot that indicates the optimal number of clusters with the K-means method, which is where the elbow occurs. We can imagine an arm with the elbow here. We have two plots depending on whether we want to plot the distortion or the inertia. Sometimes the elbow is clearer on one plot than the other.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we can see that four clusters are the best for this product.</p>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.3 - Cluster with the best k</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The K-means() function groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Specify or randomly select k initial cluster centroids.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Assign each data point to the cluster that has the closest centroid.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Recalculate the positions of the k centroids.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Repeat steps 2 and 3 until the centroids no longer move.</li>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The algorithm doesn't necessarily find the optimal configuration as it depends significantly on the initial randomly selected cluster centers. User can run the function multiple times to reduce the effect of this limitation.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will do K-means clustering on the dataset using the below parameters:
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The <code>KMeans</code> class from <code>teradataml</code> is used to perform clustering.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The <code>BASE_PRODUCT_NUMBER</code> column is used as the identifier.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The previously defined feature columns are used as the target columns.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The number of clusters is set to <code>k</code> <b>(4)</b>.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>A random seed is specified for reproducibility.</li>
</p>


In [None]:
k = 4

In [None]:
from teradataml import KMeans, KMeansPredict
KMeans_out = KMeans(
    id_column      = "Base_Product_ID",
    target_columns = features,
    data           = dataset,
    num_clusters   = k,
    seed           = 42
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will predict the cluster assignments for the dataset using the trained K-means model. The trained K-means model (<code>KMeans_out.result</code>) and the dataset are passed as arguments.</p>

In [None]:
KMeansPredict_out = KMeansPredict(
    object = KMeans_out.result,
    data   = dataset
)
KMeansPredict_out.result

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.4 - Visualize the clusters</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
We will merge the original dataset with the K-means prediction results and than visualize the clusters using the pair_plot</p>

In [None]:
res = dataset.join(KMeansPredict_out.result, on='Base_Product_ID',lprefix='l', rprefix='r')
res = res.assign(Base_Product_ID=res.l_Base_Product_ID)[['Base_Product_ID'] + features + ['td_clusterid_kmeans']]
res

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Visualize the clusters using pair_plot</p>


In [None]:
from tdnpathviz.visualizations import scatter_plot, pair_plot
pair_plot(res[features + ['td_clusterid_kmeans']], series_id='td_clusterid_kmeans', 
          width=2000, height = 2000,markersize=8)

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7.  Identify ZERO SALES periods in Time Series</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function <code>TD_NORMALIZE_OVERLAP_MEET</code> prepares the time series by identifying periods with zero sales. It merges consecutive periods of zero sales that overlap or touch each other into a single period. This is what TD_NORMALIZE_OVERLAP_MEET does. For example, if two periods overlap or meet, they collapse into a single period. This function is very powerful and has been used effectively in various scenarios, including at the bank.</p>

<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Prepare the periods.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Use the TD_NORMALIZE_OVERLAP_MEET function.</li>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You can see the query and the results. For instance, a product might have a period of over nine months with zero sales. This feature is interesting because it allows us to combine and make statistics on these periods, checking if they occur at the beginning or end of the product’s lifecycle.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This helps generate consistent and complete time series before returning to the correlation analysis we discussed earlier. You can also resample or smooth your time series before computing correlations and use a restricted time window to exclude zero sales from your computation. This ensures that the correlations you compute make sense and are not disturbed by other factors like stock shortages.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have created a function add_overlaps.This function is designed to normalize overlapping time periods in a Teradata DataFrame using Teradata's built-in normalization functions. The function accepts a DataFrame and several optional parameters to specify the method of normalization, column names for unique identifiers, and start and end dates. It constructs and executes a SQL query to process the periods and returns a new DataFrame with normalized periods and additional details.
</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The function first attempts to retrieve the table name from the DataFrame.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>It constructs a subquery to select the unique identifier and create a PERIOD type from the start and end dates.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>A main query is then built to compute overlaps using the specified normalization method.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The query is printed for debugging purposes.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Finally, the query is executed, and the result is returned as a new DataFrame.</li>

In [None]:
def add_overlaps(df, method='TD_NORMALIZE_OVERLAP', new_period_name='durations', ID_CTP='obligor_id', date_beg='Default_Phase_Start_Date', date_end='Default_Phase_End_Date'):
    """
    Adds overlaps in time periods for a given DataFrame using Teradata's normalization functions.
    
    This function leverages Teradata's `TD_NORMALIZE_OVERLAP` or `TD_NORMALIZE_OVERLAP_MEET` to process 
    period data, ensuring that any overlapping or meeting periods are normalized. This can be useful for 
    financial, project management, or any domain where time period data requires such normalization.

    Args:
        df (teradataml.DataFrame): The input DataFrame containing period data to normalize.
        method (str, optional): The Teradata function to use for normalization. Options are 
            'TD_NORMALIZE_OVERLAP' to handle overlapping periods and 'TD_NORMALIZE_OVERLAP_MEET' to handle 
            periods that either overlap or meet exactly. Defaults to 'TD_NORMALIZE_OVERLAP'.
        new_period_name (str, optional): The name for the new period column created after normalization.
            Defaults to 'durations'.
        ID_CTP (str, optional): The column name for the unique identifier within the DataFrame.
            Defaults to 'obligor_id'.
        date_beg (str, optional): The column name for the start date of the period. Defaults to 
            'Default_Phase_Start_Date'.
        date_end (str, optional): The column name for the end date of the period. Defaults to 
            'Default_Phase_End_Date'.
    
    Returns:
        teradataml.DataFrame: A new DataFrame containing the normalized periods, with additional details
        such as the normalized count and the end of each period.

    Notes:
        - `TD_NORMALIZE_OVERLAP` is used to process overlapping periods by merging them and providing 
          normalized counts.
        - `TD_NORMALIZE_OVERLAP_MEET` handles both overlapping and exactly meeting periods, treating them
          as continuous periods for normalization purposes.
        - The function assumes the input DataFrame is a Teradata DataFrame, utilizing Teradata SQL 
          capabilities for the operations.
    """
    
    try:
        # Attempt to get the table name from the DataFrame
        view_name = df._table_name
    except:
        # Execute the node and set the table name if not already set
        df._DataFrame__execute_node_and_set_table_name(df._nodeid, df._metaexpr)
        view_name = df._table_name
    
    # Subquery to select the ID and create a PERIOD type from start and end dates
    subquery = f"""
                SELECT
                           {ID_CTP}
                       ,   PERIOD(
                                   CAST({date_beg} AS TIMESTAMP(6) WITH TIME ZONE)
                                ,  CAST({date_end} AS TIMESTAMP(6) WITH TIME ZONE)) as {new_period_name}
                    FROM {view_name}
                """
    
    # Main query to compute overlaps using the specified method
    query = f"""
                WITH subtbl({ID_CTP}, {new_period_name}) AS
                   ({subquery}
                    )
                SELECT 
                    A.{ID_CTP}
                ,   A.{new_period_name}
                ,   A.NrmCount
                ,   INTERVAL(A.{new_period_name}) MONTH AS NB_MONTH
                ,   BEGIN(A.{new_period_name}) AS BEGIN_PERIOD
                ,   END(A.{new_period_name}) AS END_PERIOD
                FROM
                (SELECT {ID_CTP}, {new_period_name}, NrmCount
                FROM TABLE (TD_SYSFNLIB.{method}(NEW VARIANT_TYPE(subtbl.{ID_CTP}),
                                                             subtbl.{new_period_name})
                RETURNS (id____ BIGINT, time_duration____ PERIOD(TIMESTAMP(6) WITH TIME ZONE), NrmCount INT)
                HASH BY {ID_CTP}     
                LOCAL ORDER BY {ID_CTP}, {new_period_name})     
                AS DT({ID_CTP}, {new_period_name}, NrmCount)) A
                """
    
    # Print the query for debugging purposes
    print(query)
    
    # Execute the query and return the result as a DataFrame
    return DataFrame.from_query(query)


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will construct a subquery to identify periods where actual sales are zero for each product. It uses the LEAD window function to find the next calendar date for each product and creates a PERIOD type for the duration between the current and next date.</p>

<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The subquery selects the base product number and the duration period.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>It filters the results to include only those periods where actual sales are zero.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The query is executed, and the result is returned as a DataFrame.</li>

In [None]:
subquery = f"""
SELECT CAST(Base_Product_ID AS BIGINT) AS Base_Product_ID, DURATION
FROM (
    SELECT
      Date_Calendar
      ,Base_Product_ID
      ,sum_Sales_Units
      ,LEAD(Date_Calendar) OVER (PARTITION BY Base_Product_ID ORDER BY Date_Calendar) AS NEXT_DATE
      ,PERIOD(Date_Calendar, NEXT_DATE) AS DURATION
    FROM {time_series._table_name}
    QUALIFY NEXT_DATE IS NOT NULL
    ) A
    WHERE sum_Sales_Units = 0
"""
DataFrame.from_query(subquery)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will construct a subquery to identify periods where actual sales are zero for each product, similar to the previous cell, but it includes additional details like the calendar date and the next date.</p>

<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The subquery selects the base product number, calendar date, and next date.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>It uses the LEAD window function to find the next calendar date for each product.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>It filters the results to include only those periods where actual sales are zero and the next date is not null.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The query is executed, and the result is returned as a DataFrame.</li>

In [None]:
subquery = f"""
SELECT CAST(A.Base_Product_ID AS BIGINT) AS Base_Product_ID, A.Date_Calendar, A.NEXT_DATE
FROM (
    SELECT
          Date_Calendar
        , Base_Product_ID
        , sum_Sales_Units
        , LEAD(Date_Calendar) OVER (PARTITION BY Base_Product_ID ORDER BY Date_Calendar) AS NEXT_DATE
    FROM {time_series._table_name}
    ) A
    WHERE sum_Sales_Units = 0 AND NEXT_DATE IS NOT NULL
"""
df = DataFrame.from_query(subquery)
df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the add_overlaps function to normalize periods of zero sales for each product. It specifies the method TD_NORMALIZE_OVERLAP_MEET to handle both overlapping and meeting periods.</p>

<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The add_overlaps function is called with the DataFrame from the previous cell and appropriate parameters.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The resulting DataFrame, df_zeros, contains the normalized periods.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The DataFrame is then sorted by the normalized count in descending order.</li>

In [None]:
df_zeros = add_overlaps(
    df, 
    method='TD_NORMALIZE_OVERLAP_MEET',
    new_period_name='durations', 
    ID_CTP='Base_Product_ID', 
    date_beg = 'Date_Calendar', 
    date_end='NEXT_DATE')

df_zeros

df_zeros.sort('NrmCount', ascending=False)

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8.  Normalize Time Series</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>8.1 Normalize the time series</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> We will normalize the sales units by dividing by the total sales units for each product and week. </p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Adds new columns for the day of the week and a unique week identifier.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Aggregates sales units by product, date, day, and week.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>The code normalizes the sales units by dividing by the total sales units for each product and week.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>It filters out weeks that do not have sales data for all seven days.</li></p>

In [None]:
dataset = DataFrame(in_schema("DEMO_ProductHierarchy","Retail_Product_Hierarchy"))

In [None]:
dataset = dataset.assign(day_ = dataset.Date_Calendar.day_of_week(), 
                         week_ =  dataset.Base_Product_ID*100000 + dataset.Date_Calendar.week() 
                         + 100*dataset.Date_Calendar.year())

time_series = dataset[['Base_Product_ID','Date_Calendar','day_','week_','Sales_Units']]\
                .groupby(['Base_Product_ID','Date_Calendar','day_','week_']).sum()
time_series

In [None]:
normalized_time_series = time_series.join(
    other   = time_series[['Base_Product_ID','week_','sum_Sales_Units']].
    groupby(['Base_Product_ID', 'week_']).
    agg({'sum_Sales_Units':['sum','count']}),
    on      =['Base_Product_ID','week_'],
    how     ='inner',
    rprefix ='r')

normalized_time_series = normalized_time_series[(normalized_time_series.sum_sum_Sales_Units > 0) & 
                                                (normalized_time_series.count_sum_Sales_Units == 7)]

normalized_time_series = normalized_time_series.assign(
        normalized_Sales_Units = 
        normalized_time_series['sum_Sales_Units']/normalized_time_series['sum_sum_Sales_Units'])

normalized_time_series = normalized_time_series[time_series.columns+['normalized_Sales_Units']]

normalized_time_series.sort('normalized_Sales_Units', ascending=False)

In [None]:
plotcurves(
    normalized_time_series,
    field     = 'normalized_Sales_Units',
    row_axis  = 'day_', 
    series_id = 'week_',
    row_axis_type = 'SEQUENCE',
    legend=None)

In [None]:
plotcurves(
    normalized_time_series[normalized_time_series.Base_Product_ID == 6282552604],#8132739927],
    field     = 'normalized_Sales_Units',
    row_axis  = 'day_', 
    series_id = 'week_',
    row_axis_type = 'SEQUENCE',
    legend=None)

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>8.2 Filter Outliers using In-DB functions</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The OutlierFilterFit() function calculates the lower_percentile, upper_percentile, count of rows and median for all the "target_columns" provided by the user. These metrics for each column helps the function OutlierTransform() detect outliers in the input table. It also stores parameters from arguments into a FIT table used during transformation.</p>

In [None]:
from teradataml import OutlierFilterFit, OutlierFilterTransform
OutlierFilterFit_out = OutlierFilterFit(
                            data = normalized_time_series,
                            target_columns = "normalized_Sales_Units",
                            group_columns = 'day_')


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>OutlierFilterTransform() function filters the outliers from the input teradataml DataFrame. OutlierFilterTransform() uses the result DataFrame from OutlierFilterFit() function to get statistics like median, count of rows, lower percentile and upper percentile for every column specified in target columns argument and filters the outliers in the input data.</p>

In [None]:
cleaned_normalized_time_series = OutlierFilterTransform(
                                            data=normalized_time_series,
                                            object=OutlierFilterFit_out.result,
                                            data_partition_column = 'day_',
                                            object_partition_column = 'day_').result

cleaned_normalized_time_series

In [None]:
plotcurves(
    cleaned_normalized_time_series[cleaned_normalized_time_series.Base_Product_ID == 6282552604],  #8132739927],
    field     = 'normalized_Sales_Units',
    row_axis  = 'day_', 
    series_id = 'week_',
    row_axis_type = 'SEQUENCE',
    legend=None
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>After transforming the outliers from the time series data, we will count the no of days for which sales is available and cleanse the time series data by removing the data for which the day count is not equal to 7 days(week)</p>

In [None]:
cleaned_normalized_time_series.groupby(['Base_Product_ID','week_']).count()

In [None]:
super_cleaned_time_series = cleaned_normalized_time_series.join(
    other   = cleaned_normalized_time_series.groupby(['Base_Product_ID','week_']).count(),
    on      =['Base_Product_ID','week_'],
    how     ='inner',
    rprefix ='r')

super_cleaned_time_series = super_cleaned_time_series[super_cleaned_time_series.count_normalized_Sales_Units == 7]
super_cleaned_time_series = super_cleaned_time_series[cleaned_normalized_time_series.columns]

super_cleaned_time_series

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will pivot the data for each base product id and week id and calculate the sum of sales based on these,</p>

In [None]:
pivoted_dataset = super_cleaned_time_series[['Base_Product_ID','week_','day_','normalized_Sales_Units']]\
                    .pivot(columns = super_cleaned_time_series.day_,
                           aggfuncs=super_cleaned_time_series.normalized_Sales_Units.sum())

pivoted_dataset

In [None]:
pivoted_dataset.to_sql(table_name = 'pivoted_dataset', 
                       if_exists = 'replace', 
                       types={'sum_normalized_sales_units_'+str(i+1) : FLOAT() for i in range(7)})

pivoted_dataset = DataFrame('pivoted_dataset')
pivoted_dataset

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will visualize the normalized and pivoted data using elbow plots and find the appropriate number of clusters</p>

In [None]:
from tdnpathviz.visualizations import plot_elbow
plot_elbow(
    tddf             = pivoted_dataset,
    features         = pivoted_dataset.columns[2::],
    index_columns    = 'Base_Product_ID',
    nb_cluster_max   = 20,
    scaling          = True,
    tdml_KMeans_args = {'seed' : 42}
)

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>8.3 Clustering data based on the sales</b></p> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will add a 'DummyId' column to the 'pivoted_dataset' DataFrame using the FillRowId function from the Teradata ML library. The FillRowId() function adds a column of unique row identifiers to the input DataFrame.</p>


In [None]:
from teradataml import FillRowId
res = FillRowId(data=pivoted_dataset,
                row_id_column='DummyId').result

res

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Based on the elbow plot we set the number of clusters (k) to 7 and retrieves the column names from the third column to the second last column of the res DataFrame. This is likely in preparation for clustering.</p>

In [None]:
k = 7
res.columns[2:-1]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We perform K-Means clustering on the res DataFrame using the KMeans function using 'DummyId' as the ID column, the previously selected columns as target columns, and sets the number of clusters to 7 with a seed of 42 for reproducibility.</p>

In [None]:
KMeans_out = KMeans(
    id_column      = 'DummyId',
    target_columns = res.columns[2:-1],
    data           = res,
    num_clusters   = k,
    seed           = 42
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The KMeansPredict function is used to predict cluster assignments for the res DataFrame based on the K-Means model stored in KMeans_out.</p>

In [None]:
KMeansPredict_out = KMeansPredict(
    object = KMeans_out.result,
    data   = res
)
KMeansPredict_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will filter the K-Means clustering result to include only rows where the cluster ID is greater than -1, indicating valid cluster assignments. The normalized sales units are visualized for all these clusters to undersatnd the sales for each cluster.</p>

In [None]:
import matplotlib.pyplot as plt
KMeans_out.result[KMeans_out.result.td_clusterid_kmeans>-1]
KMeans_out.result[KMeans_out.result.td_clusterid_kmeans>-1]\
        [['sum_normalized_sales_units_'+str(i+1) for i in range(7)]].to_pandas().T.plot(figsize=(15,8))
plt.xticks(rotation=60)
plt.title("Sales Units By Clusters of Products", fontdict = {'fontsize': 16})


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The main focus is on the central shape of the week for each cluster. Typically, the centroids of these clusters are well-formed around a central point, providing a good representation of the weekly pattern.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the plot of the shape of the week. The typical weekly shapes in this dataset show distinct patterns. For example, the blue line starts very low at the beginning, likely representing Sunday, and then shows a decrease on Monday. In contrast, the red line shows an increase towards the end of the week.</p>



<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Conclusion:</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Thus, by aligning product hierarchies with customer shopping behavior, products are grouped based on similar sales and demand patterns, indicating a strong correlation in their sales time series. </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>These patterns should be interpreted by examining which products belong to each weekly shape. It’s also important to consider narrowing the range in terms of months or seasons, as different shapes may emerge. By tagging the different weekly shapes for various products, we can determine if these patterns remain consistent throughout the year or vary with the seasons. This analysis should be compared with the seasonality of the products to provide a comprehensive understanding.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>9. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ProductHierarchy');" 
#Takes 45 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024 All Rights Reserved
        </div>
    </div>
</footer>