<header>
   <p  style='font-size:36px;font-family:Arial;color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Anomaly Detection in Robot Welding Process<br> Trusted AI
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>


</header>

<p style = 'font-size:18px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Detecting anomalies reduces issues and delays in many industries, especially in the manufacturing field. There have been approaches to detect anomalies in the past, such as engineering rules and graph and deep learning. However, it still proves difficult to detect all the existing anomalies. Plus, companies are striving to minimize false positives, cope with the diversity of sensors and metrology issues, and deliver actionable insights at a business pace. Fortunately, Teradata and ClearScape Analytics have the solution. In ClearScape Analytics, users can execute all steps of anomaly detection from data preparation and exploration to model training and evaluations and adjustments. These analyses can improve the process and ensure accuracy in anomaly detection.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Spot Welding Quality Assessment</b></p>
<p style = 'font-size:16px;font-family:Arial'>Spot welding is a common technique used for welding car body panels, particularly in the assembly of smaller parts and components. Spot welding involves using a pair of copper electrodes to apply a series of short, high-current welding pulses to the metal, fusing the parts together at specific points or “spots”.</p>

<p style = 'font-size:16px;font-family:Arial'>The automotive industry is known for its high level of automation, and spot welding is one of the most automated processes, heavily reliant on robots to improve efficiency, reduce labor costs, and improve the consistency and quality of the finished product. Poor welding quality is rare, but even so, the consequences of poor quality may not be negligible in terms of rework costs and customer satisfaction, especially when quality issues are detected too late.</p>

<img  src="images/AnomalyWelding.png"/>

<p style = 'font-size:16px;font-family:Arial'>Spot welding is a resistance welding process that uses large electrical current. There are many ways to assess the quality of a spot, like tensile or ultrasonic testing to assess the weld strength or the analysis of the welding current measured and recorded during the welding process. In this demo, we focus on the analysis of the anomalies in the welding spot due to welding current, and more specifically the resistance, i.e. the voltage-current ratio which impacts the quality of the welding. The shape of the resistance curve depends on many factors like  the nature of the materials, the geometry, and the quality of the electrodes etc. </p>


<p style = 'font-size:18px;font-family:Arial'><b>Business Values</b></p>
<li style = 'font-size:16px;font-family:Arial'>Improve accuracy in the production and manufacturing process.</li>
<li style = 'font-size:16px;font-family:Arial'>Reduce the number of false positive anomalies detected in a system.</li>
<li style = 'font-size:16px;font-family:Arial'>Decrease additional costs and time wasted due to undetected anomalies.</li>
<li style = 'font-size:16px;font-family:Arial'>Determine patterns and significant factors that lead to anomalies.</li></p>
<p style = 'font-size:18px;font-family:Arial'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial'>Many organizations fail to realize value from their ML and AI investments due to a lack of scale. It is estimated that for broad adoption across many industries, the number of models and model deployments needs to scale 100-1000x larger than their organizations currently support.</p>
<p style = 'font-size:16px;font-family:Arial'>The unique massively-parallel architecture of Teradata Vantage allows users to prepare data, train, evaluate, and deploy models at unprecedented scale.</p>
<p style = 'font-size:16px;font-family:Arial'>In this particular use case, the volume of machine sensor data was so great that millions of ML models were created to derive analytic features that ultimately deployed tens of thousands of models for real-time scoring. This extent of scale is only possible by combining the power of Vantage with native ClearScape Analytic functions.</p>



<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

<p style = 'font-size:16px;font-family:Arial'>Let us start by installing the necessary libraries</p>

In [None]:
%%capture
!pip install lime
!pip install scikit-learn==1.1.3

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>After installing the above libraries, Please restart the kernel. The simplest way is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
import json
import getpass
import pandas as pd
import datetime
from teradataml import *

# import tdsense
# from tdsense.plot import plotcurves

import numpy as np # linear algebra
import matplotlib.pyplot as plt
import sklearn
from sklearn import preprocessing
# from tdsense.clustering import hierarchy_dendrogram, hierarchy_clustering
# from tdnpathviz.visualizations import plotcurves
%matplotlib inline

from sklearn import datasets
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, roc_auc_score, f1_score,confusion_matrix, roc_curve, ConfusionMatrixDisplay
import time
import pytz
import lime

import os
from jdk4py import JAVA, JAVA_HOME, JAVA_VERSION
# Set java path

os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA_HOME)
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA)[:-5]

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from collections import defaultdict
import plotly.offline as offline
offline.init_notebook_mode()


from teradataml.dataframe.sql_functions import case
from teradataml import db_drop_table
configure.byom_install_location = "mldb"

display.max_rows = 5
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=AnomalyDetection.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>   


In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_AnomalyDetection_cloud');"
 # Takes about 50 seconds
%run -i ../run_procedure.py "call get_data('DEMO_AnomalyDetection_local');"
 # Takes about 2 minute 30 secs

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>3. Analyze the raw data set</b></p>

<p style = 'font-size:16px;font-family:Arial'>Let us start by creating a teradataml dataframe. A "Virtual DataFrame" that points directly to the dataset in Vantage.</p>



In [None]:
Sensor_Data = DataFrame(in_schema('DEMO_AnomalyDetection', 'Sensor_Data'))
Sensor_Data

In [None]:
Sensor_Data.shape

<p style = 'font-size:16px;font-family:Arial'>We get the above data from sensors. We focus on one plant (PLANT=1) and one robot (ROBOT_ID=41). The Partition_ID is the type of welding, ID is the WELDING_ID, X is time required for welding in ms and Y is the RESISTANCE. We create a view with the columns required to get data with proper column names.</p>

In [None]:
%%capture
query = f"""
REPLACE VIEW DEMO_AnomalyDetection.V_dataset_01 AS
SELECT
    1 AS PLANT
,   {41} AS ROBOT_ID
,   CAST(A.PARTITION_ID AS BIGINT) AS WELDING_TYPE
,   CAST((DATE '{str(datetime.datetime.now()).split(' ')[0]}'  + FLOOR((WELDING_ID-700*WELDING_TYPE)/100))  AS DATE FORMAT 'YYYY-MM-DD') AS WELDING_DAY
,   CAST(A.ID AS BIGINT) AS WELDING_ID
,   CAST(A.X AS INTEGER) AS TIME_MS
,   A.Y AS RESISTANCE
FROM DEMO_AnomalyDetection.Sensor_Data A
"""
execute_sql(query)

In [None]:
welding_dataset_new = DataFrame(in_schema('DEMO_AnomalyDetection', 'V_dataset_01'))
welding_dataset_new

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.1 - Some aggregations and visualization. </b></p>


<p style = 'font-size:16px;font-family:Arial'>We will check the histogram based on the minimum and maximum Time for welding.</p>
<p style = 'font-size:16px;font-family:Arial'>A histogram is a better way to assess distribution, to cope with the scalability, it is recommended to compute the histogram bins in-database to leverage the Massively Parallel Architecture of Teradata Vantage. For that, we use the Histogram function of teradataml that pushes down the computations to Vantage.</p>

In [None]:
welding_duration_ms = welding_dataset_new. \
                        groupby(['PLANT','ROBOT_ID','WELDING_TYPE', 'WELDING_ID']). \
                        agg({'TIME_MS':['min','max','count']})
welding_duration_ms

In [None]:
from teradataml import Histogram
obj = Histogram(data=welding_duration_ms,
                    target_columns="count_TIME_MS",
                    method_type="Scott")
res = obj.result.sort('MinValue')
res

<p style = 'font-size:16px;font-family:Arial'>We can see that we have calculated the histogram values using the teradataml functions. Clearscape Analytics can easily integrate with 3rd party visualization tools like Tableau, PowerBI or many python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantage and pass only the necessary information to visualization tools, this will not only make the calculation faster but also reduce the time due to less data movement between tools. We do the data transfer for this and the subsequent visualizations wherever necessary.</p>

In [None]:
res = obj.result.sort('MinValue').to_pandas()
res['duration_ms'] = [str(row['MinValue'])+'-'+str(row['MaxValue']) for i,row in res.iterrows()]
res.plot(x='duration_ms',y='CountOfValues',kind='bar', figsize=(15,10), legend=False,xlabel='Duration(ms)', ylabel='Welding Counts')

<p style = 'font-size:16px;font-family:Arial'>In the above histogram we can see the bins between the Min and the Max value of the durations and the welding counts.</p> 
<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.2 - More advanced processing using window functions and delta_t </b></p>
<p style = 'font-size:16px;font-family:Arial'>Resistance is an important parameter in resistance welding. The resistance should not vary too much. If there are any significant changes in resistance over time, it could indicate an issue with the weld quality. For example, an unusually high resistance could indicate poor contact between the parts being welded or a problem with the welding equipment.</p>

In [None]:
welding_dataset_new.loc[welding_dataset_new.WELDING_ID == 854]

In [None]:
from tdnpathviz.visualizations import plotcurves
plotcurves(welding_dataset_new.loc[welding_dataset_new.WELDING_ID == 854],field='RESISTANCE',row_axis='TIME_MS', series_id='WELDING_ID',select_id=None)

<p style = 'font-size:16px;font-family:Arial'>The above graph shows the variation of the resistance of the welding with respect to time. We see that the most interesting part lies between 40 and 400ms from the start of the curve.</p>

<p style = 'font-size:16px;font-family:Arial'>Next we apply the window function on the resistance to smooth the resistance and taking the mean value.</p>


In [None]:
# curve smoothing
window_for_smoothing = welding_dataset_new.RESISTANCE.window(
                            partition_columns   = "WELDING_ID",
                            order_columns       = 'TIME_MS',
                            window_start_point  = -15,
                            window_end_point    = 15
)
welding_dataset_smooth = welding_dataset_new.assign(RESISTANCE_SMOOTHED = window_for_smoothing.mean())

In [None]:
id_curve = 854
single_welding = welding_dataset_smooth[welding_dataset_smooth.WELDING_ID == id_curve].sort('TIME_MS')
single_welding

In [None]:
figure = Figure(width=1000, height=400, image_type="jpg",
                        heading="RESISTANCE and RESISTANCE SMOOTHED")
plot = single_welding.plot(x=single_welding.TIME_MS, y=[single_welding.RESISTANCE, single_welding.RESISTANCE_SMOOTHED],
                    style=['blue', 'red'],xlabel='time in ms', ylabel='resistance ',figure=figure)
plot.show()

<p style = 'font-size:16px;font-family:Arial'>The above graph shows the variation of the resistance of the welding with respect to time and the smoothed resistance, as shown by the Red line, after applying the window function.</p>

<p style = 'font-size:16px;font-family:Arial'>The window function generates a Window object on a teradataml DataFrame Column to run window aggregate functions.
<p style = 'font-size:16px;font-family:Arial'>Function allows user to specify window for different types of computations:
<li style = 'font-size:16px;font-family:Arial'>Cumulative
<li style = 'font-size:16px;font-family:Arial'>Group
<li style = 'font-size:16px;font-family:Arial'>Moving
<li style = 'font-size:16px;font-family:Arial'>Remaining
<p style = 'font-size:16px;font-family:Arial'>By default, window with Unbounded Preceding and Unbounded following is considered for calculation.</p>

<p style = 'font-size:16px;font-family:Arial'>Next we calculate the derivative by using the lead function and taking the difference of the lead value and the mean value of the resistance. Applying a window function to smooth the resistance curve helps to eliminate noise and makes it easier to see the overall trend. The derivative of the resistance gives an indication of how quickly the resistance is changing, which can be a useful measure for detecting anomalies and predicting potential issues.</p>


In [None]:
# let's compute the lead
window_for_lead = welding_dataset_smooth.RESISTANCE_SMOOTHED.window(
                            partition_columns   = "WELDING_ID",
                            order_columns       = 'TIME_MS')

In [None]:
welding_dataset_smooth = welding_dataset_smooth.assign(RESISTANCE_SMOOTHED_AFTER = window_for_lead.lead())
welding_dataset_smooth = welding_dataset_smooth.assign(DERIVATIVE = (welding_dataset_smooth.RESISTANCE_SMOOTHED_AFTER - welding_dataset_smooth.RESISTANCE_SMOOTHED).zeroifnull())
welding_dataset_smooth.sort(['WELDING_ID','TIME_MS'])

In [None]:
id_curve = 854
single_welding_subplot = welding_dataset_smooth[welding_dataset_smooth.WELDING_ID == id_curve].sort('TIME_MS')
single_welding_subplot

In [None]:
from teradataml import subplots
# fig, axes = subplots(grid = {(1, 1): (1, 1),(2, 1): (1, 2)})
# Plot 1980 data at first Axis.
fig, axes = subplots(nrows=2, ncols=1)
plot = single_welding_subplot.plot(x=single_welding_subplot.TIME_MS, 
                    y=[single_welding_subplot.RESISTANCE, single_welding_subplot.RESISTANCE_SMOOTHED],
                    legend=["RESISTANCE", "RESISTANCE SMOOTHED"],
                    figure=fig,
                    style=['blue', 'red'],xlabel='time in ms', ylabel='resistance ',               
                    ax=axes[0])

# Plot 1981 data at second Axis.
plot = single_welding_subplot.plot(x=single_welding_subplot.TIME_MS, 
                    y=single_welding_subplot.DERIVATIVE,
                    legend=["DERIVATIVE"],
                    figure=fig,
                    style="red",xlabel='time in ms', ylabel='derivative ' ,              
                    ax=axes[1])
plot.show()

<p style = 'font-size:16px;font-family:Arial'>We see that the most interesting part lies between 40 and 400ms from the start of the curve, so we plot only that subset.</p>

<p style = 'font-size:16px;font-family:Arial'>It is hard to assess the diversity of curve shapes in this plot since many of them are superimposed. However, we see in the middle of the picture a sharp drop that looks unusual. Moreover, we guess that there are shifts in time and height.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>4. Feature Engineering</b></p>

In [None]:
welding_dataset_new.columns

<p style = 'font-size:16px;font-family:Arial'>We will create a feature table by using different functions on the Resistance column. Valid values for functions are: 'count', 'sum', 'min', 'max', 'mean', 'std', 'percentile', 'unique','median', 'var', 'skew', 'kurtosis'. </p>

In [None]:
features = welding_dataset_new.loc[welding_dataset_new.TIME_MS > 20,:]. \
        groupby(welding_dataset_new.columns[0:5]). \
        agg({
            'TIME_MS':['min','max'],
            'RESISTANCE':['count', 'sum', 'min', 'max', 'mean', 'std', 'percentile', 'unique','median', 'var','skew','kurtosis']
        })
features

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5. Anomaly Detection on Sensor Data</b></p>
    
<p style = 'font-size:16px;font-family:Arial'>Let's start by getting the feature columns from the features tables</p>   

In [None]:
feature_names = features.columns[7::]
feature_names

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.1 Clustering by curve shape</b></p>
<p style = 'font-size:16px;font-family:Arial'>To cluster time series by shapes, we will use the Dynamic Time Warping (DTW) distance that measures the similarity between two time series. This distance is well adapted to this kind of problem since it provides robustness to shifts in time and height.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Distance Matrix in-database Computations</b></p>

<p style = 'font-size:16px;font-family:Arial'>The ClearScape Analytics DTW function computes at scale distances between one reference curve to a set of curves, a many-to-one approach. ClearScape Analytics offers in database dynamic time warping function, callable in SQL as TD_DTW. TD_DTW measures the similarity of two time series. The Dynamics Time Warping (DTW) algorithm is used for space and time. The algorithm uses the FastDTW algorithm. TD_DTW measures the similarity of two time series. The Dynamics Time Warping (DTW) algorithm is used for space and time. The algorithm uses the FastDTW algorithm. This function computes at scale the DTW distances between one reference curve to a set of curves, a many-to-one approach. We want to compute the distance matrix of our subset, i.e. the DTW distance between each curve. The distance matrix is symmetric, since the DTW is, hence we only need to compute the triangular matrix. We wrapped this computation in the tdsense package that calls the TD_DTW function and iterates on the matrix row to compute and store the whole triangular distance matrix in a table.</p>

In [None]:
overview = welding_dataset_new.groupby('WELDING_DAY').count(distinct=True)
dates = list(overview.to_pandas().reset_index()['WELDING_DAY'].values.astype('str'))
dates

In [None]:
subset = welding_dataset_new[ \
                 (welding_dataset_new['PLANT'] == 1) & \
                 (welding_dataset_new['ROBOT_ID'] == 41) & \
                 (welding_dataset_new['WELDING_TYPE'] in (8,9)) & \
                 (welding_dataset_new['WELDING_DAY'].isin(dates)) \
                ]

In [None]:
subset_zoom = subset[(subset.TIME_MS < 400) & (subset.TIME_MS > 40)]
subset_zoom.shape

<p style = 'font-size:16px;font-family:Arial'>The subset of data we have taken contains 7 columns and 344,622 rows.</p>

<p style = 'font-size:16px;font-family:Arial'>Since this is a 2CPU system, the below computation takes around more than 2 hours for 350k rows and so we have pre calculated it and stored in the table in database.</p>

<p style = 'font-size:16px;font-family:Arial'><i>**In case we still want to compute the matrix please set the If part of the below code to <b>True</b> instead of <b>False</b></i></p>

In [None]:
if False:
    dtw_matrix = dtw_distance_matrix_computation2(subset_zoom,field='RESISTANCE',
                                     table_name=dtw_result_table,
                                     schema_name = Param['database'],
                                     row_axis='TIME_MS',
                                     series_id = 'WELDING_ID')
else:
    dtw_matrix = DataFrame(in_schema('DEMO_AnomalyDetection','DTW_Matrix'))

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.2 Hierarchical clustering with Scipy</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now the distance matrix is available, we can perform the clustering. Here, we will use the open-source package Scipy and its cluster.hierarchy modules, that have been used in a tdsense for convenience.</p>

<p style = 'font-size:16px;font-family:Arial'>Hierarchical clustering is an alternative class of clustering algorithms that produce 1 to n clusters, where n is the number of observations in the data set. As you go down the hierarchy from 1 cluster (contains all the data) to n clusters (each observation is its own cluster), the clusters become more and more similar (almost always).</p>

In [None]:
dtw_matrix_loc = dtw_matrix.sort(columns=['WELDING_ID_2','WELDING_ID_1']).to_pandas(all_rows=True)
dtw_matrix_loc

In [None]:
from tdsense.clustering import hierarchy_dendrogram, hierarchy_clustering
linked, labelList = hierarchy_dendrogram(dtw_matrix_loc, cluster_distance = 'ward')

<p style = 'font-size:16px;font-family:Arial'>The dendrogram is useful for visualizing the structure of the hierarchical clustering and identifying the optimal number of clusters to use for further analysis. The optimal number of clusters can be determined by examining the dendrogram to identify a level at which the clusters start to merge more slowly or by using a threshold for the maximum distance between clusters.</p>

<p style = 'font-size:16px;font-family:Arial'>The resulting dendrogram as above shows how the hierarchical clustering algorithm has merged the data points into clusters based on their pairwise distances using the Ward linkage criterion. The dendrogram is a summary of the distance matrix. The X axis has the WELDING_ID but not visible as we have more than 450k rows. Looking at the dendrogram, we see that we have about 6 clusters. When selected 6, here is what we have got.</p>

In [None]:
cluster = hierarchy_clustering(linked, labelList, n_clusters=6)
cluster.head()

<p style = 'font-size:16px;font-family:Arial'>The above dendogram is for only 6 clusters with the colors representing the different clusters. Now, we plot the Resistance curves for each cluster.</p>

In [None]:
fig, ax = plt.subplots(2,3,figsize=(20,10))
colors = cluster[['cluster','leaves_color_list']].copy().drop_duplicates()
for k in range(6):
    plt.subplot(2,3,k+1)
    img = plotcurves( subset_zoom,
                      field='RESISTANCE',
                      row_axis='TIME_MS',
                      series_id='WELDING_ID',
                      select_id=list(cluster[cluster.cluster ==k].CURVE_ID.values),
                      noplot=True)
    plt.imshow(img)
    plt.title('cluster : ' +str(k) + '\n' + str(cluster.groupby('cluster').count()['CURVE_ID'][k]) + ' obs.',fontdict = {'fontsize' : 10, 'color':colors.leaves_color_list.values[k]})
    plt.axis('off')

<p style = 'font-size:16px;font-family:Arial'>And if we plot the curves per cluster, we spot the curves with a sharp drop(cluster 4) and these are the curves we are interested in, i.e. the curve exhibiting the anomaly we are looking for. We note also the other clusters are looking more or less similar. By monitoring the resistance over time and calculating its derivative, you can detect any sudden changes or anomalies. Anomalies might indicate a problem with the welding process, such as a sudden drop in current or a sudden increase in resistance. </p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.3 Create the anomaly dataset</b></p>
<p style = 'font-size:16px;font-family:Arial'>Now we create a table containing the anomaly flag that will be the target of a supervised machine learning model or a relevant KPI to monitor in production dashboards.</p>



In [None]:
target = cluster.copy().drop('leaves_color_list',axis=1)
target = target[target.cluster.isin([1,2])]
target['WELDING_ID'] = target['CURVE_ID']
target['anomaly'] = 0
target.loc[target.cluster==2,'anomaly'] = 1
target.drop(['cluster','CURVE_ID'],axis=1, inplace=True)
target.groupby('anomaly').count().plot(y='WELDING_ID',kind='bar',figsize=(10,10))
copy_to_sql( target,
                  table_name = 'Anomaly_Target',
                  if_exists='replace',
                  primary_index='WELDING_ID')

In [None]:
anomalies = DataFrame('Anomaly_Target')
anomalies

<p style = 'font-size:16px;font-family:Arial'>The above anomaly data has the welding ID and the anomaly flag.</p>
<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.4 Build the analytical dataset </b></p>

<p style = 'font-size:16px;font-family:Arial'>We prepare the analytical dataset by joining the feature table with the anomaly table using the Welding ID so that we get the anomalies for the weldings.</p>

In [None]:
ADS = features[['WELDING_ID']+feature_names].join(other=anomalies, how='inner', on='WELDING_ID=WELDING_ID',rsuffix='r',lsuffix='l')
ADS = ADS.assign(WELDING_ID=ADS.WELDING_ID_l).drop(['WELDING_ID_l','WELDING_ID_r'],axis=1).select(['WELDING_ID']+feature_names+['anomaly'])
ADS

In [None]:
ADS.shape

<p style = 'font-size:16px;font-family:Arial'>The analytical dataset we created has 14 columns and 391 rows which will be used to build the model below.</p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>6. Build the model </b></p>
<p style = 'font-size:16px;font-family:Arial'>We have datasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimetres. If we feed these features to the model as is, there is every chance that one feature will influence the result more due to its value than the others. But this doesn’t necessarily mean it is more important as a predictor. So, to give importance to all the features we need feature scaling.</p>
    
<p style = 'font-size:16px;font-family:Arial'>Here, we apply the Standard scale and transform functions which are ScaleFit and ScaleTransform functions in Vantage. ScaleFit() function outputs statistics to input to ScaleTransform() function, which scales specified input DataFrame columns.</p> 

In [None]:
from teradataml import ScaleFit , ScaleTransform
scaler = ScaleFit(
                    data=ADS,
                    target_columns=feature_names,
                    scale_method="STD",
                    global_scale=False)

In [None]:
ADS_scaled = ScaleTransform(data=ADS,
                         object=scaler.output,
                         accumulate="anomaly").result
ADS_scaled

In [None]:
df = ADS_scaled.to_pandas()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>6.1 Create a model file using the python libraries.</b></p>

<p style = 'font-size:16px;font-family:Arial'>The Vantage Bring Your Own Model (BYOM) package gives data scientists and analysts the ability to operationalize predictive models in Vantage. Predictive models trained in external tools with sample data can be used to score data stored in Vantage using the BYOM Predict. Create or convert your predictive model using a supported model interchange format (PMML, MOJO, ONNX, Dataiku, and DataRobot are currently available), store it in a Vantage table, and use the BYOM PMMLPredict, H2OPredict, ONNXPredict, DataikuPredict, or DataRobotPredict to score your data with the model.</p>

<p style = 'font-size:16px;font-family:Arial'>A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. One way to solve this problem is to oversample the examples in the minority class. the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique, or SMOTE for short. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.</p>

<p style = 'font-size:16px;font-family:Arial'>Then we use the RandomForestClassifier to create the model. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.</p>

In [None]:
X_train = df[feature_names]
y_train = df['anomaly']

In [None]:
# Balance the training set using SMOTE
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)


# Create a random forest classifier
model = RandomForestClassifier(n_estimators=10,max_depth= 3, random_state=42)

# Create a pipeline that includes the SMOTE transformer and the model
pipeline = PMMLPipeline([ ('model', model)])


In [None]:
# Train the pipeline
start = time.time()
pipeline.fit(X_train, y_train)
end = time.time()
print('duration : ', end-start, 's')

In [None]:
# make predictions on the training set
y_train_pred = pipeline.predict(X_train)

# calculate and print the accuracy score
acc = accuracy_score(y_train, y_train_pred)
print("Accuracy: {:.2f}%".format(acc * 100))

# calculate and print precision, AUC and F1-score
prec = precision_score(y_train, y_train_pred)
print("Precision: {:.2f}%".format(prec * 100))

# calculate AUC, AUC requires probability for positive class
prob = pipeline.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, prob)
print("AUC: {:.2f}%".format(auc * 100))

f1 = f1_score(y_train, y_train_pred)
print("F1-Score: {:.2f}%".format(f1 * 100))

In [None]:
pmml_metrics=pd.DataFrame([{'Model':'PMML using BYOM','Accuracy':acc, 'Precision':prec, 'F1-Score':f1}])
pmml_metrics

In [None]:
sklearn2pmml(pipeline, "my_model.pmml", with_repr = True)

In [None]:
additional_columns = {"Description": type("RandomForestClassifier model"),
                              "UserId": type('demo_user'),
                              "ProductionReady": False,
                              "ModelAccuracy": float(acc),
                              "ModelPrecision": prec,
                              "ModelAUC": auc,
                              "Modelf1Score": f1,
                              "ModelSavedTime": str(datetime.datetime.now(tz=pytz.UTC)),
                              "ModelGeneratedTime": end-start,
                              "sklearnVersion": sklearn.__version__
                             }
for k in additional_columns.keys():
    print(type(additional_columns[k]))

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>6.2 Save the model file</b></p>

In [None]:
try:
    save_byom(model_id = 'model_anomaly1',
          model_file = 'my_model.pmml',
          table_name = 'BYOM_PMMLMODELS_REPOSITORY',
          additional_columns={"Description": "RandomForestClassifier model",
                              "UserId": 'demo_user',
                              "ProductionReady": False,
                              "ModelAccuracy": float(acc),
                              "ModelPrecision": float(prec),
                              "ModelAUC": float(auc),
                              "Modelf1Score": float(f1),
                              "ModelSavedTime": str(datetime.datetime.now(tz=pytz.UTC)),
                              "ModelGeneratedTime": float(end-start),
                              "sklearnVersion": sklearn.__version__
                             }
            )
except Exception as e: 
    # if our model exists, delete and rewrite 
    if str(e.args).find('TDML_2200') >= 1: 
        delete_byom(model_id = 'model_anomaly1', table_name = 'BYOM_PMMLMODELS_REPOSITORY') 
        save_byom(model_id = 'model_anomaly1',
              model_file = 'my_model.pmml',
              table_name = 'BYOM_PMMLMODELS_REPOSITORY',
              additional_columns={"Description": "RandomForestClassifier model",
                              "UserId": 'demo_user',
                              "ProductionReady": False,
                              "ModelAccuracy": float(acc),
                              "ModelPrecision": float(prec),
                              "ModelAUC": float(auc),
                              "Modelf1Score": float(f1),
                              "ModelSavedTime": str(datetime.datetime.now(tz=pytz.UTC)),
                              "ModelGeneratedTime": float(end-start),
                              "sklearnVersion": sklearn.__version__
                             }
            )
    else:    
        raise ValueError(f"Unable to save the model due to the following error: {e}")
#     pass 
# else: 
#     raise    



<p style = 'font-size:16px;font-family:Arial'>The model file is saved as can be found in the left navigation pane in /UseCases/Anomaly_Detection.</p>

<p style = 'font-size:16px;font-family:Arial'>We create new scaled data to apply this model and predict data. New dataset is created by joining the features and the anomalies.</p>

In [None]:
newdata = features[['WELDING_ID']+feature_names].join(other=anomalies, how='inner', on='WELDING_ID=WELDING_ID',rsuffix='r',lsuffix='l')
newdata = newdata.assign(WELDING_ID=newdata.WELDING_ID_l).drop(['WELDING_ID_l','WELDING_ID_r'],axis=1).select(['WELDING_ID']+feature_names+['anomaly'])
newdata

<p style = 'font-size:16px;font-family:Arial'>We create new transformed data by using the same Scalefit object we used earlier and get the transformed data for this new data.</p>

In [None]:
newdata_scaled = ScaleTransform(data=newdata,
                         object=scaler.output,
                                # DataFrame(in_schema('demo_user','scaler_anomaly')),
                         accumulate=["WELDING_ID","anomaly"]).result
newdata_scaled

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>6.3 Retrieve the model file and use it to predict</b></p>
<p style = 'font-size:16px;font-family:Arial'>We use the PMMLPredict function from the teradataml library to predict the anomalies.</p>
<p style = 'font-size:16px;font-family:Arial'>Predictive Model Markup Language (PMML) is an XML-based standard established by the Data Mining Group (DMG) for defining statistical and data-mining models. PMML models can be shared between PMML-compliant platforms and across organizations so that business analysts and developers are unified in designing, analyzing, and implementing PMML-based assets and services.</p>

In [None]:
from teradataml import PMMLPredict
modeldata_anomaly = retrieve_byom("model_anomaly1", table_name="BYOM_PMMLMODELS_REPOSITORY")
result=PMMLPredict(
                modeldata = modeldata_anomaly,
                newdata = newdata_scaled,
                accumulate = ['WELDING_ID'],
                model_output_fields=['probability(0)','probability(1)'],
                overwrite_cached_models = '*'
                )
pmml_predict=result.result
pmml_predict

In [None]:
pmml_predict_result = pmml_predict.join(other=newdata_scaled, how='inner', on='WELDING_ID=WELDING_ID',rsuffix='r',lsuffix='l')
pmml_predict_result = pmml_predict_result.assign(prob_0=pmml_predict_result['probability(0)'])
pmml_predict_result = pmml_predict_result.assign(prob_1=pmml_predict_result['probability(1)'])
pmml_predict_result = pmml_predict_result.assign(WELDING_ID=pmml_predict_result.WELDING_ID_l)
pmml_predict_result = pmml_predict_result.assign(prediction=case([(pmml_predict_result.prob_1>pmml_predict_result.prob_0, 1 )],else_ = 0))
pmml_predict_result = pmml_predict_result.select(['WELDING_ID']+['anomaly']+['prob_0']+['prob_1']+['prediction'])
pmml_predict_result

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>7. Random Forest using Teradata OpenSource ML functions</b></p> </b></p>
 
<p style = 'font-size:16px;font-family:Arial'>We start by creating a subset for the most interesting part lies between 40 and 400ms from the start of the curve.</p>
        
        

In [None]:
DF_curves_zoom = welding_dataset_new[(welding_dataset_new.TIME_MS > 40) & (welding_dataset_new.TIME_MS < 400) ]
DF_curves_zoom

<p style = 'font-size:16px;font-family:Arial'>We create various features by using the window function on the Resistance and taking the difference between the previous and current resistance based on time. We will create these features by using the aggregation function on this resistance and the difference of the resistance.</p>
        

In [None]:
DF_curves_zoom = DF_curves_zoom.assign(
    resistance_diff = DF_curves_zoom.RESISTANCE 
                        - DF_curves_zoom.RESISTANCE.window(
                                partition_columns=['WELDING_ID'],
                                order_columns=["TIME_MS"]
                            ).lag(1)
)
# DF_curves_zoom[DF_curves_zoom.WELDING_ID==138].sort("TIME_MS")

In [None]:
DF_features = DF_curves_zoom.groupby("WELDING_ID").agg({
    'RESISTANCE':['sum', 'min', 'max', 'mean', 'std', 'var','skew','kurtosis'],
    'resistance_diff':['min']
})
DF_features

In [None]:
feature_names = DF_features.columns[1:]
feature_names

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>7.1 Build the analytical dataset.</b></p>
<p style = 'font-size:16px;font-family:Arial'>We create the analytical dataset joining the anomaly table created above and the dataset with the features created.</p>

In [None]:
DF_target = DataFrame('Anomaly_Target')

In [None]:
DF_ADS = DF_features[['WELDING_ID']+feature_names].join(
    other=DF_target, how='inner', on='WELDING_ID=WELDING_ID',rsuffix='r',lsuffix='l')

In [None]:
DF_ADS

In [None]:
DF_ADS = DF_ADS.assign(WELDING_ID=DF_ADS.WELDING_ID_l
                                  ).drop(['WELDING_ID_l','WELDING_ID_r'],axis=1
                                        ).select(['WELDING_ID']+feature_names+['anomaly']
                                                ).assign(anomaly_int = DF_ADS.anomaly.cast(INTEGER()))
DF_ADS

In [None]:
# Sample 5% of data for model validation.
DF_ADS=DF_ADS.drop('anomaly', axis=1)
# df_sample = DF_ADS.sample(frac=[0.75, 0.25], randomize=False, seed=20)
# df_sample

TrainTestSplit_out = TrainTestSplit(
                                    data = DF_ADS,
                                    id_column = "WELDING_ID",
                                    train_size = 0.80,
                                    test_size = 0.20,
                                    seed = 42
                                   )
df_sample = TrainTestSplit_out.result

In [None]:
df_sample

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>7.2 Train RandomForest Classifier</b></p>
<p style = 'font-size:16px;font-family:Arial'>Train dataset is created using sampleid = 1.</p>

In [None]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is required for training model.
data_train = df_sample[df_sample.TD_IsTrainRow == "1"].drop("TD_IsTrainRow", axis = 1)
data_train

<p style = 'font-size:16px;font-family:Arial'>Test dataset is created using sampleid = 2.</p>

In [None]:
# Create validation dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is required for validating model.
data_val = df_sample[df_sample.TD_IsTrainRow == "0"].drop("TD_IsTrainRow", axis = 1)
data_val

In [None]:
from teradataml import td_sklearn as osml
X_train = data_train.drop(['anomaly_int','WELDING_ID'], axis = 1)
y_train = data_train.select(["anomaly_int"])
X_test = data_val.drop(['anomaly_int','WELDING_ID'], axis = 1)
y_test = data_val.select(["anomaly_int"])

In [None]:
RF_classifier = osml.RandomForestClassifier(n_estimators=10,max_leaf_nodes=2,max_features='auto',max_depth=2)
#,random_state=42
RF_classifier.fit(X_train, y_train)

In [None]:
RF_classifier.get_params()

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>7.3 Predict and Evaluate model</b></p>


In [None]:
#model predictions
predict_RF =RF_classifier.predict(X_test,y_test)
predict_RF

In [None]:
#accuracy of the model
accuracy_RF = RF_classifier.score(X_test, y_test)
accuracy_RF

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>8. Compare PMML and OpenSource ML model</b></p>
<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>8.1 Show AUC-ROC Curve</b></p>

<p style = 'font-size:16px;font-family:Arial'>The <a href = 'https://docs.teradata.com/search/all?query=TD_ROC&content-lang=en-US'>ROC</a> curve shows the performance of a binary classification model as its discrimination threshold varies. For a range of thresholds, the curve plots the true positive rate against false-positive rate.</p>

<p style = 'font-size:16px;font-family:Arial'>This function accepts a set of prediction-actual pairs as input and calculates the following values for a range of discrimination thresholds.</p>
    <ul style = 'font-size:16px;font-family:Arial'>
        <li style = 'font-size:16px;font-family:Arial'>True-positive rate (TPR)</li>
        <li style = 'font-size:16px;font-family:Arial'>False-positive rate (FPR)</li>
        <li style = 'font-size:16px;font-family:Arial'>The area under the ROC curve (AUC)</li>
        <li style = 'font-size:16px;font-family:Arial'>Gini coefficient</li>
        <li style = 'font-size:16px;font-family:Arial'>Other details are mentioned in the documentation</li>
    </ul>

<p style = 'font-size:18px;font-family:Arial'><b>ROC for PMML</b></p>

In [None]:
from teradataml import ROC 
roc_pmml = ROC(data = pmml_predict_result, 
                    probability_column = "prob_1",
                    observation_column = "anomaly",
                    positive_class="1"
                    )

In [None]:
roc_data_pmml = roc_pmml.output_data.to_pandas().sort_values("fpr", ascending=True)
roc_data_pmml.tail(10)

In [None]:
auc_pmml = roc_pmml.result.to_pandas().iloc[0,0]
auc_pmml

<p style = 'font-size:18px;font-family:Arial'><b>ROC for tdmlOpenSource RandomForestClassifier</b></p>

In [None]:
roc_obj = ROC(data = predict_RF, 
                    probability_column = "randomforestclassifier_predict_1",
                    observation_column = "anomaly_int",
                    positive_class="1"
                    )

In [None]:
roc_data = roc_obj.output_data.to_pandas().sort_values("fpr", ascending=True)
roc_data.tail(10)

In [None]:
auc = roc_obj.result.to_pandas().iloc[0,0]
auc

<p style = 'font-size:18px;font-family:Arial'><b>Plot ROC Curves</b></p>

In [None]:
# Plot 1
plt.plot(roc_data_pmml['fpr'], roc_data_pmml['tpr'], color='orange', label='PMML ROC. AUC = {}'.format(str(auc_pmml)), drawstyle='steps') 
# Plot 2
plt.plot(roc_data['fpr'], roc_data['tpr'], color='green', label='RandomForest ROC. AUC = {}'.format(str(auc)),  drawstyle='steps') 
# Plot the diagonal dashed line
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--') 
# Set labels and title
plt.xlabel('False Positive Rate',fontsize=12) 
plt.ylabel('True Positive Rate',fontsize=12) 
plt.title('Receiver Operating Characteristic (ROC) Curve',fontsize=16) 
# Add legend
plt.legend(loc="lower right",fontsize=10) 
# Show the plot
plt.show()

<p style = 'font-size:16px;font-family:Arial'>The closer the ROC curve is to the upper left corner of the graph, the higher the accuracy of the test because in the upper left corner, the sensitivity = 1 and the false positive rate = 0 (specificity = 1). The ideal ROC curve thus has an AUC = 1.0. As seen in the above graph the AUC for both the models is close to 1 so the accuracy of both models is very good. </p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>8.2 Show Confusion Matrix</b></p>

<p style = 'font-size:16px;font-family:Arial'>Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.</p>

<p style = 'font-size:16px;font-family:Arial'>Confusion matrices represent counts from predicted and actual values. The output “TN” stands for True Negative which shows the number of negative examples classified accurately. Similarly, “TP” stands for True Positive which indicates the number of positive examples classified accurately. The term “FP” shows False Positive value, i.e., the number of actual negative examples classified as positive; and “FN” means a False Negative value which is the number of actual positive examples classified as negative.</p>


In [None]:
# Calculate confusion matrix for PMML
DF_result=predict_RF.to_pandas().reset_index()
pmml_result=pmml_predict_result.to_pandas()
cm_pmml = confusion_matrix(pmml_result['anomaly'], pmml_result['prediction']) 
# Calculate confusion matrix for DecisionForest
cm_df = confusion_matrix(DF_result['anomaly_int'], DF_result['randomforestclassifier_predict_1']) 
# Create figure and axes objects
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8)) 
# Plot PMML confusion matrix
disp_pmml = ConfusionMatrixDisplay(confusion_matrix=cm_pmml, display_labels=['No Anomaly', 'Anomaly']) 
disp_pmml.plot(ax=ax1, cmap='Blues', colorbar=False) 
ax1.set_title('PMML Confusion Matrix') 
ax1.set_xlabel('Predicted Label') 
ax1.set_ylabel('True Label') 
ax1.set_xticks([0, 1]) 
ax1.set_yticks([0, 1]) 
ax1.set_xticklabels(['No Anomaly', 'Anomaly']) 
ax1.set_yticklabels(['No Anomaly', 'Anomaly'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm_pmml.shape[0]): 
    for j in range(cm_pmml.shape[1]): 
        ax1.text(j, i, f'{cm_pmml[i, j]}', ha='center', va='center', color='white' if cm_pmml[i, j] > cm_pmml.max() / 2 else 'black') 

# Plot DecisionForest confusion matrix
disp_df = ConfusionMatrixDisplay(confusion_matrix=cm_df, display_labels=['No Anomaly', 'Anomaly']) 
disp_df.plot(ax=ax2, cmap='Blues', colorbar=False) 
ax2.set_title('RandomForest Confusion Matrix') 
ax2.set_xlabel('Predicted Label') 
ax2.set_ylabel('True Label') 
ax2.set_xticks([0, 1]) 
ax2.set_yticks([0, 1]) 
ax2.set_xticklabels(['No Anomaly', 'Anomaly']) 
ax2.set_yticklabels(['No Anomaly', 'Anomaly'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm_df.shape[0]): 
    for j in range(cm_df.shape[1]): 
        ax2.text(j, i, f'{cm_df[i, j]}', ha='center', va='center', color='white' if cm_df[i, j] > cm_df.max() / 2 else 'black') 

# Adjust layout and spacing
plt.tight_layout() 
# Show the plot
plt.show()

<p style = 'font-size:16px;font-family:Arial'>The confusion matrix for this binary class classification problem has the below 4 quadrants: </p>

<li style = 'font-size:16px;font-family:Arial'>True Positive (TP) refers to a sample belonging to the positive class being classified correctly.</li>
<li style = 'font-size:16px;font-family:Arial'>True Negative (TN) refers to a sample belonging to the negative class being classified correctly.</li>
<li style = 'font-size:16px;font-family:Arial'>False Positive (FP) refers to a sample belonging to the negative class but being classified wrongly as belonging to the positive class.</li>
<li style = 'font-size:16px;font-family:Arial'>False Negative (FN) refers to a sample belonging to the positive class but being classified wrongly as belonging to the negative class.</li>




<p style = 'font-size:18px;font-family:Arial'><b> Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have seen an end-to-end exploration process for labelling anomalous time series using ClearScape Analytics on Teradata Vantage. Thanks to the in-database capabilities offered by Teradata Vantage with ClearScape Analytics, we were able to run this exploration with the smallest notebook instance. The unique massively-parallel architecture of Teradata Vantage allows users to prepare data, train, evaluate, and deploy models at unprecedented scale.</p>
<p style = 'font-size:16px;font-family:Arial'>In this particular use case, we have observed that with large volume of machine sensor data millions of ML models were created to derive analytic features that ultimately deployed tens of thousands of models for real-time scoring. This extent of scale is only possible by combining the power of Vantage with native ClearScape Analytic functions.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>9. Model Explainability</b></p>
<p style = 'font-size:18px;font-family:Arial'><b>Trusted AI</b></p>

<p style = 'font-size:16px;font-family:Arial'>Trusted AI is important for the in-database functions and data pipelines used in predictive AI/ML, providing significant benefits when applied. One way to enhance the benefits: Teradata VantageCloud, the only platform to offer the massively parallel processing (MPP) architecture that enables best-in-class vertical and horizontal scaling of models.</p>

<p style = 'font-size:16px;font-family:Arial'>LIME stands for Local Interpretable Model-agnostic Explanations. LIME focuses on training local surrogate models to explain individual predictions. Local surrogate models are interpretable models that are used to explain individual predictions of black box machine learning models. Surrogate models are trained to approximate the predictions of the underlying black box model. Instead of training a global surrogate model, LIME focuses on training local surrogate models.</p>

<p style = 'font-size:16px;font-family:Arial'>In practice, LIME only optimizes the loss part. The user has to determine the complexity, e.g. by selecting the maximum number of features that the linear regression model may use.</p>

<p style = 'font-size:16px;font-family:Arial'>So, the recipe for training local surrogate models is as follows:</p>

<li style = 'font-size:16px;font-family:Arial'>Select your instance of interest for which you want to have an explanation of its black box prediction.</li>
<li style = 'font-size:16px;font-family:Arial'>Perturb your dataset and get the black box predictions for these new points.</li>
<li style = 'font-size:16px;font-family:Arial'>Weight the new samples according to their proximity to the instance of interest.</li>
<li style = 'font-size:16px;font-family:Arial'>Train a weighted, interpretable model on the dataset with the variations.</li>
<li style = 'font-size:16px;font-family:Arial'>Explain the prediction by interpreting the local model.</li></p>

<p style = 'font-size:16px;font-family:Arial'>Here we will use the model which is created using the teradataml opensouce ml functions to create the explainer and explain the modle parameters. LIME has an attribute lime_tabular that can interpret how the features correlate to the target outcome. We can also specify the mode to classification, training_label to the target outcome (Anomaly), and the features that we have selected on the training process.</p>


In [None]:
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.get_values(), feature_names=X_train.columns,                                 
                                    class_names=['Anomaly','NoAnomaly'], verbose=True, mode='classification')

<p style = 'font-size:16px;font-family:Arial'>We will choose 1 instance of the data and use it to explain the predictions.</p>
<p style = 'font-size:14px;font-family:Arial'><i><b>Note:Please replace the WELDING_ID with the ID we need to get explaination</b></i></p>


In [None]:
X_test_df = data_val
X_test_df.head(20)

In [None]:
df = X_test_df[X_test_df.WELDING_ID==120]
df = df.drop(columns=["WELDING_ID","anomaly_int"])
df

<p style = 'font-size:16px;font-family:Arial'>Next, we call the explainer using the selected instance and the model object created using the RandomForestClassifier.</p>

In [None]:
exp = explainer.explain_instance(df.get_values().flatten(), RF_classifier.modelObj.predict_proba, num_features=9)

<p style = 'font-size:16px;font-family:Arial'>We display the results using the show_in_notebook function of the explainer</p>

In [None]:
from IPython import display
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=ResourceWarning)
exp.show_in_notebook(show_table=True)

<p style = 'font-size:16px;font-family:Arial'>This gives a result as shown in the image above. There are three parts to the explanation :</p>

<li style = 'font-size:16px;font-family:Arial'>left most section displays prediction probabilities</li>
<li style = 'font-size:16px;font-family:Arial'>the middle section returns the features. For the binary classification task, it would be in 2 colors orange/blue. Attributes in orange support class 0 and those in blue support class 1.</li>
<li style = 'font-size:16px;font-family:Arial'>Float point numbers on the horizontal bars represent the relative importance of these features. The color-coding is consistent across sections. It contains the actual values of the variables.</li></p>

<p style = 'font-size:16px;font-family:Arial'>We will repeat the same steps for 1 more instance</p>

In [None]:
df = X_test_df[X_test_df.WELDING_ID==16]
df = df.drop(columns=["WELDING_ID","anomaly_int"])
df

<p style = 'font-size:16px;font-family:Arial'>Next, we call the explainer using the selected instance and the model object created using the RandomForestClassifier.</p>

In [None]:
exp = explainer.explain_instance(df.get_values().flatten(), RF_classifier.modelObj.predict_proba, num_features=9)

<p style = 'font-size:16px;font-family:Arial'>We display the results using the show_in_notebook function of the explainer</p>

In [None]:
from IPython import display
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=ResourceWarning)
exp.show_in_notebook(show_table=True)

<p style = 'font-size:16px;font-family:Arial'>Similar to the previous example, the above image shows three graphs that each show essential information about the anomaly.</p>

<p style = 'font-size:16px;font-family:Arial'>The left graph shows the prediction probabilities and the middle and right most show the features and their contribution towards the prediction.</p>
<p style = 'font-size:16px;font-family:Arial'>Thus, with the explainer functions we try to get explainations using the different feature values on why the weldings have anomaly or do not have anomaly.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>10. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>Work Tables</b></p>

In [None]:
tables = ['ADS_train_data', 'ADS_test_data','DF_train', 'DF_Predict', 'DF_Predict_test','additional_metrics_test']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_AnomalyDetection');" 
#Takes 40 seconds

In [None]:
remove_context()

<p style = 'font-size:16px;font-family:Arial'>If you have updated the teradataml package, reinstall the package by uncommenting and running the below code cell.</p>

In [None]:
%%capture
# !pip install teradataml==17.20.0.6 --force-reinstall
!pip install scikit-learn==1.0.2 --force-reinstall
!pip install numpy==1.24.2 --force-reinstall

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>11. Exploring the Versatility of this Analytical Approach in Alternative Use Case Settings</b></p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>How this analytic approach can be levaraged in other use case settings</b></p>

<p style = 'font-size:16px;font-family:Arial;color:##00233C'>The analytical approach of leveraging clustering followed by classification for anomaly detection in short time series data is highly adaptable and can be broadly applied across various industries, especially in settings where operations or processes are characterized by short, continuous time series with a defined start and end and where ground truth labels are not initially available.</p>
<p style = 'font-size:16px;font-family:Arial;color:##00233C'>This method begins with unsupervised learning to explore and understand the data, identifying patterns, similarities, and potential outliers through techniques like Dynamic Time Warping (DTW). Such exploration is crucial in settings where anomalies are not predefined or where the data’s inherent complexity requires initial unsupervised insight to develop an understanding of what constitutes normal behavior versus an anomaly. Following the clustering phase, supervised classification models are trained on the newly identified labels to predict anomalies. This generic approach is particularly effective for short time series data, where each sequence represents a process or event whose normal operational parameters need to be defined through exploratory analysis before precise anomaly detection can occur.</p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>Potential Use Cases Across Industries:</b></p>
<li style = 'font-size:16px;font-family:Arial;color:##00233C'><b> Telco & Utilities</b> <code>- Power Grid Load Monitoring:</code> Analyzing short time series of electricity load during peak usage times to identify anomalies that could indicate equipment failure, energy theft, or inefficiencies in power distribution. Each series could represent the load profile for a brief, high-demand period.</li>
<li style = 'font-size:16px;font-family:Arial;color:##00233C'><b>Healthcare</b> <code>- ECG or EEG Analysis:</code> Short segments of electrocardiogram (ECG) or electroencephalogram (EEG) readings can be analyzed to detect anomalies indicating cardiac arrhythmias or neurological issues, respectively. Each segment represents a complete heartbeat or a brief brain activity pattern.</li>
<li style = 'font-size:16px;font-family:Arial;color:##00233C'><b>Manufacturing</b> <code>- CNC Machine Operations:</code> Monitoring the torque and force profiles of a CNC (Computer Numerical Control) machine during a single machining operation. Anomalies could indicate tool wear, material inconsistency, or operational errors.</li>
<li style = 'font-size:16px;font-family:Arial;color:##00233C'><b>Travel & Transport</b> <code>- Aircraft Engine Test Runs:</code> Analyzing the time series data of engine parameters (e.g., temperature, pressure, vibration) during short test runs to identify deviations from normal operational profiles, suggesting maintenance or safety issues.</li>
<li style = 'font-size:16px;font-family:Arial;color:##00233C'><b>Hospitality & Entertainment</b> <code>- Theme Park Ride Operations:</code> Analyzing sensor data from individual rides, where each ride cycle produces a time series of mechanical or operational parameters. Anomalies in these series could indicate safety concerns or maintenance needs.</li></p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;color:##00233C'>In each of these scenarios, the focus is on analyzing the shape or behavior of a curve within a short time frame, similar to observing a spot welding curve. These curves are shaped by the specific activity taking place, whether it’s a machine at work, a health test running, financial trades happening, or people interacting with a service. The method begins by sorting these curves into groups based on their patterns, without needing to know ahead of time which ones are out of the ordinary. Then, it moves on to use a more detailed approach to pinpoint which curves don’t fit the expected pattern, labeling them as either normal or not normal. This way of doing things is great for quickly finding and addressing issues, and it also helps in getting a better grasp of how these processes work. This can lead to making things run more smoothly and keeping equipment in good shape before problems even start.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>Resources</b>
<p style = 'font-size:16px;font-family:Arial'>Let’s look at the elements we have available for reference for this notebook:</p>
<b style = 'font-size:18px;font-family:Arial'>Filters:</b> 
    <li style = 'font-size:16px;font-family:Arial'><b>Industry:</b> Manufacturing</li>
<li style = 'font-size:16px;font-family:Arial'><b>Functionality:</b> Machine Learning</li> 
<li style = 'font-size:16px;font-family:Arial'><b>Use Case:</b> Anomaly Detection</li></p>
<b style = 'font-size:18px;font-family:Arial'>Related Resources:</b>
<li style = 'font-size:16px;font-family:Arial'><a href = 'https://www.teradata.com/Blogs/Hyper-scale-time-series-forecasting-done-right'>Hyper-scale time series forecasting done right</a> </li>
<li style = 'font-size:16px;font-family:Arial'><a href = 'https://www.teradata.com/Resources/Datasheets/Stay-Ahead-of-Rapid-Change-with-a-Dynamic-Supply-Chain?utm_campaign=i_coremedia-AMS&utm_source=google&utm_medium=paidsearch&utm_content=GS_CoreMedia_NA-US_BKW&utm_creative=Brand-Vantage&utm_term=teradata%20analytic%20platform&gclid=Cj0KCQjwnMWkBhDLARIsAHBOftrWZxDktHkKMsaWjMmNRnQ6Ys-bZBAUhXjWTo1Xa02fsci-IHWBV_waAppkEALw_wcB'>Stay Ahead of Continuous and Rapid Change with a Dynamic Supply Chain</a></li>
<li style = 'font-size:16px;font-family:Arial'><a href = 'https://www.teradata.com/Industries/Manufacturing'>Achieve industry 4.0 using advanced manufacturing analytics at scale</a></li>



<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
        <div style="float:right;">
            <div style="float:left; margin-top:14px">
                Copyright © Teradata Corporation - 2023, 2024. All Rights Reserved
        </div>
    </div>
</footer>