<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Predictive Maintenance using Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Y-Machine</b> is a manufacturing company that operates a large fleet of machines across multiple locations. They have been experiencing frequent machine breakdowns, which has been causing significant losses in production time and maintenance costs. To address this issue, <b>Y-Machine</b> is looking for a predictive maintenance solution that can help them identify potential machine failures before they occur, allowing them to proactively schedule maintenance and minimize downtime.</p>

<center><img src="./images/giphy.gif" alt="Machine GIF"/></center>
<p><a href="https://giphy.com/gifs/Ykga9Kp0xT4GswQAbh">via GIPHY</a></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To achieve the goal of predictive maintenance, Y-Machine will be leveraging the power of <b>Teradata Vantage</b>, an advanced analytics platform. With Teradata Vantage, we can deploy machine learning algorithms through teradataml python library, which enable us to identify and mitigate potential machine failures before they even occur.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata Vantage provides us with the necessary capabilities to analyze the vast amounts of data generated by Y-Machine's machines, such as temperature, rotational speed, and torque. By processing this data and detecting anomalies or patterns, we can take proactive measures to address potential issues, preventing costly downtimes and ensuring the longevity of the machines.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>With Teradata Vantage, we can help Y-Machine stay ahead of the curve, providing them with cutting-edge analytics capabilities to improve the reliability and efficiency of their machines.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Standard Libraries
import warnings

# Data Manipulation and Analysis Libraries
import numpy as np
import pandas as pd

# Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Libraries
from sklearn.metrics import ConfusionMatrixDisplay, roc_auc_score, roc_curve, auc
from sklearn.preprocessing import label_binarize

# Teradata Libraries
from teradataml import *
configure.val_install_location = 'val'

# Configuration
display.max_rows = 5

# Suppress Warnings
warnings.filterwarnings('ignore')

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>1. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Predictive_Maintenance_Python.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_PredictiveMaintenance_cloud');"        # Takes about 1 minute
%run -i ../run_procedure.py "call get_data('DEMO_PredictiveMaintenance_local');"        # Takes about 2 minutes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>2. Read the data from Vantage as a teradataml Dataframe</b>

In [None]:
df = DataFrame(in_schema('DEMO_PredictiveMaintenance', 'Machine_Data'))
print(df.shape)
df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset mentioned above consists of ten columns, and among them, the 'Target' and 'Failure_Type' columns are dependent variables. The 'Target' column contains binary values, with 1(failure) and 0(no failure) indicating binary classification scenario. On the other hand, the 'Failure_Type' column comprises multiple types of failures, indicating a multi-class classification scenario.
</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>3. Data Exploration</b>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Removing nulls and redundant columns</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next cell, we'll check for null values.</p>

In [None]:
df.info(null_counts = True)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above results, we see no null values in the dataset as all the columns have 10,000 rows.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next cell, we'll remove the Product_ID column as we already have a UID column as a unique identifier.</p>

In [None]:
# Drop column Product_ID
df = df.drop(columns=['Product_ID'])

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Checking target variable distribution</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next cell, we'll check the distribution of target variables, i.e., Target and Failure_Type.</p>

In [None]:
tdf = df.groupby('Target').assign(failure_count = df.Failure_Type.count()).sort('failure_count', ascending = False)
tdf.plot(
    x = tdf.Target,
    y = tdf.failure_count,
    kind = 'bar',
    title = 'Failure Distribution',
    xlabel = 'Failure?',
    ylabel = 'Count'
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The distribution here shows that the majority of the products have no failure, and a tiny number of products have some failure.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's check further w.r.t. Failure_Type</p>

In [None]:
# Count the occurrences of Failure_Type and create a Pandas DataFrame
plot_df = df.groupby('Failure_Type').assign(count = df.Failure_Type.count()).sort('count', ascending = False).to_pandas()

# Create a figure with a larger size
fig, ax = plt.subplots(figsize = (8, 6))

# Create a bar chart of the counts by Failure_Type
ax = plot_df.plot.bar(x = "Failure_Type", y = "count", rot = 45, colormap = 'summer', ax = ax)

# Add the count to the top of each bar
for i in ax.containers:
    ax.bar_label(i, label_type = 'edge', fontsize = 10)

# Set the plot title and axis labels
ax.set_title("Type of Failure Distribution")
ax.set_xlabel("Type of Failure")
ax.set_ylabel("Count")

# Add a grid to the plot
ax.grid(axis = 'y', linestyle = '--', alpha = 0.7)

# Add a legend to the plot
ax.legend(['Count'], loc = 'best', fontsize = 12)

# Display the plot
plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The distribution here shows that the majority of the products have no failure, and a tiny number of products have different failures.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>There are two target variables: 'Target' and 'Failure_Type'. Let's check if everything is ok.</p>

In [None]:
df_failure = df[df['Target'] == 1]
df_failure.groupby('Failure_Type').assign(count = df.Failure_Type.count())

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note:</b> 9 values are classified as failure in the 'Target' variable but as No Failure in the 'Failure_Type' variable. Let's check the dataset:</p>

In [None]:
df_failure[df_failure['Failure_Type'] == 'No Failure']

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>It could go both ways, either failure or no failure. It makes sense to remove those instances since we do not know the real target here.</p>

In [None]:
index_possible_failure = list(df_failure[df_failure['Failure_Type'] == 'No Failure'].get_values()[:, 0])
df = df.drop(labels = index_possible_failure, axis = 'index')

In [None]:
df_failure = df[df['Target'] == 0]
df_failure.groupby('Failure_Type').assign(count = df.Failure_Type.count()).sort('count', ascending = False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note:</b> 18 instances are classified as Random Failures by 'Failure_Type', whereas they are classified as No failure by the 'Target'. These 18 instances are, in fact, all instances of 'Random Failures'. Let's check and remove those instances, as we do not know if they belong to the Failure class. Hence, we will end up with four types of failures since 'Random Failures' will be removed altogether.</p>

In [None]:
df_failure[df_failure['Failure_Type'] == 'Random Failures']

In [None]:
index_possible_failure = list(df_failure[df_failure['Failure_Type'] == 'Random Failures'].get_values()[:, 0])
df = df.drop(labels = index_possible_failure, axis = 'index')

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Checking the correlation</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we'll check the distribution of target variables w.r.t features like torque, rotational speed, air temperature and process temperature.</p>

In [None]:
df1 = df.to_pandas()

# Set the figure size
fig, ax = plt.subplots(1, 2, figsize = (22, 8))

# Set the titles for each subplot
ax[0].set_title('Rot. Speed vs Torque wrt Failure Type (Including class no failure)')
ax[1].set_title('Rot. Speed vs Torque wrt Failure Type (Excluding class no failure)')

# Set the color palette for the plots
palette = ['#E9C0CB', '#39A692', '#976EBD', '#ACBF5C', '#DF8B4E']

# Plot the scatterplots
sns.scatterplot(data = df1, x = 'Rotational_speed', y = 'Torque', hue = 'Failure_Type', palette = palette, ax = ax[0])
sns.scatterplot(data = df1[df1['Target'] == 1], x = 'Rotational_speed', y = 'Torque', hue = 'Failure_Type',
                palette = palette[1:], ax = ax[1])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Some insights:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Power failure happens both for lower and higher rotational speed/torque. It is the type of failure with the highest rotational speed (over 2500rpm) and lowest torque (below around 15Nm). In other words, only power failures occur above and below these thresholds.</li>    
    <li>Between torques 16Nm and 41Nm, all failures are tool wear.</li>
    <li>Overstrain failures occur with torques ranging from around (47 and 68Nm) and rotational speeds from 1200 to 1500rpm approximately.</li>
    <li>The torque range is smaller for heat dissipation failures, and the rotational speed range is higher than for overstrain failures </li>
</ul>

In [None]:
# Set the figure size
fig, ax = plt.subplots(1, 2, figsize = (22, 8))

# Set the titles for each subplot
ax[0].set_title('Process Temperature vs Air Temperature wrt Failure Type (Including class no failure)')
ax[1].set_title('Process Temperature vs Air Temperature wrt Failure Type (Excluding class no failure)')

# Set the color palette for the plots
palette = ['#E9C0CB', '#39A692', '#976EBD', '#ACBF5C', '#DF8B4E']

# Plot the scatterplots
sns.scatterplot(data = df1, x = 'Process_temperature', y = 'Air_temperature', hue = 'Failure_Type', palette = palette, ax = ax[0])
sns.scatterplot(data = df1[df1['Target'] == 1], x = 'Process_temperature', y = 'Air_temperature', hue = 'Failure_Type',
                palette = palette[1:], ax = ax[1])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Some insights:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Heat Dissipation Failure happens when Process Temperature and Air Temperature exceed 300 K.</li>
    <li>Other failures have no meaningful insights.</li>
</ul>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>4. Data Transformation</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next step, we'll use Label Encoder to convert a categorical variable to integer and numerical columns will be scaled using the ZScore function.
<br>
<br>
ZScore will allows rescaling of continuous numeric data in a more sophisticated way than a Rescaling transformation. In a Z-Score transformation, a numeric column is transformed into its Z-score based on the mean value and standard deviation of the data in the column. Z-Score transforms each column value into the number of standard deviations from the mean value of the column. This non-linear transformation is useful in data mining rather than in a linear Rescaling transformation.</p>

In [None]:
df.info()

In [None]:
# Define the label encoders
type_encoder = LabelEncoder(values = {"L": 1, "M": 2, "H": 3}, columns = "Type", datatype = 'integer')
failure_type_encoder = LabelEncoder(values = {
                            "No Failure": 1,
                            "Heat Dissipation Failure": 2,
                            "Power Failure": 3,
                            "Overstrain Failure": 4,
                            "Tool Wear Failure": 5
                            }, 
                    columns = ['Failure_Type'],
                    datatype = 'integer'
                  )

# Define the standard scaler
z_scaler = ZScore(columns = ['Air_temperature', 'Process_temperature',
                      'Rotational_speed', 'Torque', 'Tool_wear'],
            out_columns = ['Air_temperature', 'Process_temperature',
                      'Rotational_speed', 'Torque', 'Tool_wear'])

# Define the retain object
retain = Retain(columns = "Target")

In [None]:
obj = valib.Transform(data = df,
                      label_encode = [type_encoder, failure_type_encoder],
                      zscore = z_scaler,
                      retain = retain,
                      index_columns = 'UID')
df_trans = obj.result

In [None]:
df_trans.info()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As "Type" is a reserved keyword, we'll rename the column "Machine_type."<p>

In [None]:
list_td_reserved_keywords('type')

In [None]:
df_trans = df_trans.assign(Machine_type = df_trans.Type)
df_trans = df_trans.drop(columns=['Type'])

In [None]:
df_trans

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>5. Train-Test Split</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next step, we'll split the transformed dataset into training and testing datasets in the ratio 80:20, and we will save the datasets into Vantage.</p>

In [None]:
TrainTestSplit_out = TrainTestSplit(
    data = df_trans,
    id_column = "UID",
    train_size = 0.80,
    test_size = 0.20,
    seed = 42
)

df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)

In [None]:
copy_to_sql(df_train, table_name = 'df_train', if_exists = 'replace')

copy_to_sql(df_test, table_name = 'df_test', if_exists = 'replace')

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>6. In Database Model Training (Binary Classification)</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next step, we'll use the XGBOOST function to train an xgboost model using the 'Target' column as the target variable for binary classification. XGBoost's tree-based ensemble approach, regularization techniques, handling of missing values, scalability, and feature importance capabilities make it a powerful and effective choice for modeling tabular data, often leading to superior performance compared to other machine learning algorithms.
<br>
<br>
The XGBoost function, eXtreme Gradient Boosting, implements the gradient-boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.
<br>
<br>
In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
</p>

In [None]:
XGBoost_model = XGBoost(
    data = df_train,
    input_columns = '3:8',
    response_column = 'Target',
    max_depth = 7,
    num_boosted_trees = 10,
    model_type = 'CLASSIFICATION',
    seed = 2,
    lambda1 = 100000.0,
    shrinkage_factor = 1.0,
    iter_num = 10,
    column_sampling = 1.0
)

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>7. In Database Model Scoring</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
In the next step, we'll use the XGBoostPredict function to score the xgboost model trained in the previous step.</p>

In [None]:
out = XGBoost_model.predict(
    newdata = df_test,
    id_column = 'UID',
    accumulate = 'Target',
    model_type = 'CLASSIFICATION',
    object_order_column = ['task_index', 'tree_num', 'iter', 'class_num', 'tree_order'],
    output_responses = ['0', '1'],
    output_prob = True
)

out = out.result.assign(Prediction = out.result.Prediction.cast(type_ = BYTEINT))
out = out.assign(Prediction = out.Prediction.cast(type_ = VARCHAR(2)))
out = out.assign(Target = out.Target.cast(type_ = VARCHAR(2)))
out

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next, we'll use the ClassificationEvaluator function to evaluate the trained xgboost model on test data. This will let us know how well our model has performed on unseen data.</p>

In [None]:
ClassificationEvaluator_obj = ClassificationEvaluator(
    data = out,
    observation_column = 'Target',
    prediction_column = 'Prediction',
    labels = ['0', '1']
)

In [None]:
ClassificationEvaluator_obj.output_data.head(10)

In [None]:
# Extract confusion matrix
cm = np.array(ClassificationEvaluator_obj.result.sort('Mapping').to_pandas()[['CLASS_1', 'CLASS_2']])

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['No Failure', 'Failure'])
fig, ax = plt.subplots(figsize = (8, 8))
disp.plot(ax = ax, cmap = 'Blues', colorbar = True)

# Add labels and annotations
plt.title('XGBoost Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.xticks(ticks = [0, 1], labels = ['No Failure', 'Failure'])
plt.yticks(ticks = [0, 1], labels = ['No Failure', 'Failure'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, f'{cm[i, j]}', ha = 'center', va = 'center', color = 'white' if cm[i, j] > cm.max()/2 else 'black')
        
print(f'''This means that out of all the actual no-failure cases ({cm[0][0] + cm[0][1]}),
{round(cm[0][0]/(cm[0][0] + cm[0][1])*100, 2)}% were correctly classified as no-failure, while
{round(cm[0][1]/(cm[0][0] + cm[0][1])*100, 2)}% were incorrectly classified as failure.
Similarly, out of all the actual failure cases ({cm[1][0] + cm[1][1]}),
{round(cm[1][1]/(cm[1][0] + cm[1][1])*100, 2)}% were correctly classified as failure, while
{round(cm[1][0]/(cm[1][0] + cm[1][1])*100, 2)}% were incorrectly classified as no-failure.''')

# Show the plot
plt.show()

In [None]:
from teradataml import ROC

roc_out = ROC(
    probability_column = '"Prob_1"',
    observation_column = "Target",
    positive_class = "1",
    data = out,
    num_thresholds=300
)

In [None]:
roc = roc_out.output_data
roc.plot(
    x = roc.fpr,
    y = roc.tpr,
    title = 'Receiver Operating Characteristic (ROC) Curve',
    xlabel = 'False Positive Rate',
    ylabel = 'True Positive Rate',
    color = 'green'
)

In [None]:
roc_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above metrics show that our model performs well on the binary classification test dataset.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>8. In Database Model Training (Multi-Class Classification)</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next step, we'll use the XGBOOST function to train an xgboost model using Failure_Type as the target variable for multi-class classification.</p>

In [None]:
XGBoost_model = XGBoost(
    data = df_train,
    input_columns = '3:8',
    response_column = 'Failure_Type',
    max_depth = 7,
    num_boosted_trees = 10,
    model_type = 'CLASSIFICATION',
    seed = 2,
    lambda1 = 100000.0,
    shrinkage_factor = 0.9,
    iter_num = 10,
    column_sampling = 1.0
)

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>9. In Database Model Scoring</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the next step, we'll use the TD_XGBoostPredict function to score the xgboost model trained in the previous step.</p>

In [None]:
out = XGBoost_model.predict(
    newdata = df_test,
    id_column = 'UID',
    accumulate = 'Failure_Type',
    model_type = 'CLASSIFICATION',
    object_order_column = ['task_index', 'tree_num', 'iter', 'class_num', 'tree_order'],
    output_responses = ['1', '2', '3', '4', '5'],
    output_prob = True
)

out = out.result.assign(Prediction = out.result.Prediction.cast(type_ = BYTEINT))
out = out.assign(Prediction = out.Prediction.cast(type_ = VARCHAR(2)))
out = out.assign(Failure_Type = out.Failure_Type.cast(type_ = VARCHAR(2)))
out

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next, we'll use the ClassificationEvaluator function to evaluate the trained xgboost model on test data. This will let us know how well our model has performed on unseen data.</p>

In [None]:
ClassificationEvaluator_obj = ClassificationEvaluator(
    data = out,
    observation_column = 'Failure_Type',
    prediction_column = 'Prediction',
    labels = ['1', '2', '3', '4', '5']
)

In [None]:
ClassificationEvaluator_obj.output_data.head(10)

In [None]:
# Extract confusion matrix
cm = np.array(ClassificationEvaluator_obj.result.sort('Mapping').to_pandas()[['CLASS_1', 'CLASS_2', 'CLASS_3', 'CLASS_4', 'CLASS_5']])

# Plot confusion matrix using ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

# Create figure and axes objects
fig, ax = plt.subplots(figsize=(8, 8))

# Plot confusion matrix
disp.plot(ax=ax, cmap='Blues', colorbar=False)

# Set title and axis labels
plt.title('XGBoost Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

# Set x and y ticks with labels and rotation
plt.xticks(ticks=[0, 1, 2, 3, 4], labels=['No Failure', 'Heat Dissipation Failure', 'Power Failure', 'Overstrain Failure', 'Tool Wear Failure'], rotation=45)
plt.yticks(ticks=[0, 1, 2, 3, 4], labels=['No Failure', 'Heat Dissipation Failure', 'Power Failure', 'Overstrain Failure', 'Tool Wear Failure'])

# Add text to the plot to show the actual values of the confusion matrix
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, f'{cm[i, j]}', ha='center', va='center', color='white' if cm[i, j] > cm.max()/2 else 'black')

# Show the plot
plt.show()
print(f'''As an example, consider total power failure cases: ({cm[2][0] + cm[2][1], cm[2][2] + cm[2][3], cm[2][4]}),
{cm[2][2]} were correctly classified as power failure,
{cm[2][0]} were incorretcly classified as no failure,
{cm[2][1]} were incorretcly classified as heat dissipation failure,
{cm[2][3]} were incorretcly classified as overstrain failure,
{cm[2][4]} were incorretcly classified as tool wear failure
''')

In [None]:
xgb_result = out.to_pandas()
# Extract the relevant columns
y_true = xgb_result['Failure_Type'].values  # True labels (ground truth)
y_pred = xgb_result['Prediction'].values  # Predicted labels
y_probs = xgb_result[['Prob_1', 'Prob_2', 'Prob_3', 'Prob_4', 'Prob_5']].values  # Predicted probabilities for each class

# Binarize the true labels
y_true_binary = label_binarize(y_true, classes=np.unique(y_true))

# Compute ROC curve and ROC-AUC for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(y_probs.shape[1]):
    fpr[i], tpr[i], _ = roc_curve(y_true_binary[:, i], y_probs[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Plot ROC curve for each class
plt.figure()
colors = ['b', 'g', 'r', 'c', 'm']
for i in range(y_probs.shape[1]):
    plt.plot(fpr[i], tpr[i], color=colors[i], lw=2,
             label='Class {0} (AUC = {1:0.2f})'
             ''.format(i+1, roc_auc[i]))

plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC-AUC Curve for 5-Class Prediction')
plt.legend(loc="lower right")
plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above metrics show that our model performs well on the multi-class classification test dataset.</p><hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Conclusion:</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In conclusion, the implementation of a predictive maintenance solution can greatly benefit Y-Machine by reducing machine downtime and maintenance costs, improving production efficiency, and increasing overall productivity. Proactive scheduling of maintenance based on real-time data and analytics can help prevent costly breakdowns and emergency repairs, leading to improved machine reliability.
    <br>
    <br>
Additionally, setting limits and alarms on key parameters can enable early detection of potential failures, allowing for timely maintenance interventions. The ability to predict the type of failure can also help reduce diagnosis time, further optimizing maintenance efforts. By leveraging predictive maintenance, Y-Machine can make data-driven decisions to improve their maintenance strategy, leading to tangible benefits to the company's bottom line, including increased operational efficiency, reduced costs, and improved overall performance.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>10. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['df_train', 'df_test']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name = table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_PredictiveMaintenance');"        # Takes 5 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Dataset:</b>

- `UID`: Unique identifier ranging from 1 to 10000
- `Product_ID`: Unique Product ID consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
- `Type`: Consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
- `Air_temperature`: Generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
- `Process_temperature`: Generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K
- `Rotational_speed`: Calculated from a power of 2860 W, overlaid with a normally distributed noise
- `Torque`: Torque values are normally distributed around 40 Nm with a Ïƒ = 10 Nm and no negative values
- `Tool_wear`: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process
- `Target`: If the machine failed or not (boolean)
- `Failure_Type`: Type of failure -
                            No Failure,
                            Heat Dissipation Failure,
                            Power Failure,
                            Overstrain Failure,
                            Tool Wear Failure

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023, 2024. All Rights Reserved
        </div>
    </div>
</footer>