# Tool Wear Detection: Model Training

In this notebook, you will predict CNC milling machine conditions by training a [Vertex AI AutoML tabular model](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/overview) using the public [Kaggle CNC Mill Tool Wear dataset](https://www.kaggle.com/datasets/shasun/tool-wear-detection-in-cnc-mill). 

This dataset is collected from running machining experiments on 2" x 2" x 1.5" wax blocks in a CNC milling machine. The dataset contains measurements from 4 motors (X,Y, Z axes and spindle) and program values in the CNC machine. The measurements includes position, velocity, acceleration, and power consumptions across the 4 motors. 

For full detail of the dataset, please see [this](https://www.kaggle.com/datasets/shasun/tool-wear-detection-in-cnc-mill).

## Installation

Install the following packages required to execute this notebook. Depending on your environment, there may be some warnings and errors that are safe to ignore. 

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME") and not os.getenv("VIRTUAL_ENV")

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install fsspec gcsfs $USER_FLAG -q
! pip3 install tensorflow-data-validation $USER_FLAG -q
! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG -q

## Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set your project ID

If you don't know your project ID, you may be able to get your project ID using gcloud.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

### Region

You can also change the `REGION` variable, which is used for operations throughout the rest of this notebook. Below are regions supported for Vertex AI. It is recommended that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

print("Region:", REGION)

### Create a Cloud Storage bucket

In this notebook, you will be creating a Vertex AI dataset for training a Vertex AI tabular model. Before you can create a Vertex AI dataset, you will upload the processed data to a Cloud Storage bucket.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_URI = "gs://" + PROJECT_ID + "-ml"

print("BUCKET_URI:", BUCKET_URI)

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

## Import libraries

In [None]:
import os

import google.cloud.aiplatform as vertex_ai
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

matplotlib.style.use("ggplot")
plt.figure(figsize=(40, 20))
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)

## Initialize varabiles and Vertex AI client

In [None]:
# Local file path to dataset
LOCAL_DATA_PATH = "./data"
# GCS path to dataset
GCS_DATA_PATH = "tool_wear/"
# Label column name
LABEL_COL = "tool_condition"
# Split column name
SPLIT_COL = "ml_use"
# Vertex AI artifacts prefix
VERTEX_AI_PREFIX = "tool_wear"  # TODO: change me

vertex_ai.init(project=PROJECT_ID)

## Download and extract dataset

Download the [Kaggle CNC Mill Tool Wear dataset](https://www.kaggle.com/datasets/shasun/tool-wear-detection-in-cnc-mill) from a public Cloud Storage bucket and extract the zip file locally.

In [None]:
! mkdir -p $LOCAL_DATA_PATH
! gsutil cp gs://gc-mde-demo-public/tool_wear_dataset.zip $LOCAL_DATA_PATH

In [None]:
from zipfile import ZipFile

with ZipFile(f"{LOCAL_DATA_PATH}/tool_wear_dataset.zip", "r") as zipObj:
    # Extract all the contents of zip file in current directory
    zipObj.extractall(LOCAL_DATA_PATH)

## Merge data files

You will merge the data files into a single dataframe.

As described in [Kaggle](https://www.kaggle.com/datasets/shasun/tool-wear-detection-in-cnc-mill), this dataset contains a `train.csv` that contains the general data about each experiments. The time series data collected for each experiement are stored in `experiment_[01-18].csv`. 

In [None]:
df_train_csv = pd.read_csv(os.path.join(LOCAL_DATA_PATH, "train.csv"))

df_train_csv.head()

In [None]:
experiment_ids = list(df_train_csv["No"].unique())

li_df_experiments = []

for id in experiment_ids:
    filename = (
        f"experiment_{id:0>2d}.csv"  # Pad number with zeros (left padding, width 2)
    )
    df = pd.read_csv(os.path.join(LOCAL_DATA_PATH, filename), index_col=None)
    df["No"] = id
    df = df.merge(df_train_csv, how="left", on="No")
    print(f"experiment id: {id} | shape: {df.shape}")

    li_df_experiments.append(df)

df_experiments = pd.concat(li_df_experiments, axis=0, ignore_index=True)

In [None]:
df_experiments.head()

## Exploratory data analysis

In this section, you will explore and learn the characteristics of this dataset.

In [None]:
df_train_csv.describe(include="all")

In [None]:
df_experiments.describe()

In [None]:
import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_dataframe(
    dataframe=df_experiments,
    stats_options=tfdv.StatsOptions(
        label_feature=LABEL_COL, sample_rate=1, num_top_values=50
    ),
)

In [None]:
tfdv.visualize_statistics(stats)

### Observations

Here are some observations noted from the EDA.

1. There are some numerical columns (e.g. `Z1_CurrentFeedback`, `Z1_DCBusVoltage`) where there are only zeros.
1. There are some numerical columns (e.g. `Z1_CommandVelocity`, `Z1_CommandAcceleration`) where more than 50% of the values are zeros.
1. Some of the actual and command values have more than 50% of discrepancy in their values. For example, `X1_CommandAcceleration` contains 73.12% of zeros while `X1_ActualAcceleration` has 13.66% of zeros. 
1. `S1_SystemInertia` has a constant value of `12` across all data points. 
1. `passed_visual_inspection` has missing values.
1. The number of data points across `worn` (i.e. 13308) and `unworn` (i.e. 11978) classes are relatively balanced. 

Take some time and think about what other observations did you make.

## Explore experiments across different outcomes

This dataset provides 2 different observed outcomes from the CNC milling operations:

1. `machining_finalized`: indicator for whether machining was completed without the workpiece moving out of the pneumatic vise
1. `passed_visual_inspection`: indicator for whether the workpiece passed visual inspection, only available for experiments where machining was completed

You will explore the distribution of experiments and data points across different observed outcomes.

In [None]:
df_train_csv_na = df_train_csv.fillna("no")
df_train_csv_na.groupby(
    ["tool_condition", "machining_finalized", "passed_visual_inspection"]
).No.apply(list)

In [None]:
df_experiments_na = df_experiments.fillna("no")
df_experiments_na[
    ["tool_condition", "machining_finalized", "passed_visual_inspection"]
].value_counts()

## Data visualizations

Data visualizations are helpful when analyzing data. Data visualizations make it easier to identify patterns, trends and outliers in datasets.

In this section, you will visualize different features in the CNC Mill dataset. 

### Visualize categorical variables

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(18, 12))
axs[1, 2].set_visible(False)

df_experiments["M1_CURRENT_PROGRAM_NUMBER"].value_counts().plot(
    kind="bar", title="M1_CURRENT_PROGRAM_NUMBER", ax=axs[0, 0]
)
df_experiments["M1_CURRENT_FEEDRATE"].value_counts().plot(
    kind="bar", title="M1_CURRENT_FEEDRATE", ax=axs[0, 1]
)
df_experiments["Machining_Process"].value_counts().plot(
    kind="bar", title="Machining_Process", ax=axs[0, 2]
)
df_experiments["feedrate"].value_counts().plot(
    kind="bar", title="feedrate", ax=axs[1, 0]
)
df_experiments["clamp_pressure"].value_counts().plot(
    kind="bar", title="clamp_pressure", ax=axs[1, 1]
)

#### Observations

1. For `M1_CURRENT_PROGRAM_NUMBER`, there are imbalance of count across the categories with high density when `M1_CURRENT_PROGRAM_NUMBER` is `1.0`.
1. For `Machining_Process`, values `end` and `Starting` have low count.
1. `M1_CURRENT_FEEDRATE` and `feedrate` have similar distribution except that `M1_CURRENT_FEEDRATE` contains an additional category (`50.0`).

### Visualize correlation

In [None]:
plt.figure(figsize=(40, 40))

# Plot pairwise correlation for numerical columns
sns.heatmap(df_experiments.corr(), cbar=True, annot=True, cmap="Blues")

#### Observations

1. There are many highly correlated features.
    - Across X / Y / Z / S
        - `*_ActualPosition` & `*_CommandPosition`
        - `*_ActualVelocity` & `*_CommandVelocity`
        - `*_DCBusVoltage` & `*_OutputCurrent` & `*_OutputVoltage` & `*_OutputPower`
    - `{X,Y,Z}_*Positions` have negative correlation with `S1_{ActualVelocity,CurrentFeedback,DCBusVoltage,OutputVoltage,OutputPower}`.
1. There are some features (e.g. `Z1_CurrentFeedback`, `S1_SystemInertia`) with no correlation because these features contains constant value.

### Visualize numerical variables

In [None]:
def plot_metrics(
    df: pd.DataFrame, experiment_ids: list, metric: str, color: str = "#00FF00"
):
    n_plots = len(experiment_ids)
    n_cols = 3
    n_rows = n_plots // n_cols + 1

    fig, axs = plt.subplots(n_rows, n_cols, figsize=(16, 14))

    for i in range(n_plots):
        exp_id = experiment_ids[i]
        df_metric = df[df["No"] == exp_id].reset_index()[metric]
        df_metric.plot(
            kind="line",
            title=f"{metric} in experiment {exp_id}",
            color=color,
            ax=axs[i // n_cols, i % n_cols],
        )

In [None]:
# Graph various metrics by label and experiments
agg = df_train_csv.groupby([LABEL_COL])["No"].apply(list)

worn_exp_ids = agg["worn"]
unworn_exp_ids = agg["unworn"]

print(agg)

### Actual Velocity

In [None]:
plot_metrics(
    df=df_experiments,
    experiment_ids=unworn_exp_ids,
    metric="Z1_ActualVelocity",
    color="green",
)

In [None]:
plot_metrics(
    df=df_experiments,
    experiment_ids=worn_exp_ids,
    metric="Z1_ActualVelocity",
    color="red",
)

### Actual Position

In [None]:
plot_metrics(
    df=df_experiments,
    experiment_ids=unworn_exp_ids,
    metric="Z1_ActualPosition",
    color="green",
)

In [None]:
plot_metrics(
    df=df_experiments,
    experiment_ids=worn_exp_ids,
    metric="Z1_ActualPosition",
    color="red",
)

#### Observations

1. It is not intuitive to distinguish worn and unworn CNC mills from their the telemetries (e.g. `*_ActualVelocity`, `*_ActualPosition`). 

### EDA Summary

1. There are some feature columns (e.g. `Z1_CurrentFeedback`,`Z1_CommandVelocity`) with high percentage of zeros. 
1. The number of data points across `worn` (i.e. 13308) and `unworn` (i.e. 11978) classes are relatively balanced. 
1. There are some highly correlated features (e.g. `*_ActualPosition` and `*_CommandPosition`) and some features (e.g. `Z1_CurrentFeedback`, `S1_SystemInertia`) with no correlation.
1. It is not intuitive to distinguish worn and unworn CNC mills from their the telemetries (e.g. `*_ActualVelocity`, `*_ActualPosition`). 


## Data preparation

Vertex AI AutoML is a no-code & lo-code end-to-end model training pipeline that includes automatic data splitting, feature engineering, architecture search, model training, model ensembling, and model distillation. AutoML lets you create and train a model with minimal technical effort. You can use AutoML to quickly prototype models and explore new datasets before investing in development. You can further customize AutoML for classification and regression tasks using [Vertex AI Tabluar Workflow](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/overview). 

In constrast to training custom machine learning models, AutoML does many [data preparation](https://cloud.google.com/vertex-ai/docs/datasets/data-types-tabular) tasks (e.g. one-hot encoding and feature scaling) for you. Hence, it is important to learn your responsibility in data preparation when using AutoML. See [this](https://cloud.google.com/vertex-ai/docs/datasets/data-types-tabular) for the full list of AutoML data transformations. 

In this section, you will prepare the CNC Mill dataset following the [Vertex AI tabular data preparation best practices](https://cloud.google.com/vertex-ai/docs/datasets/bp-tabular).

### [Avoid Data Leakage](https://cloud.google.com/vertex-ai/docs/datasets/bp-tabular#target-leakage)

Target leakage happens when your training data includes predictive information that is not available when you ask for a prediction. Target leakage can cause your model to show excellent evaluation metrics, but perform poorly on real data.

For this dataset, there are 2 leaking features: `machining_finalized` and `passed_visual_inspection`. These features are not known when the machining operation is in progress. 

In [None]:
COL_TO_EXCLUDE = ["machining_finalized", "passed_visual_inspection"]

df_experiments.drop(columns=COL_TO_EXCLUDE, inplace=True, errors="ignore")

### [Make sure your categorical features are accurate and clean](https://cloud.google.com/vertex-ai/docs/datasets/bp-tabular#make_sure_your_categorical_features_are_accurate_and_clean)

Data inconsistencies can cause categories to be incorrectly split. For example, if your data includes "Brown" and "brown", Vertex AI uses those values as separate categories, when you might have intended them to be the same. Misspellings can have a similar effect.

For the `Machining_Process` feature, there are some misspelled and inconsistent categories.

In [None]:
df_experiments["Machining_Process"].value_counts()

#### Clean Up Machining Process

Notice the values `End` and `end` represents the same process, but have different capitalization. This will cause [AutoML](https://cloud.google.com/vertex-ai/docs/datasets/data-types-tabular#categorical-transf) to treat these as different values. Hence, we will relabel the value `end` to `End`. 

In addition, notice that `Starting` value only has 1 count. There isn't sufficient data points to provide insights on the `Starting` process and may confuse the model prediction. Hence, we will group `Starting` and `Prep` together.

In [None]:
df_experiments.replace(
    {"Machining_Process": {"Starting": "Prep", "end": "End"}}, inplace=True
)

df_experiments["Machining_Process"].value_counts()

## Feature Engineering

[Feature engineering](https://developers.google.com/machine-learning/glossary#feature_engineering) is the process of determining which features might be useful in training a model, and then creating those features by transforming the raw data. 

Feature engineering is an important part of the data science lifecycle because having predictive features is key to high model performance. For the CNC mill dataset, you will create a set of new features by calculating the difference between the command and actual metrics. The hypothesis is that the difference between the command and actual metrics is indicative of tool conditions. 

In [None]:
axes = ["X1", "Y1", "Z1", "S1"]
metrics = ["Position", "Velocity", "Acceleration"]

for ax in axes:
    for metric in metrics:
        df_experiments[f"{ax}_{metric}Diff"] = abs(
            df_experiments[f"{ax}_Command{metric}"]
            - df_experiments[f"{ax}_Actual{metric}"]
        )

## Holdout data for simulation

In order to simulate CNC milling machine telemetries from the edge, you will create a flow in [Manufacturing Connect edge (MCe)](https://litmusdocs.mcoutput.com/1379100/Content/MCe-Installation&Configuration/c-edgs-overview.htm) to generate sample CNC milling machine telemetries. Hence, you will split 500 records from both experiment 2 (tool_condition: `unworn`) and experiment 13 (tool_condition: `worn`) for simulation. 

> Note: the MCe flow definition is included in this repository. You can directly import the flow definition without having to worry about the holdout data. However, it is important to **exclude** the holdout data from the training data so the AutoML model doesn't train on the holdout data.

In [None]:
df_experiment_2_holdout = df_experiments[df_experiments["No"] == 2].iloc[-500:]
df_experiment_13_holdout = df_experiments[df_experiments["No"] == 13].iloc[-500:]
indices_exclude = list(df_experiment_2_holdout.index) + list(
    df_experiment_13_holdout.index
)

df_experiments.drop(index=indices_exclude, inplace=True)

df_experiment_holdout = pd.concat(
    [df_experiment_2_holdout, df_experiment_13_holdout], axis=0, ignore_index=True
)

dataset_bucket_uri = f"{BUCKET_URI}/{GCS_DATA_PATH}{VERTEX_AI_PREFIX}_holdout.json"

df_experiment_holdout.drop(columns=["No"], errors="ignore", inplace=True)
df_experiment_holdout.drop(columns=["material"], errors="ignore", inplace=True)
df_experiment_holdout.to_json(dataset_bucket_uri, orient="records")

## [Using a manual split](https://cloud.google.com/vertex-ai/docs/datasets/bp-tabular#consider_using_a_manual_split)

Vertex AI selects the rows for the train and test dataset randomly (but deterministically). For imbalanced classes, you could end up with a small number of the minority class in your test dataset, or even none, which causes training to fail.

As seen in the [EDA](#Exploratory-data-analysis) section, the target classes are relatively balanced (11978 `unworn` and 13308 `worn` records). However, each experiment in the CNC mill dataset is an independent machining operation. Hence, it is not optimal to include subset of experiments in both train and test datasets as it may leak information about the machining operation. Another way to rationalize this concept is to understand how this model will be used in production. Will the model predict on past machining operations that the model is trained on? Or will the model predict on new machining operations?

> When splitting data, it is important to consider how your models will be used in production as that will give you insight on how to properly split train and test datasets, without leaking information about the target class.

One of the ways to manually split data in AutoML is to [use the ml_use label](https://cloud.google.com/vertex-ai/docs/general/ml-use#ml-use). In this case, we are adding the following experiments in the test dataset while treating other experiments as training dataset:

| Experiment | tool_condition | machining_finalized | passed_visual_inspection |
|------------|----------------|---------------------|--------------------------|
| 1          | unworn         | yes                 | yes                      |
| 4          | unworn         | no                  | no                       |
| 7          | worn           | no                  | no                       |
| 8          | worn           | no                  | no                       |
| 15         | worn           | yes                 | yes                      |

In [None]:
# Exp 1: unworn, finalized, pass
# Exp 4: unworn, not finalized, no pass
# Exp 7: worn, not finalized, no pass
# Exp 8: worn, not finalized, no pass
# Exp 15: worn, finalized, pass
test_experiment_ids = [1, 4, 7, 8, 15]

df_experiments[SPLIT_COL] = "UNASSIGNED"
df_experiments.loc[df_experiments["No"].isin(test_experiment_ids), SPLIT_COL] = "TEST"
df_experiments[SPLIT_COL].value_counts()

## Export processed data to Cloud Storage bucket

In [None]:
# Drop experiment id column
df_experiments_out = df_experiments.drop(columns=["No"], errors="ignore")
dataset_bucket_uri = f"{BUCKET_URI}/{GCS_DATA_PATH}{VERTEX_AI_PREFIX}.csv"

df_experiments_out.to_csv(dataset_bucket_uri, index=False)

## [Create Vertex AI tabular dataset](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/create-dataset)

You will create a Vertex AI dataset for training an AutoML model using the processed data in Cloud Storage bucket.

In [None]:
# Create a Vertex AI dataset
dataset = vertex_ai.TabularDataset.create(
    display_name=VERTEX_AI_PREFIX,
    gcs_source=dataset_bucket_uri,
)

print(dataset.resource_name)

## [Train an AutoML classification model](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/train-model#aiplatform_create_training_pipeline_tabular_classification_sample-python)

You will train an AutoML classification model using the Vertex AI dataset you created.

In [None]:
TRANSFORMATION = [
    {"numeric": {"column_name": "X1_ActualPosition"}},
    {"numeric": {"column_name": "X1_ActualVelocity"}},
    {"numeric": {"column_name": "X1_ActualAcceleration"}},
    {"numeric": {"column_name": "X1_CommandPosition"}},
    {"numeric": {"column_name": "X1_CommandVelocity"}},
    {"numeric": {"column_name": "X1_CommandAcceleration"}},
    {"numeric": {"column_name": "X1_CurrentFeedback"}},
    {"numeric": {"column_name": "X1_DCBusVoltage"}},
    {"numeric": {"column_name": "X1_OutputCurrent"}},
    {"numeric": {"column_name": "X1_OutputVoltage"}},
    {"numeric": {"column_name": "X1_OutputPower"}},
    {"numeric": {"column_name": "Y1_ActualPosition"}},
    {"numeric": {"column_name": "Y1_ActualVelocity"}},
    {"numeric": {"column_name": "Y1_ActualAcceleration"}},
    {"numeric": {"column_name": "Y1_CommandPosition"}},
    {"numeric": {"column_name": "Y1_CommandVelocity"}},
    {"numeric": {"column_name": "Y1_CommandAcceleration"}},
    {"numeric": {"column_name": "Y1_CurrentFeedback"}},
    {"numeric": {"column_name": "Y1_DCBusVoltage"}},
    {"numeric": {"column_name": "Y1_OutputCurrent"}},
    {"numeric": {"column_name": "Y1_OutputVoltage"}},
    {"numeric": {"column_name": "Y1_OutputPower"}},
    {"numeric": {"column_name": "Z1_ActualPosition"}},
    {"numeric": {"column_name": "Z1_ActualVelocity"}},
    {"numeric": {"column_name": "Z1_ActualAcceleration"}},
    {"numeric": {"column_name": "Z1_CommandPosition"}},
    {"numeric": {"column_name": "Z1_CommandVelocity"}},
    {"numeric": {"column_name": "Z1_CommandAcceleration"}},
    {"numeric": {"column_name": "Z1_CurrentFeedback"}},
    {"numeric": {"column_name": "Z1_DCBusVoltage"}},
    {"numeric": {"column_name": "Z1_OutputCurrent"}},
    {"numeric": {"column_name": "Z1_OutputVoltage"}},
    {"numeric": {"column_name": "S1_ActualPosition"}},
    {"numeric": {"column_name": "S1_ActualVelocity"}},
    {"numeric": {"column_name": "S1_ActualAcceleration"}},
    {"numeric": {"column_name": "S1_CommandPosition"}},
    {"numeric": {"column_name": "S1_CommandVelocity"}},
    {"numeric": {"column_name": "S1_CommandAcceleration"}},
    {"numeric": {"column_name": "S1_CurrentFeedback"}},
    {"numeric": {"column_name": "S1_DCBusVoltage"}},
    {"numeric": {"column_name": "S1_OutputCurrent"}},
    {"numeric": {"column_name": "S1_OutputVoltage"}},
    {"numeric": {"column_name": "S1_OutputPower"}},
    {"numeric": {"column_name": "S1_SystemInertia"}},
    {"categorical": {"column_name": "M1_CURRENT_PROGRAM_NUMBER"}},
    {"categorical": {"column_name": "M1_sequence_number"}},
    {"numeric": {"column_name": "M1_CURRENT_FEEDRATE"}},
    {"categorical": {"column_name": "Machining_Process"}},
    {"categorical": {"column_name": "material"}},
    {"numeric": {"column_name": "feedrate"}},
    {"numeric": {"column_name": "clamp_pressure"}},
    {"categorical": {"column_name": "tool_condition"}},
    {"numeric": {"column_name": "X1_PositionDiff"}},
    {"numeric": {"column_name": "X1_VelocityDiff"}},
    {"numeric": {"column_name": "X1_AccelerationDiff"}},
    {"numeric": {"column_name": "Y1_PositionDiff"}},
    {"numeric": {"column_name": "Y1_VelocityDiff"}},
    {"numeric": {"column_name": "Y1_AccelerationDiff"}},
    {"numeric": {"column_name": "Z1_PositionDiff"}},
    {"numeric": {"column_name": "Z1_VelocityDiff"}},
    {"numeric": {"column_name": "Z1_AccelerationDiff"}},
    {"numeric": {"column_name": "S1_PositionDiff"}},
    {"numeric": {"column_name": "S1_VelocityDiff"}},
    {"numeric": {"column_name": "S1_AccelerationDiff"}},
    {"categorical": {"column_name": "ml_use"}},
]

In [None]:
# Create a Vertex AI AutoML job
dag = vertex_ai.AutoMLTabularTrainingJob(
    display_name=VERTEX_AI_PREFIX,
    optimization_prediction_type="classification",
    optimization_objective="maximize-au-roc",
    column_transformations=TRANSFORMATION,
)

model = dag.run(
    dataset=dataset,
    model_display_name=VERTEX_AI_PREFIX,
    predefined_split_column_name=SPLIT_COL,
    budget_milli_node_hours=1000,
    disable_early_stopping=False,
    target_column=LABEL_COL,
)

> Note: The Vertex AI AutoML tabular job may run for 2+ hours. 

## Validate model

You will validate the model by:

1. [Deploying the model to an endpoint for online prediction](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-online-predictions#aiplatform_create_endpoint_sample-python)
1. Use the endpoint to predict a `worn` and `unworn` record

In [None]:
endpoint = model.deploy(
    machine_type="n1-standard-2",
)

In [None]:
def top_k_feature_attributions(explanations, k=5):
    feat_attr = list(explanations[0].attributions[0].feature_attributions.items())
    return sorted(feat_attr, key=lambda x: abs(x[1]), reverse=True)[:k]

In [None]:
unworn_dict = df_experiments[df_experiments["No"] == 17].iloc[0].astype("str").to_dict()

unworn_pred = endpoint.explain(instances=[unworn_dict])

print(unworn_pred.predictions)
print(top_k_feature_attributions(unworn_pred.explanations))

In [None]:
worn_dict = df_experiments[df_experiments["No"] == 18].iloc[0].astype("str").to_dict()

worn_pred = endpoint.explain(instances=[worn_dict])

print(worn_pred.predictions)
print(top_k_feature_attributions(worn_pred.explanations))

### Observations

1. The online predictions return [prediction](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-online-predictions#interpret_prediction_results) and [explanation](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-online-predictions#interpret_explanation_results) results.
1. [Prediction](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-online-predictions#interpret_prediction_results) results include the class labels and respective confidence scores. The confidence score communicates how strongly your model associates each class label with an item. The higher the number, the higher the model's confidence that the label should be applied to that item.
1. [Explanation](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-online-predictions#interpret_explanation_results) results include the [feature attribution](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-explanations) values for each features. In this case, you only see the top 5 attributing features.  

## Clean Up

In [None]:
import glob

files_to_remove = glob.glob(os.path.join(LOCAL_DATA_PATH, "*.csv"))
files_to_remove += glob.glob(os.path.join(LOCAL_DATA_PATH, "*.txt"))
files_to_remove += glob.glob(os.path.join(LOCAL_DATA_PATH, "*.jpg"))

for f in files_to_remove:
    os.remove(f)

In [None]:
endpoint.undeploy_all()
endpoint.delete()