# Predictive Maintenance for Machines

This notebook shows how to predict if a machine will fail or not using SparkML's Linear SVM Classifier

#### **Steps**
Using Spark, 
1) It reads the table [AI4I 2020 Predictive Maintenance Dataset](https://doi.org/10.24432/C5HS5C) from the **public_datasets** dataset located in the [metastore](../gcp_services/README.md) (notebook should be connected with the public metastore if using this specific dataset).       
2) It parses process the dataset to choose features and train the ML model (fits the classification model) to predict a target value.  
   **Features**: air temperature [K], process temperature [K], rotational speed [rpm], torque [Nm], tool wear [min]
   **Target**: machine failure
3) It evaluates and plot the results.  

#### **Details of the dataset**

- Since real predictive maintenance datasets are generally difficult to obtain and in particular difficult to publish, this dataset presents and provides a synthetic dataset that reflects real predictive maintenance encountered in industry to the best of knowledge.
- There are no missing values

### Setup

#### Identity and Access Management (IAM)

Make sure the service account running this notebook has the required permissions:

- **Run the notebook**
  - AI Platform Notebooks Service Agent
  - Notebooks Admin
  - Vertex AI Administrator
- **Read tables from Dataproc Metastore**
  - Dataproc Metastore Editor
  - Dataproc Metastore Metadata Editor
  - Dataproc Metastore Metadata User
  - Dataproc Metastore Service Agent
- **Read files from bucket**
  - Storage Object Viewer
- **Run Dataproc jobs**
  - Dataproc Service Agent
  - Dataproc Worker

In [None]:
# Import dependencies
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix


from pyspark.ml.feature import VectorAssembler, StringIndexer, StandardScaler
from pyspark.ml.classification import LinearSVC
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors, VectorUDT

from imblearn.over_sampling import SMOTE


In [None]:
from pyspark.sql import SparkSession


In [None]:
spark = SparkSession.builder \
    .appName("Linear SVM Predictive Maintenance") \
    .enableHiveSupport() \
    .getOrCreate()


In [None]:
raw_dataset = spark.read.table("public_datasets.ai4i_2020_predictive_maintenance")


### Exploratory Data Analysis

In [None]:
# Show the count of each class
class_counts = raw_dataset.groupBy('machine_failure').count()

# Calculate and display the class distribution
total_count = raw_dataset.count()
class_counts.withColumn('Percentage', (class_counts['count'] / total_count) * 100).show()


|machine_failure|count|Percentage|
|---------------|-----|----------|
|              0| 9661|     96.61|
|              1|  339|      3.39|

### Process dataset to create features

In [None]:
# Drop columns that are not relevant
filtered_dataset = raw_dataset.drop("udi", "product_id")
filtered_dataset = filtered_dataset.drop("twf","hdf", "pwf", "osf", "rnf") # we don't need types of failure

# convert numerical features to float
for column in filtered_dataset.columns:
  if column != "type":
    filtered_dataset = filtered_dataset.withColumn(column, filtered_dataset[column].cast("float"))


In [None]:
# StringIndexer to convert string labels to numerical labels
type_indexer = StringIndexer(inputCol="type", outputCol="type_index")


In [None]:
# Assemble features into a single column
feature_cols = ["type_index", "air_temperature_k", "process_temperature_k", "rotational_speed_rpm", "torque_nm", "tool_wear_min"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")


In [None]:
# Standardize features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withStd=True, withMean=True)


In [None]:
# Create a pipeline to execute the preprocessing and modeling steps
prep_pipeline = Pipeline(stages=[type_indexer, assembler, scaler])

processed_dataset = prep_pipeline.fit(filtered_dataset).transform(filtered_dataset)


In [None]:
dataset = processed_dataset.select("scaled_features", "machine_failure")


In [None]:
dataset.show(5)



|     scaled_features|machine_failure|
|--------------------|---------------|
|[0.74437552188095...|            0.0|
|[-0.7452693087793...|            0.0|
|[-0.7452693087793...|            0.0|
|[-0.7452693087793...|            0.0|
|[-0.7452693087793...|            0.0|


### Train/Fit the model

In [None]:
# Split the data into training and testing sets
train_data, test_data = dataset.randomSplit([0.8, 0.2], seed=24)

# Create a Linear Support Vector Machine (SVM) model
svm = LinearSVC(maxIter=100, regParam=0.01, labelCol="machine_failure", featuresCol="scaled_features")

pipeline = Pipeline(stages=[svm])

# Train the model
model = pipeline.fit(train_data)


### Evaluate the model

In [None]:
# Make predictions on the test data
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="machine_failure")
area_under_roc = evaluator.evaluate(predictions)

# Print the Area Under ROC
print("Area Under ROC: " + str(area_under_roc))


In [None]:
# Convert the PySpark DataFrame to a Pandas DataFrame for confusion matrix
predictions_pd = predictions.select("machine_failure", "prediction").toPandas()

# Compute the confusion matrix
confusion = confusion_matrix(predictions_pd["machine_failure"], predictions_pd["prediction"])

# Visualize the confusion matrix
def plot_confusion_matrix(cm, classes, normalize=False, title="Confusion Matrix", cmap=plt.cm.Blues):
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print("Confusion matrix, without normalization")

    print(cm)

    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

class_names = ["No failure", "Failure"]
plot_confusion_matrix(confusion, classes=class_names, title="Confusion Matrix")

plt.show()


### Handle imbalanced dataset

Since the dataset is heavily imbalanced and does not represent **Failure** class properly, we will over-sample this minority class in `machine_failure`.

We will use `SMOTE` for over-sampling.

In [None]:
# Convert the PySpark DataFrame to a Pandas DataFrame
pandas_df = dataset.select("scaled_features", "machine_failure").toPandas()

# Apply SMOTE
smote = SMOTE(sampling_strategy="auto", random_state=42)
X_resampled, y_resampled = smote.fit_resample(pandas_df["scaled_features"].apply(lambda x: x.toArray()).values.tolist(), pandas_df["machine_failure"])

# Create a new DataFrame from the resampled data
resampled_dataset = spark.createDataFrame(pd.DataFrame({"scaled_features": X_resampled, "machine_failure": y_resampled}))


In [None]:
# Show the count of each class
class_counts = resampled_dataset.groupBy('machine_failure').count()

# Calculate and display the class distribution
total_count = resampled_dataset.count()
class_counts.withColumn('Percentage', (class_counts['count'] / total_count) * 100).show()


|machine_failure|count|Percentage|
|---------------|-----|----------|
|            0.0| 9661|      50.0|
|            1.0| 9661|      50.0|


In [None]:
# Define a UDF to convert ArrayType to VectorUDT
array_to_vector_udf = udf(lambda arr: Vectors.dense(arr), VectorUDT())
resampled_dataset = resampled_dataset.withColumn("scaled_features", array_to_vector_udf("scaled_features"))


### Train/Fit the model

In [None]:
# Split the data into training and testing sets
train_data, test_data = resampled_dataset.randomSplit([0.8, 0.2], seed=24)

# Create a Linear Support Vector Machine (SVM) model
svm = LinearSVC(maxIter=100, regParam=0.01, labelCol="machine_failure", featuresCol="scaled_features")

pipeline = Pipeline(stages=[svm])

# Train the model
model = pipeline.fit(train_data)


### Evaluate the model

In [None]:
# Make predictions on the test data
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="machine_failure")
area_under_roc = evaluator.evaluate(predictions)

# Print the Area Under ROC
print("Area Under ROC: " + str(area_under_roc))


In [None]:
# Convert the PySpark DataFrame to a Pandas DataFrame for confusion matrix
predictions_pd = predictions.select("machine_failure", "prediction").toPandas()

# Compute the confusion matrix
confusion = confusion_matrix(predictions_pd["machine_failure"], predictions_pd["prediction"])

# Visualize the confusion matrix
def plot_confusion_matrix(cm, classes, normalize=False, title="Confusion Matrix", cmap=plt.cm.Blues):
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print("Confusion matrix, without normalization")

    print(cm)

    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

class_names = ["No failure", "Failure"]
plot_confusion_matrix(confusion, classes=class_names, title="Confusion Matrix")

plt.show()


In [None]:
# Stop the Spark session
spark.stop()
