## Machine Predictive Maintenance Classification

#### About Dataset
Machine Predictive Maintenance Classification Dataset
Since real predictive maintenance datasets are generally difficult to obtain and in particular difficult to publish, we present and provide a synthetic dataset that reflects real predictive maintenance encountered in the industry to the best of our knowledge.

The dataset consists of 10 000 data points stored as rows with 14 features in columns

- UID: unique identifier ranging from 1 to 10000
- productID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number
- air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
- process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
- rotational speed [rpm]: calculated from powepower of 2860 W, overlaid with a normally distributed noise
- torque [Nm]: torque values are normally distributed around 40 Nm with an Ïƒ = 10 Nm and no negative values.
- tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process. and a
'machine failure' label that indicates, whether the machine has failed in this particular data point for any of the following failure modes are true.

Important : There are two Targets - Do not make the mistake of using one of them as feature, as it will lead to leakage.
Target : Failure or Not
Failure Type : Type of Failure
Acknowledgements
UCI : https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset

## Importing Libraries and Packages

In [54]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


## Data Exploration

In [55]:
# Load the dataset
data = pd.read_csv('predictive_maintenance.csv')
data.head()


Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,No Failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,No Failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,No Failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,No Failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,No Failure


In [56]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Target                   10000 non-null  int64  
 9   Failure Type             10000 non-null  object 
dtypes: float64(3), int64(4), object(3)
memory usage: 781.4+ KB


In [57]:
data.describe()

Unnamed: 0,UDI,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,300.00493,310.00556,1538.7761,39.98691,107.951,0.0339
std,2886.89568,2.000259,1.483734,179.284096,9.968934,63.654147,0.180981
min,1.0,295.3,305.7,1168.0,3.8,0.0,0.0
25%,2500.75,298.3,308.8,1423.0,33.2,53.0,0.0
50%,5000.5,300.1,310.1,1503.0,40.1,108.0,0.0
75%,7500.25,301.5,311.1,1612.0,46.8,162.0,0.0
max,10000.0,304.5,313.8,2886.0,76.6,253.0,1.0


## General Observations
- Dataset Size: The dataset contains 10,000 entries, which is a substantial size for building a predictive model.

- Target Variable: The Target column, which likely represents whether maintenance is needed (1) or not (0), 
shows a mean of approximately 0.034. 
This suggests that the dataset is highly imbalanced with 
a much larger proportion of cases not requiring maintenance.

## Feature Observations
- UDI (Unique Identifier): Ranges from 1 to 10,000, evenly distributed as it's a count.

- Air Temperature [K]: Ranges from 295.3 to 304.5 K, with a mean of about 300 K.
The relatively small standard deviation (2.00) indicates that the air temperatures don’t vary widely.
- Process Temperature [K]: Ranges from 305.7 to 313.8 K, with a mean of approximately 310 K.
The standard deviation is 1.48, which is also quite low, indicating similar behavior to air temperature with limited variation.
- Rotational Speed [rpm]: This varies more significantly than temperatures, ranging from 1168 to 2886 rpm, with a mean of 1538.78 rpm.
The standard deviation is higher (179.28), suggesting more variability in this feature.
- Torque [Nm]: The torque values range from 3.8 to 76.6 Nm, with an average of around 40 Nm.
The standard deviation is about 10 Nm, indicating moderate variability.
- Tool Wear [min]: Tool wear ranges from 0 to 253 minutes, with an average of around 108 minutes.
The standard deviation (63.65) suggests a wide range of tool usage times before maintenance is required.

## Distribution of Each Feature

In [58]:
fig = px.bar(data['Target'].value_counts(), 
             title='Distribution of Target Classes',
             labels={'index': 'Target Class', 'value': 'Frequency'},
             text='value')
fig.show()

In [59]:
numerical_columns = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']
colors = px.colors.qualitative.Plotly  # Using a set of predefined colors

for i, col in enumerate(numerical_columns):
    fig = px.histogram(data, x=col, nbins=50, title=f'Distribution of {col}', color_discrete_sequence=[colors[i % len(colors)]])
    fig.show()

The histogram of Air Temperature [K] appears to show a multi-modal distribution, meaning there are several peaks that may indicate the presence of multiple sub-groups within the data. Here are some observations.

The histogram of Process Temperature [K] shows what appears to be a somewhat normally distributed

The histogram of Rotational speed [rpm] is right-skewed, indicating that there are more occurrences of lower rotational speeds and fewer occurrences as speed increases.

The histogram for Torque [Nm] illustrates a bell-shaped distribution

The histogram for Tool Wear [min] seems to show a uniform distribution

It is clear that the dataset is imbalance as shown by only 339 out of 10,000 data is failure

## Bivariate Analysis

In [60]:
# Scatter plots of numerical columns with Target as hue
for i, col in enumerate(numerical_columns):
    fig = px.scatter(data, x=col, y=col, color='Target',  # 'Target' is now used as a color hue
                     title=f'Distribution of {col} by Target',
                     color_discrete_sequence=[colors[i % len(colors)]],
                     labels={'y': col, 'color': 'Target'})  # Proper labeling
    # Customize the legend title
    fig.update_layout(legend_title_text='Target')
    fig.show()


## Multivariate Analysis

In [61]:
# Create the scatter matrix
abbreviated_labels = {
    'Air temperature [K]': 'Air Temp',
    'Process temperature [K]': 'Proc Temp',
    'Rotational speed [rpm]': 'RPM',
    'Torque [Nm]': 'Torque',
    'Tool wear [min]': 'Tool Wear'
}

# Create the scatter matrix with abbreviated labels
fig = px.scatter_matrix(
    data,
    dimensions=numerical_columns,
    color='Target',
    title='Pair Plot of Numerical Features Colored by Target',
    labels=abbreviated_labels
)

# Customize layout and axes to prevent label overlap
fig.update_layout(
    width=1200,
    height=800,
    font=dict(size=10),
    margin=dict(l=100, r=100, t=100, b=100)
)

# Rotate axis labels
fig.update_xaxes(tickangle=45)
fig.update_yaxes(tickangle=-45)

# Remove density plots along the diagonal
fig.update_traces(diagonal_visible=False)

# Show the plot
fig.show()

#### Air Temperature vs. Process Temperature:

These two features show a very strong positive correlation, as indicated by the nearly linear pattern in their scatter plot. The dots are tightly clustered along a line, suggesting a direct relationship between air and process temperatures.

#### RPM:

The RPM scatter plots against other features appear to be more spread out, indicating less of a linear relationship.
The RPM values for Target=1 (yellow points) are more scattered and seem to have a wider range compared to Target=0 (blue points).

#### Torque:

There's a clear inverse relationship between Torque and RPM, as shown by the downward trend in their scatter plot. This is typical in machinery where an increase in speed (RPM) often results in a decrease in torque.
Similar to RPM, Torque shows a distinct pattern with the target variable, with more yellow points concentrated at higher torque values.

#### Tool Wear:

The scatter plots for Tool Wear do not show a strong pattern or trend with the other features. However, there's a slight concentration of yellow points at higher values of Tool Wear, which could indicate that greater tool wear might correlate with the occurrence of the event captured by Target=1.

#### Distribution of the Target Variable:

In all plots, yellow points represent instances where the target is 1, and these points are less frequent compared to the blue points (Target=0).
There is no clear separation between the yellow and blue points, suggesting that the target variable might not be easily predictable based on these features alone.

#### Data Density:

The density of the blue points is generally higher, reflecting the class imbalance with more instances of Target=0.
The yellow points are less dense and more spread out, which could indicate that the conditions leading to Target=1 are more varied.

## Checking for Missing Values

In [62]:
missing_values = data.isnull().sum()
missing_values



UDI                        0
Product ID                 0
Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
Target                     0
Failure Type               0
dtype: int64

#### There are no missing values in the dataset

## Outlier Detection

In [63]:
# Using boxplot to visualize outliers
colors = px.colors.qualitative.Plotly  # Using a set of predefined colors

# Using boxplot to visualize outliers with consistent colors
for i, col in enumerate(numerical_columns):
    fig = px.box(data, y=col, title=f'Boxplot of {col}', color_discrete_sequence=[colors[i % len(colors)]])
    fig.show()



#### Rotational Speed and Torque are the 2 features with visible outliers. However, we will not remove these outliers as normally start up and shut down will have extreme values

## Feature Engineering

In [64]:
data['Failure Type'].unique()

array(['No Failure', 'Power Failure', 'Tool Wear Failure',
       'Overstrain Failure', 'Random Failures',
       'Heat Dissipation Failure'], dtype=object)

In [65]:
# Drop 'UDI' and 'Product ID' columns
data.drop(['UDI', 'Product ID'], axis=1, inplace=True)
data.head()

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type
0,M,298.1,308.6,1551,42.8,0,0,No Failure
1,L,298.2,308.7,1408,46.3,3,0,No Failure
2,L,298.1,308.5,1498,49.4,5,0,No Failure
3,L,298.2,308.6,1433,39.5,7,0,No Failure
4,L,298.2,308.7,1408,40.0,9,0,No Failure


In [66]:
# One-hot encoding 'Type' and 'Failure Type' columns
data = pd.get_dummies(data, columns=['Type'])

In [67]:
data.head()

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type,Type_H,Type_L,Type_M
0,298.1,308.6,1551,42.8,0,0,No Failure,False,False,True
1,298.2,308.7,1408,46.3,3,0,No Failure,False,True,False
2,298.1,308.5,1498,49.4,5,0,No Failure,False,True,False
3,298.2,308.6,1433,39.5,7,0,No Failure,False,True,False
4,298.2,308.7,1408,40.0,9,0,No Failure,False,True,False


## Data Preprocessing

In [69]:
# Split the data into features and target
X = data.drop(['Target','Failure Type'], axis=1)
y = data['Target']

In [70]:
data.head()

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type,Type_H,Type_L,Type_M
0,298.1,308.6,1551,42.8,0,0,No Failure,False,False,True
1,298.2,308.7,1408,46.3,3,0,No Failure,False,True,False
2,298.1,308.5,1498,49.4,5,0,No Failure,False,True,False
3,298.2,308.6,1433,39.5,7,0,No Failure,False,True,False
4,298.2,308.7,1408,40.0,9,0,No Failure,False,True,False


## Model Training & Evaluation

In [71]:
# Scaling the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Stratified K-Fold for handling class imbalances
cv = StratifiedKFold(n_splits=5)

# Function to calculate the mean ROC curve
def compute_mean_roc(model, X, y, cv):
    mean_fpr = np.linspace(0, 1, 100)
    tprs = []

    for train_idx, test_idx in cv.split(X, y):
        model.fit(X[train_idx], y[train_idx])
        y_scores = model.predict_proba(X[test_idx])[:, 1]
        fpr, tpr, _ = roc_curve(y[test_idx], y_scores)
        tprs.append(np.interp(mean_fpr, fpr, tpr))

    mean_tpr = np.mean(tprs, axis=0)
    return mean_fpr, mean_tpr, auc(mean_fpr, mean_tpr)

# Compute and store each model's mean ROC curve
roc_data = []
for name, model in models.items():
    mean_fpr, mean_tpr, roc_auc = compute_mean_roc(model, X_scaled, y, cv)
    for i in range(len(mean_fpr)):
        roc_data.append({'False Positive Rate': mean_fpr[i], 'True Positive Rate': mean_tpr[i], 'Model': name, 'AUC': roc_auc})

# Convert to DataFrame
roc_df = pd.DataFrame(roc_data)

# Plot using Plotly Express
fig = px.line(roc_df, x='False Positive Rate', y='True Positive Rate', color='Model', 
              title='Average ROC Curves Across Folds',
              labels={'AUC': 'Area Under Curve'},
              hover_data=['Model', 'AUC'])

fig.show()

In [72]:
# Function to calculate the mean AUC
def compute_mean_auc(model, X, y, cv):
    aucs = []

    for train_idx, test_idx in cv.split(X, y):
        model.fit(X[train_idx], y[train_idx])
        y_scores = model.predict_proba(X[test_idx])[:, 1]
        fpr, tpr, _ = roc_curve(y[test_idx], y_scores)
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)

    return np.mean(aucs)

# Compute and print the mean AUC for each model
for name, model in models.items():
    mean_auc = compute_mean_auc(model, X_scaled, y, cv)
    print(f"{name}: Average AUC = {mean_auc:.2f}")

Logistic Regression: Average AUC = 0.87
SVM: Average AUC = 0.86
Random Forest: Average AUC = 0.87


All three models show similar performance levels. This balance might indicate that the remaining features have a consistent and understandable relationship with the target variable, regardless of the model complexity.

## Feature Importance Analysis

In [73]:
# Feature names
feature_names = X.columns

# Logistic Regression Feature Importance
log_reg_importance = pd.DataFrame({'Feature': feature_names, 'Importance': np.abs(log_reg.coef_[0])})
log_reg_importance = log_reg_importance.sort_values(by='Importance', ascending=False)
log_reg_importance

Unnamed: 0,Feature,Importance
3,Torque [Nm],2.646244
0,Air temperature [K],2.301442
2,Rotational speed [rpm],1.907963
1,Process temperature [K],1.839489
4,Tool wear [min],0.653077
6,Type_L,0.062194
5,Type_H,0.042318
7,Type_M,0.038759


In [74]:
# Random Forest Feature Importance
rf_importance = pd.DataFrame({'Feature': feature_names, 'Importance': random_forest.feature_importances_})
rf_importance = rf_importance.sort_values(by='Importance', ascending=False)

# Print Feature Importance
rf_importance

Unnamed: 0,Feature,Importance
3,Torque [Nm],0.299748
2,Rotational speed [rpm],0.250412
4,Tool wear [min],0.149855
0,Air temperature [K],0.144177
1,Process temperature [K],0.133339
6,Type_L,0.009049
7,Type_M,0.007485
5,Type_H,0.005935


## Torque and Rotational Speed are the most useful indicator of predicting Machine Failure

1. Torque [Nm]:

What It Is: 
Torque is a measure of the rotational force applied to an object. In many mechanical systems, especially those involving rotating parts like motors, engines, or drills, torque is a critical parameter.

Domain Significance: 
High or irregular torque can indicate stress or strain in mechanical systems. It may suggest issues like resistance in movement, mechanical wear, or the need for more power to maintain performance. Therefore, torque is often a strong indicator of the mechanical health and efficiency of the system.

Predictive Value: 
Variations in torque readings can be indicative of maintenance needs or impending failures. In predictive maintenance models, torque can be a significant predictor of equipment failure or performance degradation.

2. Rotational Speed [rpm]:

What It Is: 
Rotational speed, measured in revolutions per minute (rpm), indicates how fast a component is spinning. It’s a fundamental parameter for any rotating machinery.

Domain Significance: 
Abnormal rotational speeds, either too high or too low, can be symptomatic of issues in the machinery. High speeds may lead to excessive wear or heat generation, while low speeds might signal power issues or mechanical obstructions.

Predictive Value: 
Consistent monitoring of rotational speed can help in predicting maintenance needs. Sudden changes or trends away from normal operating speeds could be early indicators of mechanical faults or inefficiencies.

Combined Importance in Predictive Models:
Interrelationship: 
Torque and rotational speed often have a direct relationship: changes in one can affect the other. For instance, an increase in torque may lead to a decrease in rotational speed and vice versa, depending on the system's design and current load.

Indicator of Mechanical Health:
Together, these parameters paint a comprehensive picture of a machine's operational health. A predictive maintenance model might, therefore, find these two features particularly useful in forecasting potential issues or failures.

Actionable Insights:
Detecting anomalies or trends in torque and rotational speed can prompt preventive maintenance actions, helping avoid costly downtimes and extend the lifespan of machinery.