
### **<font color='white gray'>Preventive Maintenance Recommendation System Integrated with IoT for Reducing Unplanned Downtime</font>**

# **Installing and Loading Packages**

In [7]:
# To update a package, run the command below in the terminal or command prompt:
# pip install -U package_name

# To install a specific version of a package, run the command below in the terminal or command prompt:
# !pip install package_name==desired_version

# After installing or updating the package, restart the Jupyter notebook.

# Install the `watermark` package.
# This package is used to record the versions of other packages used in this notebook.
!pip install -q -U watermark

> **Note:** It may be necessary to remove the package named `imblearn` with the command:  
> `pip uninstall imblearn`  
>  
> [Official Package Documentation](https://pypi.org/project/imblearn/)  
>  
> The correct package to use is `imbalanced-learn`, but it is imported as `imblearn`!

In [8]:
!pip install -q imbalanced-learn==0.12.3

In [9]:
!pip install -q lightgbm

In [10]:
# 1. Imports
import joblib
import sklearn
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [11]:
%reload_ext watermark
%watermark -a "Your_Name"

Author: Your_Name



## **Loading Data and Checking Class Proportion**

In [12]:
# 2. Load the dataset
df = pd.read_csv('dataset.csv')

In [13]:
# 3. Dataset Shape
df.shape

(10000, 6)

In [14]:
# 4. Dataset Sample
df.head()

Unnamed: 0,vibration,temperature,pressure,humidity,working_hours,maintenance_required
0,0.250951,92.419225,100.311847,67.596275,7499,0
1,0.895355,69.132552,96.137413,70.454398,600,0
2,0.564789,66.456903,93.642299,31.822434,6919,0
3,0.853165,81.967579,101.924996,46.543886,4032,1
4,2.143944,60.097525,97.527537,50.129838,8036,0


In [15]:
# 5. Dataset Information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   vibration             10000 non-null  float64
 1   temperature           10000 non-null  float64
 2   pressure              10000 non-null  float64
 3   humidity              10000 non-null  float64
 4   working_hours         10000 non-null  int64  
 5   maintenance_required  10000 non-null  int64  
dtypes: float64(4), int64(2)
memory usage: 468.9 KB


In [16]:
# 6. Class Proportion
print(df['maintenance_required'].value_counts())

maintenance_required
1    5517
0    4483
Name: count, dtype: int64


- **1**: Positive class (maintenance was required for the machine)  
- **0**: Negative class (maintenance was not required for the machine)

## **Data Preparation**

In [17]:
# 7. Separate explanatory variables (X) and target variable (y)
X = df.drop('maintenance_required', axis=1)
y = df['maintenance_required']

In [18]:
# 8. Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

> **Why do we apply class balancing only on training data?**

We apply class balancing exclusively to the training data to ensure the model learns effectively without being biased toward the majority class. Here are the main reasons:

1. **Generalization to Real-World Data**  
   The test set must reflect the original distribution of the data since it represents real-world scenarios. Balancing the test set would distort the model's evaluation results.

2. **Avoiding Model Bias**  
   Balancing the training data ensures the model does not become biased toward the majority class, but balancing the test set would create artificial conditions that don't match reality.

3. **Realistic Evaluation Metrics**  
   Evaluating the model on an unbalanced test set allows for a realistic measurement of metrics like precision, recall, and AUC-ROC, which are critical in practical applications.



## **Option 1 - Adjusting Class Weights in the Model (Without Resampling)**

> For more details, refer to the [official documentation of `RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [19]:
# 9. Standardize the data
scaler_v1 = StandardScaler()
X_train_scaled = scaler_v1.fit_transform(X_train)
X_test_scaled = scaler_v1.transform(X_test)

In [20]:
# 10. Instantiate the model with class weight adjustment
model_v1 = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    class_weight='balanced'
)

In [21]:
# 11. Train the model and measure execution time
%%time
model_v1.fit(X_train_scaled, y_train)

CPU times: user 2.43 s, sys: 12.8 ms, total: 2.44 s
Wall time: 2.65 s


In [22]:
# 12. Make predictions on the test set
y_pred = model_v1.predict(X_test_scaled)
y_pred_proba = model_v1.predict_proba(X_test_scaled)[:, 1]

In [23]:
# 13. Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nAccuracy: {accuracy * 100:.2f}%")
print(f"AUC-ROC: {roc_auc * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 98.70%
AUC-ROC: 98.94%

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.99       876
           1       0.99      0.98      0.99      1124

    accuracy                           0.99      2000
   macro avg       0.99      0.99      0.99      2000
weighted avg       0.99      0.99      0.99      2000



In [24]:
# 14. Save the model and scaler to disk
model_file = 'model_v1.pkl'
scaler_file = 'scaler_v1.pkl'

joblib.dump(model_v1, model_file)
joblib.dump(scaler_v1, scaler_file)

print(f"Model saved at {model_file}")
print(f"Scaler saved at {scaler_file}")

Model saved at model_v1.pkl
Scaler saved at scaler_v1.pkl


## **Option 2 - Undersampling the Majority Class**

In [25]:
# 15. Class Proportion
df.maintenance_required.value_counts()

Unnamed: 0_level_0,count
maintenance_required,Unnamed: 1_level_1
1,5517
0,4483


In [26]:
# 16. Concatenate X_train and y_train to facilitate resampling
train_data = pd.concat([X_train, y_train], axis=1)

In [27]:
# 17. Separate the majority and minority classes from the training set
df_majority = train_data[train_data.maintenance_required == 1]
df_minority = train_data[train_data.maintenance_required == 0]

Remember that resampling is applied only to the training data!

In [28]:
# 18. Check the size of the majority class
len(df_majority)

4393

In [29]:
# 19. Check the size of the minority class
len(df_minority)

3607

In [30]:
# 20. Apply undersampling to the majority class in the training set
df_majority_undersampled = resample(
    df_majority,
    replace=False,
    n_samples=len(df_minority),  # Match the size of the minority class
    random_state=42
)

In [31]:
# 21. Combine the minority class and the undersampled majority class
train_data_balanced = pd.concat([df_majority_undersampled, df_minority])

In [32]:
# 22. Split the balanced dataset into X_train and y_train
X_train_balanced = train_data_balanced.drop('maintenance_required', axis=1)
y_train_balanced = train_data_balanced['maintenance_required']

In [33]:
# 23. Check the class balance in the training set
print(y_train_balanced.value_counts())

maintenance_required
1    3607
0    3607
Name: count, dtype: int64


In [34]:
# 24. Standardize the data
scaler_v2 = StandardScaler()
X_train_scaled = scaler_v2.fit_transform(X_train_balanced)
X_test_scaled = scaler_v2.transform(X_test)

In [35]:
# 25. Instantiate and train the model
model_v2 = RandomForestClassifier(n_estimators=100, random_state=42)

In [36]:
# 26. Train the model and measure execution time
%%time
model_v2.fit(X_train_scaled, y_train_balanced)

CPU times: user 2.11 s, sys: 7.29 ms, total: 2.12 s
Wall time: 2.23 s


In [37]:
# 27. Evaluate the model on the test set
y_pred = model_v2.predict(X_test_scaled)
y_pred_proba = model_v2.predict_proba(X_test_scaled)[:, 1]

In [38]:
# 28. Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nAccuracy: {accuracy * 100:.2f}%")
print(f"AUC-ROC: {roc_auc * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 98.20%
AUC-ROC: 98.28%

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       876
           1       0.99      0.98      0.98      1124

    accuracy                           0.98      2000
   macro avg       0.98      0.98      0.98      2000
weighted avg       0.98      0.98      0.98      2000



In [39]:
# 29. Save the model and scaler to disk
model_file = 'model_v2.pkl'
scaler_file = 'scaler_v2.pkl'

joblib.dump(model_v2, model_file)
joblib.dump(scaler_v2, scaler_file)

print(f"Model saved at {model_file}")
print(f"Scaler saved at {scaler_file}")

Model saved at model_v2.pkl
Scaler saved at scaler_v2.pkl


## **Option 3 - Oversampling the Minority Class**

In [40]:
# 30. Concatenate X_train and y_train to facilitate resampling
train_data = pd.concat([X_train, y_train], axis=1)

In [41]:
# 31. Separate the majority and minority classes from the training set
df_majority = train_data[train_data.maintenance_required == 1]
df_minority = train_data[train_data.maintenance_required == 0]

> **Remember:** Resampling is applied only to the training data!

In [42]:
# 32. Check the size of the majority class
len(df_majority)

4393

In [43]:
# 33. Check the size of the minority class
len(df_minority)

3607

In [44]:
# 34. Oversampling the minority class in the training set
df_minority_oversampled = resample(
    df_minority,
    replace=True,
    n_samples=len(df_majority),  # Match the size of the majority class
    random_state=42
)

In [45]:
# 35. Combine the majority class and the oversampled minority class
train_data_balanced = pd.concat([df_majority, df_minority_oversampled])

In [46]:
# 36. Split the balanced dataset into X_train and y_train
X_train_balanced = train_data_balanced.drop('maintenance_required', axis=1)
y_train_balanced = train_data_balanced['maintenance_required']

In [47]:
# 37. Check the class balance in the training set
print(y_train_balanced.value_counts())

maintenance_required
1    4393
0    4393
Name: count, dtype: int64


In [48]:
# 38. Standardize the data
scaler_v3 = StandardScaler()
X_train_scaled = scaler_v3.fit_transform(X_train_balanced)
X_test_scaled = scaler_v3.transform(X_test)

In [49]:
# 39. Create the model
model_v3 = RandomForestClassifier(n_estimators=100, random_state=42)

In [50]:
# 40. Train the model and measure execution time
%%time
model_v3.fit(X_train_scaled, y_train_balanced)

CPU times: user 2.6 s, sys: 15.7 ms, total: 2.62 s
Wall time: 5.29 s


In [51]:
# 41. Evaluate the model on the test set
y_pred = model_v3.predict(X_test_scaled)
y_pred_proba = model_v3.predict_proba(X_test_scaled)[:, 1]

In [52]:
# 42. Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nAccuracy: {accuracy * 100:.2f}%")
print(f"AUC-ROC: {roc_auc * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 98.25%
AUC-ROC: 98.96%

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       876
           1       0.98      0.98      0.98      1124

    accuracy                           0.98      2000
   macro avg       0.98      0.98      0.98      2000
weighted avg       0.98      0.98      0.98      2000



In [53]:
# 43. Save the model and scaler to disk
model_file = 'model_v3.pkl'
scaler_file = 'scaler_v3.pkl'

joblib.dump(model_v3, model_file)
joblib.dump(scaler_v3, scaler_file)

print(f"Model saved at {model_file}")
print(f"Scaler saved at {scaler_file}")

Model saved at model_v3.pkl
Scaler saved at scaler_v3.pkl


## **Option 4 - Automatic Balancing with SMOTE**

**SMOTE (Synthetic Minority Over-sampling Technique):**  
SMOTE is a technique used to generate new samples for the minority class based on the distances between data points. This operation is sensitive to the distances between the data points. If the data is not scaled, variables with larger magnitudes (e.g., "working_hours" compared to "vibration") may disproportionately influence the generation of new samples.

In [54]:
# 44. Standardization
scaler_v4 = StandardScaler()
X_train_scaled = scaler_v4.fit_transform(X_train)
X_test_scaled = scaler_v4.transform(X_test)

In [55]:
# 45. Create the SMOTE instance
smote = SMOTE(random_state=42)

In [56]:
# 46. Train and apply SMOTE on the training set
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)



In [57]:
# 47. Check the size of the resampled training set
len(X_train_smote)

8786

In [58]:
# 48. Create the model
model_v4 = RandomForestClassifier(n_estimators=100, random_state=42)

In [59]:
# 49. Train the model and measure execution time
%%time
model_v4.fit(X_train_smote, y_train_smote)

CPU times: user 2.91 s, sys: 21.2 ms, total: 2.93 s
Wall time: 3.97 s


In [60]:
# 50. Evaluate the model on the test set
y_pred = model_v4.predict(X_test_scaled)
y_pred_proba = model_v4.predict_proba(X_test_scaled)[:, 1]

In [61]:
# 51. Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nAccuracy: {accuracy * 100:.2f}%")
print(f"AUC-ROC: {roc_auc * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 98.85%
AUC-ROC: 99.03%

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.99       876
           1       0.99      0.98      0.99      1124

    accuracy                           0.99      2000
   macro avg       0.99      0.99      0.99      2000
weighted avg       0.99      0.99      0.99      2000



In [62]:
# 52. Save the model and scaler to disk
model_file = 'model_v4.pkl'
scaler_file = 'scaler_v4.pkl'

joblib.dump(model_v4, model_file)
joblib.dump(scaler_v4, scaler_file)

print(f"Model saved at {model_file}")
print(f"Scaler saved at {scaler_file}")

Model saved at model_v4.pkl
Scaler saved at scaler_v4.pkl


## **Option 5 - Automatic Balancing with SMOTE and Algorithm Change**

### **Attention! Observe the Order of Standardization and SMOTE Tasks**

The correct order between standardization and SMOTE depends on the nature of the operations each one performs:

- **SMOTE (Synthetic Minority Over-sampling Technique):**  
  SMOTE generates new samples for the minority class based on the distances between data points. This operation is sensitive to distances, and if the data is not scaled, variables with larger magnitudes (e.g., `working_hours` compared to `vibration`) may disproportionately influence the generation of new samples.

- **Standardization:**  
  Standardization scales the variables to have a mean of 0 and a standard deviation of 1, removing magnitude differences between variables. Since SMOTE relies on distances, it is crucial that the data is standardized before applying SMOTE to ensure that all variables have equal influence on the generation of new samples.

#### **Correct Order:**
You should apply **standardization first** and **SMOTE second.** This ensures that the distances between points, which SMOTE uses to generate new examples, are not distorted by variables with differing scales.

### **Attention! SMOTE First and Standardization After**


In [63]:
%%time

# 53. Create the SMOTE instance
smote = SMOTE(random_state=42)

# 54. Train and apply SMOTE on the training set
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# 55. Standardization
scaler_v5 = StandardScaler()
X_train_scaled = scaler_v5.fit_transform(X_train_smote)
X_test_scaled = scaler_v5.transform(X_test)

# 56. Create the model
model_v5 = lgb.LGBMClassifier(random_state=42)

# 57. Train the model with the SMOTE-balanced training set
model_v5.fit(X_train_scaled, y_train_smote)

# 58. Evaluate the model on the test set
y_pred = model_v5.predict(X_test_scaled)
y_pred_proba = model_v5.predict_proba(X_test_scaled)[:, 1]

# 59. Model evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nAccuracy: {accuracy * 100:.2f}%")
print(f"AUC-ROC: {roc_auc * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

[LightGBM] [Info] Number of positive: 4393, number of negative: 4393




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002648 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1275
[LightGBM] [Info] Number of data points in the train set: 8786, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000

Accuracy: 91.90%
AUC-ROC: 97.59%

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.97      0.91       876
           1       0.98      0.88      0.92      1124

    accuracy                           0.92      2000
   macro avg       0.92      0.93      0.92      2000
weighted avg       0.93      0.92      0.92      2000

CPU times: user 339 ms, sys: 12.8 ms, total: 352 ms
Wall time: 596 ms




### **Attention! Standardization First and SMOTE After**

In [64]:
%%time

# 60. Standardize the variables
scaler_v5 = StandardScaler()
X_train_scaled = scaler_v5.fit_transform(X_train)
X_test_scaled = scaler_v5.transform(X_test)

# 61. Create the SMOTE instance
smote = SMOTE(random_state=42)

# 62. Apply SMOTE to the training set to handle class imbalance
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

# 63. Create the LightGBM model
model_v5 = lgb.LGBMClassifier(random_state=42)

# 64. Train the LightGBM model
model_v5.fit(X_train_smote, y_train_smote)

# 65. Evaluate the model on the original test set
y_pred = model_v5.predict(X_test_scaled)
y_pred_proba = model_v5.predict_proba(X_test_scaled)[:, 1]

# 66. Model evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nAccuracy: {accuracy * 100:.2f}%")
print(f"AUC-ROC: {roc_auc * 100:.2f}%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))



[LightGBM] [Info] Number of positive: 4393, number of negative: 4393
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000836 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1275
[LightGBM] [Info] Number of data points in the train set: 8786, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000

Accuracy: 92.40%
AUC-ROC: 97.47%

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.98      0.92       876
           1       0.98      0.88      0.93      1124

    accuracy                           0.92      2000
   macro avg       0.92      0.93      0.92      2000
weighted avg       0.93      0.92      0.92      2000

CPU times: user 392 ms, sys: 6.31 ms, total: 398 ms
Wall time: 732 ms




In [65]:
# 67. Save the model and scaler to disk
model_file = 'model_v5.pkl'
scaler_file = 'scaler_v5.pkl'

joblib.dump(model_v5, model_file)
joblib.dump(scaler_v5, scaler_file)

print(f"Model saved at {model_file}")
print(f"Scaler saved at {scaler_file}")

Model saved at model_v5.pkl
Scaler saved at scaler_v5.pkl


## **Model Selection**
**Version 1 of the Model:**

- Accuracy: 98.70%
- AUC-ROC: 98.78%

**Version 2 of the Model:**

- Accuracy: 98.20%
- AUC-ROC: 98.28%

**Version 3 of the Model:**

- Accuracy: 98.25%
- AUC-ROC: 98.96%

**Version 4 of the Model:**

- Accuracy: 98.85%
- AUC-ROC: 99.03%

**Version 5 of the Model:**

- **SMOTE First, Standardization After!**
  - Accuracy: 96.30%
  - AUC-ROC: 98.53%

- **Standardization First, SMOTE After! (This is the ideal order)**
  - Accuracy: 92.40%
  - AUC-ROC: 97.47%

**Which model would you choose?**

A **Machine Learning model** is the result of this equation: **algorithm + data**!  
In other words, a simple model is one that delivers good performance with minimal changes to the data and minimal changes to the algorithm. Based on this criterion, **Version 1** is the best option and will be our choice for deployment!

Version 1 of the model was the one that showed the best balance between:

- **Generalization Ability / Performance / Simplicity / Interpretability**

Our challenge is always to find the model that balances these four elements! And that is almost an art!

## **Testing the Model Deployment**

In [66]:
# 68. Function to recommend maintenance based on new IoT sensor data
def maintenance_recommendation(new_data):

    # Define the column names as per the scaler adjustment
    columns = ['vibration', 'temperature', 'pressure', 'humidity', 'working_hours']

    # Convert new data to DataFrame with correct column names
    new_data_df = pd.DataFrame([new_data], columns=columns)

    # Apply the scaler to the new data
    new_data_scaled = scaler_v1.transform(new_data_df)

    # Make the prediction
    prediction = model_v1.predict(new_data_scaled)

    if prediction == 1:
        return "Recommendation: Perform maintenance."
    else:
        return "Recommendation: No maintenance needed."

In [67]:
# 69. Example of new IoT sensor data
new_data_1 = [0.5, 80, 102, 45, 8000]
print(maintenance_recommendation(new_data_1))

Recommendation: Perform maintenance.


In [68]:
# 70. Example of new IoT sensor data
new_data_2 = [0.89, 92, 96, 70, 600]
print(maintenance_recommendation(new_data_2))

Recommendation: No maintenance needed.


**Let's now deploy via web app using Streamlit.**

In [69]:
%watermark -a "Your_Name"

Author: Your_Name



In [70]:
!pip show imbalanced-learn | grep Version

Version: 0.12.3


In [71]:
%watermark -v -m

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.85+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [72]:
%watermark --iversions

lightgbm: 4.5.0
joblib  : 1.4.2
imblearn: 0.12.3
numpy   : 1.26.4
pandas  : 2.2.2
sklearn : 1.6.0



# **The End**