# Building a Predictive Maintenance Model for a Delivery Company Using Classification Techniques

## Introduction
In this project, I aim to develop a predictive maintenance model for a delivery company to determine device failure based on nine key attributes. The dataset is highly imbalanced, with approximately 120,000 records for functioning devices and only 100 for failed ones. To address this, I will employ SMOTE to generate synthetic samples for failed devices, enabling balanced training. Various classification models will be evaluated to identify the most effective solution.

### Import Necessary Libraries

In [1]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.naive_bayes import BernoulliNB, GaussianNB


## Exploratory Data Analysis and Feature Engineering

In [2]:
df = pd.read_csv("failure.csv")

In [3]:
df.head()

Unnamed: 0,date,device,failure,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
0,2015-01-01,S1F01085,0,215630672,56,0,52,6,407438,0,0,7
1,2015-01-01,S1F0166B,0,61370680,0,3,0,6,403174,0,0,0
2,2015-01-01,S1F01E6Y,0,173295968,0,0,0,12,237394,0,0,0
3,2015-01-01,S1F01JE0,0,79694024,0,0,0,6,410186,0,0,0
4,2015-01-01,S1F01R2B,0,135970480,0,0,0,15,313173,0,0,3


In [4]:
df.shape

(124494, 12)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124494 entries, 0 to 124493
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        124494 non-null  object
 1   device      124494 non-null  object
 2   failure     124494 non-null  int64 
 3   attribute1  124494 non-null  int64 
 4   attribute2  124494 non-null  int64 
 5   attribute3  124494 non-null  int64 
 6   attribute4  124494 non-null  int64 
 7   attribute5  124494 non-null  int64 
 8   attribute6  124494 non-null  int64 
 9   attribute7  124494 non-null  int64 
 10  attribute8  124494 non-null  int64 
 11  attribute9  124494 non-null  int64 
dtypes: int64(10), object(2)
memory usage: 11.4+ MB


In [6]:
devices = df["device"]

In [7]:
df = df.drop(["device", "date"], axis=1)

In [8]:
df.head()

Unnamed: 0,failure,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
0,0,215630672,56,0,52,6,407438,0,0,7
1,0,61370680,0,3,0,6,403174,0,0,0
2,0,173295968,0,0,0,12,237394,0,0,0
3,0,79694024,0,0,0,6,410186,0,0,0
4,0,135970480,0,0,0,15,313173,0,0,3


## Model Training and Evaluation
### Splitting Training Data into Features (X) and Target (Y)

In [9]:
x = df.drop("failure", axis=1)
y = df[["failure"]]

In [10]:
x.head()

Unnamed: 0,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
0,215630672,56,0,52,6,407438,0,0,7
1,61370680,0,3,0,6,403174,0,0,0
2,173295968,0,0,0,12,237394,0,0,0
3,79694024,0,0,0,6,410186,0,0,0
4,135970480,0,0,0,15,313173,0,0,3


In [11]:
y.head()

Unnamed: 0,failure
0,0
1,0
2,0
3,0
4,0


### Balancing Data Imbalance Using SMOTE

In [12]:
smote = SMOTE(sampling_strategy={1: 120000}, random_state=42)
x_resampled, y_resampled = smote.fit_resample(x, y)

In [13]:
y_resampled.head()

Unnamed: 0,failure
0,0
1,0
2,0
3,0
4,0


In [14]:
y_resampled.shape

(244388, 1)

In [15]:
y_resampled[y_resampled["failure"] == 1].value_counts().sum()

120000

### Combining Resampled Features and Labels

In [16]:
new_df = pd.concat([x_resampled, y_resampled], axis=1)

In [17]:
new_df.sample(10)

Unnamed: 0,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9,failure
166078,83856476,0,0,35,8,269638,0,0,0,1
215102,197592323,0,0,11,13,330737,0,0,0,1
186555,67489741,0,0,76,10,257116,16,16,0,1
160106,238762849,0,0,0,9,261666,0,0,0,1
218296,232631641,7360,0,9,7,269501,1,1,0,1
115138,231478848,0,0,0,16,366612,0,0,0,0
186757,162641451,4658,0,764,10,250001,13,13,3,1
115821,186217296,0,0,0,16,59,0,0,0,0
66667,175677728,0,0,0,8,27,0,0,0,0
24515,142859776,0,0,0,11,214178,0,0,0,0


In [18]:
new_df.shape

(244388, 10)

### Identifying Important Features Based on Correlation with Failure

In [19]:
new_df.corr()

Unnamed: 0,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9,failure
attribute1,1.0,-0.085372,0.008334,-0.031356,0.069016,-0.075497,0.19475,0.19475,0.055941,0.034029
attribute2,-0.085372,1.0,-0.013584,0.36732,-0.017683,-0.021902,0.022083,0.022083,-0.023374,0.249558
attribute3,0.008334,-0.013584,1.0,0.00351,-0.006738,0.010922,-0.009357,-0.009357,0.45245,-0.020856
attribute4,-0.031356,0.36732,0.00351,1.0,0.002218,-0.03503,0.053288,0.053288,0.01865,0.225862
attribute5,0.069016,-0.017683,-0.006738,0.002218,1.0,-0.006355,-0.018345,-0.018345,-0.003031,0.034528
attribute6,-0.075497,-0.021902,0.010922,-0.03503,-0.006355,1.0,-0.114936,-0.114936,0.027616,-0.009296
attribute7,0.19475,0.022083,-0.009357,0.053288,-0.018345,-0.114936,1.0,1.0,0.215181,0.226885
attribute8,0.19475,0.022083,-0.009357,0.053288,-0.018345,-0.114936,1.0,1.0,0.215181,0.226885
attribute9,0.055941,-0.023374,0.45245,0.01865,-0.003031,0.027616,0.215181,0.215181,1.0,0.034977
failure,0.034029,0.249558,-0.020856,0.225862,0.034528,-0.009296,0.226885,0.226885,0.034977,1.0


### Choosing Important Features

In [20]:
new_df = new_df.drop(["attribute1", "attribute3", "attribute5", "attribute6", "attribute9"], axis=1)

### Splitting Dataframe into Features (X) and Target (Y)

In [21]:
x = new_df.drop("failure", axis=1)
y = new_df[["failure"]]

In [22]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=42)

### Decision Tree Model Training and Prediction

In [23]:
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
y_pred = dtc.predict(x_test)

### Testing the Model

In [24]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9512459593273047
              precision    recall  f1-score   support

           0       0.93      0.97      0.95     24650
           1       0.97      0.93      0.95     24228

    accuracy                           0.95     48878
   macro avg       0.95      0.95      0.95     48878
weighted avg       0.95      0.95      0.95     48878



### Logistic Regression Model Training and Prediction

In [25]:
log = LogisticRegression()
log.fit(x_train, y_train)
y_pred = log.predict(x_test)

  y = column_or_1d(y, warn=True)


### Testing the Model

In [26]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8272024223577069
              precision    recall  f1-score   support

           0       0.76      0.97      0.85     24650
           1       0.96      0.68      0.80     24228

    accuracy                           0.83     48878
   macro avg       0.86      0.83      0.82     48878
weighted avg       0.86      0.83      0.82     48878



### Random Forest Model Training and Prediction

In [27]:
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)

  return fit_method(estimator, *args, **kwargs)


### Testing the Model

In [28]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9515528458611237
              precision    recall  f1-score   support

           0       0.93      0.97      0.95     24650
           1       0.97      0.93      0.95     24228

    accuracy                           0.95     48878
   macro avg       0.95      0.95      0.95     48878
weighted avg       0.95      0.95      0.95     48878



### Bernoulli Model Training and Prediction

In [29]:
ber = BernoulliNB()
ber.fit(x_train, y_train)
y_pred = ber.predict(x_test)

  y = column_or_1d(y, warn=True)


### Testing the Model

In [30]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8863701460779901
              precision    recall  f1-score   support

           0       0.83      0.98      0.90     24650
           1       0.97      0.80      0.87     24228

    accuracy                           0.89     48878
   macro avg       0.90      0.89      0.89     48878
weighted avg       0.90      0.89      0.89     48878



## Results

The table below presents the performance of each classification model, as evaluated using the balanced dataset:

| Model                    | Precision   | Recall   | F1-Score  |
|--------------------------|-------------|----------|-----------|
| **Logistic Regression**  | 0.82        | 0.83     | 0.82      |
| **Random Forest**         | 0.95        | 0.95     | 0.95      |
| **Decision Tree**         | 0.95        | 0.95     | 0.95      |
| **Bernoulli**                   | 0.89        | 0.89     | 0.89      |

As observed, the **Random Forest** and **Decision Tree** models outperform others in terms of precision, recall, and F1-Score for predicting device failures.
