# Building a Predictive Maintenance Model for a Delivery Company Using Classification Techniques

## Introduction
In this project, I aim to develop a predictive maintenance model for a delivery company to determine device failure based on nine key attributes. The dataset is highly imbalanced, with approximately 120,000 records for functioning devices and only 100 for failed ones. To address this, I will employ SMOTE to generate synthetic samples for failed devices, enabling balanced training. Various classification models will be evaluated to identify the most effective solution.

### Import Necessary Libraries

In [1]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.naive_bayes import BernoulliNB, GaussianNB


## Exploratory Data Analysis and Feature Engineering

In [2]:
df = pd.read_csv("failure.csv")

In [3]:
df.head()

Unnamed: 0,date,device,failure,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
0,2015-01-01,S1F01085,0,215630672,56,0,52,6,407438,0,0,7
1,2015-01-01,S1F0166B,0,61370680,0,3,0,6,403174,0,0,0
2,2015-01-01,S1F01E6Y,0,173295968,0,0,0,12,237394,0,0,0
3,2015-01-01,S1F01JE0,0,79694024,0,0,0,6,410186,0,0,0
4,2015-01-01,S1F01R2B,0,135970480,0,0,0,15,313173,0,0,3


In [4]:
df.shape

(124494, 12)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124494 entries, 0 to 124493
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        124494 non-null  object
 1   device      124494 non-null  object
 2   failure     124494 non-null  int64 
 3   attribute1  124494 non-null  int64 
 4   attribute2  124494 non-null  int64 
 5   attribute3  124494 non-null  int64 
 6   attribute4  124494 non-null  int64 
 7   attribute5  124494 non-null  int64 
 8   attribute6  124494 non-null  int64 
 9   attribute7  124494 non-null  int64 
 10  attribute8  124494 non-null  int64 
 11  attribute9  124494 non-null  int64 
dtypes: int64(10), object(2)
memory usage: 11.4+ MB


In [6]:
devices = df["device"]

In [7]:
df = df.drop(["device", "date"], axis=1)

In [8]:
df.head()

Unnamed: 0,failure,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
0,0,215630672,56,0,52,6,407438,0,0,7
1,0,61370680,0,3,0,6,403174,0,0,0
2,0,173295968,0,0,0,12,237394,0,0,0
3,0,79694024,0,0,0,6,410186,0,0,0
4,0,135970480,0,0,0,15,313173,0,0,3


In [9]:
print(df[df["failure"] == 0].value_counts().sum())
print(df[df["failure"] == 1].value_counts().sum())

124388
106


### Balancing Data Imbalance Using SMOTE

In [10]:
x = df.drop("failure", axis=1)
y = df[["failure"]]
smote = SMOTE(sampling_strategy={1: 124388}, random_state=42)
x_resampled, y_resampled = smote.fit_resample(x, y)

In [11]:
y_resampled.head()

Unnamed: 0,failure
0,0
1,0
2,0
3,0
4,0


In [12]:
y_resampled.shape

(248776, 1)

In [13]:
y_resampled[y_resampled["failure"] == 1].value_counts().sum()

124388

### Combining Resampled Features and Labels

In [14]:
new_df = pd.concat([x_resampled, y_resampled], axis=1)

In [15]:
new_df.sample(10)

Unnamed: 0,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9,failure
10268,49577744,0,0,0,25,239288,0,0,0,0
31397,161273752,0,0,0,10,35,0,0,0,0
54170,33870304,0,0,0,8,221529,0,0,0,0
121796,65120168,4816,0,0,9,339720,0,0,0,0
49543,229516712,0,0,0,3,223467,0,0,3,0
6359,62416760,0,0,0,11,192060,0,0,0,0
238975,88996082,10532,0,5,11,271920,0,0,2,1
154667,152373502,257,0,3,17,274108,46,46,0,1
156345,228371364,0,0,34,57,209559,7,7,1,1
60980,86999904,0,0,0,7,304971,0,0,0,0


In [16]:
new_df.shape

(248776, 10)

### Identifying Important Features Based on Correlation with Failure

In [17]:
new_df.corr()

Unnamed: 0,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9,failure
attribute1,1.0,-0.087833,0.008579,-0.030105,0.070927,-0.074834,0.194243,0.194243,0.058853,0.034093
attribute2,-0.087833,1.0,-0.013723,0.365223,-0.017119,-0.023879,0.020114,0.020114,-0.024307,0.247283
attribute3,0.008579,-0.013723,1.0,0.002965,-0.006611,0.010824,-0.00925,-0.00925,0.44641,-0.020971
attribute4,-0.030105,0.365223,0.002965,1.0,0.002414,-0.035395,0.05053,0.05053,0.01728,0.222554
attribute5,0.070927,-0.017119,-0.006611,0.002414,1.0,-0.005932,-0.017559,-0.017559,-0.003242,0.034978
attribute6,-0.074834,-0.023879,0.010824,-0.035395,-0.005932,1.0,-0.109083,-0.109083,0.029568,-0.008657
attribute7,0.194243,0.020114,-0.00925,0.05053,-0.017559,-0.109083,1.0,1.0,0.232653,0.224206
attribute8,0.194243,0.020114,-0.00925,0.05053,-0.017559,-0.109083,1.0,1.0,0.232653,0.224206
attribute9,0.058853,-0.024307,0.44641,0.01728,-0.003242,0.029568,0.232653,0.232653,1.0,0.037619
failure,0.034093,0.247283,-0.020971,0.222554,0.034978,-0.008657,0.224206,0.224206,0.037619,1.0


### Choosing Important Features

In [18]:
new_df = new_df.drop(["attribute1", "attribute3", "attribute5", "attribute6", "attribute9"], axis=1)

## Model Training and Evaluation

### Splitting Dataframe into Features (X) and Target (Y)

In [19]:
x = new_df.drop("failure", axis=1)
y = new_df[["failure"]]

In [20]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=42)

### Decision Tree Model Training and Prediction

In [21]:
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
y_pred = dtc.predict(x_test)

### Testing the Model

In [22]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9514430420451805
              precision    recall  f1-score   support

           0       0.93      0.97      0.95     24752
           1       0.97      0.93      0.95     25004

    accuracy                           0.95     49756
   macro avg       0.95      0.95      0.95     49756
weighted avg       0.95      0.95      0.95     49756



### Logistic Regression Model Training and Prediction

In [23]:
log = LogisticRegression()
log.fit(x_train, y_train)
y_pred = log.predict(x_test)

  y = column_or_1d(y, warn=True)


### Testing the Model

In [24]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8289050566765818
              precision    recall  f1-score   support

           0       0.75      0.97      0.85     24752
           1       0.96      0.69      0.80     25004

    accuracy                           0.83     49756
   macro avg       0.86      0.83      0.83     49756
weighted avg       0.86      0.83      0.83     49756



### Random Forest Model Training and Prediction

In [25]:
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)

  return fit_method(estimator, *args, **kwargs)


### Testing the Model

In [26]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9518249055390304
              precision    recall  f1-score   support

           0       0.93      0.97      0.95     24752
           1       0.97      0.93      0.95     25004

    accuracy                           0.95     49756
   macro avg       0.95      0.95      0.95     49756
weighted avg       0.95      0.95      0.95     49756



### Bernoulli Model Training and Prediction

In [27]:
ber = BernoulliNB()
ber.fit(x_train, y_train)
y_pred = ber.predict(x_test)

  y = column_or_1d(y, warn=True)


### Testing the Model

In [28]:
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8846370287000562
              precision    recall  f1-score   support

           0       0.82      0.98      0.89     24752
           1       0.97      0.79      0.87     25004

    accuracy                           0.88     49756
   macro avg       0.90      0.89      0.88     49756
weighted avg       0.90      0.88      0.88     49756



## Results

The table below presents the performance of each classification model, as evaluated using the balanced dataset:

| Model                    | Precision   | Recall   | F1-Score  |
|--------------------------|-------------|----------|-----------|
| **Logistic Regression**  | 0.82        | 0.83     | 0.82      |
| **Random Forest**         | 0.95        | 0.95     | 0.95      |
| **Decision Tree**         | 0.95        | 0.95     | 0.95      |
| **Bernoulli**                   | 0.89        | 0.89     | 0.89      |

As observed, the **Random Forest** and **Decision Tree** models outperform others in terms of precision, recall, and F1-Score for predicting device failures.
