## Model Development and Model Evaluation

+ Trained the Logistic Regression as a Baseline Model.
+ Logistic Regression Works well when the relationship between features and target is linear.
+ Calulated the Accuracy Score == 0.95
+ 

In [1]:
## import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#### Read the Data

In [2]:
## read the data
pump_data = pd.read_csv('hypothetical_pump_failure_dataset.csv')

In [3]:
## make the timestamp column to a datetime format
pump_data['timestamp'] = pd.to_datetime(pump_data['timestamp'])
pump_data.set_index('timestamp', inplace=True)

In [4]:
## select the features
features = pump_data.drop(columns = ['failure'])
## select the target
target = pump_data['failure']

In [5]:
## split the data to train and test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(700, 4)
(300, 4)
(700,)
(300,)


In [6]:
## train the models
# Train logistic regression model
logreg = LogisticRegression()
tree = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(random_state=42)
svm = SVC(kernel='linear', random_state=42)
xgb = XGBClassifier(random_state=42)

logreg.fit(X_train, y_train)
tree.fit(X_train, y_train)
rf.fit(X_train, y_train)
svm.fit(X_train, y_train)
xgb.fit(X_train, y_train)

In [7]:
## Predictions and evaluation
y_pred_log_reg = logreg.predict(X_test)
y_pred_tree = tree.predict(X_test)
y_pred_rf = rf.predict(X_test)
y_pred_svm = svm.predict(X_test)
y_pred_xgb = xgb.predict(X_test)


print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("Logistic Regression Confusion Matrix :",confusion_matrix(y_test, y_pred_log_reg))
print("Logistic Regression Classification Report :",classification_report(y_test, y_pred_log_reg))
print("********************************************************************************************")
print("Desicion Tree Classifier Accuracy:", accuracy_score(y_test, y_pred_tree))
print("Desicion Tree Classifier Confusion Matrix :",confusion_matrix(y_test, y_pred_tree))
print("Desicion Tree Classifier Classification Report :",classification_report(y_test, y_pred_tree))
print("********************************************************************************************")
print("Random Forest Classifier Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Random Forest Classifier Confusion Matrix :",confusion_matrix(y_test, y_pred_rf))
print("Random Forest Classifier Classification Report :",classification_report(y_test, y_pred_rf))
print("********************************************************************************************")
print("Support Vector Machine Classifier Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Support Vector Machine Classifier Confusion Matrix :",confusion_matrix(y_test, y_pred_svm))
print("Support Vector Machine Classifier Classification Report :",classification_report(y_test, y_pred_svm))
print("********************************************************************************************")
print("XG Boost Classifier Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("XG Boost Classifier Confusion Matrix :",confusion_matrix(y_test, y_pred_xgb))
print("XG Boost Classifier Classification Report :",classification_report(y_test, y_pred_xgb))

Logistic Regression Accuracy: 0.95
Logistic Regression Confusion Matrix : [[281   0]
 [ 15   4]]
Logistic Regression Classification Report :               precision    recall  f1-score   support

           0       0.95      1.00      0.97       281
           1       1.00      0.21      0.35        19

    accuracy                           0.95       300
   macro avg       0.97      0.61      0.66       300
weighted avg       0.95      0.95      0.93       300

********************************************************************************************
Desicion Tree Classifier Accuracy: 1.0
Desicion Tree Classifier Confusion Matrix : [[281   0]
 [  0  19]]
Desicion Tree Classifier Classification Report :               precision    recall  f1-score   support

           0       1.00      1.00      1.00       281
           1       1.00      1.00      1.00        19

    accuracy                           1.00       300
   macro avg       1.00      1.00      1.00       300
weighted avg

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Note :**

+ **Precision** = How many Predicted Failures are Actual Failures
+ **Recall** = How many Actual Failure were correctly Identified.

#### Inference of Basic Models

+ Logistic Regression performs well in predicting non-failure cases (class 0), but struggles with predicting failures (class 1). The recall for class 1 is low, which indicates that it misses many actual failures.

+ The Decision Tree model perfectly classifies both failure and non-failure cases. However, this could be an indication of overfitting, as it may not generalize well to new data.

+ Random Forest provides a near-perfect performance, with only one misclassification. It balances between avoiding overfitting and maintaining high accuracy, making it a strong candidate.

+ SVM struggles significantly with predicting failures (class 1). It only predicts non-failure cases, making it unsuitable for this problem.

+ XGBoost also performs exceptionally well, similar to Random Forest. It’s a powerful model and often performs well with imbalanced datasets.

#### Selecting the Model
Given the results, Random Forest or XGBoost would be the best models to move forward with. But need to perform hyperparameter tuning and cross-validation to further refine the performance of these models.