## Model Development and Evaluation - XGboost

+ do the normalization MinMax Scaler
+ Cross Validation
+ Stratified Sampling
+ Regularization
+ Hyperparameter tuning
+ Evaluation using Precision, Recall and f1 score

In [2]:
import pandas as pd
import joblib
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from xgboost import XGBClassifier

# Read and prepare the data
pump_data = pd.read_csv('hypothetical_pump_failure_dataset.csv')
pump_data['timestamp'] = pd.to_datetime(pump_data['timestamp'])
pump_data.set_index('timestamp', inplace=True)

# Select features and target
features = pump_data.drop(columns=['failure'])
target = pump_data['failure']

# Initialize and fit MinMaxScaler
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)

# Save the scaler
joblib.dump(scaler, 'scaler_xgboost.pkl')

# Initialize XGBoost model with regularization
xgb_model = XGBClassifier(
    eval_metric='logloss',
    reg_alpha=0.1,  # L1 regularization term
    reg_lambda=1.0  # L2 regularization term
)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

# Perform Stratified K-Folds cross-validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning using GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, 
                           scoring='f1', cv=skf, verbose=1, n_jobs=-1)

# Train and evaluate model using Stratified K-Folds cross-validation
precision_scores = []
recall_scores = []
f1_scores = []

for train_index, test_index in skf.split(features_scaled, target):
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]

    # Fit the grid search on the training data
    grid_search.fit(X_train, y_train)

    # Get the best model from grid search
    best_model = grid_search.best_estimator_

    # Make predictions on the test data
    y_pred = best_model.predict(X_test)

    # Calculate precision, recall, and F1-score
    precision_scores.append(precision_score(y_test, y_pred, zero_division=0))
    recall_scores.append(recall_score(y_test, y_pred, zero_division=0))
    f1_scores.append(f1_score(y_test, y_pred, zero_division=0))

# Calculate average scores across folds
avg_precision = sum(precision_scores) / len(precision_scores)
avg_recall = sum(recall_scores) / len(recall_scores)
avg_f1 = sum(f1_scores) / len(f1_scores)

# Save the best model
joblib.dump(best_model, 'xgboost_regularized_tuned_model.pkl')

# Print the best hyperparameters and evaluation metrics
print(f"Best Hyperparameters: {grid_search.best_params_}")
print("XGBoost with Hyperparameter Tuning:")
print(f"Precision: {avg_precision:.4f}")
print(f"Recall: {avg_recall:.4f}")
print(f"F1 Score: {avg_f1:.4f}")
print("-" * 30)

print("XGBoost model with regularization, hyperparameter tuning, and stratified cross-validation has been trained and saved.")

Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 243 candidates, totalling 1215 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 243 candidates, totalling 1215 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
XGBoost with Hyperparameter Tuning:
Precision: 0.9267
Recall: 0.8778
F1 Score: 0.8953
------------------------------
XGBoost model with regularization, hyperparameter tuning, and stratified cross-validation has been trained and saved.


**Selecting XGBoost as the Best Model**

After applying regularization and hyperparameter tuning, the XGBoost model showed a notable improvement over its previous performance:

+ Previous XGBoost Performance:

Precision: 0.9550
Recall: 0.8578
F1 Score: 0.9027

+ Optimized XGBoost Performance:

Precision: 0.9267 (slightly lower)
Recall: 0.8778 (improved)
F1 Score: 0.8953 (slightly lower)


While the precision and F1 score slightly decreased, the recall improved, indicating a better balance between false negatives and false positives. The optimized XGBoost model was chosen as the final model due to its enhanced generalization capability, maintaining a strong balance between precision and recall, making it the most reliable choice for accurately detecting pump failures in this dataset. dataset.


### Tweak the model

In [6]:
import pandas as pd
import joblib
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from xgboost import XGBClassifier

# Read and prepare the data
pump_data = pd.read_csv('hypothetical_pump_failure_dataset.csv')
pump_data['timestamp'] = pd.to_datetime(pump_data['timestamp'])
pump_data.set_index('timestamp', inplace=True)

# Select features and target
features = pump_data.drop(columns=['failure'])
target = pump_data['failure']

# Initialize and fit MinMaxScaler
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)

# Save the scaler
joblib.dump(scaler, 'scaler_tweaked_xgboost.pkl')

# Initialize XGBoost model with regularization
xgb_model = XGBClassifier(
    eval_metric='logloss',
    reg_alpha=0.5,  # L1 regularization term  0.1, 0.5, 1.0, 2.0
    reg_lambda=0.5  # L2 regularization term  1.0, 0.5, 2.0, 5.0
)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'max_depth': [7, 9, 11],  # Reduced to 3 deeper depths to explore complexity
    'learning_rate': [0.05, 0.1, 0.2],  # Moderate learning rates for finer tuning
    'n_estimators': [200, 300, 500],  # Focus on higher estimator values
    'subsample': [0.8, 0.9, 1.0],  # Reasonable range for subsampling
    'colsample_bytree': [0.8, 0.9, 1.0],  # Limited range for feature sampling
    'gamma': [0.2, 0.5],  # Only 2 values to control minimum loss reduction
    'min_child_weight': [5, 7],  # Focus on larger values to avoid overfitting
    'reg_alpha': [0.5, 1.0],  # Higher L1 regularization to encourage sparsity
    'reg_lambda': [2.0, 5.0]  # Higher L2 regularization to control model complexity
}

# Perform Stratified K-Folds cross-validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning using GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, 
                           scoring='f1', cv=skf, verbose=1, n_jobs=-1)

# Train and evaluate model using Stratified K-Folds cross-validation
precision_scores = []
recall_scores = []
f1_scores = []

for train_index, test_index in skf.split(features_scaled, target):
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]

    # Fit the grid search on the training data
    grid_search.fit(X_train, y_train)

    # Get the best model from grid search
    best_model = grid_search.best_estimator_

    # Make predictions on the test data
    y_pred = best_model.predict(X_test)

    # Calculate precision, recall, and F1-score
    precision_scores.append(precision_score(y_test, y_pred, zero_division=0))
    recall_scores.append(recall_score(y_test, y_pred, zero_division=0))
    f1_scores.append(f1_score(y_test, y_pred, zero_division=0))

# Calculate average scores across folds
avg_precision = sum(precision_scores) / len(precision_scores)
avg_recall = sum(recall_scores) / len(recall_scores)
avg_f1 = sum(f1_scores) / len(f1_scores)

# Save the best model
joblib.dump(best_model, 'tweaked_xgboost_regularized_tuned_model.pkl')

# Print the best hyperparameters and evaluation metrics
print(f"Best Hyperparameters: {grid_search.best_params_}")
print("XGBoost with Hyperparameter Tuning:")
print(f"Precision: {avg_precision:.4f}")
print(f"Recall: {avg_recall:.4f}")
print(f"F1 Score: {avg_f1:.4f}")
print("-" * 30)

print("XGBoost model with regularization, hyperparameter tuning, and stratified cross-validation has been trained and saved.")

Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Best Hyperparameters: {'colsample_bytree': 1.0, 'gamma': 0.5, 'learning_rate': 0.2, 'max_depth': 7, 'min_child_weight': 5, 'n_estimators': 200, 'reg_alpha': 1.0, 'reg_lambda': 2.0, 'subsample': 1.0}
XGBoost with Hyperparameter Tuning:
Precision: 0.8596
Recall: 0.8378
F1 Score: 0.8264
------------------------------
XGBoost model with regularization, hyperparameter tuning, and stratified cross-validation has been trained and saved.


## Feature Engineering and Model Development

In [16]:
import pandas as pd
import joblib
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from xgboost import XGBClassifier

In [17]:
# Read and prepare the data
pump_data = pd.read_csv('hypothetical_pump_failure_dataset.csv')
pump_data['timestamp'] = pd.to_datetime(pump_data['timestamp'])
pump_data.set_index('timestamp', inplace=True)
pump_data.head()

Unnamed: 0_level_0,vibration_level,temperature_C,pressure_PSI,flow_rate_m3h,failure
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-01-01 00:00:00,0.549671,76.996777,93.248217,40.460962,0
2024-01-01 01:00:00,0.486174,74.623168,98.554813,45.698075,0
2024-01-01 02:00:00,0.564769,70.298152,92.075801,47.931972,0
2024-01-01 03:00:00,0.652303,66.765316,96.920385,59.438438,0
2024-01-01 04:00:00,0.476585,73.491117,81.063853,52.782766,0


In [18]:
# Feature ranges
vibration_range = (0.175873, 0.885273)
temperature_range = (55.298057, 85.965538)
pressure_range = (69.804878, 139.262377)
flow_rate_range = (35.352757, 66.215465)

# Function to categorize a feature
def categorize_feature(value, feature_range):
    low, high = feature_range
    interval = (high - low) / 3
    if value <= low + interval:
        return 'Low'
    elif value <= low + 2 * interval:
        return 'Medium'
    else:
        return 'High'


# Categorize each feature
pump_data['vibration_category'] = pump_data['vibration_level'].apply(lambda x: categorize_feature(x, vibration_range))
pump_data['temperature_category'] = pump_data['temperature_C'].apply(lambda x: categorize_feature(x, temperature_range))
pump_data['pressure_category'] = pump_data['pressure_PSI'].apply(lambda x: categorize_feature(x, pressure_range))
pump_data['flow_rate_category'] = pump_data['flow_rate_m3h'].apply(lambda x: categorize_feature(x, flow_rate_range))

pump_data.head()

Unnamed: 0_level_0,vibration_level,temperature_C,pressure_PSI,flow_rate_m3h,failure,vibration_category,temperature_category,pressure_category,flow_rate_category
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2024-01-01 00:00:00,0.549671,76.996777,93.248217,40.460962,0,Medium,High,Medium,Low
2024-01-01 01:00:00,0.486174,74.623168,98.554813,45.698075,0,Medium,Medium,Medium,Medium
2024-01-01 02:00:00,0.564769,70.298152,92.075801,47.931972,0,Medium,Medium,Low,Medium
2024-01-01 03:00:00,0.652303,66.765316,96.920385,59.438438,0,High,Medium,Medium,High
2024-01-01 04:00:00,0.476585,73.491117,81.063853,52.782766,0,Medium,Medium,Low,Medium


In [23]:
# Apply One-Hot Encoding
pump_data_encoded = pd.get_dummies(pump_data, columns=['vibration_category', 'temperature_category', 'pressure_category', 'flow_rate_category'], drop_first=True)

# Display the resulting DataFrame
pump_data_encoded.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2024-01-01 00:00:00 to 2024-02-11 15:00:00
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   vibration_level              1000 non-null   float64
 1   temperature_C                1000 non-null   float64
 2   pressure_PSI                 1000 non-null   float64
 3   flow_rate_m3h                1000 non-null   float64
 4   failure                      1000 non-null   int64  
 5   vibration_category_Low       1000 non-null   bool   
 6   vibration_category_Medium    1000 non-null   bool   
 7   temperature_category_Low     1000 non-null   bool   
 8   temperature_category_Medium  1000 non-null   bool   
 9   pressure_category_Low        1000 non-null   bool   
 10  pressure_category_Medium     1000 non-null   bool   
 11  flow_rate_category_Low       1000 non-null   bool   
 12  flow_rate_category_Medium    1000 non-nu

In [20]:
# Group by categories and calculate counts of failures (1) and non-failures (0)
vibration_failure_count = pump_data.groupby(['vibration_category', 'failure']).size().unstack(fill_value=0)
temperature_failure_count = pump_data.groupby(['temperature_category', 'failure']).size().unstack(fill_value=0)
pressure_failure_count = pump_data.groupby(['pressure_category', 'failure']).size().unstack(fill_value=0)
flow_rate_failure_count = pump_data.groupby(['flow_rate_category', 'failure']).size().unstack(fill_value=0)

# Display the counts
print("Vibration Failure Counts:\n", vibration_failure_count)
print("\nTemperature Failure Counts:\n", temperature_failure_count)
print("\nPressure Failure Counts:\n", pressure_failure_count)
print("\nFlow Rate Failure Counts:\n", flow_rate_failure_count)

Vibration Failure Counts:
 failure               0   1
vibration_category         
High                 72   4
Low                 171   6
Medium              708  39

Temperature Failure Counts:
 failure                 0   1
temperature_category         
High                  132  10
Low                   161   4
Medium                658  35

Pressure Failure Counts:
 failure              0   1
pressure_category         
High                31  20
Low                226   7
Medium             694  22

Flow Rate Failure Counts:
 failure               0   1
flow_rate_category         
High                127   2
Low                 179  32
Medium              645  15


**Inferences from Feature Categories:**

Vibration Level:

+ High: Few failures (4 out of 76) indicate that high vibration levels are not strongly associated with failures.
+ Low: Failures are also low (6 out of 177), suggesting low vibration is generally safe.
+ Medium: Most failures (39 out of 747) occur at medium vibration levels, implying that moderate vibrations might be a critical indicator of failure.

Temperature (°C):

+ High: A small proportion of failures (10 out of 142) suggests high temperatures alone do not significantly contribute to failures.
+ Low: Very few failures (4 out of 165), indicating that low temperatures are safe.
+ Medium: The majority of failures (35 out of 693) occur at medium temperatures, indicating this range may be more prone to failures.

Pressure (PSI):

+ High: A notable number of failures (20 out of 51) occur at high pressures, making it a strong indicator of failure.
+ Low: Few failures (7 out of 233), indicating lower pressures are relatively safe.
+ Medium: Some failures (22 out of 716) occur at medium pressures, but the rate is much lower than at high pressure.

Flow Rate (m³/h):

+ High: Very few failures (2 out of 129) suggest high flow rates are generally safe.
+ Low: The majority of failures (32 out of 211) occur at low flow rates, indicating a critical risk factor.
+ Medium: Fewer failures (15 out of 660) suggest medium flow rates are stable.

**Overall Insights:**

Key Failure Indicators:

+ High Pressure and Low Flow Rate are the most significant indicators of failure.
+ Medium Vibration and Temperature levels may also indicate potential risk.

In [36]:
import pandas as pd
import joblib
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from xgboost import XGBClassifier

# Read and prepare the data
pump_data = pd.read_csv('hypothetical_pump_failure_dataset.csv')
pump_data['timestamp'] = pd.to_datetime(pump_data['timestamp'])
pump_data.set_index('timestamp', inplace=True)

# Feature categorization
def categorize_feature(value, feature_range):
    low, high = feature_range
    interval = (high - low) / 3
    if value <= low + interval:
        return 'Low'
    elif value <= low + 2 * interval:
        return 'Medium'
    else:
        return 'High'

# Define the ranges for each feature
vibration_range = (0.175873, 0.885273)
temperature_range = (55.298057, 85.965538)
pressure_range = (69.804878, 139.262377)
flow_rate_range = (35.352757, 66.215465)

# Create categorized columns
pump_data['vibration_category'] = pump_data['vibration_level'].apply(lambda x: categorize_feature(x, vibration_range))
pump_data['temperature_category'] = pump_data['temperature_C'].apply(lambda x: categorize_feature(x, temperature_range))
pump_data['pressure_category'] = pump_data['pressure_PSI'].apply(lambda x: categorize_feature(x, pressure_range))
pump_data['flow_rate_category'] = pump_data['flow_rate_m3h'].apply(lambda x: categorize_feature(x, flow_rate_range))

# Apply One-Hot Encoding
encoder = OneHotEncoder(sparse_output=False)
encoded_features = encoder.fit_transform(pump_data[['vibration_category', 'temperature_category', 'pressure_category', 'flow_rate_category']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(), index=pump_data.index)  # Set index to match

# Concatenate encoded features with the original numeric features along columns
features = pd.concat([pump_data[['vibration_level', 'temperature_C', 'pressure_PSI', 'flow_rate_m3h']], encoded_df], axis=1)
target = pump_data['failure']


# Initialize and fit MinMaxScaler
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)

# Save the scaler
joblib.dump(scaler, 'scaler_feature_engineered_xgboost.pkl')

# Initialize XGBoost model with regularization
xgb_model = XGBClassifier(
    eval_metric='logloss',
    reg_alpha=0.1,  # L1 regularization term
    reg_lambda=1.0  # L2 regularization term
)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'max_depth': [7, 9, 11],  # Reduced to 3 deeper depths to explore complexity
    'learning_rate': [0.05, 0.1, 0.2],  # Moderate learning rates for finer tuning
    'n_estimators': [200, 300, 500],  # Focus on higher estimator values
    'subsample': [0.8, 0.9, 1.0],  # Reasonable range for subsampling
    'colsample_bytree': [0.8, 0.9, 1.0],  # Limited range for feature sampling
    'gamma': [0.2, 0.5],  # Only 2 values to control minimum loss reduction
    'min_child_weight': [5, 7],  # Focus on larger values to avoid overfitting
    'reg_alpha': [0.5, 1.0],  # Higher L1 regularization to encourage sparsity
    'reg_lambda': [2.0, 5.0]  # Higher L2 regularization to control model complexity
}

# Perform Stratified K-Folds cross-validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning using GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, 
                           scoring='f1', cv=skf, verbose=1, n_jobs=-1)

# Train and evaluate model using Stratified K-Folds cross-validation
precision_scores = []
recall_scores = []
f1_scores = []

for train_index, test_index in skf.split(features_scaled, target):
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]

    # Fit the grid search on the training data
    grid_search.fit(X_train, y_train)

    # Get the best model from grid search
    best_model = grid_search.best_estimator_

    # Make predictions on the test data
    y_pred = best_model.predict(X_test)

    # Calculate precision, recall, and F1-score
    precision_scores.append(precision_score(y_test, y_pred, zero_division=0))
    recall_scores.append(recall_score(y_test, y_pred, zero_division=0))
    f1_scores.append(f1_score(y_test, y_pred, zero_division=0))

# Calculate average scores across folds
avg_precision = sum(precision_scores) / len(precision_scores)
avg_recall = sum(recall_scores) / len(recall_scores)
avg_f1 = sum(f1_scores) / len(f1_scores)

# Save the best model
joblib.dump(best_model, 'xgboost_feature_engineered_model.pkl')

# Print the best hyperparameters and evaluation metrics
print(f"Best Hyperparameters: {grid_search.best_params_}")
print("XGBoost with Categorized Features and Hyperparameter Tuning:")
print(f"Precision: {avg_precision:.4f}")
print(f"Recall: {avg_recall:.4f}")
print(f"F1 Score: {avg_f1:.4f}")
print("-" * 30)

print("XGBoost model with categorized features has been trained and saved.")

Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Fitting 5 folds for each of 3888 candidates, totalling 19440 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
Best Hyperparameters: {'colsample_bytree': 1.0, 'gamma': 0.5, 'learning_rate': 0.2, 'max_depth': 7, 'min_child_weight': 5, 'n_estimators': 200, 'reg_alpha': 1.0, 'reg_lambda': 2.0, 'subsample': 1.0}
XGBoost with Categorized Features and Hyperparameter Tuning:
Precision: 0.7796
Recall: 0.8578
F1 Score: 0.8131
------------------------------
XGBoost model with categorized features has been trained and saved.
