## Problem Statement

### Business Context

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).



## Objective
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost.
The nature of predictions made by the classification model will translate as follows:

- True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
- False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
- False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

## Data Description
- The data provided is a transformed version of original data which was collected using sensors.
- Train.csv - To be used for training and tuning of models.
- Test.csv - To be used only for testing the performance of the final best model.
- Both the datasets consist of 40 predictor variables and 1 target variable

## Importing necessary libraries

In [None]:
# Installing the libraries with the specified version.
# %pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 imbalanced-learn==0.10.1 xgboost==2.0.3 threadpoolctl==3.3.0 -q

**Note:** After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [None]:
# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To help with data preprocessing
from scipy.stats import zscore

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier


from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, RobustScaler

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

## Loading the dataset

In [None]:
test = pd.read_csv("Test.csv")
train = pd.read_csv("Train.csv")

## Data Overview

- Observations
- Sanity checks

### Train Dataset

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.info()

- 39 columns and 1 target column
- Columns V1 and V2 contain 18 null values
- All columns are numeric values

In [None]:
train.describe().T

- Class imbalance present: only 5.6% of failures (Target = 1), this means failures are rare.
- Some features have extreme min max values indicating possible outliers.
- Potential Skewness log transformation or robust scaling possibly needed for highly skewed features.

In [None]:
# Checking for duplicates
train.duplicated().sum()

In [None]:
# Checking for missing values
round(train.isnull().sum() / train.isnull().count() * 100,2)

- Missing values consist of 0.09% of columns `V1` `V2`

### Test Dataset

In [None]:
test.head()

In [None]:
test.shape

In [None]:
test.info()

- There are a total of 5000 rows `V1` and `V2` contain 5 and 4 missing values

In [None]:
test.describe().T

- Class imbalance for target variable remains (5.6%) this is consistent across both datasets.
- Means and std for most columns are close to the training data.
- Outliers exist in both sets


In [None]:
# checking for duplicates
test.duplicated().sum()

In [None]:
# check for missing values
round(test.isnull().sum() / test.isnull().count() * 100,2)

- `V1` 0.1% missing values `V2` 0.12% missing values

## Exploratory Data Analysis (EDA)

### Plotting histograms and boxplots for all the variables

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

### Plotting all the features at one go

#### Training Dataset

In [None]:
for feature in train.columns:
    histogram_boxplot(train, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data

## Data Pre-processing

In [None]:
from scipy.stats import zscore

# Function to detect outliers using IQR
def detect_outliers_iqr(df, columns):
    outlier_dict = {}
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outlier_dict[col] = len(outliers)
    return outlier_dict

# Function to detect outliers using Z-score
def detect_outliers_zscore(df, columns, threshold=3):
    outlier_dict = {}
    for col in columns:
        z_scores = np.abs(zscore(df[col]))
        outliers = df[z_scores > threshold]
        outlier_dict[col] = len(outliers)
    return outlier_dict

# Select numerical columns (excluding target)
num_cols = [col for col in train.columns if col != "Target"]

# Detect outliers
iqr_outliers = detect_outliers_iqr(train, num_cols)
zscore_outliers = detect_outliers_zscore(train, num_cols)

# Display the top features with the most outliers
iqr_sorted = sorted(iqr_outliers.items(), key=lambda x: x[1], reverse=True)
zscore_sorted = sorted(zscore_outliers.items(), key=lambda x: x[1], reverse=True)

print("Top Features with Outliers (IQR Method):", iqr_sorted[:10])
print("Top Features with Outliers (Z-score Method):", zscore_sorted[:10])

##### Failures Relationship to outliers

- Feature with most outliers

In [None]:
# Select top feature with most outliers
top_feature = iqr_sorted[0][0]

plt.figure(figsize=(8, 5))
sns.boxplot(x=train["Target"], y=train[top_feature])
plt.title(f"Boxplot of {top_feature} by Failure")
plt.xlabel("Failure (0 = No, 1 = Yes)")
plt.ylabel(top_feature)
plt.show()

- Feature with second most outliers

In [None]:
top_feature = iqr_sorted[1][0]

plt.figure(figsize=(8, 5))
sns.boxplot(x=train["Target"], y=train[top_feature])
plt.title(f"Boxplot of {top_feature} by Failure")
plt.xlabel("Failure (0 = No, 1 = Yes)")
plt.ylabel(top_feature)
plt.show()

- Feature with third most outliers

In [None]:
top_feature = iqr_sorted[3][0]

plt.figure(figsize=(8, 5))
sns.boxplot(x=train["Target"], y=train[top_feature])
plt.title(f"Boxplot of {top_feature} by Failure")
plt.xlabel("Failure (0 = No, 1 = Yes)")
plt.ylabel(top_feature)
plt.show()

- It appears that failures do not have a relationship with outliers

In [None]:
scaler = RobustScaler()
train_scaled = train.copy()
train_scaled[num_cols] = scaler.fit_transform(train[num_cols])

Since the outliers are not strongly correlated with failures, removing them is unnecessary. Instead, they should be normalized to prevent them from negatively impacting the model.

Why RobustScaler

- Preserves all data (no removals, no loss of useful information).
- Reduces outlier influence (scales using median & IQR instead of mean & standard deviation).
- Better for models like Logistic Regression, Decision Trees, and Gradient Boosting, which are sensitive to feature scale.

#### Test Dataset

- Apply the same RobustScaler transformation to the test dataset, but do not fit it again. Instead, use the scaler already fitted on the training data to ensure consistency. 

If you apply different transformations to the test data, the model will receive differently scaled inputs, leading to incorrect predictions.

- Since the test dataset follows a similar distribution to the training set, repeating the analysis is unnecessary. The same preprocessing steps must be applied to avoid data leakage.

In [None]:
# Apply the same scaler (DO NOT FIT AGAIN)
test[num_cols] = scaler.transform(test[num_cols])

### Missing value imputation

In [None]:
# show only missing values
test.isnull().sum()[test.isnull().sum() > 0]

In [None]:
# show only missing values
train.isnull().sum()[train.isnull().sum() > 0]

In [None]:
train.fillna(train.median(), inplace=True)
test.fillna(train.median(), inplace=True)  # Use training set median to avoid data leakage

### Dealing With class imbalance

In [None]:
# Recommended for boosting and tree-based models
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced', classes=np.unique(train["Target"]), y=train["Target"])
weights_dict = {0: class_weights[0], 1: class_weights[1]}

In [None]:
print("Class Weights:", weights_dict) 

- Use class weight to give failures more importance in the model training.

### Feature Selection

In [None]:
# Identify numerical columns (excluding Target) Re define the num_cols (not necessary)
num_cols = [col for col in train.columns if col != "Target"]

In [None]:
# Compute Correlation Matrix
correlation_matrix = train[num_cols].corr()

In [None]:
# plot the correlation matrix
plt.figure(figsize=(20, 12))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", annot_kws={"size": 7})
plt.title("Correlation Matrix")

plt.show()

In [None]:
# Identify Highly Correlated Features (Threshold: 0.85)
correlation_threshold = 0.85
highly_correlated_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > correlation_threshold:
            colname = correlation_matrix.columns[i]
            highly_correlated_features.add(colname)

# Print Highly Correlated Features
print("Highly Correlated Features to Drop:", highly_correlated_features)

In [None]:
# Get feature importance using Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(train[num_cols], train["Target"])

In [None]:
# Get feature importance scores
importances = pd.Series(rf.feature_importances_, index=num_cols).sort_values(ascending=False)

# Display feature importance as a bar plot
plt.figure(figsize=(10, 6))
importances.head(15).plot(kind="bar")
plt.title("Top 15 Feature Importances (Random Forest)")
plt.xlabel("Feature")
plt.ylabel("Importance Score")
plt.xticks(rotation=45)
plt.show()

# Return top 15 most important features
importances.head(15)

- Since `V15` and `V14` are highly correlated with other features, they are dropped to prevent redundancy

In [None]:
# Drop highly correlated features from train & test datasets
train.drop(columns=['V15', 'V14'], inplace=True)
test.drop(columns=['V15', 'V14'], inplace=True) 

### Prepare Data for Model Training

In [None]:
# Separate features and target
X_train = train.drop(columns=["Target"])
y_train = train["Target"]

X_test = test.drop(columns=["Target"])
y_test = test["Target"]

## Model Building

### Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

- True positives (TP) are failures correctly predicted by the model.
- False negatives (FN) are real failures in a generator where there is no detection by model.
- False positives (FP) are failure detections in a generator where there is no failure.

**Which metric to optimize?**

* We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
* We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
* We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

**Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.**

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf

### Defining scorer to be used for cross-validation and hyperparameter tuning

- We want to reduce false negatives and will try to maximize "Recall".
- To maximize Recall, we can use Recall as a **scorer** in cross-validation and hyperparameter tuning.

In [None]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

### Model Building with original data

Sample Decision Tree model building with original data

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("rf", RandomForestClassifier(random_state=1)))
models.append(("ada", AdaBoostClassifier(random_state=1)))
models.append(("xgb", XGBClassifier(random_state=1)))
models.append(("gb", GradientBoostingClassifier(random_state=1)))
models.append(("bag", BaggingClassifier(random_state=1)))
models.append(("logistic", LogisticRegression(random_state=1)))


results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_test, model.predict(X_test))
    print("{}: {}".format(name, scores))

- Xgboost is the best performing model followed by Decision Tree, all other models show significant drops in validation performance

### Model Building with Oversampled data


In [None]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

In [None]:
# Train Decision Tree on oversampled data
dtree_oversampled = DecisionTreeClassifier(random_state=1)
dtree_oversampled.fit(X_train_over, y_train_over)

# Evaluate Performance on Test Set
performance_df = model_performance_classification_sklearn(dtree_oversampled, X_test, y_test)
print("Decision Tree:" '\n',performance_df)

In [None]:
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=1, class_weight="balanced")
rf.fit(X_train_over, y_train_over)

rf_performance = model_performance_classification_sklearn(rf, X_test, y_test)
print('Random Forest\n',rf_performance)

In [None]:
# Train Random Forest using weights_dict
rf = RandomForestClassifier(n_estimators=100, random_state=1, class_weight=weights_dict)
rf.fit(X_train_over, y_train_over)

rf_performance = model_performance_classification_sklearn(rf, X_test, y_test)
print('Random Forest using weights_dict\n',rf_performance)

In [None]:
# Train XGBoost
xgb = XGBClassifier(n_estimators=100, random_state=1, scale_pos_weight=len(y_train_over) / sum(y_train_over))
xgb.fit(X_train_over, y_train_over)

xgb_performance = model_performance_classification_sklearn(xgb, X_test, y_test)
print('XGBoost\n',xgb_performance)

Model with the best recall is XGBoost

### Model Building with Undersampled data

In [None]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

In [None]:
# Train Decision Tree on undersampled data
dtree_undersampled = DecisionTreeClassifier(random_state=1)
dtree_undersampled.fit(X_train_un, y_train_un)

# Evaluate Performance on Test Set
performance_df_under = model_performance_classification_sklearn(dtree_undersampled, X_test, y_test)
print('\n' "Decision Tree:" '\n')
print(performance_df_under)

In [None]:
# Train Random Forest on undersampled data
rf_under = RandomForestClassifier(n_estimators=100, random_state=1, class_weight="balanced")
rf_under.fit(X_train_un, y_train_un)

rf_performance_under = model_performance_classification_sklearn(rf_under, X_test, y_test)
print('Random Forest\n',rf_performance_under)

In [None]:
# Train Random Forest on undersampled data
rf_under = RandomForestClassifier(n_estimators=100, random_state=1, class_weight=weights_dict)
rf_under.fit(X_train_un, y_train_un)

rf_performance_under = model_performance_classification_sklearn(rf_under, X_test, y_test)
print('Random Forest with weights_dict\n',rf_performance_under)

In [None]:
# Train XGBoost on undersampled data
xgb_under = XGBClassifier(n_estimators=100, random_state=1, scale_pos_weight=len(y_train_un) / sum(y_train_un))
xgb_under.fit(X_train_un, y_train_un)

xgb_perf_under = model_performance_classification_sklearn(xgb_under, X_test, y_test)
print('XGBoost\n',xgb_perf_under)

Model with the best recall Random Forest

- Overall Undersampled Random Forest (94.7%) has the maximum Recall

- Over Sampled Random Forest has a balance of high recall (90.4%) and higher precision meaning less false alarms.

Because false alarms have no cost and we are focusing on maximizing recall the model moving forward will be undersampled random forest

## HyperparameterTuning

### Sample Parameter Grids

**Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.**

- For Gradient Boosting:

param_grid = {
    "n_estimators": np.arange(100,150,25),
    "learning_rate": [0.2, 0.05, 1],
    "subsample":[0.5,0.7],
    "max_features":[0.5,0.7]
}

- For Adaboost:

param_grid = {
    "n_estimators": [100, 150, 200],
    "learning_rate": [0.2, 0.05],
    "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1),
    ]
}

- For Bagging Classifier:

param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}

- For Random Forest:

param_grid = {
    "n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}

- For Decision Trees:

param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}

- For Logistic Regression:

param_grid = {'C': np.arange(0.1,1.1,0.1)}

- For XGBoost:

param_grid={
    'n_estimators': [150, 200, 250],
    'scale_pos_weight': [5,10],
    'learning_rate': [0.1,0.2],
    'gamma': [0,3,5],
    'subsample': [0.8,0.9]
}

### Sample tuning method for Decision tree with original data

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} \nwith CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

### Sample tuning method for Decision tree with oversampled data

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} \nwith CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

### Sample tuning method for Decision tree with undersampled data

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
              'min_samples_leaf': [1, 2, 5, 7],
              'max_leaf_nodes' : [5, 10,15],
              'min_impurity_decrease': [0.0001,0.001]}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} \nwith CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

### Model Tuning: Undersampled Random Forest

In [None]:
Model = RandomForestClassifier(random_state=1)

param_grid = {
    "n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [0.3, 0.6, 0.1,'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1),
    "class_weight": [None, 'balanced', weights_dict]
}

randomized_cv = RandomizedSearchCV(estimator=Model, 
                                   param_distributions=param_grid, 
                                   n_iter=10, 
                                   n_jobs=-1, 
                                   scoring=scorer, 
                                   cv=5, 
                                   random_state=1)



In [None]:
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} \nwith CV score={}:".format(randomized_cv.best_params_, randomized_cv.best_score_))

In [None]:
best_rf_model = RandomForestClassifier(
    n_estimators=250,
    min_samples_leaf=2,
    max_samples=0.6,
    max_features=0.1,
    random_state=1,
    class_weight="balanced"
)

best_rf_model.fit(X_train_un, y_train_un)

# Evaluate the final tuned model
best_rf_performance = model_performance_classification_sklearn(best_rf_model, X_test, y_test)
print("Final Model Performance on Test Data:\n", best_rf_performance)

### Model Tuning: Oversampled Random Forest

In [None]:
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} \nwith CV score={}:".format(randomized_cv.best_params_, randomized_cv.best_score_))

In [None]:
best_rf_model = RandomForestClassifier(
    n_estimators=250,
    min_samples_leaf=2,
    max_samples=0.6,
    max_features=0.1,
    random_state=1,
    class_weight="balanced"
)

best_rf_model.fit(X_train_over, y_train_over)

best_rf_performance = model_performance_classification_sklearn(best_rf_model, X_test, y_test)
print("Final Model Performance on Test Data:\n", best_rf_performance)

## Model performance comparison and choosing the final model

Since the cost of replacing a generator is the highest, followed by repair costs, and inspections are the least expensive, recall is the most important metric to minimize False Negatives (FN). The Undersampled Random Forest model is the best choice because it has the highest recall (90.8%), meaning it detects the most failures and helps prevent costly replacements, even if it results in more False Positives (FP) and additional inspections. While the Oversampled Random Forest has better precision and F1-score, the main goal is to catch as many failures as possible, since missing them leads to higher costs. Moving forward, the Undersampled RF model should be deployed, with possible threshold tuning to balance recall and false positives. 

### Test set final performance

The final test set performance for the Undersampled Random Forest model, which was selected based on cost considerations, is:

Metric	Final Score
- Accuracy	0.707 (70.7%)
- Recall (Most Important) 	0.908 (90.8%)
- Precision	0.151 (15.1%)
- F1-Score	0.259 (25.9%)

This means the model correctly identifies 90.8% of actual failures, minimizing False Negatives (FN) and preventing costly generator replacements. While precision (15.1%) is low, leading to more False Positives (FP) and extra inspections, the tradeoff is acceptable since inspections are far less expensive than missed failures.

## Pipelines to build the final model


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load Data
def load_data():
    train = pd.read_csv("Train.csv")
    test = pd.read_csv("Test.csv")
    return train, test

# Preprocess Data
def preprocess_data(train, test):
    numeric_features = train.select_dtypes(include=[np.number]).columns.tolist()
    categorical_features = train.select_dtypes(include=["object"]).columns.tolist()

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ])

    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

    return preprocessor





In [None]:
# Train Model
def train_model(train, preprocessor):

    print("Columns inside function:", list(train.columns))  # Debugging step

    y = train["Target"]
    X = train.drop(columns=["Target"])  # Replace 'target' with actual target column
    

    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(n_estimators=250, 
                                              min_samples_leaf=2,
                                              max_samples=0.6,
                                              max_features=0.1,
                                              random_state=1,
                                              class_weight="balanced"))
    ])

    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    acc = accuracy_score(y_val, y_pred)
    print(f"Validation Accuracy: {acc:.4f}")
    
    return model

In [3]:
train, test = load_data()

In [4]:
preprocessor = preprocess_data(train, test)

In [5]:
print("Before function call - columns in train:", list(train.columns))

Before function call - columns in train: ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'Target']


In [None]:
def train_model(train, preprocessor):

    print("Columns inside function:", list(train.columns))  # Debugging step

    y = train["Target"]
    X = train.drop(columns=["Target"])  # Replace 'target' with actual target column
    

    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(n_estimators=250, 
                                              min_samples_leaf=2,
                                              max_samples=0.6,
                                              max_features=0.1,
                                              random_state=1,
                                              class_weight="balanced"))
    ])

    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    acc = accuracy_score(y_val, y_pred)
    print(f"Validation Accuracy: {acc:.4f}")
    
    return model

In [None]:
model = train_model(train, preprocessor)

In [None]:

# Run Pipeline
if __name__ == "__main__":
    train, test = load_data()
    preprocessor = preprocess_data(train, test)
    print(train.columns)
    model = train_model(train, preprocessor)

# Business Insights and Conclusions

***