Balance Classes (SMOTE or Class Weighting) to improve Class 1 recall.

Experiment with more features (top 15 instead of 10).

Try advanced models (e.g., XGBoost, LightGBM).

You can also train an ANN with these top features.

Selecting Top-K features using SelectKBest (f_classif)

Training a model (RandomForest)

In [1]:
from sklearn.feature_selection import SelectKBest, f_classif
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#load data from drive Data/00_Data.xlsx
df = pd.read_excel('/content/drive/MyDrive/data/00_Data.xlsx')

#print NA data
#print(df.isna().sum())

#remove NA rows
df = df.dropna()

y = df['DEFECTIVE']
X = df.drop('DEFECTIVE', axis=1)

## Apply SelectKBest with f_classif

In [14]:
# Apply SelectKBest with f_classif
def select_k_best_features(X, y, k_f):
  selector = SelectKBest(score_func=f_classif, k=k_f)
  X_selected = selector.fit_transform(X, y)

  # Prepare DataFrame with Feature Names and Scores
  feature_scores_df = pd.DataFrame({
      'Feature': X.columns,
      'Score': selector.scores_
  })

  #short values based in score
  feature_scores_df = feature_scores_df.sort_values(by='Score', ascending=False)
  return X_selected, feature_scores_df


## RandomForest Classifier model

In [10]:
# Train-Test Split
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

def rf_train_test_split(X, y, test_size=0.2, random_state=42):

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  # RandomForest Classifier
  model = RandomForestClassifier(n_estimators=100, random_state=42)
  model.fit(X_train, y_train)

  # Prediction & Evaluation
  y_pred = model.predict(X_test)
  print("\nClassification Report RandomForest:")
  print(classification_report(y_test, y_pred))


## Check with RandomForest classifier

In [27]:
x_selected, _ = select_k_best_features(X, y, k_f=15)
rf_train_test_split(x_selected, y)


Classification Report RandomForest:
              precision    recall  f1-score   support

   DEFECTIVE       0.67      0.58      0.62       586
          NO       0.93      0.95      0.94      3466

    accuracy                           0.90      4052
   macro avg       0.80      0.76      0.78      4052
weighted avg       0.89      0.90      0.89      4052



##We need to augment the data with SMOTE.
##Check RandomForest after SMOTE

In [30]:
#SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)

print("\nClass distribution after SMOTE:")
print(pd.Series(y_smote).value_counts())
#
rf_train_test_split(X_smote, y_smote)



Class distribution after SMOTE:
DEFECTIVE
NO           17371
DEFECTIVE    17371
Name: count, dtype: int64

Classification Report RandomForest:
              precision    recall  f1-score   support

   DEFECTIVE       0.94      0.95      0.94      3469
          NO       0.95      0.94      0.94      3480

    accuracy                           0.94      6949
   macro avg       0.94      0.94      0.94      6949
weighted avg       0.94      0.94      0.94      6949

