now, as we have already analyzed data and made some meaningful changes, we move onto and this file implements: preparing data specifically for machine learning : feature selection, encoding, class imbalance treating, implementing multiple ML models and evaluating them


✅ Baseline model
No resampling
Proper metrics
Show imbalance problem

✅ Weighted model
class_weight='balanced'
✅ SMOTE model
Compare recall / F1 / ROC-AUC
✅ Explain trade-offs
Precision vs recall
Business interpretation

In [11]:
# we have high class imbalance as already mentioned, so to show difference and importance of data balancing, we will train one logistic regression with unbalanced data,
# and one with weighted model - comparing recall and F1 score.

In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

In [None]:
data = pd.read_csv('data/data_updated.csv')

In [15]:
# first we will train model on given,  unbalanced data to really see the difference balancing makes.

# Features and target
X = data.drop(columns='Attrition')
y = data['Attrition']  

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train Logistic Regression
model = LogisticRegression(solver='lbfgs', random_state=0, max_iter=10000)  
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, predictions))
print("F1 score:", f1_score(y_test, predictions))
print(classification_report(y_test, predictions))




Accuracy: 0.8809523809523809
F1 score: 0.5070422535211268
              precision    recall  f1-score   support

           0       0.92      0.95      0.93       255
           1       0.56      0.46      0.51        39

    accuracy                           0.88       294
   macro avg       0.74      0.70      0.72       294
weighted avg       0.87      0.88      0.88       294



we obviously see the terrible result and contrast in between accuracy and F1. accuracy is 0.88 while F1 score is 0.56, this is because our model mostly guess the majority class, so when minority class is prioritized, model performance drops

In [21]:
# Features and target

X = data.drop(columns='Attrition')
y = data['Attrition']  

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train Logistic Regression
model2 = LogisticRegression(
    solver='lbfgs',
    random_state=0,
    max_iter=5000,
    class_weight={0:1, 1:2}
)

model2.fit(X_train, y_train)
predictions = model2.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, predictions))
print("F1 score:", f1_score(y_test, predictions))
print(classification_report(y_test, predictions))





Accuracy: 0.8469387755102041
F1 score: 0.47058823529411764
              precision    recall  f1-score   support

           0       0.92      0.90      0.91       255
           1       0.43      0.51      0.47        39

    accuracy                           0.85       294
   macro avg       0.68      0.71      0.69       294
weighted avg       0.86      0.85      0.85       294



In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Random Forest
rf = RandomForestClassifier(
    n_estimators=100,      # number of trees
    max_depth=None,
    min_samples_split=2,
    max_features="sqrt",   # features per split
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)

# Train
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nReport:\n", classification_report(y_test, y_pred))


Accuracy: 0.8673469387755102

Report:
               precision    recall  f1-score   support

           0       0.88      0.99      0.93       255
           1       0.50      0.08      0.13        39

    accuracy                           0.87       294
   macro avg       0.69      0.53      0.53       294
weighted avg       0.83      0.87      0.82       294

