# **Dataset context**

The **IBM HR Analytics Employee Attrition & Performance** dataset provides a fictional snapshot of an organization's workforce aimed at helping HR teams and data analysts identify the factors that contribute to employee turnover and performance trends. It features both demographic and job-related variables, enabling the development of predictive models and strategic insights to improve employee retention.

# **Dataset content**
There are 1,470 observations and 35 variables, including:


*   **Age, Gender, MaritalStatus, Education, EducationField** — employee demographics

*   **Department, JobRole, BusinessTravel, OverTime** — job characteristics and work patterns

*   **DistanceFromHome, EnvironmentSatisfaction, JobSatisfaction, WorkLifeBalance, RelationshipSatisfaction** — job experience and well‑being

*   **MonthlyIncome, HourlyRate, PercentSalaryHike, StockOptionLevel** — compensation info

*   **TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, NumCompaniesWorked, TrainingTimesLastYear** — career progression & training metrics
*   **PerformanceRating, JobInvolvement** — performance indicators

*   **DailyRate, MonthlyRate, StandardHours, EmployeeCount, EmployeeNumber, Over18** — various administrative identifiers; some hold constant values and are often dropped during preprocessing

*   **Attrition** — the target variable: whether the employee left (“Yes”/“No”)

*(source: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)*


In [1]:
!pip install -r requirements.txt --quiet

In [None]:
# Standard library imports
import os
import pickle
import warnings
from typing import Dict, Any

# Third-party imports
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn preprocessing and model selection
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline as SklearnPipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Imbalanced-learn preprocessing
import imblearn
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN

# Classifiers
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier
)
import xgboost
from xgboost import XGBClassifier
import catboost
from catboost import CatBoostClassifier

# Metrics and evaluation
from sklearn.metrics import (
    accuracy_score,
    average_precision_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
    ConfusionMatrixDisplay,
    RocCurveDisplay
)
from sklearn.metrics._plot.precision_recall_curve import precision_recall_curve

# Model persistence
import joblib
from joblib import dump, load

# Suppress warnings
warnings.filterwarnings('ignore')

os.environ['OMP_NUM_THREADS']='1'

# **Functions**

In [None]:
def crosstab_categorical(df: pd.DataFrame) -> None:
    """Print crosstabs of 'Attrition' vs. each categorical column in df_cat_col."""
    for col in df_cat_col:
        print(pd.crosstab(df['Attrition'], df[col]))

In [None]:
def cap_upper_iqr(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    """
    Caps only the upper outliers in specified columns using the IQR method.
    """
    df_capped = df.copy()
    for col in cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        upper_bound = Q3 + 1.5 * IQR
        df_capped[col] = df[col].clip(upper=upper_bound)  # No lower bound applied
    return df_capped

In [None]:
def get_feature_importance_df(
    pipeline: SklearnPipeline,
    preprocessor: ColumnTransformer
) -> pd.DataFrame:
    """
    Extracts feature importances from a model pipeline.
    """
    model: BaseEstimator = pipeline.named_steps['model']
    feature_names = []

    for name, transformer, columns in preprocessor.transformers_:
        if name != 'remainder':
            if hasattr(transformer, 'get_feature_names_out'):
                transformed_names = transformer.get_feature_names_out(columns)
            else:
                transformed_names = columns
            feature_names.extend(transformed_names)

    importances = model.feature_importances_

    return pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False)

In [None]:
def compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred):
    """
    Calculate key classification metrics for train and test data.
    Returns a dictionary with confusion matrix, accuracy, precision,
    recall, F1 score, ROC AUC, and average precision.
    """
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

    metrics = {
        'confusion_matrix': (tn, fp, fn, tp),
        'accuracy_train': accuracy_score(y_train, y_pred_tr),
        'accuracy_test': accuracy_score(y_test, y_pred),
        'precision_train': precision_score(y_train, y_pred_tr, average='binary'),
        'precision_test': precision_score(y_test, y_pred, average='binary'),
        'recall_train': recall_score(y_train, y_pred_tr),
        'recall_test': recall_score(y_test, y_pred),
        'f1_train': f1_score(y_train, y_pred_tr),
        'f1_test': f1_score(y_test, y_pred),
        'roc_auc_test': roc_auc_score(y_test, y_pred),
        'average_precision_test': average_precision_score(y_test, y_pred)
    }

    return metrics

In [None]:
def plot_roc_curve(model, X_test, y_test) -> None:
    """
    Display ROC curve for a trained classifier.
    """
    fig, ax = plt.subplots(figsize=(12, 8))
    RocCurveDisplay.from_estimator(model, X_test, y_test, ax=ax)
    plt.show()

In [None]:
def plot_precision_recall_curve(model, X_test, y_test) -> None:
    """
    Plot the Precision-Recall curve with average precision.
    """
    y_scores = model.predict_proba(X_test)[:, 1]
    prec, rec, _ = precision_recall_curve(y_test, y_scores)
    avg_precision = average_precision_score(y_test, y_scores)

    plt.figure(figsize=(12, 7))
    plt.plot(rec, prec, label=f'Average precision-recall score: {avg_precision:.2f}')
    plt.title('Precision-Recall Curve', size=20)
    plt.xlabel('Recall', size=14)
    plt.ylabel('Precision', size=14)
    plt.legend()
    plt.show()

In [None]:
def create_model_summary_row(model_name: str, metrics: Dict[str, Any]) -> Dict[str, Any]:
    """
    Create a summary dictionary for classification model performance.
    """
    tn, fp, fn, tp = metrics['confusion_matrix']

    return {
        'model': model_name,
        'tp': tp,
        'tn': tn,
        'fp': fp,
        'fn': fn,
        'correct': tp + tn,
        'incorrect': fp + fn,
        'accuracy_train': metrics['accuracy_train'],
        'accuracy_test': metrics['accuracy_test'],
        'precision_train': metrics['precision_train'],
        'precision_test': metrics['precision_test'],
        'recall_train': metrics['recall_train'],
        'recall_test': metrics['recall_test'],
        'f1_train': metrics['f1_train'],
        'f1_test': metrics['f1_test'],
        'roc_auc': metrics['roc_auc_test'],
        'avg_pre': round(metrics['average_precision_test'], 3)
    }

# **Load the data**

In [None]:
#df = pd.read_csv("/content/drive/MyDrive/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df = pd.read_csv("./data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

# **EDA**

In [None]:
# Shape of data
df.shape

In [None]:
# First 5 rows of data
df.head()

In [None]:
# Last 5 rows of data
df.tail()

In [None]:
# Check duplicates - there are no duplicates
len(df[df.duplicated()])

In [None]:
# Columns names
df.columns

In [None]:
# Data types - there are categorical and numerical columns
df.dtypes

In [None]:
# Unique values in each column
pd.Series({c: df[c].unique() for c in df})

In [None]:
# Number of unique values for each column - some columns have few unque values whereas others - almost for each row
df.nunique()

In [None]:
# Check missing values - no missing values
sum(df.isna().sum())

In [None]:
# Data details
df.info()

In [None]:
# Dataset is completely imbalanced - 1233 employees stayed at company vs 237 left
df['Attrition'].value_counts()

In [None]:
# The further distance from home, the less employees are in the company from this particular area
df['DistanceFromHome'].value_counts()

In [None]:
# The most employees work at R&D Department, the least - in HR
df['Department'].value_counts()

In [None]:
# Drop columns  - no significant meaning for further analysis
# Column EmployeCount represents the amount of employees with particular EmployeeNumber - it is always 1
# Each employee is Over18
# Each emlpoyee works 80 hours - column StandardHours
df.drop(columns=['EmployeeCount', 'Over18', 'EmployeeNumber', 'StandardHours'], inplace=True)

In [None]:
# Change the type for columns to categorical
for col in ['BusinessTravel', 'Department', 'EducationField', 'JobRole', 'MaritalStatus']: 
    df[col] = df[col].astype('category')

In [None]:
# Change the type for columns to binary
df['Attrition'] = np.where(df['Attrition'] == 'Yes', 1, 0)
df['Gender'] = np.where(df['Gender'] == 'Female', 1, 0)
df['OverTime'] = np.where(df['OverTime'] == 'Yes', 1, 0)

In [None]:
# Check the details after change - less memory usage
df.info()

In [None]:
# Statistics for categorical data
df.describe(include='category').T

In [None]:
# Statistics for numerical data - the highest valuesare highlighted in red
df_num_col = df.select_dtypes(exclude=['category']).columns
df_num = df[df_num_col]
desc_stats = df_num.describe()
plt.figure(figsize=(20, 10))
sns.heatmap(desc_stats, cmap='Oranges', annot=True, fmt=".2f", cbar=True)
plt.title("Descriptive Statistics Heatmap")
plt.tight_layout()
plt.show()

In [None]:
# Present cross tabs for all categorical columns separately
df_cat_col = df.select_dtypes(include=['category']).columns
crosstab_categorical(df)

In [None]:
# The percentage of employees who left and stayed
df['Attrition'] = df['Attrition'].astype(int)
print('Stay:', round(df['Attrition'].value_counts()[0] / len(df) * 100, 2), '% of the dataset')
print('Left:', round(df['Attrition'].value_counts()[1] / len(df) * 100, 2), '% of the dataset')

# **Visualization**

In [None]:
# Based on below bar chart for Attrition column, dataset is not well balanced (84% stay vs 16% leave)
left = (df[df['Attrition'] == 1]).count()
stay = (df[df['Attrition'] == 0]).count()
data = [left[0], stay[0]]
labels = ['Leave', 'Stay']
plt.figure(figsize=(8, 5))
plt.title('Employees who left vs employees who stayed')
bars = plt.bar(labels, data, color='green', edgecolor='black')
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + 1, str(yval), ha='center', va='bottom')
plt.show()

In [None]:
# The most people are married, the least - divorced
marital_status = df['MaritalStatus'].value_counts()

plt.figure(figsize=(8, 6))
ax = sns.countplot(data=df, x='MaritalStatus')
plt.title('Proportion of Marital Statuses')
plt.xlabel('Marital Status')
plt.ylabel('Count')

total = len(df['MaritalStatus'])
for p in ax.patches:
    height = p.get_height()
    percentage = (height / total) * 100
    ax.annotate(f'{percentage:.2f}%', (p.get_x() + p.get_width() / 2, height),
                ha='center', va='bottom')
plt.show()

In [None]:
# Total working years distribution
sns.histplot(df['TotalWorkingYears'], kde=True, bins=10, color='lightgreen')
plt.title("Total working year distribution")
plt.xlabel('Count')
plt.ylabel('Total working years')
plt.show()

*   The distribution is not symmetric - it has long tail to the right - most employees have fewer years of experience, while a few have a lot.
*   The most common total working experience falls in the 8-10 year range - workforce consisted of mid-career professionals.
*   The count drop significantly after 15 years of experience - it is possible that fewer employees stay in the company long enough to reach higher salary.
*   Very experienced people are very rare in the company.

In [None]:
# Plot numeric correlation with heatmap
corr = df.select_dtypes(exclude=['category']).corr()
plt.figure(figsize=(20, 20))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, annot = True, mask=mask, cmap = 'coolwarm', fmt = ".2f")

## **Remarks for correlation with Attrition column**

*   JobInvolvement has one of the moderate negative correlaction wit attrition level at -0.13.
*   Changes in attrition level can be explained by about 3% changes in JobLevel.
*   MonthlyIncome has negative correlation with attrition level at -0.16.
*   Changes in attrition level can be explained by about 6% changes in OverTime. This correlation is the biggest.
*   StockOptionLevel, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager have also negative correcaltion with attrition level at -0.14, -0.17, -0.13, -0.16 and -0.16 accordingly.

## **Remarks for other correlations**

*   There is strong correlation (above 0.7) between following columns: MonthlyIncome/JobLevel, TotalWorkingYears/JobLevel, TotalWorkingYears/MonthlyIncome, PerformanceRating/PercentSalaryHike, YearsInCurrentRole/YearsAtCompany, YearsWithCurrManager/YearAtCompany, YearsWithCurrManager/YearsInCurrentRole.
*   However, one correlation is extremly high - between MonthlyIncome and JobLevel. Due to this fact, MonthlyIncome column will be eliminated from further analysis to make the form filled quicker.
* Columns:
 - NumCompaniesWorked, abs correlation 0.04
 - Education, Gender, YearsSinceLastPromotion, abs correlation 0.03
 - MonthlyRate, abs correlation 0.02
 - HourlyRate, PercentSalaryHike, abs correlation 0.01
 - PerformanceRating, correlation 0.00

 are weakly correlated with attrition level. The model without them will be considered.





























In [None]:
# Distribution plots for all numerical features
excluded_col = 'Attrition'
filtered_columns = [col for col in df_num_col if col != excluded_col]

for c in df_num_col:
    plt.figure(figsize=(6, 3))
    sns.histplot(data=df, x=c, hue=excluded_col, kde=True, stat='count', alpha=0.6, palette='Set2')
    plt.title(f"Distribution of {c} by {excluded_col}")
    plt.grid(True)
    plt.tight_layout()
    plt.show()



*  JobInvolvement - employees with medium job involvement are more likely to quit.
*  JobLevel - employees at lower job levels are more likely to leave, possibly due to lack of satisfaction, compensation or growth.
*  MonthlyIncome - workers with lower income are more prone to leave the company.
*  StockOptionLevel - no stock options might be a factor contributing to employee leaving.
*  TotalWorkingYears - less experienced employees are mote prone to leaving. They may explore career options or seek better opportunities.
*  YearsIiCurrentRole - short tenure is a red flag. It can indicate role misfit or lack of progression.
*  YearsWithCurrManager - the attrition is more possible for workers who spend less years with current manager.
*  YearsAtCompany - newer employees are at higher risk of attrition. The first few years are critical.










# **Outliers detection**

In [None]:
# Split the data into numerical and categorical columns
num_attr = df.select_dtypes(include='number').columns
cat_attr = df.select_dtypes(include='category').columns

In [None]:
cat_attr

In [None]:
# Numerical columns to be capped based on distribution on upper sight: YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager
columns = ['YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
df_capped = cap_upper_iqr(df, columns)

In [None]:
# The comparison for columns before and after capping
for col in columns:
    plt.figure(figsize=(12, 4))

    # Original
    plt.subplot(1, 2, 1)
    sns.boxplot(x=df[col], color='salmon')
    plt.title(f"{col} - Original")

    # Capped
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df_capped[col], color='lightblue')
    plt.title(f"{col} - IQR Capped")

    plt.tight_layout()
    plt.show()

## **Remarks for possible outliers**


*   Columns YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrentManager might have outliers.
* The outliers are floored using first and third quartile for extreme value.  Thanks to this operation for upper bound, the outliers are reduced without removing rows and the dataset size remains constant.


# **Feature Engineering**

In [None]:
# Dataset without MonthlyIncome and other weekly correlated with attrition features
df_shrink = df.drop(columns=['MonthlyIncome', 'NumCompaniesWorked', 'Education', 'Gender', 'YearsSinceLastPromotion', 'MonthlyRate', 'HourlyRate', 'PercentSalaryHike', 'PerformanceRating'])
df_shrink.columns

In [None]:
# Verify categorical and numerical attributes content for original dataset
print('Numerical attributes:', num_attr)
print('Categorical attributes:', cat_attr)

In [None]:
# Split the shrink data into numerical and categorical columns
num_attr_shrink = df_shrink.select_dtypes(include='number').columns
cat_attr_shrink = df_shrink.select_dtypes(include='category').columns

In [None]:
# Verify categorical and numerical attributes content for shrink dataset
print('Numerical attributes:', num_attr_shrink)
print('Categorical attributes:', cat_attr_shrink)

In [None]:
# Column transformer to preprocess numeric and categorical columns
excluded_column = 'Attrition'
num_attr_except_one = [col for col in num_attr if col != excluded_column]
all_attr = list(num_attr) + list(cat_attr)

num_attr_shrink_except_one = [col for col in num_attr_shrink if col != excluded_column]
all_attr_shrink = list(num_attr_shrink) + list(cat_attr_shrink)

In [None]:
# Column transformer to preprocess numeric and categorical columns
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_attr_except_one),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_attr)
])

preprocessor_shrink = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_attr_shrink_except_one),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_attr_shrink)
])

In [None]:
# Function to get pipeline with any model
def get_model_pipeline(model):
    return SklearnPipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])

# **Modellling for original dataset**

In [None]:
# Retrieving columns which will represent independent/dependent variables
X = df.drop(['Attrition'], axis = 1)
y = df[['Attrition']]

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, stratify=y, random_state = 111)

## **Logistic Regression model**

In [None]:
# Logistic regression model
model_1 = get_model_pipeline(LogisticRegression())
model_1.fit(X_train, y_train)

# Model performance evaluation for test and train set
y_pred = model_1.predict(X_test)
y_pred_tr = model_1.predict(X_train)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(model_1, X_train, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(model_1, X_test, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
feature_names = []

# Extract feature names from each transformer in the ColumnTransformer
for name, transformer, columns in preprocessor.transformers_:
    if name != 'remainder':
        if hasattr(transformer, 'get_feature_names_out'):
            transformed_names = transformer.get_feature_names_out(columns)
        else:
            transformed_names = columns
        feature_names.extend(transformed_names)

m1 = model_1.named_steps['model']
importances = m1.coef_[0]

# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances - Logistic regression')
plt.tight_layout()
plt.show()


*  We can expect that odds of people who left the company to increase (the original coefficient is positive) for features: OverTime, BusinessTravel_Travel_Frequently, EducationField_Technical Degree, JobRole_Sales Representative, JobRole_Laboratory Technician, YearsAtCompany, MaritalStatus_Single, DistanceFromHome, NumCompaniesWorked, YearsSinceLastPromotion, JobRole_Human Resources, Department_Sales, EducationField_HumanResources, EducationField_Marketing, HourlyRate, MonthlyRate, BusinessTravel_Travel Rarely, JobRole_Manager, PercentSalaryHike, Education, JobLevel and JobRole_Sales Executive.
*   OverTime, BusinessTravel_Travel_Frequently and EducationField_Technical Degree are the strongest predictors.
*   PerformanceRating and JobRole_Sales Executive are the weakest predictors.






In [None]:
# Gather some metrics for future comparison of models performance
metrics_1 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(model_1, X_test, y_test)

AUC is 0.87 which means that there is a 87% chance that the model will be able to distinguish between positive class and negative class.

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(model_1, X_test, y_test)

In [None]:
# Row with model
row_1 = create_model_summary_row("Logistic Regression", metrics_1)

## **Random Forest model**

In [None]:
# Random Forest Classifier model
model_2 = get_model_pipeline(RandomForestClassifier())
model_2.fit(X_train, y_train)

# Model performance evaluation for test and train set
y_pred = model_2.predict(X_test)
y_pred_tr = model_2.predict(X_train)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(model_2, X_train, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(model_2, X_test, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
feature_importance_df = get_feature_importance_df(model_2, preprocessor)

# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances - Random Forest')
plt.tight_layout()
plt.show()


*   We can expect the odds of people who left to increase (the original coeff was positive) for all features.
*   OverTime, MonthlyIncome and Age are the strongest predictors.
*  JobRole_Research Director and Department_Human Resources are the weakest predictors.








In [None]:
# Gather some metrics for future comparison of models performance
metrics_2 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(model_2, X_test, y_test)

AUC is 0.83 which means that there is a 83% chance that the model will be able to distinguish between positive class and negative class.

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(model_2, X_test, y_test)

In [None]:
# Row with model
row_2 = create_model_summary_row("Random Forest", metrics_2)

## **Decision tree model**

In [None]:
# DecisionTree Classifier model
model_3 = get_model_pipeline(DecisionTreeClassifier())
model_3.fit(X_train, y_train)

# Model performance evaluation for test and train set
y_pred = model_3.predict(X_test)
y_pred_tr = model_3.predict(X_train)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(model_3, X_train, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(model_3, X_test, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
feature_importance_df = get_feature_importance_df(model_3, preprocessor)
# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances - Decision Tree')
plt.tight_layout()
plt.show()


*   We can expect the odds of people who left to increase (the original coeff was positive) for MonthlyIncome, OverTime, DistanceFromHome, NumCompaniesWorked, BusinessTravel_Travel_Frequently, TotalWorkingYears, StockOptionLevel, MaritalStatus_Single, Department_Sales, EnvironmentSatisfaction, DailyRate, EducationField_Marketing, JobInvolvement, YearsSincdLastPromotion, YearsAtCompany, Gender, HourlyRate, Department_Research & Development, PercentSalaryHike, TrainingTimesLastYear.
*   Rest columns have coeff equals zero.
*   MonthlyIncome is the strongest predictor.





In [None]:
# Gather some metrics for future comparison of models performance
metrics_3 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(model_3, X_test, y_test)

AUC is 0.57 which means that there is a 57% chance that the model will be able to distinguish between positive class and negative class.

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(model_3, X_test, y_test)

In [None]:
# Row with model
row_3 = create_model_summary_row("Decision Tree", metrics_3)

## **AdaBoost model**

In [None]:
#  AdaBoost Classifier model
model_4 = get_model_pipeline(AdaBoostClassifier())
model_4.fit(X_train, y_train)

# Model performance evaluation for test and train set
y_pred = model_4.predict(X_test)
y_pred_tr = model_4.predict(X_train)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(model_4, X_train, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(model_4, X_test, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
feature_importance_df = get_feature_importance_df(model_4, preprocessor)

# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances -  Ada Boost')
plt.tight_layout()
plt.show()

* We can expect the odds of people who left the company to increase (the original coeff was positive) for OverTime, YearsWithCurrManeger, MonthlyIncome, JobSatisfaction, YearsAtCompany, StockOptionLevel, HourlyRate, RelationshipSatisfaction, EnvironmentSatisfaction, Age, DistanceFromHome, JobInvolvement, Department_Research & Development, TrainingTmesLastYear, EducationField_Technical Degree, JobLevel, JobRole_Laboratory Technician, Department_Sales, BusinessTravel_Travel_Frequently, NumCompaniesWorked, BusinessTravel_Non-Travel.
* Remaining features had coeff equals zero.
* OverTime and YearsWithCurrManager are the strongest predictors.

In [None]:
# Gather some metrics for future comparison of models performance
metrics_4 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(model_4, X_test, y_test)

AUC is 0.82 which means that there is a 82% chance that the model will be able to distinguish between positive class and negative class.

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(model_4, X_test, y_test)

In [None]:
# Row with model
row_4 = create_model_summary_row("Ada Boost", metrics_4)

## **Gradient Boosting model**

In [None]:
# Gradient Boosting model
model_5 = get_model_pipeline(GradientBoostingClassifier())
model_5.fit(X_train, y_train)

# Model performance evaluation for test and train set
y_pred = model_5.predict(X_test)
y_pred_tr = model_5.predict(X_train)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(model_5, X_train, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(model_5, X_test, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
feature_importance_df = get_feature_importance_df(model_5, preprocessor)

# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances - Gradient Boosting')
plt.tight_layout()
plt.show()

* We can expect the odds of workers who left to increase (the original coeff was positive) for almost all features.
* PerformanceRating, EducationField_Life Sciences,
BusinessTravel_Travel Rarely, EducationField_Human Resources,
JobRole_Manager,
JobRole_Human Resources,
JobRole_Healthcare Representative, EducationField_Other,
JobRole_Manufacturing Director,
JobRole_Research Scientist,
JobRole_Research Director,
MaritalStatus_Divorced,
MaritalStatus_Married had coeff zero.
* MontlyIncome and OverTime are the strongest predictors.

In [None]:
# Gather some metrics for future comparison of models performance
metrics_5 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(model_5, X_test, y_test)

AUC is 0.80 which means that there is a 80% chance that the model will be able to distinguish between positive class and negative class.

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(model_5, X_test, y_test)

In [None]:
# Row with model
row_5 = create_model_summary_row("Gradeint Boosting", metrics_5)

## **XGBoosing model**

In [None]:
# XGB model
model_6 = get_model_pipeline(XGBClassifier())
model_6.fit(X_train, y_train)

# Model performance evaluation for test and train set
y_pred = model_6.predict(X_test)
y_pred_tr = model_6.predict(X_train)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(model_6, X_train, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(model_6, X_test, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
feature_importance_df = get_feature_importance_df(model_6, preprocessor)

# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances - XGBoost')
plt.tight_layout()
plt.show()

* We can expect the odds of people who left to increase (the original coeff was positive) for almost all features.
* PerformanceRating, EducationField_Human Resources, Department_Human Resources, JobRole_Healthcare Representative, JobRole_Human Resources, JobRole_Manufacturing Director and JobRole_Research Director have coeff equals to zero.
* Overtime and JobRole_Sales Executive are the strongest predictors.

In [None]:
# Gather some metrics for future comparison of models performance
metrics_6 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(model_6, X_test, y_test)

AUC is 0.79 which means that there is a 79% chance that the model will be able to distinguish between positive class and negative class.

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(model_6, X_test, y_test)

In [None]:
# Row with model
row_6 = create_model_summary_row("XGBoosting", metrics_6)

## **CatBoost model**

In [None]:
# CatBoost classifier model
model_7 = get_model_pipeline(CatBoostClassifier())
model_7.fit(X_train, y_train)

# Model performance evaluation for test and train set
y_pred = model_7.predict(X_test)
y_pred_tr = model_7.predict(X_train)

In [None]:
# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(model_7, X_train, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(model_7, X_test, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
feature_importance_df = get_feature_importance_df(model_7, preprocessor)

# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances - Cat Boost')
plt.tight_layout()
plt.show()

* We can expect the odds of people who left to increase (the original coeff was positive) for almost all features.
* EducationField_Human Resources has coeff equals to zero.
* OverTime, MonthlyIncome and Age are the strongest predictors.

In [None]:
# Gather some metrics for future comparison of models performance
metrics_7 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(model_7, X_test, y_test)

AUC is 0.82 which means that there is a 82% chance that the model will be able to distinguish between positive class and negative class.

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(model_7, X_test, y_test)

In [None]:
# Row with model
row_7 = create_model_summary_row("CatBoost", metrics_7)

## **Model summary**

In [None]:
# Add all results to dataframe
results = pd.DataFrame(columns=['model', 'tp', 'tn', 'fp', 'fn', 'correct', 'incorrect',
                                  'accuracy_train', 'accuracy_test', 'precision_train', 'precision_test', 'recall_train', 'recall_test', 'f1_train', 'f1_test', 'roc_auc', 'avg_pre'])
new_rows = pd.DataFrame([row_1, row_2, row_3, row_4, row_5, row_6, row_7])
results = pd.concat([results, new_rows], ignore_index=True)
results

## **Genaral overwiew of models**

* Logistic Regression performs the best overall with a strong balance between precision and recall, making it the most robust choice. It is well-suited for imbalanced classification with minimal overfitting.

* Random Forest is overfitting—perfect precision on train and test, but almost no true positives captured (low recall). Poor generalization and poor F1.

* Single tree is underperforming. Although interpretable, it lacks predictive power and likely overfits.

* AdaBoost is precise but fails to capture actual attrition cases (low recall). Performs similarly to Random Forest but worse than Logistic Regression.

* Gradient Boosting - decent recall and ROC AUC, better than AdaBoost, but still significantly underperforms compared to Logistic Regression. Potential for improvement with tuning.

* XGBoost shows promise—balanced performance and better recall than other tree-based models. Still trails behind Logistic Regression in F1 and ROC AUC.

* CatBoost is stable and good at precision but lacks strong recall. Similar to AdaBoost and Gradient Boosting.

# **To sum up, Logistic Regression will be taken into account in further analysis.**


# **Logistic regression with undersampling**

In [None]:
feature_names = []

# Extract feature names from each transformer in the ColumnTransformer
for name, transformer, columns in preprocessor.transformers_:
    if name != 'remainder':
        if hasattr(transformer, 'get_feature_names_out'):
            transformed_names = transformer.get_feature_names_out(columns)
        else:
            transformed_names = columns
        feature_names.extend(transformed_names)

m1 = model_1.named_steps['model']
importances = abs(m1.coef_[0])

# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(15, 12))
plt.barh(feature_importance_df['Feature'][::-1], feature_importance_df['Importance'][::-1])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances - Logistic regression')
plt.tight_layout()
plt.show()

In [None]:
# For further analysis, only the numerical features with the greatest positive feature importance will be be used: 'OverTime', 'YearsAtCompany', 'JobSatisfaction', 'EnvironmentSatisfaction', 'DistanceFromHome', 'YearsWithCurrManager', 'YearsInCurrentRole', 'NumCompaniesWorked', 'TotalWorkingYears', 'YearsSinceLastPromotion'
# All categorical features will be considered
num_shrink =['OverTime', 'YearsAtCompany', 'JobSatisfaction', 'EnvironmentSatisfaction', 'DistanceFromHome', 'YearsWithCurrManager', 'YearsInCurrentRole', 'NumCompaniesWorked', 'TotalWorkingYears', 'YearsSinceLastPromotion']
attr = list(num_shrink) + list(cat_attr)

In [None]:
preprocessor_shrink = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_shrink),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_attr_shrink)
])

In [None]:
X_train_top = X_train[attr]
X_test_top = X_test[attr]

In [None]:
undersample_pipeline = ImbPipeline(steps=[
    ("preprocessing", preprocessor_shrink),
    ("undersample", RandomUnderSampler(random_state=42)),
    ("model", LogisticRegression())
])

In [None]:
# Logistic regression model
undersample_pipeline.fit(X_train_top, y_train)

# Model performance evaluation for test and train set
y_pred = undersample_pipeline.predict(X_test_top)
y_pred_tr = undersample_pipeline.predict(X_train_top)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(undersample_pipeline, X_train_top, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(undersample_pipeline, X_test_top, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
# Gather some metrics for future comparison of models performance
metrics_8 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(undersample_pipeline, X_test, y_test)

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(undersample_pipeline, X_test, y_test)

In [None]:
# Row with model
row_8 = create_model_summary_row("Logistic regression with undersampling", metrics_8)

# **Logistic Regression with oversampling**

In [None]:
oversample_pipeline = ImbPipeline(steps=[
    ("preprocessing", preprocessor_shrink),
    ("oversample", RandomOverSampler(random_state=42)),
    ("model", LogisticRegression())
])

In [None]:
# Logistic regression model
oversample_pipeline.fit(X_train_top, y_train)

# Model performance evaluation for test and train set
y_pred = oversample_pipeline.predict(X_test_top)
y_pred_tr = oversample_pipeline.predict(X_train_top)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(oversample_pipeline, X_train_top, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(oversample_pipeline, X_test_top, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
# Gather some metrics for future comparison of models performance
metrics_9 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(oversample_pipeline, X_test, y_test)

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(oversample_pipeline, X_test, y_test)

In [None]:
# Row with model
row_9 = create_model_summary_row("Logistic regression with oversampling", metrics_9)

# **Logistic regression with SMOTE**

In [None]:
#!pip install scikit-learn==1.3.2 imbalanced-learn==0.11.0 numpy==1.23.5 pandas==1.5.3 --quiet --user

In [None]:
smote_pipeline = ImbPipeline(steps=[
    ("preprocessing", preprocessor_shrink),
    ("smote", SMOTE(random_state=42)),
    ("model", LogisticRegression())
])

In [None]:
#Logistic regression model
smote_pipeline.fit(X_train_top, y_train)

# Model performance evaluation for test and train set
y_pred = smote_pipeline.predict(X_test_top)
y_pred_tr = smote_pipeline.predict(X_train_top)

# Confusion matrix for train set
print("Confusion matrix for train set: ")
ConfusionMatrixDisplay.from_estimator(smote_pipeline, X_train_top, y_train)

In [None]:
# Confusion matrix for test set
print("Confusion matrix for test set: ")
ConfusionMatrixDisplay.from_estimator(smote_pipeline, X_test_top, y_test)

In [None]:
# Classification report for test set
print(classification_report(y_test, y_pred))

In [None]:
# Classification report for train set
print(classification_report(y_train, y_pred_tr))

In [None]:
# Gather some metrics for future comparison of models performance
metrics_10 = compute_classification_metrics(y_train, y_pred_tr, y_test, y_pred)

In [None]:
# ROC curve
plot_roc_curve(smote_pipeline, X_test, y_test)

In [None]:
# Precision-Recall curve
plot_precision_recall_curve(smote_pipeline, X_test, y_test)

In [None]:
# Row with model
row_10 = create_model_summary_row("Logistic regression with SMOTE", metrics_10)

## **Best model summary**

In [None]:
new_rows = pd.DataFrame([row_8, row_9, row_10])
results = pd.concat([results, new_rows], ignore_index=True)
results

In [None]:
# Roc for all mogistic regression models
models = models = {
    "Logistic Regression": model_1,
    "Logistic regression with undersampling": undersample_pipeline,
    "Logistic regression with oversampling": oversample_pipeline,
    "Logistic regression with SMOTE": smote_pipeline
}
plt.figure(figsize=(12, 8))
for name, model in models.items():
    RocCurveDisplay.from_estimator(model, X_test, y_test, name=name, ax=plt.gca())
plt.title("ROC Curve Comparison")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

**Models summary**

* Baseline Logistic regression model has the highest test accuracy, highest precision and F1 score. It has good ROC AUC and average precision. The drawback is that the recall is lowest - misses more positive cases than others.
* Undersampling model has the highest recall but the lowest precision. It has moderate F! and highest ROC AUC. Seems to be the best option for maximizing recall.
* Oversampling model results are similar to indersampling model but it has slightly lower performance overall.
* SMOTE model is a balanced option between undersampling and oversampling. It has slightly better precision and F1 than oversampling. 

* **Undersampling model** is the best for attrition detection. It has the highest recall - correctly detects almost 78% of attrition cases. It has the best ROC AUC 77%. Althoug it has lower precision, in attrition detection false positives are less harmful than false negatives.

## **Making a predictive system**

In [None]:
input_data = (0, 9, 4, 3, 2, 3, 7, 4, 12, 7, 'Travel_Rarely', 'Sales', 'Medical', 'Sales Executive', 'Single')

# Changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

example = pd.DataFrame(input_data_reshaped)
example.columns = X_test_top.columns

best_model = undersample_pipeline

prediction = best_model.predict(example)

if (prediction[0] == 0):
  print('The person will stay.')
else:
  print('The person will leave.')

In [None]:
example.T

## **Saving the trained model**

In [None]:
filename = './model/trained_model.sav'
pickle.dump(best_model, open(filename, 'wb'))

In [None]:
# loading the saved model
loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
input_data = (1, 9, 4, 3, 2, 3, 7, 4, 12, 7, 'Travel_Rarely', 'Sales', 'Medical', 'Sales Executive', 'Married')

# Changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

example = pd.DataFrame(input_data_reshaped)
example.columns = X_test_top.columns

prediction = loaded_model.predict(example)

if (prediction[0] == 0):
  print('The person will stay.')
else:
  print('The person will leave.')

In [None]:
example.T