In [1]:
import pandas as pd

In [None]:
df = pd.read_csv("../data/Extended_Employee_Performance_and_Productivity_Data.csv")

In [3]:
df

Unnamed: 0,Employee_ID,Department,Gender,Age,Job_Title,Hire_Date,Years_At_Company,Education_Level,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score,Resigned
0,1,IT,Male,55,Specialist,2022-01-19 08:03:05.556036,2,High School,5,6750.0,33,32,22,2,0,14,66,0,2.63,False
1,2,Finance,Male,29,Developer,2024-04-18 08:03:05.556036,0,High School,5,7500.0,34,34,13,14,100,12,61,2,1.72,False
2,3,Finance,Male,55,Specialist,2015-10-26 08:03:05.556036,8,High School,3,5850.0,37,27,6,3,50,10,1,0,3.17,False
3,4,Customer Support,Female,48,Analyst,2016-10-22 08:03:05.556036,7,Bachelor,2,4800.0,52,10,28,12,100,10,0,1,1.86,False
4,5,Engineering,Female,36,Analyst,2021-07-23 08:03:05.556036,3,Bachelor,2,4800.0,38,11,29,13,100,15,9,1,1.25,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,99996,Finance,Male,27,Technician,2022-12-07 08:03:05.556036,1,Bachelor,4,4900.0,55,46,5,3,75,16,48,2,1.28,False
99996,99997,IT,Female,36,Consultant,2018-07-24 08:03:05.556036,6,Master,5,8250.0,39,35,7,0,0,10,77,1,3.48,True
99997,99998,Operations,Male,53,Analyst,2015-11-24 08:03:05.556036,8,High School,2,4800.0,31,13,6,5,0,5,87,1,2.60,False
99998,99999,HR,Female,22,Consultant,2015-08-03 08:03:05.556036,9,High School,5,8250.0,35,43,10,1,75,2,31,1,3.10,False


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Employee_ID                  100000 non-null  int64  
 1   Department                   100000 non-null  object 
 2   Gender                       100000 non-null  object 
 3   Age                          100000 non-null  int64  
 4   Job_Title                    100000 non-null  object 
 5   Hire_Date                    100000 non-null  object 
 6   Years_At_Company             100000 non-null  int64  
 7   Education_Level              100000 non-null  object 
 8   Performance_Score            100000 non-null  int64  
 9   Monthly_Salary               100000 non-null  float64
 10  Work_Hours_Per_Week          100000 non-null  int64  
 11  Projects_Handled             100000 non-null  int64  
 12  Overtime_Hours               100000 non-null  int64  
 13  

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve


In [6]:
# Drop irrelevant columns (e.g., Employee_ID, Hire_Date)
df = df.drop(["Employee_ID", "Hire_Date"], axis=1)

# Encode categorical variables
cat_cols = df.select_dtypes(include=["object", "bool"]).columns

for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

# Features & Target
X = df.drop("Resigned", axis=1)
y = df["Resigned"].astype(int)   # convert True/False -> 1/0

In [7]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
# Logistic Regression Model
log_reg = LogisticRegression(max_iter=1000, class_weight="balanced", random_state=42)
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)
y_prob = log_reg.predict_proba(X_test)[:,1]

In [9]:
# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob))

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.50      0.65     17998
           1       0.10      0.49      0.17      2002

    accuracy                           0.50     20000
   macro avg       0.50      0.50      0.41     20000
weighted avg       0.82      0.50      0.60     20000

Confusion Matrix:
 [[9069 8929]
 [1016  986]]
ROC AUC Score: 0.4940032742010739


# 📊 Logistic Regression Evaluation: Employee Resignation Model

---

## 1. Classification Report

| Metric      | Class `0` (Not Resigned) | Class `1` (Resigned) | Interpretation |
|-------------|---------------------------|-----------------------|----------------|
| **Precision** | 0.90 | 0.10 | Model is good at predicting employees who stay, but very poor at predicting resignations. |
| **Recall**    | 0.50 | 0.48 | The model only captures ~50% of both stayers and leavers. |
| **F1-Score**  | 0.65 | 0.17 | Balanced score shows weak performance for resignations. |
| **Support**   | 17,998 | 2,002 | Dataset is **highly imbalanced** (more non-resigned employees). |

**🔑 Takeaway**:  
The model is **biased towards predicting "Not Resigned"** (class `0`) because of dataset imbalance.  
Precision for resignation (`0.10`) means when it predicts an employee will resign, it’s wrong 90% of the time.  

---

## 2. Confusion Matrix

- **True Negatives (TN = 9069):** Correctly predicted employees who stayed.  
- **False Positives (FP = 8929):** Incorrectly predicted resignations (false alarms).  
- **False Negatives (FN = 1016):** Missed resignations (employees left but predicted as staying).  
- **True Positives (TP = 986):** Correctly predicted resignations.  

⚠️ The **false positives are very high (8929)** → the model frequently predicts resignation incorrectly.  

---

## 3. ROC AUC Score

- **ROC AUC = 0.494** (≈ 0.5) → The model performs **no better than random guessing** in distinguishing resigned vs. not resigned employees.  
- A strong model should aim for **≥ 0.7**.  

---

## ✅ Overall Interpretation

- The model **struggles with imbalance** (90% stayers vs. 10% leavers).  
- It learns to predict “Not Resigned” most of the time, which inflates overall accuracy but fails in practice.  
- The **low precision and recall for resignations** make the model **unreliable for predicting employee turnover**.  

---


👉 This means the current model doesn’t yet provide reliable insights into **employee resignation risk** — improvements are needed before deployment.
