# HR Analytics: Employee Attrition & Workforce Insights
## Dataset: IBM HR Analytics Employee Attrition & Performance (Kaggle)  

### 03 - Modeling
This attrition risk model help HR teams identify employees who are more likely to resign. The goal was to understand what drives attrition and to generate an early warning risk score for each employee.

In [1]:
# Import libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score
from sklearn.preprocessing import StandardScaler

In [2]:
# Load dataset

df = pd.read_csv("hr_cleaned.csv") 
df.head()

Unnamed: 0,age,attrition,business_travel,daily_rate,department,distance_from_home,education,education_field,employeecount,employee_number,...,work_life_balance,years_at_company,years_in_current_role,years_since_last_promotion,years_with_curr_manager,tenure_group,age_group,overtime_flag,income_band,satisfaction_score
0,41,1,Travel Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,6,4,0,5,3-6,Mid,1,Low,2.333333
1,49,0,Travel Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,3,10,7,1,7,7-10,Senior,0,Low,2.666667
2,37,1,Travel Rarely,1373,Research & Development,2,2,Other,1,4,...,3,0,0,0,0,0-2,Mid,1,Low,3.333333
3,33,0,Travel Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,8,7,3,0,7-10,Mid,1,Low,3.333333
4,27,0,Travel Rarely,591,Research & Development,2,1,Medical,1,7,...,3,2,2,2,2,0-2,Young,0,Low,2.0


In [3]:
# Select Features and Target

features = [
"overtime",
"age",
"monthly_income",
"job_satisfaction",
"years_at_company",
"environment_satisfaction"
]

X = df[features]
y = df["attrition"] # 1 = left, 0 = stayed

# Convert OverTime yes/no to numeric if needed
if X['overtime'].dtype == "object":
    X['overtime'] = X['overtime'].map({"Yes": 1, "No": 0})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['overtime'] = X['overtime'].map({"Yes": 1, "No": 0})


In [4]:
# Split data into train/test

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

In [5]:
# Train Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [6]:
# Evaluate Model
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # probability of attrition

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print("MODEL PERFORMANCE")
print("----------------------")
print(f"Accuracy: {accuracy:.3f}")
print(f"Recall: {recall:.3f}")
print(f"ROC-AUC: {roc_auc:.3f}")


MODEL PERFORMANCE
----------------------
Accuracy: 0.864
Recall: 0.115
ROC-AUC: 0.742


In [7]:
# Extract Model Coefficients (Top Predictors)
coef = pd.DataFrame({
"Feature": features,
"Coefficient": model.coef_[0]
}).sort_values("Coefficient", ascending=False)

print("\nTOP PREDICTORS")
print(coef)


TOP PREDICTORS
                    Feature  Coefficient
0                  overtime     1.571113
2            monthly_income    -0.000078
1                       age    -0.035848
4          years_at_company    -0.048576
3          job_satisfaction    -0.303342
5  environment_satisfaction    -0.319806


In [8]:
# Create Risk Score Column (Predicted Probability)

df["RiskScore"] = model.predict_proba(X)[:, 1]

# Save file if you want
df.to_csv("employees_with_risk_score.csv", index=False)

print("\nRisk score added to dataframe.")


Risk score added to dataframe.
