# HR Attrition Prediction

This notebook analyzes HR data to understand why employees leave a company and builds a logistic regression model to predict employee attrition.

**Author:** Daeven Morgan
**Goal:** Use data to help HR identify drivers of turnover and employees at risk of leaving.

## 1. Import packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

pd.set_option('display.max_columns', None)

## 2. Load dataset
The dataset file should be named `HR_capstone_dataset.csv` and placed in the same folder as this notebook or in the `data/` folder.


In [None]:
# Try loading from local folder first, then from data/ subfolder
import os

file_paths = [
    'HR_capstone_dataset.csv',
    '../data/HR_capstone_dataset.csv',
    './data/HR_capstone_dataset.csv'
]

data_path = None
for path in file_paths:
    if os.path.exists(path):
        data_path = path
        break

if data_path is None:
    raise FileNotFoundError('HR_capstone_dataset.csv not found. Place the file in this folder or in a data/ subfolder.')

df = pd.read_csv(data_path)
df.head()

## 3. Initial exploration

In [None]:
# Basic info
df.info()

In [None]:
# Descriptive statistics
df.describe(include='all')

In [None]:
# Check missing values
df.isna().sum()

In [None]:
# Check duplicates
df.duplicated().sum()

In [None]:
# Drop duplicates if any
df = df.drop_duplicates().reset_index(drop=True)
df.shape

## 4. Clean column names

In [None]:
df = df.rename(columns={
    'Work_accident': 'work_accident',
    'time_spend_company': 'tenure',
    'Department': 'department'
})
df.columns

## 5. Exploratory data analysis (EDA)

In [None]:
# Count of employees who left vs stayed
df['left'].value_counts()

In [None]:
# Percentage of employees who left vs stayed
df['left'].value_counts(normalize=True) * 100

In [None]:
# Plot: left vs stayed
plt.figure(figsize=(4,3))
sns.countplot(x='left', data=df)
plt.title('Employees: Left vs Stayed')
plt.show()

In [None]:
# Plot: satisfaction by attrition
plt.figure(figsize=(6,4))
sns.kdeplot(data=df, x='satisfaction_level', hue='left', common_norm=False)
plt.title('Satisfaction Level by Attrition')
plt.show()

In [None]:
# Plot: monthly hours by attrition
plt.figure(figsize=(6,4))
sns.kdeplot(data=df, x='average_monthly_hours', hue='left', common_norm=False)
plt.title('Average Monthly Hours by Attrition')
plt.show()

In [None]:
# Plot: salary vs attrition
plt.figure(figsize=(6,4))
sns.countplot(x='salary', hue='left', data=df)
plt.title('Attrition by Salary Level')
plt.show()

In [None]:
# Plot: tenure vs attrition
plt.figure(figsize=(6,4))
sns.boxplot(x='left', y='tenure', data=df)
plt.title('Tenure by Attrition')
plt.show()

## 6. Prepare data for modeling

In [None]:
# Separate features and target
X = df.drop(columns=['left'])
y = df['left']

# One-hot encode categorical variables
X = pd.get_dummies(X, columns=['department', 'salary'], drop_first=True)
X.head()

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape

In [None]:
# Scale numeric columns
numeric_cols = [
    'satisfaction_level', 'last_evaluation', 'number_project',
    'average_monthly_hours', 'tenure', 'work_accident',
    'promotion_last_5years'
]

scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

X_train.head()

## 7. Build logistic regression model

In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

## 8. Evaluate model performance

In [None]:
# Predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Classification report
print(classification_report(y_test, y_pred))

In [None]:
# ROC-AUC
auc = roc_auc_score(y_test, y_proba)
print('ROC-AUC:', auc)

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'LogReg (AUC = {auc:.2f})')
plt.plot([0,1], [0,1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## 9. Interpretation and next steps

- Lower satisfaction and higher monthly hours are strongly associated with employees leaving.
- Employees without promotions and those in lower salary bands are more likely to quit.
- HR can use this model to flag higher-risk employees and focus on improving satisfaction, workload, pay, and promotion opportunities.

Next steps could include trying other models (like random forests), tuning hyperparameters, and testing the model on new data.