# IBM HR Analysis

## Observations
- Dataset has 34 features plus 1 UID (Empmployee number)
- Categorical Data (7 Features) is already encoded
  - Education (1 ='Below College', 2 = 'College', 3 = 'Bachelor', 4 = 'Master', 5 'Doctor')
  - EnvironmentSatisfaction (1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High')
  - JobInvolvement (1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High')
  - JobSatisfaction (1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High')
  - PerformanceRating (1 = 'Low', 2 = 'Good', 3 = 'Excellent', 4 = 'Outstanding')
  - RelationshipSatisfaction (1 = 'Low', 2 = 'Medium', 3 = 'High', 4 = 'Very High')
  - WorkLifeBalance (1 = 'Bad', 2 = 'Good', 3 = 'Better', 4 = 'Best')

## Objectives & Possible Questions

1. Which factors (e.g., Age, MonthlyIncome, JobRole, WorkLifeBalance, OverTime) are most associated with attrition?
2. Can we build a model to predict which employees are likely to leave within the next 6–12 months?
3. Do performance ratings correlate with attrition, and does that relationship vary by job role or tenure?
4. How does compensation (salary, stock options, percent salary hike) relate to voluntary attrition?
5. Are there observable churn patterns by department, job level, or job role?
6. Does overtime or business travel frequency increase attrition risk?
7. How does training time or number of training courses completed affect retention and performance?
8. Are there seasonal or hiring-cohort effects in attrition (e.g., hires in certain years more likely to leave)?
9. What employee segments (clusters) exist based on demographics, performance, and engagement metrics?
10. Can we identify early-warning indicators (changes in satisfaction, environment, or workload) before an employee leaves?
11. Does manager effectiveness (e.g., ManagerID patterns or team-level attrition rates) predict individual attrition?
12. How do promotion history and time-in-role affect both performance and attrition risk?
13. Are there pay inequities across gender/ethnicity that correlate with attrition or performance?
14. What is the survival curve (time-to-attrition) for different employee groups?
15. Can dimensionality reduction (PCA/UMAP) reveal structure or outliers in employee profiles?

## Import Python packages


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport

## Import data

In [17]:
ibm_df = pd.read_csv("data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

ibm_df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


## First Look

I am using ydata profiling to take a quick look at the data set

In [None]:
profile = ProfileReport(ibm_df, title="HR Attrittion Profile", explorative=True)

# Save profile report 
profile.to_file("reports/hr_attrition.html", silent=False)

100%|██████████| 35/35 [00:00<00:00, 633.08it/s]<00:00, 154.59it/s, Describe variable: YearsWithCurrManager]    
Summarize dataset: 100%|██████████| 269/269 [00:13<00:00, 19.46it/s, Completed]                                               
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.07s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.12s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 97.82it/s]


### Insights surfaced with ydata
1. All of the workers are over 18 ("Over18"), all employees are 1 employee ("EmployeeCount")

### Start Answering Questions

#### 1. Which factors (e.g., Age, MonthlyIncome, JobRole, WorkLifeBalance, OverTime) are most associated with attrition?