# IBM HR Analysis

## Objectives & Possible Questions

- Which factors (e.g., Age, MonthlyIncome, JobRole, WorkLifeBalance, OverTime) are most associated with attrition?
- Can we build a model to predict which employees are likely to leave within the next 6–12 months?
- Do performance ratings correlate with attrition, and does that relationship vary by job role or tenure?
- How does compensation (salary, stock options, percent salary hike) relate to voluntary attrition?
- Are there observable churn patterns by department, job level, or job role?
- Does overtime or business travel frequency increase attrition risk?
- How does training time or number of training courses completed affect retention and performance?
- Are there seasonal or hiring-cohort effects in attrition (e.g., hires in certain years more likely to leave)?
- What employee segments (clusters) exist based on demographics, performance, and engagement metrics?
- Can we identify early-warning indicators (changes in satisfaction, environment, or workload) before an employee leaves?
- Does manager effectiveness (e.g., ManagerID patterns or team-level attrition rates) predict individual attrition?
- How do promotion history and time-in-role affect both performance and attrition risk?
- Are there pay inequities across gender/ethnicity that correlate with attrition or performance?
- What is the survival curve (time-to-attrition) for different employee groups?
- Can dimensionality reduction (PCA/UMAP) reveal structure or outliers in employee profiles?

## Import Python packages


In [11]:
import pandas as pd
from ydata_profiling import ProfileReport

## Import data

In [12]:
ibm_df = pd.read_csv("data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

## First Look

I am using ydata profiling to take a quick look at the data set

In [16]:
profile = ProfileReport(ibm_df, title="HR Attrittion Profile", explorative=True)

profile.to_file("data/hr_attrition.html")

100%|██████████| 35/35 [00:00<00:00, 633.08it/s]<00:00, 154.59it/s, Describe variable: YearsWithCurrManager]    
Summarize dataset: 100%|██████████| 269/269 [00:13<00:00, 19.46it/s, Completed]                                               
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.07s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.12s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 97.82it/s]
