<div style="text-align: center; color: black;">
    <h1>📊 Synthetic HR Analytics Dataset Generator</h1>
</div>

#### **This notebook creates a realistic employee lifecycle dataset for HR analytics and dashboard development. It simulates departments, job roles, employee demographics, salary progression, performance reviews, promotions, attrition events, and hiring sources — all generated using Python and the Faker library. Ideal for powering tools like Streamlit, Power BI, and Tableau.**

### 🔧 Step 1: Import Required Libraries
We begin by loading essential Python libraries used for data manipulation (`pandas`, `numpy`), synthetic data generation (`Faker`, `random`), date computations, and file management.

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np
# faker for generating realistic synthetic values (names, dates, etc.)
from faker import Faker
# random and datetime for reproducibility and date operations
import random
from datetime import datetime, timedelta
import os

### 🎲 Step 2: Initialize Faker and Seed Generators
Faker will be used to create realistic names and dates. Random seeds ensure reproducibility — the same synthetic dataset will be generated each time the notebook runs.

In [2]:
# Ensures reproducibility of synthetic data across runs
fake = Faker()
np.random.seed(42)
random.seed(42)

### 🗂️ Step 3: Set Up Output Directory
All generated datasets will be stored in a dedicated folder named `hr_synthetic_data`. This keeps the workspace clean and organized.

In [3]:
# Create folder to store generated HR datasets
output_dir = "hr_synthetic_data"
os.makedirs(output_dir, exist_ok=True)

### 🏢 Step 4: Create Department Reference Table
We define a fixed list of departments and assign unique IDs to each. This acts as a reference for linking job roles and employees to departments.

In [4]:
# Define list of departments used across the organization
departments = ["Human Resources", "Engineering", "Sales", "Marketing", "Finance", "Legal", "IT Support", "Operations"]
# Assign unique IDs to each department and build DataFrame
department_df = pd.DataFrame({
    "department_id": range(1, len(departments) + 1),
    "department_name": departments
})
# 💾 Save department table as CSV for reference during employee data generation
department_df.to_csv(f"{output_dir}/departments.csv", index=False)

### 👔 Step 5: Create Job Roles Table
This step defines a detailed list of job roles across all departments. Each role is mapped to its corresponding department using a `department_id`. This lookup table will serve as the backbone for assigning job roles to synthetic employees.

In [5]:
# Each tuple contains (Job Role Name, Associated Department ID)
job_roles = [
    # HR Department (ID: 1)
    ("HR Manager", 1), ("HR Coordinator", 1), ("Talent Acquisition Specialist", 1), 
    ("Learning & Development Manager", 1), ("HR Analyst", 1),
    # Engineering Department (ID: 2)
    ("Software Engineer/Developer", 2), ("Web Developer", 2), ("Data Scientist", 2),
    ("DevOps Engineer", 2), ("Data Engineer", 2), ("QA/Test Engineer", 2), ("Machine Learning Engineer", 2),
    # Sales Department (ID: 3)
    ("Sales Executive/Representative", 3), ("Account Manager", 3), ("Business Development Manager", 3),
    ("Sales Operations Analyst", 3), ("Regional Sales Manager", 3), ("Inside Sales Coordinator", 3),
    # Marketing Department (ID: 4)
    ("Marketing Manager", 4), ("Content Strategist/Copywriter", 4), ("Digital Marketing Specialist", 4),
    ("SEO/SEM Analyst", 4), ("Social Media Manager", 4), ("Brand Manager", 4), ("Marketing Analyst", 4),
    # Finance Department (ID: 5)
    ("Accountant", 5), ("Financial Analyst", 5), ("Controller", 5), ("Auditor", 5),
    ("Budget Analyst", 5), ("Payroll Manager", 5), ("Tax Specialist", 5),
    # Legal Department (ID: 6)
    ("Legal Counsel / Corporate Lawyer", 6), ("Paralegal", 6), ("Compliance Officer", 6),
    ("Contract Manager", 6), ("Intellectual Property Specialist", 6), 
    ("Legal Operations Manager", 6), ("Risk Analyst", 6),
    # IT Support Department (ID: 7)
    ("IT Support Specialist / Technician", 7), ("Help Desk Analyst", 7), 
    ("System Administrator", 7), ("Network Administrator", 7),
    ("IT Security Analyst", 7), ("Desktop Support Engineer", 7),
    # Operations Department (ID: 8)
    ("Operations Manager", 8), ("Operations/Logistics Coordinator", 8), 
    ("Supply Chain Analyst", 8), ("Warehouse Manager", 8),
    ("Business Operations Analyst", 8), ("Inventory Manager", 8)
]
# 🔄 Convert job role list to a DataFrame with role IDs and associated departments
job_role_df = pd.DataFrame({
    "job_role_id": range(1, len(job_roles) + 1),
    "job_role_name": [jr[0] for jr in job_roles],
    "department_id": [jr[1] for jr in job_roles]
})
# 💾 Save job roles with department linkage
job_role_df.to_csv(f"{output_dir}/job_roles.csv", index=False)

### 👥 Step 6: Generate Synthetic Employee Profiles
Using the `Faker` library and controlled randomization, we simulate 5000 employees. Each record includes:
- Name, age, gender, marital status, and ethnicity
- Country and department affiliation
- Assigned job role and salary
- Hire date and age distribution

We clip extreme salary values to ensure realism and save the result as `employees.csv`.

In [6]:
# Number of employee records to generate
num_of_employees = 5000
# Define sample values for categorical diversity
ethnicity_choices = ["White", "Black", "Asian", "Hispanic", "Native American", "Mixed", "Other"]
country_choices = ["United States", "United Kingdom", "India", "Canada", "Germany", "Australia", "Nigeria", "Brazil", "Japan", "Spain", "France", "Italy", "South Korea", "New Zealand", "Singapore"]
marital_status_choices = ["Single", "Married", "Divorced", "Separated", "Widowed", "Other", "Prefer not to say"]
employees = []
for i in range(1, num_of_employees + 1):    
    # 🎲 Randomize basic demographics
    gender = random.choice(["Male", "Female"])
    first_name = fake.first_name_male() if gender == "Male" else fake.first_name_female()
    last_name = fake.last_name()
    full_name = f"{first_name} {last_name}"    
    # 👔 Randomly assign job role (linked to department)
    job_role = random.choice(job_role_df.to_dict("records"))    
    # 📆 Generate realistic hire date and age
    age = random.randint(21, 60)
    hire_date = fake.date_between(start_date="-10y", end_date="-30d")   
    # 🧾 Build employee record dictionary
    employee = {
        "employee_id": i,
        "name": full_name,
        "age": age,
        "gender": gender,
        "marital_status": random.choice(marital_status_choices),
        "ethnicity": random.choice(ethnicity_choices),
        "country": random.choice(country_choices),
        "job_role_id": job_role["job_role_id"],
        "department_id": job_role["department_id"],
        "hire_date": hire_date,
        "salary": round(np.random.normal(loc=70000, scale=20000), 2)
    }  
    employees.append(employee)
# 📊 Convert to DataFrame and clip unrealistic salaries
employee_df = pd.DataFrame(employees)
employee_df["salary"] = employee_df["salary"].clip(lower=30000, upper=150000)
# 💾 Save employee dataset as CSV
employee_df.to_csv(f"{output_dir}/employees.csv", index=False)

### 💰 Step 7: Create Salary Progression Records
To reflect career growth, we simulate annual salary adjustments for each employee. Salary increases are modeled as a 3% linear raise per year. Each entry includes the employee ID, effective date, and adjusted salary.

In [7]:
salary_history = []
for _, row in employee_df.iterrows():
    # Random number of salary records per employee (simulate raises)
    num_records = random.randint(1, 3)
    base_salary = row["salary"]
    hire_date = pd.to_datetime(row["hire_date"])
    for i in range(num_records):
        # Create an effective date spaced by 1-year intervals
        effective_date = (hire_date + timedelta(days=365 * i)).strftime("%Y-%m-%d")
        # Assume a 3% raise per year (simple linear growth)
        salary = round(base_salary * (1 + 0.03 * i), 2)
        salary_history.append({
            "employee_id": row["employee_id"],
            "effective_date": effective_date,
            "salary_amount": salary
        })
# 📄 Convert salary records to DataFrame and save
salary_df = pd.DataFrame(salary_history)
salary_df.to_csv(f"{output_dir}/salaries.csv", index=False)

### 📈 Step 8: Create Employee Performance Reviews
We simulate periodic performance reviews for each employee, spaced yearly after their hire date. Each review includes:
- Review date
- A score from 1 to 5
- Eligibility for bonuses

Reviews scheduled beyond today’s date are excluded to maintain realism.

In [8]:
performance_reviews = []
for _, row in employee_df.iterrows():
    num_reviews = random.randint(1, 4)
    hire_date = pd.to_datetime(row['hire_date'])
    for i in range(num_reviews):
        # Reviews occur annually post-hire
        review_date = hire_date + timedelta(days=365 * (i + 1))
        # Skip reviews set in the future
        if review_date > datetime.now():
            break
        performance_reviews.append({
            "employee_id": row['employee_id'],
            "review_date": review_date.strftime('%Y-%m-%d'),
            "performance_score": random.randint(1, 5),
            "bonus_eligible": random.choice([True, False])
        })
# 📄 Save performance review history
performance_reviews_df = pd.DataFrame(performance_reviews)
performance_reviews_df.to_csv(f"{output_dir}/performance_reviews.csv", index=False)

### 🚪 Step 9: Generate Employee Attrition Events
About 25% of employees are flagged as separated. Each attrition record includes:
- Exit date
- Reason (Resigned, Fired, Retired, Layoff, etc.)
- Whether the exit was voluntary
- Satisfaction rating at time of departure

Exit dates are constrained to occur post-hire and prior to today.

In [9]:
attrition_records = []
for _, row in employee_df.iterrows():
    # Approx 25% employees experience attrition
    if random.random() < 0.25:
        hire_date = pd.to_datetime(row['hire_date'])
        exit_date = hire_date + timedelta(days=random.randint(365, 365 * 8))
        # Skip future exits
        if exit_date > datetime.now():
            continue
        # Determine reason based on age profile
        age = row['age']  # 🔧 Added: pull age from row to fix logical scope
        if age >= 55:
            reason = random.choice(["Resigned", "Retired", "Layoff"])
        else:
            reason = random.choice(["Resigned", "Fired", "Layoff"])
        attrition_records.append({
            "employee_id": row['employee_id'],
            "exit_date": exit_date.strftime('%Y-%m-%d'),
            "reason": reason,
            "voluntary_exit": reason in ["Resigned", "Retired"],
            "satisfaction_rating": random.randint(1, 5)
        })
# 📄 Save synthetic attrition data
attrition_df = pd.DataFrame(attrition_records)
attrition_df.to_csv(f"{output_dir}/attrition.csv", index=False)

### 🎓 Step 10: Record Employee Promotions
A subset of employees (approx. 15%) are assigned a promotion during their tenure. This includes:
- Promotion date
- New job role assignment

Promotions are constrained to occur within a realistic time window and not in the future.

In [10]:
promotion_records = []
for _, row in employee_df.iterrows():
    # Approx. 15% of employees get promoted
    if random.random() < 0.15:
        # Promotion occurs 1–5.5 years after hire
        promotion_date = pd.to_datetime(row['hire_date']) + timedelta(days=random.randint(365, 2000))
        # Exclude promotions that occur in the future
        if promotion_date > datetime.now():
            continue
        # Randomly assign a new job role ID to simulate upward movement
        new_role = random.choice(job_role_df['job_role_id'].tolist())
        promotion_records.append({
            'employee_id': row['employee_id'],
            'promotion_date': promotion_date.strftime('%Y-%m-%d'),
            'new_job_role_id': new_role
        })
# 📄 Convert to DataFrame and save promotion history
promotion_df = pd.DataFrame(promotion_records)
promotion_df.to_csv(f"{output_dir}/promotions.csv", index=False)

### 🔍 Step 11: Assign Hiring Sources
Each employee is associated with a hiring source (e.g. Referral, Job Board, Career Fair). If sourced via referral, a referring person is also noted using Faker. <br>
This information supports analysis of recruitment effectiveness and referral trends.

In [11]:
sources = ['Referral', 'Job Board', 'Career Fair', 'Recruitment Agency', 'Internal Transfer', 'Social Media']
hiring_sources = []
for _, row in employee_df.iterrows():
    # Randomly assign one source per employee
    source = random.choice(sources)
    hiring_sources.append({
        'employee_id': row['employee_id'],
        'hiring_source': source,
        # Only populate 'referred_by' if source is 'Referral'
        'referred_by': fake.name() if source == 'Referral' else None
    })
# 📄 Save hiring source details for analytical insights
hiring_sources_df = pd.DataFrame(hiring_sources)
hiring_sources_df.to_csv(f"{output_dir}/hiring_sources.csv", index=False)

### ✅ Final Summary

In this notebook, we've programmatically generated a full-spectrum synthetic HR dataset simulating realistic employee lifecycle events. Through modular steps, we created:

- 🏢 Department and Job Role mappings
- 👥 Employee profiles with demographics and employment metadata
- 💰 Salary history with annual progression
- 📈 Performance review cycles with ratings and bonus eligibility
- 🚪 Attrition patterns including exit reasons and satisfaction scores
- 🎓 Promotion records reflecting internal mobility
- 🔍 Hiring source insights for recruitment analysis

This dataset supports use cases such as:
- HR analytics dashboards (Streamlit, Power BI, Tableau)
- Machine learning on attrition prediction or performance modeling
- Data visualization and KPI storytelling
- Prototyping enterprise-grade applications with realistic personnel data

Built with reusability and extensibility in mind, this pipeline can scale across industries or serve as the foundation for advanced analytics projects.

---