# HR Employee Lifecycle Data Generator

This notebook generates a complete synthetic HR dataset for Microsoft Fabric demo purposes.

**Outputs:**
- 8 CSV files in `data/raw/hr/`
- ~200 text reports in `data/raw/reports_txt/`
- Data dictionary markdown file
- Data quality report

**Duration:** ~2-3 minutes

**⚠️ All data is FICTIONAL - no real personal information**

## 1. Setup and Configuration

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
import uuid
import json
from pathlib import Path
import yaml
import re

# Set random seed for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Load configuration
config_path = Path("../config.yaml")
with open(config_path, 'r', encoding='utf-8') as f:
    config = yaml.safe_load(f)

# Extract configuration parameters
NUM_EMPLOYEES = config['volumes']['employees']
NUM_DEPARTMENTS = config['volumes']['departments']
NUM_POSITIONS = config['volumes']['positions']
AVG_EVENTS_PER_EMP = config['volumes']['lifecycle_events_per_employee_avg']
HR_CASES_PCT = config['volumes']['hr_cases_percentage']
TEXT_REPORTS_PCT = config['volumes']['text_reports_percentage']

START_DATE = datetime.strptime(config['date_ranges']['start_date'], '%Y-%m-%d')
END_DATE = datetime.strptime(config['date_ranges']['end_date'], '%Y-%m-%d')
HISTORICAL_YEARS = config['date_ranges']['historical_hire_window_years']

# Define output paths
BASE_PATH = Path("..")
DATA_RAW_PATH = BASE_PATH / "data" / "raw"
HR_PATH = DATA_RAW_PATH / "hr"
REPORTS_PATH = DATA_RAW_PATH / "reports_txt"

# Create folder structure
HR_PATH.mkdir(parents=True, exist_ok=True)
REPORTS_PATH.mkdir(parents=True, exist_ok=True)

print(f"✅ Configuration loaded:")
print(f"   - Employees: {NUM_EMPLOYEES}")
print(f"   - Date range: {START_DATE.date()} to {END_DATE.date()}")
print(f"   - Output path: {HR_PATH}")
print(f"   - Reports path: {REPORTS_PATH}")

## 2. Helper Functions - Name & Contact Generation