# Dataset Creation Notebook

### CSIS4495-050: Applied Research Project

#### End-to-End Data Engineering Solution for HR Analytics

Group:
- Bruno do Nascimento Beserra
- Jay Clark Bermudez
- Matheus Filipe Figueiredo

Instructor: Dr. Bambang Sarif

<hr>

### Description:

This project simulates the evolution of a mid-sized company with 5,000 employees over a period of seven years. To build the initial workforce, we used a Kaggle dataset containing employee information and extracted a representative sample to serve as our company’s employee force.

To showcase our data pipeline solution built with state-of-the-art techniques. We designed a realistic simulation environment that captures key workforce dynamics over time. Throughout the seven-year period, employees may experience promotions, change teams, or leave the company. In parallel, the company will continuously hire new employees, based in their information from the main dataset to keep the workforce evolving.

We defined a set of key metrics to guide the algorithm responsible for generating these time-based snapshots. The final output is a structured folder hierarchy following the pattern: "data/year/month/file.csv"

Each file represents the company’s employee data at a specific point in time that will be used to perform our analytics process.

<hr>

### Step by Step:

- Import Libraries and Dataset
- Exploratory Data Analysis (EDA)
- Data Cleaning
- Dataset Creation (First Snapshot)
- Dataset Creation (Historical Snapshots)



## Import Libraries and Dataset

In [0]:
# Import Libraries
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil.relativedelta import relativedelta
import os

In [0]:
# Read Full Dataset
df = pd.read_csv("/content/drive/MyDrive/Estudos/Douglas College - Pós/Applied Research/HR_Data_MNC_Data Science Lovers.csv")

## Exploratory Data Analysis (EDA)

In [0]:
# Check Dataset Shape
df.shape

In [0]:
# Confirm the data was imported sucessfully
df.head()

In [0]:
# Check basic stats
df.info()

In [0]:
# Check basic metrics
df.describe()

In [0]:
# Check Null Values
df.isna().sum()

In [0]:
# Check Duplicates
df.value_counts()

## Data Cleaning

In [0]:
# Drop Column Unnamed: 0 as it does not aggregate value to the dataset
df = df.drop(columns=["Unnamed: 0"])

In [0]:
# Fix Salary Values to better display Canadian reality
df["Annual_Salary"] = df["Salary_INR"]/8
df = df.drop(columns=["Salary_INR"])

In [0]:
# Fix Column dtype
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])
df[df.select_dtypes('object').columns] = df.select_dtypes('object').astype('string')

In [0]:
# Confirm dtypes
df.info()

In [0]:
# Confirm dataset Changes
df.head()

## Dataset Creation (First Snapshot)

In [0]:
# Simple Subset Slicing
sampled_df = df.sample(frac=1.0, random_state=4495).head(5000)

In [0]:
# Confirm sample size
sampled_df.shape

In [0]:
# Check Slice Stats
sampled_df.describe()

In [0]:
# Check Job Titles
print(sampled_df.groupby('Job_Title').size().reset_index(name='count'))

In [0]:
# Check Percentiles of Years of Experience in the company
percentiles = sampled_df['Experience_Years'].quantile([0.25, 0.5, 0.75])
p25 = 3
p50 = 8
p75 = 11

# Create Job_Level column
def assign_job_level(row):
    # Executive titles
    executive_titles = ["CTO", "CFO", "HR Director", "Operations Director", "Sales Director"]

    if row['Job_Title'] in executive_titles:
        return 'Executive'
    elif row['Experience_Years'] <= p25:
        return 'Specialist'
    elif row['Experience_Years'] <= p50:
        return 'Analyst'
    elif row['Experience_Years'] <= p75:
        return 'Manager'
    else:
        return 'Principal'

sampled_df['Job_Level'] = sampled_df.apply(assign_job_level, axis=1)
df['Job_Level'] = sampled_df.apply(assign_job_level, axis=1)

In [0]:
# Check if the count per job level make sense
print(sampled_df["Job_Level"].value_counts())
pd.crosstab(sampled_df['Status'], sampled_df['Job_Level'])

In [0]:
# Save first snapshot
sampled_df.to_csv("/content/drive/MyDrive/Estudos/Douglas College - Pós/Applied Research/Project_Dataset_Official/initial_dataset.csv", index=False)

## Dataset Creation (Historical Snapshots)

In [0]:
# Historical Metrics Setup
np.random.seed(4495)
start_date = datetime(2025, 10, 1)
months = 84
snapshot_dates = [start_date + relativedelta(months=i) for i in range(months)]
departments = ["IT", "Marketing", "HR", "Operations", "Finance", "Sales", "R&D"]
leave_reasons = ["Resigned", "Terminated", "Retired"]
promotion_milestones = [36, 96, 132]
monthly_hiring_rate = 0.02
raise_percentages = {"Analyst": 0.01, "Specialist": 0.02, "Manager": 0.02, "Principal": 0.03, "Executive": 0.03}
raise_milestone_percentages = {"Analyst": 0.07, "Manager": 0.1, "Principal": 0.11, "Executive": 0.25}
used_ids = set(sampled_df["Employee_ID"])

In [0]:
# Guiding Columns for metrics
sampled_df["promotion_count"] = 0
sampled_df["month"] = 0
sampled_df["last_raise_year"] = 0
sampled_df["previous_job_level"] = sampled_df["Job_Level"]

# GDrive path
base_path = "/content/drive/MyDrive/Estudos/Douglas College - Pós/Applied Research/Project_Dataset_Official/historical_data"

In [0]:
sampled_df.head()

In [0]:
for i, snapshot_date in enumerate(snapshot_dates, start=1):
    current_emps = sampled_df[sampled_df["Hire_Date"] <= snapshot_date].copy()

    current_emps["time_in_company"] = (
        (snapshot_date.year - current_emps["Hire_Date"].dt.year) * 12 +
        (snapshot_date.month - current_emps["Hire_Date"].dt.month)
    )

    # Calculate years in company
    current_year_in_company = current_emps["time_in_company"] // 12

    # Employees Departure (every 10 months)
    if i % 10 == 0:
        active_mask = current_emps["Status"] == "Active"
        leave_mask = (np.random.rand(len(current_emps)) < 0.10) & active_mask
        current_emps.loc[leave_mask, "Status"] = np.random.choice(leave_reasons, size=leave_mask.sum())
        for idx in current_emps[leave_mask].index:
            emp_id = current_emps.loc[idx, "Employee_ID"]
            sampled_df.loc[sampled_df["Employee_ID"] == emp_id, "Status"] = current_emps.loc[idx, "Status"]

    # Department Transfers (every 4 months)
    if i % 4 == 0:
        move_mask = (np.random.rand(len(current_emps)) < 0.05) & (current_emps["Status"] == "Active")
        current_emps.loc[move_mask, "Department"] = np.random.choice(departments, size=move_mask.sum())
        for idx in current_emps[move_mask].index:
            emp_id = current_emps.loc[idx, "Employee_ID"]
            sampled_df.loc[sampled_df["Employee_ID"] == emp_id, "Department"] = current_emps.loc[idx, "Department"]

    # Annual Raises (every 12 months)
    if i > 1:
        prev_year_in_company = (current_emps["time_in_company"] - 1) // 12
        new_year_mask = (current_year_in_company > prev_year_in_company) & (current_emps["Status"] == "Active")

        for job_level, raise_pct in raise_percentages.items():
            level_mask = new_year_mask & (current_emps["Job_Level"] == job_level)
            if level_mask.any():
                current_emps.loc[level_mask, "Annual_Salary"] *= (1 + raise_pct)
                for idx in current_emps[level_mask].index:
                    emp_id = current_emps.loc[idx, "Employee_ID"]
                    sampled_df.loc[sampled_df["Employee_ID"] == emp_id, "Annual_Salary"] = current_emps.loc[idx, "Annual_Salary"]
                    sampled_df.loc[sampled_df["Employee_ID"] == emp_id, "last_raise_year"] = current_year_in_company.loc[idx]

    # Milestone Promotions (at 36, 96, 132 months)
    for milestone in promotion_milestones:
        promo_mask = (current_emps["time_in_company"] == milestone) & (current_emps["Status"] == "Active")
        if promo_mask.any():
            current_emps.loc[promo_mask, "promotion_count"] += 1
            for idx in current_emps[promo_mask].index:
                current_level = current_emps.loc[idx, "Job_Level"]

                if current_level == "Specialist":
                    new_level = "Analyst"
                elif current_level == "Analyst":
                    new_level = "Manager"
                elif current_level == "Manager":
                    new_level = "Principal"
                elif current_level == "Principal":
                    new_level = "Executive"
                else:
                    new_level = current_level
                # Update Job Level
                if new_level != current_level:
                    current_emps.loc[idx, "Job_Level"] = new_level
                    # Apply Milestone Raise
                    if new_level in raise_milestone_percentages:
                        raise_pct = raise_milestone_percentages[new_level]
                        current_emps.loc[idx, "Annual_Salary"] *= (1 + raise_pct)

                    emp_id = current_emps.loc[idx, "Employee_ID"]
                    sampled_df.loc[sampled_df["Employee_ID"] == emp_id, "Job_Level"] = new_level
                    sampled_df.loc[sampled_df["Employee_ID"] == emp_id, "Annual_Salary"] = current_emps.loc[idx, "Annual_Salary"]
                    sampled_df.loc[sampled_df["Employee_ID"] == emp_id, "promotion_count"] = current_emps.loc[idx, "promotion_count"]

    # New Hires (every 4 months)
    if i % 4 == 0:
        available_pool = df[~df["Employee_ID"].isin(used_ids)]
        n_hires = int(len(sampled_df) * monthly_hiring_rate)

        if len(available_pool) >= n_hires:
            new_hires = available_pool.sample(n=n_hires, random_state=4495 + i)
            new_hires = new_hires.copy()
            new_hires["promotion_count"] = 0
            new_hires["month"] = i
            new_hires["Status"] = "Active"
            new_hires["Hire_Date"] = snapshot_date
            new_hires["previous_job_level"] = new_hires["Job_Level"]
            new_hires["last_raise_year"] = 0

            sampled_df = pd.concat([sampled_df, new_hires], ignore_index=True)
            used_ids.update(new_hires["Employee_ID"])

    # Add Snapshot Metadata
    current_emps["month"] = i
    current_emps["snapshot_date"] = snapshot_date.strftime("%Y-%m-%d")
    current_emps["ingestion_timestamp"] = snapshot_date.strftime("%Y-%m-%d")

    # Create Directory Structure
    year_dir = os.path.join(base_path, str(snapshot_date.year))
    month_dir = os.path.join(year_dir, f"{snapshot_date.month:02d}")
    os.makedirs(month_dir, exist_ok=True)

    filename = os.path.join(month_dir, f"snapshot_{snapshot_date.strftime('%Y_%m')}.csv")
    current_emps.to_csv(filename, index=False)

    if i % 12 == 0:
        print(f"Saved snapshot {i}/{months}: Year {snapshot_date.year} ({len(current_emps)} employees)")

print(f"\n✅ Completed! All {months} snapshots saved to Google Drive at: {base_path}")
print(f"Total employees tracked: {len(used_ids)}")