<center><b><p style="font-size:22px;">FAKE EMPLOYEES DATASET GENERATOR</p></b></center>
<center><b><p style="font-size:20px;">WITH A FOCUS ON EMPLOYEE ENGAGEMENT</p></b></center>

<div style="text-align: center">
<img src="https://images.unsplash.com/photo-1628069640591-7736fd4c87cf?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1121&q=80"
     alt="staff only;"
     float= "right;" 
     width=500/>

Photo by <a href="https://unsplash.com/@fredrikwandem?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Fredrik Solli Wandem</a> on <a href="https://unsplash.com/s/photos/employee?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></div>
  

<b><p style="font-size:16px;">ABOUT</p></b>

This Jupyter Notebook contains **a code to generate a fake employee dataset** that can be further used in **employee engagement analysis.** The motivation behind creating such dataset was that I couldn't find any similar/suitable dataset to demonstrate a real-life example.

Using Faker's localizations, I created a dataset that generates multiple offices in different countries, **simulating a real-world multinational company**. I used an organizational structure typical for most tech companies; one in which you'll find most employees working in research & development.

---

<center><b><p style="font-size:18px;">SNAPLA – THE FAKE COMPANY</p></b><center>

<div style="text-align: center">
<img src="https://raw.githubusercontent.com/Pav-Ini/fake_employee_generator/main/images/snapla.png"
     alt="snapla_logo;"
     float= "right;">


<p>
<div style="text-align: left">

**Snapla is a fake tech company founded 15 years ago.** At the beginning of its existence, Snapla needed to grow fast. So, the company went on a large hiring spree and opened up new offices around Europe. 

Snapla headquarters are located in Prague, Czechia. The second largest office is located in Brno. Later on, Snapla opened up offices in Spain – Madrid and Barcelona. The smallest and youngest office can be found in Vienna, Austria.

Now, Snaplas's turbulent startup times are ending, and the **management wants to retain its employees and see how engaged the employees are.**
</p>
<br></br>

<b><p style="font-size:16px;">SNAPLA ORGANIZATION CHART</p></b>

<div style="text-align: center">
<img src="https://raw.githubusercontent.com/Pav-Ini/fake_employee_generator/main/images/snapla_org_chart.png"
     alt="snapla_org_chart;"
     float= "right;">
</p>

<div style="text-align: left">
<b><p style="font-size:16px;">ENGAGEMENT DATA – UWES</p></b>

Snapla's HR team is using the Utrecht Work Engagment Scale (UWES-9), a survey designed by Utrecht University to measure employees' engagment. The scale for the UWES-9 is 0 to 6, with 0 being *Never* and 6 being *Always*. The survey is operating with three engagement facets. **Vigor, Dedication, and Absorption.** The facets are defined as follows: 

- **Vigor** (VI) is characterized by high levels of energy and mental resilience while working, the willingness to invest effort in one’s work and persistence even in the face of difficulties. 
- **Dedication** (DE) refers to being strongly involved in one’s work and experiencing a sense of significance, enthusiasm, inspiration, pride, and challenge. 
- **Absorption** (AB) is characterized by being fully concentrated and happily engrossed in one’s work, whereby time passes quickly and one has difficulties with detaching oneself from work.



|                        Statement                         | Abbreviation | Engagement facet |
|:---------------------------------------------------------|:------------:|:----------------:|
| At my work, I am full of energy.                         |    VI1       |      Vigor       |
| At my job, I feel strong and vigorous.                   |    VI2       |      Vigor       |
| When I get up in the morning, I feel like going to work. |    VI3       |      Vigor       |
| I am enthusiastic about my job.                          |    DE1       |     Dedication   |
| My job inspires me.                                      |    DE2       |     Dedication   |
| I am proud of the work that I do.                        |    DE3       |     Dedication   |
| I feel happy when I am working intensely.                |    AB1       |     Absorption   |
| I am immersed in my work.                                |    AB2       |     Absorption   |
| I get carried away when I am working.                    |    AB3       |     Absorption   |

</p>

<br></br>

<b><p style="font-size:16px;">GENERATED DATA</p></b>

|        FIELD            | DATA TYPE |                         DESCRIPTION - EXAMPLES                                          |
| ----------------------- |:----------|-----------------------------------------------------------------------------------------|
| `employee_id`           | int       | Unique employee ID generated on user prompt                                            |
| `employee_status`       | string    | Status is either *active* for active employees or *terminated* for terminated employees |
| `first_name`            | string    | Employee's first name                                                                   |
| `last_name`             | string    | Employee's last name                                                                    |
| `gender`                | string    | Employee's gender. *F* is female, *M* is male, and *O* is other/non-binary/etc.         |
| `hire_date`             | date      | Date the employee was hired                                                             |
| `probation_period_date` | date      | Date the employee passed their probation period                                         |
| `termination_date`      | date      | Date the *terminated* employee left the company                                         |
| `location`              | string    | One of the 5 office locations: *Prague, Brno, Madrid, Barcelona, Vienna*                |
| `functional_line`       | string    | First level of org. structure, eg. *Research & Development*                             |
| `department`            | string    | Second level of org. structure, eg. *Engineering*                                       |
| `team`                  | string    | Third level of org. strucutre, eg. *Mobile*                                             |
| `performance`           | int       | Latest employee performance review. 1 is the lowest, 5 is the highest                   |
| `cr_pct`                | int       | Compa-ratio in percentage, employee's current salary divided by the current market rate    |
| `people_manager`        | bool      | Employee is a supervisor                                                                |

----

In [1]:
# importing used packages
import pandas as pd
import numpy as np
from faker import Faker
import datetime as dt

In [2]:
# Organzational structure dictionary for the fake Snapla company
org_structure_dic = {
    "Research & Development" : {
        "Engineering" : {
            "Team" : ["Data", "Mobile", "Backend", "Frontend"]
        },
        "Product" : {
            "Team" : ["Prodcut Design", "Developer Tools & Infrastrucuture", "Business Affairs"]
        },
    },
    "Finance" : {
        "Finance" : {
            "Team" : ["Accounting", "Internal Audit", "Controlling"]
        },
    },
    "Marketing & Sales" : {
        "Marketing" : {
            "Team" : ["Brand & Creative", "Content Marketing"]
        },
        "Sales" : {
            "Team" : ["Business Development", "Ad Sales"]
        },
    },
    "Operations" : {
        "Human Resources" : {
            "Team" : ["Recruitment", "HR Generalists", "HR Business Partners"]
        },
        "Facilities" : {
            "Team" : ["Facilities", "End User Support"]
        },
    },
}


In [3]:
# Function to automatically generate Pandas DataFrame with fake employees

def fake_employees(num, gender_1, office_location_list, location_prob_list, faker_localization, id_num):

    # nested function to get organizational structure with probabilities of being in a given functional line
    def generate_org_structure(prob):
   
        random_fl = np.random.choice(list(org_structure_dic), p=prob) # probabilities of being in FL
        random_department = np.random.choice(list(org_structure_dic[random_fl]))
        random_team = np.random.choice(list(org_structure_dic[random_fl][random_department]["Team"]))
        
        return random_fl + "-" +random_department + "-" +random_team

    # nested function the get random weights with a sum that equals user value
    # to get the weights that match 100% for the UWES survey answers random generator
    def random_weights(num, total_sum):
        nums = np.random.rand(num)
        return nums / np.sum(nums) * total_sum

    # intiation Faker localization package
    fake = Faker(faker_localization)

    # lists to radomly assign to workers
    employee_status = ["active", "terminated"]
    gender = [gender_1, "O"]
    performance = [1,2,3,4,5]
    cr = list(range(80,150))
    supervisor = [True, False]
    survey_respo = [0,1,2,3,4,5,6]  # UWES responses scale


    # fake workers generator for females
    if gender_1 == "F":

        fake_workers = [{
            "employee_id" : x+ id_num,
            "employee_status" : np.random.choice(employee_status, p=[0.8, 0.2]),
            "first_name": fake.first_name_female(),
            "last_name" : fake.last_name_female(),
            "gender" : np.random.choice(gender, p=[0.95, 0.05]),
            "hire_date" : fake.date_between(start_date="-15y", end_date="-3y"),
            "org_structure" : generate_org_structure([0.5, 0.1, 0.2, 0.2]),
            "location" : np.random.choice(office_location_list, p=location_prob_list),
            "performance" : np.random.choice(performance, p=[0.06, 0.14, 0.5, 0.2 ,0.1]),
            "cr" : np.random.choice(cr),
            "people_manager" : np.random.choice(supervisor, p=[0.1,0.9]),
            # UWES Survey Responses
            "VI1" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "VI2" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "VI3" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "AB1" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "AB2" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "AB3" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "DE1" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "DE2" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "DE3" : np.random.choice(survey_respo, p=random_weights(7,1)),
                        } for x in range(num)]

    # fake workers generator for males
    elif gender_1 == "M":
        fake_workers = [{
            "employee_id" : x + id_num,
            "employee_status" : np.random.choice(employee_status, p=[0.8, 0.2]),
            "first_name": fake.first_name_male(),
            "last_name" : fake.last_name_male(),
            "gender" : np.random.choice(gender, p=[0.95, 0.05]),
            "hire_date" : fake.date_between(start_date="-15y", end_date="-3y"),
            "org_structure" : generate_org_structure([0.5, 0.1, 0.2, 0.2]),
            "location" : np.random.choice(office_location_list, p=location_prob_list),
            "performance" : np.random.choice(performance, p=[0.06, 0.14, 0.5, 0.2 ,0.1]),
            "cr" : np.random.choice(cr),
            "people_manager" : np.random.choice(supervisor, p=[0.1,0.9]),
            "VI1" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "VI2" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "VI3" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "AB1" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "AB2" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "AB3" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "DE1" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "DE2" : np.random.choice(survey_respo, p=random_weights(7,1)),
            "DE3" : np.random.choice(survey_respo, p=random_weights(7,1)),
                        } for x in range(num)]

    # generating the fake workers DataFrame
    df = pd.DataFrame(fake_workers)

    # nested function to further alternate the fake workers dataset
    def adjust_workers_df(df):
        # spliting the organization structure
        df[["functional_line", "department", "team"]] = df["org_structure"].str.split("-", expand=True)
        df.drop("org_structure", axis=1, inplace=True)

        # converting hire date to datetime and adding 3 mos probation period date
        df["hire_date"] = pd.to_datetime(df["hire_date"])
        df["probation_period_date"] = df["hire_date"] + pd.DateOffset(months=3)
        df.insert(6, "probation_period_date", df.pop("probation_period_date"))

        # adding termination date for terminated employees
        termination_months = list(range(1, 37))
        df["termination_date"] = df.apply(lambda x: x["hire_date"] + pd.DateOffset(months=np.random.choice(termination_months)) 
            if x["employee_status"] =="terminated"
            else np.nan, axis=1)
        df.insert(7, "termination_date", df.pop("termination_date"))
        return df


    return adjust_workers_df(df)

<b><p style="font-size:14px;">GENERATING FAKE EMPLOYEES</p></b>

In [4]:
# Faker localization packages
czech_localization = "cs_CZ"
spanish_localization = "es_ES"
austria_localization = "de_AT"

# Snapla subsidiaries
czech_offices = ["Prague", "Brno"]
spanish_offices = ["Madrid", "Barcelona"]
austria_offices = ["Vienna"]

# Office probabilities
czech_probability = [0.8, 0.2]
spanish_probability = [0.4, 0.6]
austria_probability = [1]

<p style="font-size:14px;">CZECHIA 🇨🇿</p>

In [5]:
# generating female employees
cz_f = fake_employees(760, "F", czech_offices, czech_probability, czech_localization, 10000)

# generating male employees
cz_m = fake_employees(740, "M", czech_offices, czech_probability, czech_localization, 10760)

<p style="font-size:14px;">SPAIN 🇪🇸</p>

In [6]:
# generating female employees
es_f = fake_employees(400, "F", spanish_offices, spanish_probability, spanish_localization, 20000)

# generating male employees
es_m = fake_employees(600, "M", spanish_offices, spanish_probability, spanish_localization, 20400)

<p style="font-size:14px;">AUSTRIA 🇦🇹</p>

In [7]:
# generating female employees
at_f = fake_employees(230, "F", austria_offices, austria_probability, austria_localization, 30000)

# generating male employees
at_m = fake_employees(270, "M", austria_offices, austria_probability, austria_localization, 30230)

<b><p style="font-size:14px;">SNAPLA EMPLOYEES 🐙</p></b>

In [8]:
snapla = pd.concat([cz_f, cz_m, es_f, es_m, at_f, at_m], ignore_index=True)

In [9]:
pd.set_option("display.max_columns", None)
snapla.sample(10)

Unnamed: 0,employee_id,employee_status,first_name,last_name,gender,hire_date,probation_period_date,termination_date,location,performance,cr,people_manager,VI1,VI2,VI3,AB1,AB2,AB3,DE1,DE2,DE3,functional_line,department,team
2799,30299,active,Michael,Mark,M,2009-08-09,2009-11-09,NaT,Vienna,3,118,False,0,5,6,0,6,4,2,1,2,Research & Development,Product,Developer Tools & Infrastrucuture
641,10641,active,Vladimíra,Dušková,F,2013-10-25,2014-01-25,NaT,Prague,1,148,True,6,2,5,4,6,3,4,0,2,Research & Development,Engineering,Mobile
2980,30480,active,Ilias,Wandl,M,2016-08-21,2016-11-21,NaT,Vienna,3,112,False,0,1,4,2,0,3,0,3,1,Marketing & Sales,Marketing,Brand & Creative
2087,20587,active,Emigdio,Mateos,M,2016-11-26,2017-02-26,NaT,Barcelona,1,117,False,4,2,1,3,6,0,2,0,5,Operations,Human Resources,HR Business Partners
82,10082,active,Šárka,Kopecká,F,2013-04-29,2013-07-29,NaT,Brno,2,101,False,4,1,2,2,5,5,3,1,1,Research & Development,Engineering,Backend
1229,11229,active,Šimon,Pokorný,M,2008-02-17,2008-05-17,NaT,Prague,3,114,False,0,6,0,4,5,5,3,5,1,Research & Development,Engineering,Mobile
2091,20591,terminated,Jacinto,Barba,M,2012-11-29,2013-02-28,2015-11-29,Barcelona,2,84,False,1,3,5,0,3,0,5,2,1,Marketing & Sales,Sales,Ad Sales
1100,11100,active,Marian,Sedláček,M,2016-01-21,2016-04-21,NaT,Prague,3,87,False,0,5,4,2,4,3,0,6,1,Operations,Human Resources,HR Business Partners
2890,30390,active,Florin,Hell,M,2016-07-09,2016-10-09,NaT,Vienna,3,139,True,4,6,5,4,6,1,1,5,2,Research & Development,Engineering,Backend
698,10698,terminated,Denisa,Krejčová,O,2017-09-05,2017-12-05,2018-01-05,Prague,4,103,False,6,0,4,1,5,0,2,1,0,Research & Development,Product,Business Affairs


In [10]:
# saving the generated file as .csv
snapla.to_csv("snapla_employees.csv")

<b><p style="font-size:16px;">SOURCES 🔗 </p></b>
1. [Generating Fake Data With Python](https://towardsdatascience.com/generating-fake-data-with-python-c7a32c631b2a) – Towards Data Science Article
2. [Faker](https://faker.readthedocs.io/)
3. [Free Logo Design](freelogodesign.org) – software used creating Snapla logo
4. [Compa-ratio](https://en.wikipedia.org/wiki/Compa-ratio) – wikipedia article
5. [UWES Paper](https://www.wilmarschaufeli.nl/publications/Schaufeli/Test%20Manuals/Test_manual_UWES_English.pdf)