Skip to content

Synthetic HR text requests from employees to employers, created using Open Data information about salaries, sick leaves etc to contain realistic information, generated using GPT-J.

License

SAP/hr-request-data-set

HR request data set

REUSE status

The objective of this project is to generate synthetic tickets sent from employees to the HR department. The goal is not to create tickets that are recognizable from real ones, but to create a dataset that can be used for training Machine Learning models respecting the GDPR. The tickets are created starting from real open data, which is generalized to respect privacy.

In the project there are also some examples of use cases of the tickets generated, in particular:

  • Ticket classification

complete_schema

Test data set (human)

A test data set of human-produced HR requests is available at

ticket_generation/data/survey_tickets

this data set was created by 29 individuals, based on the same prompts given to GPT-J. These tickets can be used as test data set for downstream tasks; we used them for our use cases.

🏃 How to run

Version of Python used to run the experiments: 3.9 ( Should work also with python >= 3.7, but I cannot guarantee 100% ) It is advised to use different virtual environment for each task

python -m venv venv

On Windows

venv/Scripts/activate

On Linux/MacOS

source venv/bin/activate

Install requirements and run ticket generation

python -m pip install -r requirements.txt

python run_ticket_generator.py

Install requirements and run ticket classification

python -m pip install -r ./use_cases/ticket_classification/requirements_classification.txt

python run_ticket_classification.py

Parameter configuration set up with Hydra

Types of tickets:

  • absence
  • salary
  • life_event
  • gender_pay_gap
  • info_accommodation
  • complaint
  • refund_travel
  • shift_change

To run the Ticket Generation with the default parameters

python run_ticket_generation

To run the experiments for only one type of ticket

python run_ticket_generation ticket_type=absence

To run the experiments for only all types of ticket

python run_ticket_generation -m

To run the experiments for multiple types of ticket, but not all (ex. absence and life_event)

python run_ticket_generation -m ticket_type=absence,life_event

To change the parameters you can change them directly in the conf/ticket_generation folder or by command line

Example:

python run_ticket_generation gpt.top_k=30

🐳 Run on Docker

To run on docker ticket generation:

docker build -t nextgen_det .
nvidia-docker run --name nextgen_det -v ${PWD}/gen_cache/:/app/gen_cache -v ${PWD}/ticket_generation/output/:/app/ticket_generation/output/ -it -e NVIDIA_VISIBLE_DEVICES=1,2 nextgen_det data_creation.number_of_data=100

To run on docker ticket classification:

docker build -t nextgen_det_class -f use_cases/ticket_classification/Dockerfile_classification .
nvidia-docker run --name nextgen_det_class -v ${PWD}/gen_cache/:/app/gen_cache -v ${PWD}/ticket_generation/output/:/app/ticket_generation/output/ -v ./use_cases/ticket_classification/output/:/app/use_cases/ticket_classification/output/ -it -e NVIDIA_VISIBLE_DEVICES=1,2 nextgen_det_class

When you run the experiments with Docker you can also configure all the parameters as described before, for example to set the parameters top_k of gpt to 30 you will have to write

nvidia-docker run --name nextgen_det -v ${PWD}/gen_cache/:/app/gen_cache -v ${PWD}/ticket_generation/output/:/app/ticket_generation/output/ -it -e NVIDIA_VISIBLE_DEVICES=1,2 nextgen_det gpt.top_k=30

🛠️ Configuration

Parameters that can be changed in the conf/ticket_generation folder. This in not an exhaustive list, to see the entire list see in the conf/ folder

File: config.yaml

  • defaults:
    • ticket_type: type of ticket you want to generate ( must be one between {absence,salary,life_event})
  • data_creation:
    • number_of_data: Number of tickets created
    • data_path: data folder path where the dataset are stored
  • gpt:
    • top_k: the k most likely next words are filtered and the probability mass is redistributed among only these k words.
    • top_p: samples from the smallest possible set of words whose cumulative probability exceeds the probability p.
    • repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty
    • temperature: the value used to module the logits distribution
    • word_limit: maximum amount of words created by a gpt generation
  • gpu:
    • use_gpu: True if you want to use the gpu
    • device: device name of your gpu ( ex. "cuda:0" or "cuda" if you want to use more than 1 gpu )

File: ticket_type/{absence,life_event,salary}.yaml

  • df_provider:
    • number_of_data: number of data to sample from original dataset (-1 means get all data)
    • shuffle: boolean to shuffle or not the data before sampling
    • file_name: file name of the dataset
    • columns: columns name in the csv
  • text_generator:
    • file_name: file name of the template
    • category: category that will be written in the final ticket
    • sub_category: sub-category that will be written in the final ticket

File: conf/ticket_classification

  • gpu:
    • use_gpu: True if you want to use the gpu
    • device: device name of your gpu ( ex. "cuda:0" )
  • model_name: Bert model you want to use to classify
  • ticket_dataset:
    • data_path: where the tickets are saved
    • template_path: where the templates are saved
    • sample_size: size of the tickets you want to use to train and test the model ( ex: 0.8 to use 80% of the tickets available)
    • remove_first_part: True if you want to do the classification without the initial part of the ticket ( all the tickets' metadata )
    • remove_template_sections: True if you want to remove the parts of the tickets that belong to the original template
  • train_size:
  • validation_size:
  • test_size:
  • explanation:
    • execute: True if you want to get not only the results, but also the words that made the model make the classification decision
    • top_n: top n words to print

The results for the generation part will be created in the folder ticket_generation/output. All the logs of the different runs are saved in the folder outputs/, if the runs were of the type multirun (using the command -m or --multirun) the logs are saved in the folder multirun/

📖 Datasets

List of dataset used in the project:

Support, Feedback, Contributing

This project is open to feature requests/suggestions, bug reports etc. via GitHub issues. Contribution and feedback are encouraged and always welcome. For more information about how to contribute, the project structure, as well as additional contribution information, see our Contribution Guidelines.

Code of Conduct

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone. By participating in this project, you agree to abide by its Code of Conduct at all times.

Licensing

Copyright 2023 SAP SE or an SAP affiliate company and hr-request-data-set contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.

About

Synthetic HR text requests from employees to employers, created using Open Data information about salaries, sick leaves etc to contain realistic information, generated using GPT-J.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published