# Synthetic recruiting data

This notebook constructs a synthetic recruiting data set that we will use for exploring fairness interventions.

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

We suppose that a large company has historical records of people that have applied to join the company, and whether or not that candidate was subsequently employed. We will use this data to train a model to predict whether individuals should be employed or not. A discussion of whether this is appropriate and how to mitigate potential biases is contained in the app.

We aim to generate data in such a way that each of the features reflects certain unfair biases, as do the actual labels themselves. Biases in the features such as the level of education attained reflect systemic biases, whereas bias in the labels reflects historical biases in the hiring practices of the company.

We have settled on the following features as ones that might be relevant in an automated recruitment setting.

- Was the candidate referred for this position?
- Number of career years relevant for the job
- Whether candidate went to a Russell Group univserity
- Did the candidate graduate with an honours degree
- GCSE results
- A-levels
- Current income
- Sex
- Race
- Quality of written cv
- Years of volunteering experience
- Years of gaps in cv
- Level of IT skills
- Whether currently employed or not

We start by defining some high-level parameters that will control the data generation.

In [None]:
N = 10000  # number of data points to generate
P_SEX_MALE = 0.5
P_RACE_WHITE = 0.5

P_EMPLOYED_WHITE_MALE = 0.7
P_EMPLOYED_BLACK_MALE = 0.45
P_EMPLOYED_WHITE_FEMALE = 0.5
P_EMPLOYED_BLACK_FEMALE = 0.25

## Sampling the data

We build the data up starting with demographic features. Remaining features are sampled conditional on the demographic features.

In [None]:
df = pd.DataFrame()

df["sex_male"] = np.random.binomial(1, P_SEX_MALE, N)
df["race_white"] = np.random.binomial(1, P_RACE_WHITE, N)
# we won't use age in the final data, we just use it
# to ensure other features like years of experience
# are generated consistently
df["age"] = np.floor(np.random.poisson(70, N) / 2)

We assume that on average individuals have spent half of the time they've been of working age accumulating relevant experience. We sample from the Poisson distribution with this mean.

In [None]:
df["years_experience"] = np.random.poisson(
    0.4 * np.where(df.age >= 22, df.age - 22, 0)
    + df.race_white * 0.2
    + df.sex_male * 0.1
)

Binary variable stating whether the applicant has been referred or not. We assume men are more likely to be referred than women, and white people are more likely to be referred than black people.

In [None]:
df["referred"] = np.random.binomial(
    1, 0.2 + 0.4 * df.sex_male + 0.3 * df.race_white
)

We model the number of GCSEs better than C grade as a binomial distribution with 10 trials. The increased probability of good grades for white students is intended to reflect systemic biases in access to education.

In [None]:
df["gcse"] = np.random.binomial(10, 0.6 + df.race_white * 0.15)

A level results are mostly determined by GCSE results.

In [None]:
a_level_prob = (
    0.4  # baseline probability
    + df.gcse / 20  # adjusted for gcse results
    + df.race_white * 0.05  # adjustest for race
    - df.sex_male * 0.05  # adjusted for sex
)

df["a_level"] = np.random.binomial(4, a_level_prob)

Sample binary variable indicating whether individual went to a Russell Group Univeristy. Influenced mainly by A-levels and GCSEs

In [None]:
def russell_group_prob(row):
    if row.a_level == 4:
        return 0.8
    elif row.a_level == 3 and row.gcse >= 7:
        return 0.4
    return 0.1


df["russell_group"] = np.random.binomial(
    1, df.apply(russell_group_prob, axis=1)
)

Honours degree depends both on a-levels and Russell Group attendance.

In [None]:
def honours_prob(row):
    if row.russell_group == 1:
        return 0.9
    return 0.2 + 0.15 * row.a_level


df["honours"] = np.random.binomial(1, df.apply(honours_prob, axis=1))

Years of voluntary experience.

In [None]:
df["years_volunteer"] = np.random.poisson(0.5, N)

Current income

In [None]:
def salary_mean(row):
    return (
        15000
        + row.russell_group * 3000
        + row.race_white * 2000
        + np.sqrt(row.years_experience) * 5000
    )


def salary_std(row):
    return 1000 + np.sqrt(row.years_experience) * 2000


# integer divide and multiply by 250 to round to nearest 250
df["income"] = (
    np.random.normal(
        df.apply(salary_mean, axis=1), df.apply(salary_std, axis=1),
    )
    // 250
    * 250
)

IT skills is a simple ordered categorical variable that depends on sex.

In [None]:
df["it_skills"] = np.random.binomial(3, 0.4 + 0.3 * df.sex_male)

Years of holes in cv

In [None]:
df["years_gaps"] = np.random.poisson(
    0.2
    * (1.0 - 0.5 * df.sex_male - 0.25 * df.race_white)
    * df.years_experience
)

Quality of written cv

In [None]:
df["quality_cv"] = np.random.binomial(3, 0.6, N)

Finally we use a logistic regression to create a probability that the individual was employed, then sample a label with that probability.

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def employed_prob(row):
    return sigmoid(
        # implicit discrimination
        2 * row.referred
        + 1 * row.years_experience
        + 0.5 * row.gcse
        + 0.8 * row.a_level
        + 0.1 * row.russell_group
        + 0.1 * row.honours
        - 0.5 * row.years_gaps
        + 0.4 * row.quality_cv
        + 0.4 * row.it_skills
        # explicit discrimination
        + 0.8 * row.race_white
        + 0.5 * row.sex_male
        # offset
        - 15
    )


df["employed_yes"] = np.random.binomial(1, df.apply(employed_prob, axis=1))

Drop age as it's no longer needed.

In [None]:
df = df.drop(columns="age")

The final data looks like this.

In [None]:
df.head()

## Train, val and test splits

We split the data into train, validation and test sets.

In [None]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)

## Preprocessing

In [None]:
ss = StandardScaler()

# Numerical attributes
cts_features = [
    "a_level",
    "gcse",
    "years_experience",
    "years_volunteer",
    "income",
    "it_skills",
    "years_gaps",
    "quality_cv",
]

train_df_scaled = train_df.copy()
val_df_scaled = val_df.copy()
test_df_scaled = test_df.copy()

train_df_scaled[cts_features] = ss.fit_transform(train_df[cts_features])
val_df_scaled[cts_features] = ss.transform(val_df[cts_features])
test_df_scaled[cts_features] = ss.transform(test_df[cts_features])

## Save data

In [None]:
artifacts_dir = Path("../../artifacts")

In [None]:
artifacts_dir = Path("../../../artifacts")

In [None]:
# temporary platform specific directory
data_dir = artifacts_dir / "data" / "recruiting"

Data generated by us is committed to the repository for reproducibility. However feel free to regenerate your own version of the data and compare results.

In [None]:
# train_df.to_csv(data_dir / "raw" / "train.csv", index=False)
# test_df.to_csv(data_dir / "raw" / "test.csv", index=False)
# val_df.to_csv(data_dir / "raw" / "val.csv", index=False)

# train_df_scaled.to_csv(data_dir / "processed" / "train.csv", index=False)
# val_df_scaled.to_csv(data_dir / "processed" / "val.csv", index=False)
# test_df_scaled.to_csv(data_dir / "processed" / "test.csv", index=False)