# Generate samples from identity terms and job titles/occupations

The goal of this notebook is to create a synthetic dataset that can be used to measure the occupational bias in sentiment analysis systems.

In [50]:
# imports
import pandas as pd

# read data
identities = pd.read_excel("identity_terms.xlsx")
occupations = pd.read_excel("occupations.xlsx")

# display data
print(len(identities), "identities and", len(occupations), "occupations.")
print("This should yield", len(identities)*len(occupations), "sentences.")
display(identities.head())
display(occupations.head())

# merge data 
corpus = pd.merge(identities.assign(key=1), occupations.assign(key=1), on='key').drop('key', axis=1) # cartesian product of the two dfs

# create synthetic sentences
corpus["sentence"] = corpus["identity_term"] + " er " + corpus["job_title"] + "." # create sentence
corpus["sentence"] = corpus["sentence"].apply(lambda x: x.capitalize()) # capitalize first word of sentence
print("\nResult:")
print(len(corpus), "sentences")
display(corpus.head())

# save df
corpus.to_excel("gender_corpus.xlsx", index=False)

48 identities and 50 occupations.
This should yield 2400 sentences.


Unnamed: 0,identity_term,gender
0,androgynen,Q
1,denne dame,F
2,denne fætter,M
3,denne herre,M
4,dette interkønnede individ,Q


Unnamed: 0,job_title,gender_distribution
0,bager,female-dominated
1,bibliotekar,female-dominated
2,optiker,female-dominated
3,boghandler,female-dominated
4,praktiserende læge,female-dominated



Result:
2400 sentences


Unnamed: 0,identity_term,gender,job_title,gender_distribution,sentence
0,androgynen,Q,bager,female-dominated,Androgynen er bager.
1,androgynen,Q,bibliotekar,female-dominated,Androgynen er bibliotekar .
2,androgynen,Q,optiker,female-dominated,Androgynen er optiker.
3,androgynen,Q,boghandler,female-dominated,Androgynen er boghandler.
4,androgynen,Q,praktiserende læge,female-dominated,Androgynen er praktiserende læge.
