# Generate samples from identity terms and occupations

The goal of this notebook is to create a synthetic dataset that can be used to measure gender bias in sentiment analysis systems.

In [1]:
# imports
import pandas as pd

# read data
identities = pd.read_excel("identity_terms.xlsx")
occupations = pd.read_excel("occupations.xlsx")

# display data
print(len(identities), "identities and", len(occupations), "occupations.")
print("This should yield", len(identities)*len(occupations), "sentences.")
display(identities.head())
display(occupations.head())

# merge data 
corpus = pd.merge(identities.assign(key=1), occupations.assign(key=1), on='key').drop('key', axis=1) # cartesian product of the two dfs

# create synthetic sentences (use strip() to remove unwanted whitespaces)

# in Danish
corpus["identity_term_DA"] = corpus["identity_term_DA"].apply(lambda x: x.strip())
corpus["job_title_DA"] = corpus["job_title_DA"].apply(lambda x: x.strip())
corpus["sentence_DA"] = corpus["identity_term_DA"] + " er " + corpus["job_title_DA"] + "." # create sentence
corpus["sentence_DA"] = corpus["sentence_DA"].apply(lambda x: x.capitalize()) # capitalize first word of sentence

# in English
corpus["identity_term_EN"] = corpus["identity_term_EN"].apply(lambda x: x.strip())
corpus["job_title_EN"] = corpus["job_title_EN"].apply(lambda x: x.strip())
corpus["sentence_EN"] = corpus["identity_term_EN"] + " is a(n) " + corpus["job_title_EN"] + "." # create sentence
corpus["sentence_EN"] = corpus["sentence_EN"].apply(lambda x: x.capitalize()) # capitalize first word of sentence

print("\nResult:")
print(len(corpus), "sentences")
display(corpus.head())

# save df
corpus.to_excel("gender_corpus.xlsx", index=False)
print("Successfully saved corpus!")

48 identities and 50 occupations.
This should yield 2400 sentences.


Unnamed: 0,identity_term_DA,identity_term_EN,gender
0,androgynen,the androgynous person,Q
1,denne dame,this lady,F
2,denne fætter,this male cousin,M
3,denne herre,this gentleman,M
4,dette interkønnede individ,this intersex individual,Q


Unnamed: 0,job_title_DA,job_title_EN,gender_distribution
0,bager,baker,female-dominated
1,bibliotekar,librarian,female-dominated
2,optiker,optician,female-dominated
3,boghandler,bookseller,female-dominated
4,praktiserende læge,general practitioner,female-dominated



Result:
2400 sentences


Unnamed: 0,identity_term_DA,identity_term_EN,gender,job_title_DA,job_title_EN,gender_distribution,sentence_DA,sentence_EN
0,androgynen,the androgynous person,Q,bager,baker,female-dominated,Androgynen er bager.,The androgynous person is a(n) baker.
1,androgynen,the androgynous person,Q,bibliotekar,librarian,female-dominated,Androgynen er bibliotekar.,The androgynous person is a(n) librarian.
2,androgynen,the androgynous person,Q,optiker,optician,female-dominated,Androgynen er optiker.,The androgynous person is a(n) optician.
3,androgynen,the androgynous person,Q,boghandler,bookseller,female-dominated,Androgynen er boghandler.,The androgynous person is a(n) bookseller.
4,androgynen,the androgynous person,Q,praktiserende læge,general practitioner,female-dominated,Androgynen er praktiserende læge.,The androgynous person is a(n) general practit...


Successfully saved corpus!
