# 1. Description of datasets and EDA

In [2]:
import pandas as pd

## 1.1 Corpus datasets for training

The "corpus" datasets are used to train word embedding models (i.e. unsupervised learning) to be able to get vectors representing words. The similarity of vectors then represent semantic association of words in the corpus. The corpuses are "from the wild" and should include examples of societal bias.

### 1.1.1 [BUG dataset](https://github.com/SLAB-NLP/BUG/blob/main/data.tar.gz) from the [paper](https://arxiv.org/pdf/2109.03858.pdf ):

In [3]:
with open("data/corpus/dataset_BUG.txt", "r") as f:
    print(f.readline())

"Patient number 2 was isolated with his wife that had been diagnosed with SARS-CoV-2 infection , with similar lesions that patient number 1 ."



### 1.1.2 [Doughman et al. dataset](https://github.com/jaddoughman/Gender-Bias-Datasets-Lexicons/blob/main/generic_pronouns/dataset.csv) from the [paper](https://arxiv.org/pdf/2201.08675.pdf): rename to `doughman_dataset.csv`

In [4]:
with open("data/corpus/dataset_doughman.txt", "r") as f:
    print(f.readline())

"( all these are being observed , recorded by a rhymist by his/her deep thoughts and observations)."



### 1.1.3 [Wikipedia biographies](https://rlebret.github.io/wikipedia-biography-dataset/) dataset originally for text generation.

In [5]:
with open("data/corpus/dataset_wikibios.txt", "r") as f:
    print(f.readline())

leonard shenoff randle -lrb- born february 12 , 1949 -rrb- is a former major league baseball player .



## 1.2 Lexicon datasets for testing

The "lexicon" datasets are used to test the inferred vectors embedding certain words. We expect that gender-bias associated words from psychological/sociological studies to be close to gendered words in the learnt vector representations.

Note that in all cases, 1 represents male bias, -1 represents female bias, 0 represents neutral/no bias.

### 1.2.1 Test lexicon
The first lexicon is a pure test-accuracy lexicon, to see if definitely gendered words and definitely un-gendered words are correctly classified.

In [6]:
pd.read_csv("data/lexicons/test_lexicon.csv").head()

Unnamed: 0,word,label
0,he,1
1,him,1
2,father,1
3,fatherly,1
4,male,1


### 1.2.2 "Bias" lexicon
The second lexicon comes from the Gaucher et al. and Konnikov et al. papers and represent job advert words associated with gender bias and gender non-bias.

In [7]:
pd.read_csv("data/lexicons/bias_lexicon.csv").head()

Unnamed: 0,word,label
0,active,1
1,adventurous,1
2,aggressive,1
3,ambitious,1
4,analytical,1
