This notebook shows the word occurrences of the words involved in the tests (embedding quality and WEAT) across the three corpora.

In addition, we will also determine the words to use for the WEAT tests. This is needed in order to avoid variability on the WEAT test due to a different number of words between the two target sets and between the two attribute sets (also between corpora). In particular, given the two target word sets X and Y, and the two attribute words sets A and B, we should ensure that X and Y have the same number of words, as well as for the sets A and B. We can encounter this problem when one or more test words are missing from one of the corpus. 

To prevent this, we ensure to rely on the same sets of target and attribute words across the three corpora. The procedure is the following:
1. compute the fraction of occurrence of each word across the three corpora, separately for each WEAT word set

2. if some words are missing from at least one corpus, we need to remove the same amount of words on the other corresponding set. This is done by removing the rarest word/words. We assign to each word the minimum of the relative frequencies we measure across the three corpora, and the word/words with the least frequencies are the candidates to be discarded.

In [1]:
%pylab inline
import pandas as pd
import json
import spacy

Populating the interactive namespace from numpy and matplotlib


In [2]:
def get_lemma(token):
    return token.lemma_ if token.lemma_!='-PRON-' else token.text

# this is the same model used to preprocess the lyrics
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

In [3]:
# load the word occurrences
# each row refers to a test
report = pd.read_json("../data/occurrence_test_words_in_person_corpora.json")
report = report.set_index(['test_name', 'test_type'])
report.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,test_word_count_all_person,test_word_count_all_person_lemmatized,test_word_count_male_person,test_word_count_male_person_lemmatized,test_word_count_female_person,test_word_count_female_person_lemmatized
test_name,test_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ws353,,"{'virtuoso': 14, 'cabbage': 377, 'government':...","{'virtuoso': 15, 'cabbage': 397, 'government':...","{'virtuoso': 14, 'cabbage': 334, 'government':...","{'virtuoso': 15, 'cabbage': 351, 'government':...","{'virtuoso': 0, 'cabbage': 43, 'government': 1...","{'virtuoso': 0, 'cabbage': 46, 'government': 1..."
toefl,,"{'conspicuous': 20, 'correlated': 2, 'dangerou...","{'conspicuous': 20, 'correlated': 0, 'dangerou...","{'conspicuous': 17, 'correlated': 2, 'dangerou...","{'conspicuous': 17, 'correlated': 0, 'dangerou...","{'conspicuous': 3, 'correlated': 0, 'dangerous...","{'conspicuous': 3, 'correlated': 0, 'dangerous..."
syn_sem,capital-common-countries,"{'ottawa': 34, 'beijing': 43, 'cairo': 60, 'sp...","{'ottawa': 34, 'beijing': 43, 'cairo': 60, 'sp...","{'ottawa': 31, 'beijing': 30, 'cairo': 48, 'sp...","{'ottawa': 31, 'beijing': 30, 'cairo': 48, 'sp...","{'ottawa': 3, 'beijing': 13, 'cairo': 12, 'spa...","{'ottawa': 3, 'beijing': 13, 'cairo': 12, 'spa..."
syn_sem,capital-world,"{'nigeria': 33, 'moldova': 0, 'somalia': 39, '...","{'nigeria': 33, 'moldova': 0, 'somalia': 39, '...","{'nigeria': 25, 'moldova': 0, 'somalia': 30, '...","{'nigeria': 25, 'moldova': 0, 'somalia': 30, '...","{'nigeria': 8, 'moldova': 0, 'somalia': 9, 'ot...","{'nigeria': 8, 'moldova': 0, 'somalia': 9, 'ot..."
syn_sem,currency,"{'baht': 2, 'nigeria': 33, 'peso': 57, 'dram':...","{'baht': 2, 'nigeria': 33, 'peso': 79, 'dram':...","{'baht': 2, 'nigeria': 25, 'peso': 48, 'dram':...","{'baht': 2, 'nigeria': 25, 'peso': 69, 'dram':...","{'baht': 0, 'nigeria': 8, 'peso': 9, 'dram': 1...","{'baht': 0, 'nigeria': 8, 'peso': 10, 'dram': ..."


In [4]:
# for each test, show the fraction of words occurring more than 100 times
# done for embedding quality and non-lemmatized corpus (WEAT is for lemmatized corpus)

x = 100

for test_name, row in report.iterrows():
    if 'WEAT' in test_name[0]: continue
    
    occ_all = report.loc[test_name].test_word_count_all_person
    frac_x_all = np.mean([1 if val>=x else 0 for val in occ_all.values()])
    occ_male = report.loc[test_name].test_word_count_male_person
    frac_x_male = np.mean([1 if val>=x else 0 for val in occ_male.values()])
    occ_female = report.loc[test_name].test_word_count_female_person
    frac_x_female = np.mean([1 if val>=x else 0 for val in occ_female.values()])
    
    print(test_name)
    print(f"Corpus all - fraction of words occurring more than {x} times: {round(frac_x_all*100, 2)}%")
    print(f"Corpus male - fraction of words occurring more than {x} times: {round(frac_x_male*100, 2)}%")
    print(f"Corpus female - fraction of words occurring more than {x} times: {round(frac_x_female*100, 2)}%")
    print()

('ws353', '')
Corpus all - fraction of words occurring more than 100 times: 76.43%
Corpus male - fraction of words occurring more than 100 times: 74.6%
Corpus female - fraction of words occurring more than 100 times: 48.97%

('toefl', '')
Corpus all - fraction of words occurring more than 100 times: 49.1%
Corpus male - fraction of words occurring more than 100 times: 44.5%
Corpus female - fraction of words occurring more than 100 times: 23.79%

('syn_sem', 'capital-common-countries')
Corpus all - fraction of words occurring more than 100 times: 47.83%
Corpus male - fraction of words occurring more than 100 times: 47.83%
Corpus female - fraction of words occurring more than 100 times: 15.22%

('syn_sem', 'capital-world')
Corpus all - fraction of words occurring more than 100 times: 14.22%
Corpus male - fraction of words occurring more than 100 times: 13.36%
Corpus female - fraction of words occurring more than 100 times: 3.88%

('syn_sem', 'currency')
Corpus all - fraction of words occu

In [5]:
# for each test, show the fraction of words occurring more than 100 times
# done for embedding quality and WEATs, and lemmatized corpus

x = 100

for test_name, row in report.iterrows():
    
    occ_all = report.loc[test_name].test_word_count_all_person_lemmatized
    frac_x_all = np.mean([1 if val>=x else 0 for val in occ_all.values()])
    occ_male = report.loc[test_name].test_word_count_male_person_lemmatized
    frac_x_male = np.mean([1 if val>=x else 0 for val in occ_male.values()])
    occ_female = report.loc[test_name].test_word_count_female_person_lemmatized
    frac_x_female = np.mean([1 if val>=x else 0 for val in occ_female.values()])
    
    print(test_name)
    print(f"Corpus all - fraction of words occurring more than {x} times: {round(frac_x_all*100, 2)}%")
    print(f"Corpus male - fraction of words occurring more than {x} times: {round(frac_x_male*100, 2)}%")
    print(f"Corpus female - fraction of words occurring more than {x} times: {round(frac_x_female*100, 2)}%")
    print()

('ws353', '')
Corpus all - fraction of words occurring more than 100 times: 79.41%
Corpus male - fraction of words occurring more than 100 times: 76.89%
Corpus female - fraction of words occurring more than 100 times: 52.63%

('toefl', '')
Corpus all - fraction of words occurring more than 100 times: 39.64%
Corpus male - fraction of words occurring more than 100 times: 35.29%
Corpus female - fraction of words occurring more than 100 times: 20.72%

('syn_sem', 'capital-common-countries')
Corpus all - fraction of words occurring more than 100 times: 47.83%
Corpus male - fraction of words occurring more than 100 times: 47.83%
Corpus female - fraction of words occurring more than 100 times: 15.22%

('syn_sem', 'capital-world')
Corpus all - fraction of words occurring more than 100 times: 14.22%
Corpus male - fraction of words occurring more than 100 times: 13.36%
Corpus female - fraction of words occurring more than 100 times: 3.88%

('syn_sem', 'currency')
Corpus all - fraction of words o

### Select sets for WEAT

Ensure that all the words in word sets are present in the three corpora. If some word is missing in one of the three corpora, remove the word in the other corresponding set based on the lowest frequency of occurrence.

In [6]:
report_weats = report.loc[['WEAT', 'WEAT2', 'WEAT3']]

In [7]:
# this is the min_count parameter we will use to train Word2Vec. Words occurring less than 5 times are ignored
min_count = 5

In [8]:
# the following generates the new sets of words to be used
# each pair (A, B) and (X, Y) will have the same number of items

# load weat tests
weat_words = json.load(open("../data/Data_WEAT/weat_attrib_target.json"))
weat2_words = json.load(open("../data/Data_WEAT/weat_attrib_target_2.json"))
weat3_words = json.load(open("../data/Data_WEAT/weat_attrib_target_3.json"))

genders = ['all', 'male', 'female']

for idx, row in report_weats.iterrows():
    
    which_weat, test_name = idx[0], idx[1]
    print(which_weat, test_name)
    
    # read the test words
    if which_weat=='WEAT':
        weat_words_dict = weat_words
    elif which_weat=='WEAT2':
        weat_words_dict = weat2_words
    elif which_weat=='WEAT3':
        weat_words_dict = weat3_words
    else:
        raise 'Problem'
        
    A_key = weat_words_dict[test_name]['A_key']
    A_words = weat_words_dict[test_name][A_key]

    B_key = weat_words_dict[test_name]['B_key']
    B_words = weat_words_dict[test_name][B_key]

    X_key = weat_words_dict[test_name]['X_key']
    X_words = weat_words_dict[test_name][X_key]

    Y_key = weat_words_dict[test_name]['Y_key']
    Y_words = weat_words_dict[test_name][Y_key]
    
    # make a table with the occurrence of all the words for each corpus
    test_word_occurrences_in_corpora = {gender:report_weats.loc[which_weat, test_name][f'test_word_count_{gender}_person_lemmatized'] 
                                        for gender in genders}
    
    test_word_occurrences_in_corpora = pd.DataFrame(test_word_occurrences_in_corpora)
    test_word_occurrences_in_corpora = test_word_occurrences_in_corpora\
                                .merge(test_word_occurrences_in_corpora / test_word_occurrences_in_corpora.sum(),
                                       left_index=True, right_index=True, suffixes=('', '_frac'))
        
    test_word_occurrences_in_corpora.loc[:, 'aggregate'] = test_word_occurrences_in_corpora[['all_frac', 
                                                                                             'male_frac', 
                                                                                             'female_frac']].min(axis=1)
    
    
    # check if some pairs of sets with different number of words
    # A and B
    A_word_occurrences_in_corpora = test_word_occurrences_in_corpora.loc[A_words]
    B_word_occurrences_in_corpora = test_word_occurrences_in_corpora.loc[B_words]
    
    # remove the words with total count lower than min_count
    # the word is remmoved if it occurs less than min_count times in one of the three corpora
    A_word_occurrences_in_corpora = A_word_occurrences_in_corpora[(A_word_occurrences_in_corpora[genders]>=min_count).all(axis=1)]
    B_word_occurrences_in_corpora = B_word_occurrences_in_corpora[(B_word_occurrences_in_corpora[genders]>=min_count).all(axis=1)]

    print("Number of words in set A: ", A_word_occurrences_in_corpora.shape[0])
    print("Number of words in set B: ", B_word_occurrences_in_corpora.shape[0])
    if A_word_occurrences_in_corpora.shape[0]>B_word_occurrences_in_corpora.shape[0]:
        
        n_to_remove = A_word_occurrences_in_corpora.shape[0] - B_word_occurrences_in_corpora.shape[0]
        
        A_word_occurrences_in_corpora = A_word_occurrences_in_corpora.sort_values('aggregate')
        words_removed = A_word_occurrences_in_corpora.iloc[:n_to_remove].index.tolist()
        A_word_occurrences_in_corpora = A_word_occurrences_in_corpora.iloc[n_to_remove:]
        
        print("Words removed from set A: ", ", ".join(words_removed))
        
    elif A_word_occurrences_in_corpora.shape[0]<B_word_occurrences_in_corpora.shape[0]:
        
        n_to_remove = B_word_occurrences_in_corpora.shape[0] - A_word_occurrences_in_corpora.shape[0]
        
        B_word_occurrences_in_corpora = B_word_occurrences_in_corpora.sort_values('aggregate')
        words_removed = B_word_occurrences_in_corpora.iloc[:n_to_remove].index.tolist()
        B_word_occurrences_in_corpora = B_word_occurrences_in_corpora.iloc[n_to_remove:]
        
        print("Words removed from set B: ", ", ".join(words_removed))

    else: pass
    
    # X and Y
    X_word_occurrences_in_corpora = test_word_occurrences_in_corpora.loc[X_words]
    Y_word_occurrences_in_corpora = test_word_occurrences_in_corpora.loc[Y_words]
    
    # remove the missing words for all the corpora df        
    X_word_occurrences_in_corpora = X_word_occurrences_in_corpora[(X_word_occurrences_in_corpora[genders]>=min_count).all(axis=1)]
    Y_word_occurrences_in_corpora = Y_word_occurrences_in_corpora[(Y_word_occurrences_in_corpora[genders]>=min_count).all(axis=1)]

    print("Number of words in set X: ", X_word_occurrences_in_corpora.shape[0])
    print("Number of words in set Y: ", Y_word_occurrences_in_corpora.shape[0])
    if X_word_occurrences_in_corpora.shape[0]>Y_word_occurrences_in_corpora.shape[0]:
        
        n_to_remove = X_word_occurrences_in_corpora.shape[0] - Y_word_occurrences_in_corpora.shape[0]
        
        X_word_occurrences_in_corpora = X_word_occurrences_in_corpora.sort_values('aggregate')
        words_removed = X_word_occurrences_in_corpora.iloc[:n_to_remove].index.tolist()
        X_word_occurrences_in_corpora = X_word_occurrences_in_corpora.iloc[n_to_remove:]
        
        print("Words removed from set X: ", ", ".join(words_removed))
        
    elif X_word_occurrences_in_corpora.shape[0]<Y_word_occurrences_in_corpora.shape[0]:
        
        n_to_remove = Y_word_occurrences_in_corpora.shape[0] - X_word_occurrences_in_corpora.shape[0]
        
        Y_word_occurrences_in_corpora = Y_word_occurrences_in_corpora.sort_values('aggregate')
        words_removed = Y_word_occurrences_in_corpora.iloc[:n_to_remove].index.tolist()
        Y_word_occurrences_in_corpora = Y_word_occurrences_in_corpora.iloc[n_to_remove:]
        
        print("Words removed from set Y: ", ", ".join(words_removed))

    else: pass
        
        
    # overwrite the original WEAT files
    # these are the final sets of words
    if which_weat=='WEAT':
                
        weat_words[test_name][A_key] = A_word_occurrences_in_corpora.index.tolist()
        weat_words[test_name][B_key] = B_word_occurrences_in_corpora.index.tolist()
        weat_words[test_name][X_key] = X_word_occurrences_in_corpora.index.tolist()
        weat_words[test_name][Y_key] = Y_word_occurrences_in_corpora.index.tolist()
        
    elif which_weat=='WEAT2':
        
        weat2_words[test_name][A_key] = A_word_occurrences_in_corpora.index.tolist()
        weat2_words[test_name][B_key] = B_word_occurrences_in_corpora.index.tolist()
        weat2_words[test_name][X_key] = X_word_occurrences_in_corpora.index.tolist()
        weat2_words[test_name][Y_key] = Y_word_occurrences_in_corpora.index.tolist()
    elif which_weat=='WEAT3':
        weat3_words[test_name][A_key] = A_word_occurrences_in_corpora.index.tolist()
        weat3_words[test_name][B_key] = B_word_occurrences_in_corpora.index.tolist()
        weat3_words[test_name][X_key] = X_word_occurrences_in_corpora.index.tolist()
        weat3_words[test_name][Y_key] = Y_word_occurrences_in_corpora.index.tolist()
        
    else:
        raise 'Problem'
                
    print()
    print()
        
# save these new objects
json.dump(weat_words, open('../data/Data_WEAT/weat_attrib_target_same_length.json', "wt"))
json.dump(weat2_words, open('../data/Data_WEAT/weat_attrib_target_2_same_length.json', "wt"))
json.dump(weat3_words, open('../data/Data_WEAT/weat_attrib_target_3_same_length.json', "wt"))
    

WEAT EuropeanAmerican_AfricanAmerican_Pleasant_Unpleasant
Number of words in set A:  25
Number of words in set B:  25
Number of words in set X:  44
Number of words in set Y:  10
Words removed from set X:  colleen, megan, wilbur, jonathan, shannon, josh, melanie, brandon, todd, lauren, amanda, brad, matthew, stephanie, betsy, greg, alan, stephen, ryan, andrew, katie, roger, ellen, sara, justin, wendy, hank, rachel, fred, peggy, donna, ian, emily, nancy


WEAT EuropeanAmerican_AfricanAmerican_Pleasant_Unpleasant_2
Number of words in set A:  33
Number of words in set B:  33
Number of words in set X:  14
Number of words in set Y:  7
Words removed from set X:  laurie, brett, todd, allison, brad, matthew, carrie


WEAT Flowers_Insects_Pleasant_Unpleasant
Number of words in set A:  25
Number of words in set B:  25
Number of words in set X:  17
Number of words in set Y:  19
Words removed from set Y:  wasp, hornet


WEAT Male_Female_Career_Family
Number of words in set A:  8
Number of words in 

Now get the lists of all the unique words involved in each WEAT group

In [9]:
def read_words_WEAT(WEAT_test_words):
    '''
    This returns all the unique words of the WEAT tests.
    '''

    # get all sets of words
    attributes = WEAT_test_words['attributes']
    targets = WEAT_test_words['targets']
    method = WEAT_test_words['method']
    #print('Attributes (A and B): ', attributes)
    #print('Targets (X and Y): ', targets)
    #print('Method: ', method)
    #print()

    A_key = WEAT_test_words['A_key']
    A_words = WEAT_test_words[A_key]
    #print('A_key: ', A_key)
    #print(f'A_words ({len(A_words)})): ', ','.join(A_words))

    B_key = WEAT_test_words['B_key']
    B_words = WEAT_test_words[B_key]
    #print('B_key: ', B_key)
    #print(f'B_words ({len(B_words)})): ', ','.join(B_words))

    X_key = WEAT_test_words['X_key']
    X_words = WEAT_test_words[X_key]
    #print('X_key: ', X_key)
    #print(f'X_words ({len(X_words)})): ', ','.join(X_words))

    Y_key = WEAT_test_words['Y_key']
    Y_words = WEAT_test_words[Y_key]
    #print('Y_key: ', Y_key)
    #print(f'Y_words ({len(Y_words)})): ', ','.join(Y_words))
    #print()
    
    return A_words + B_words + X_words + Y_words

In [10]:

weats = ['../data/Data_WEAT/weat_attrib_target_same_length.json', 
         '../data/Data_WEAT/weat_attrib_target_2_same_length.json', 
         '../data/Data_WEAT/weat_attrib_target_3_same_length.json']
out_files = ['../data/Data_WEAT/all_weat_words_same_length.txt', 
             '../data/Data_WEAT/all_weat2_words_same_length.txt', 
             '../data/Data_WEAT/all_weat3_words_same_length.txt']

for weat_wefat_file, output_file in zip(weats, out_files):
        
    # load file with word tests
    weat_associations = json.load(open(weat_wefat_file))
    all_weat_tests = [k for k, v in weat_associations.items() 
                      if type(v) is dict and 'method' in v.keys() and v['method']=='weat']
    
    all_wefat_tests = [k for k, v in weat_associations.items() if type(v) is dict and 'method' in v.keys() 
                  and v['method']=='wefat']

    # get words used for WEAT test
    all_WEAT_test_words = set()
    for which_test in all_weat_tests:
        all_WEAT_test_words_ = read_words_WEAT(weat_associations[which_test])
        all_WEAT_test_words.update(all_WEAT_test_words_)
    print('Number unique words WEAT tests: ', len(all_WEAT_test_words))
    
          
    # save WEAT+WEFAT words
    with open(output_file, 'wt') as ww:
        for w in all_WEAT_test_words:
            ww.write(w+"\n")

Number unique words WEAT tests:  229
Number unique words WEAT tests:  96
Number unique words WEAT tests:  283
