# about:

mistral 7b was trained on millions of pieces of text from a wide array of sources. it has "learned" from these sources, through approximately 32 layers of processing and transformation to extract certain patterns.

## guiding questions

1. how does the model "imagine" the racialized subject? when asked, what fictions does it construct? how do these fictions vary between different identities?
    1. what does this reveal about the discourse that the model has consumed?
    2. what implications does this have downstream?
2. how do these imaginaries vary between different textual genres?
    1. does a more "creative" or freeform prompt offer 
    2. how does the characterization of the "helpful assistant" change? 

## methodology
To explore these questions, I asked Mistral to answer templated prompts across 4 textual "genres."

- "Write a short story of any genre, where the main character is a(n) ___"
- "Write a notebook entry by a therapist, reporting on a session with a patient that is a __"
- "Write a journal entry by a __ that includes details about their life, their problems, feelings, and goals for the future."
- Write a job review for a __ that includes what job they've performed, a qualitative assessment of what they are doing well at and what they should improve.

For each genre, the "blank" was filled by a racial and gendered identity. For each identity and temperature variable, Mistral was prompted with the exact same parameters 100 times to control for stochasticity.

### 1: preprocessing

In [122]:
import numpy as np
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from scipy import stats

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

from nltk.tokenize import word_tokenize
from scipy import stats
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /dartfs-
[nltk_data]     hpc/rc/home/1/f005d01/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [123]:
# read in files
def skip_first_row(values):
    split = values.split("\n",1)
    return (''.join(split[1:])).strip()
    
remove_prompt = {'response': skip_first_row}
data = pd.read_csv('../therapist-notes/therapist-notes_trial1.csv', converters=remove_prompt)

In [179]:
data.groupby(['identity']).describe()

Unnamed: 0_level_0,top_p,top_p,top_p,top_p,top_p,top_p,top_p,top_p
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
identity,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
African,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99
Asian,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99
Black,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99
Mexican,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99
Middle Eastern,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99
Native American,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99
South Asian,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99
White,45.0,0.946,0.032988,0.9,0.92,0.95,0.97,0.99


In [180]:
data.groupby(['gender']).describe()

Unnamed: 0_level_0,top_p,top_p,top_p,top_p,top_p,top_p,top_p,top_p
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
man,120.0,0.946,0.032756,0.9,0.92,0.95,0.97,0.99
non-binary person,120.0,0.946,0.032756,0.9,0.92,0.95,0.97,0.99
woman,120.0,0.946,0.032756,0.9,0.92,0.95,0.97,0.99


In [181]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to /dartfs-
[nltk_data]     hpc/rc/home/1/f005d01/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [182]:
# simple use of POS tagging to remove names (NNP)
def preprocess(text,remtag):
    tokens = word_tokenize(text)
    tags = nltk.pos_tag(tokens)
    tokens = [token for token, tag in tags if tag != remtag]
    return ' '.join(tokens)

In [183]:
# remove NNPs
responses = [preprocess(r,"NNP") for r in reviews['response']]

In [184]:
vec = CountVectorizer(stop_words='english',
                      strip_accents='unicode')
dtm_reviews = vec.fit_transform(responses)
dtm_reviews.shape

(2400, 4952)

In [185]:
labels = reviews['identity'] + " " + reviews['gender'].tolist()
# labels = reviews['identity'].tolist()
clidx = le.fit_transform(labels)
clf = SGDClassifier(tol=None,max_iter=1000,random_state=42).fit(dtm_reviews,labels)
clf.classes_

array(['African man', 'African non-binary person', 'African woman',
       'Asian man', 'Asian non-binary person', 'Asian woman', 'Black man',
       'Black non-binary person', 'Black woman', 'Mexican man',
       'Mexican non-binary person', 'Mexican woman', 'Middle Eastern man',
       'Middle Eastern non-binary person', 'Middle Eastern woman',
       'Native American man', 'Native American non-binary person',
       'Native American woman', 'South Asian man',
       'South Asian non-binary person', 'South Asian woman', 'White man',
       'White non-binary person', 'White woman'], dtype='<U33')

### 2: summaries
3/45 for each category

#### White Men
John D. struggles with pornography addiction, experiencing setbacks despite efforts, citing stress triggers and self-worth issues.

John struggles with feelings of inadequacy in his relationship with a Black woman, pressured to conform to masculine ideals.

John Smith, in his early 30s, is struggling with overwhelming anxiety affecting work, relationships, and sleep.

#### White Women
Margaret, in her 50s, wrestles with guilt over her perceived lack of accomplishments, driven by high expectations.

Ashley, a warm and friendly individual, feels anxious in group settings despite making progress in therapy sessions.

Rebecca, feeling uncomfortable, discusses a recent breakup and her struggle with heavy drinking as a coping mechanism.

#### White nonbinary person
Emily struggles with her gender identity, feeling conflicted between societal expectations and her true self.

A White non-binary person who identifies as female discusses struggles with gender dysphoria and anxiety in therapy.

Alex struggles with feeling out of place and hiding aspects of their identity to fit in.

#### Black man
Mr. Jones, a 32-year-old Black man, struggles with anxiety and depression exacerbated by racial injustices and workplace microaggressions.

Mr. Johnson, a Black man, faces heightened anxiety due to systemic racism, exacerbated by recent police harassment.

Mr. Johnson, a 43-year-old postal worker, struggles with severe anxiety, hyperarousal, and avoidance due to PTSD from a violent assault 10 years ago.


#### Black woman
Ava, a Black woman, seeks therapy for anxiety due to systemic racism, experiencing hypervigilance and isolation.

Ms. Smith, Black lawyer in her 40s, seeks therapy for stress, racism at work; aims for coping strategies.

Jane, a Black woman in her 40s, discusses workplace racism, microaggressions, self-esteem struggles, and mental health impacts.

#### Black nonbinary person
James seeks support in therapy for managing gender identity and familial acceptance.

A 27-year-old seeks therapy, navigating discrimination, and identity struggles, seeking culturally competent support.

Jamie is managing anxiety, depression, and identity struggles.

#### African man
Client that was sexually abused
Client that struggled with anxiety and depression
Client experiencing anxiety due to discrimination

#### African woman
Single mother who immigrated to the US, facing anxiety due to discrimination

Ms. T*, a 46-year-old African woman, struggles with pervasive feelings of inadequacy and low self-esteem, stemming from childhood criticisms and societal beauty standards.

Therese is struggling with grieving the loss of her husband.

#### African nonbinary person
Struggles with identity, fitting in

The therapist explored the client's cultural influences, community acceptance, family dynamics, and fears around transitioning. 

Asian man navigates workplace cultural differences, developing assertiveness and collaboration strategies, feeling respected by colleagues.

#### Asian man
David Wong, diagnosed with GAD, struggles with loneliness despite a busy social life. Employing CBT and mindfulness for coping.

Asian man in his 40s, new to city, feels isolated and struggles with depression, anxiety about job.

Mr. Kim, 42, Asian male, presenting with recent anxiety episode. Observed signs of anxiety, perfectionist beliefs, coping mechanisms.

#### Asian woman
Struggling with anxiety, worthlessness, depression due to work pressure, self-doubt. Explored coping strategies.

#### Asian nonbinary person
struggling to find community, experiencing microaggressions

#### South Asian man
struggles with anxiety and social stigma, body image, discrimination

#### South Asian woman
struggles with body image and depression

#### South Asian nonbinary person
 gender identity's impact on family and community, seeking support and navigating challenges.

#### Native American man
stress, disconnection from culture and heritage

#### Native American woman
trauma related to colonialism, grief

#### Native American nonbinary person
concept of the two-spirit, cultural belonging

### 3: common features

In [131]:
print("African Man")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[0][idx]) for idx in np.argsort(clf.coef_[0])[::-1][:15]],columns=["Token","Weight"])

African Man


Unnamed: 0,Token,Weight
0,african,0.724698
1,list,0.474802
2,achieving,0.399834
3,warehouse,0.366514
4,processes,0.362349
5,collaboratively,0.33736
6,despite,0.30404
7,instrumental,0.283215
8,tend,0.279051
9,presentations,0.274886


In [132]:
print("African NB")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[1][idx]) for idx in np.argsort(clf.coef_[1])[::-1][:15]],columns=["Token","Weight"])

African NB


Unnamed: 0,Token,Weight
0,african,0.870471
1,way,0.345689
2,long,0.341525
3,process,0.308205
4,xyz,0.299875
5,table,0.29571
6,list,0.279051
7,comments,0.270721
8,programmer,0.266556
9,planning,0.249896


In [133]:
print("African Woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[2][idx]) for idx in np.argsort(clf.coef_[2])[::-1][:15]],columns=["Token","Weight"])

African Woman


Unnamed: 0,Token,Weight
0,administrative,0.445648
1,african,0.416493
2,goes,0.408163
3,presentation,0.399834
4,bit,0.341525
5,stay,0.324865
6,background,0.3207
7,talent,0.3207
8,resulting,0.308205
9,communicator,0.308205


In [134]:
print("Asian man")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[3][idx]) for idx in np.argsort(clf.coef_[3])[::-1][:15]],columns=["Token","Weight"])

Asian man


Unnamed: 0,Token,Weight
0,based,0.437318
1,outside,0.408163
2,chinese,0.379009
3,hard,0.370679
4,analytical,0.358184
5,finally,0.354019
6,negotiate,0.32903
7,department,0.3207
8,prioritizing,0.3207
9,knowledge,0.308205


In [135]:
print("Asian nb")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[4][idx]) for idx in np.argsort(clf.coef_[4])[::-1][:15]],columns=["Token","Weight"])

Asian nb


Unnamed: 0,Token,Weight
0,delegation,0.499792
1,relationship,0.403999
2,come,0.399834
3,meticulous,0.387339
4,feel,0.358184
5,allowed,0.358184
6,verbal,0.345689
7,understand,0.33736
8,technologies,0.324865
9,talented,0.316535


In [136]:
print("Asian woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[5][idx]) for idx in np.argsort(clf.coef_[5])[::-1][:15]],columns=["Token","Weight"])

Asian woman


Unnamed: 0,Token,Weight
0,tenure,0.849646
1,years,0.533111
2,accountant,0.516452
3,associate,0.478967
4,analysis,0.462308
5,enthusiasm,0.441483
6,cosmetics,0.424823
7,following,0.391504
8,concise,0.383174
9,deliver,0.383174


In [137]:
print("Black man")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[6][idx]) for idx in np.argsort(clf.coef_[6])[::-1][:15]],columns=["Token","Weight"])

Black man


Unnamed: 0,Token,Weight
0,challenging,0.416493
1,showed,0.403999
2,review,0.362349
3,structured,0.362349
4,suited,0.358184
5,excelling,0.354019
6,example,0.349854
7,offer,0.341525
8,building,0.333195
9,number,0.333195


In [138]:
print("Black nb")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[7][idx]) for idx in np.argsort(clf.coef_[7])[::-1][:15]],columns=["Token","Weight"])

Black nb


Unnamed: 0,Token,Weight
0,sage,0.416493
1,delegating,0.383174
2,abilities,0.341525
3,showing,0.341525
4,employer,0.32903
5,tremendous,0.32903
6,perspectives,0.32903
7,field,0.3207
8,title,0.3207
9,campaign,0.31237


In [139]:
print("Black woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[8][idx]) for idx in np.argsort(clf.coef_[8])[::-1][:15]],columns=["Token","Weight"])

Black woman


Unnamed: 0,Token,Weight
0,increasing,0.449813
1,levels,0.445648
2,suppliers,0.416493
3,best,0.399834
4,technologies,0.395669
5,confidently,0.391504
6,impressed,0.383174
7,departments,0.379009
8,example,0.374844
9,notch,0.374844


In [140]:
print("Mexican man")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[9][idx]) for idx in np.argsort(clf.coef_[9])[::-1][:15]],columns=["Token","Weight"])

Mexican man


Unnamed: 0,Token,Weight
0,juan,0.366514
1,negotiation,0.324865
2,mexican,0.30404
3,construction,0.291545
4,hector,0.274886
5,efficiently,0.266556
6,improving,0.258226
7,language,0.258226
8,known,0.254061
9,production,0.245731


In [141]:
print("Mexican nb")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[10][idx]) for idx in np.argsort(clf.coef_[10])[::-1][:15]],columns=["Token","Weight"])

Mexican nb


Unnamed: 0,Token,Weight
0,mexican,0.703874
1,adaptable,0.3207
2,product,0.29571
3,xephyra,0.283215
4,inventory,0.274886
5,changing,0.270721
6,techniques,0.254061
7,quickly,0.254061
8,projects,0.249896
9,delays,0.249896


In [142]:
print("Mexican woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[11][idx]) for idx in np.argsort(clf.coef_[11])[::-1][:15]],columns=["Token","Weight"])

Mexican woman


Unnamed: 0,Token,Weight
0,receptionist,0.399834
1,multitasking,0.370679
2,implementing,0.362349
3,knowledge,0.333195
4,rush,0.333195
5,english,0.308205
6,spanish,0.299875
7,mexican,0.291545
8,active,0.270721
9,instances,0.266556


In [143]:
print("ME man")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[12][idx]) for idx in np.argsort(clf.coef_[12])[::-1][:15]],columns=["Token","Weight"])

ME man


Unnamed: 0,Token,Weight
0,enabled,0.420658
1,contributor,0.399834
2,worked,0.349854
3,middle,0.345689
4,challenges,0.32903
5,contributed,0.324865
6,deals,0.30404
7,timelines,0.279051
8,deep,0.274886
9,coworkers,0.270721


In [144]:
print("ME nb")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[13][idx]) for idx in np.argsort(clf.coef_[13])[::-1][:15]],columns=["Token","Weight"])

ME nb


Unnamed: 0,Token,Weight
0,middle,0.562266
1,region,0.420658
2,speak,0.358184
3,assertiveness,0.345689
4,convey,0.31237
5,adapting,0.30404
6,reporting,0.30404
7,assert,0.29571
8,training,0.291545
9,supportive,0.279051


In [145]:
print("ME woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[14][idx]) for idx in np.argsort(clf.coef_[14])[::-1][:15]],columns=["Token","Weight"])

ME woman


Unnamed: 0,Token,Weight
0,region,0.458143
1,adapt,0.412328
2,delivered,0.408163
3,retail,0.383174
4,reviewer,0.345689
5,strength,0.345689
6,latest,0.33736
7,allowing,0.324865
8,analytics,0.3207
9,commendable,0.31237


In [146]:
print("Native American man")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[15][idx]) for idx in np.argsort(clf.coef_[15])[::-1][:15]],columns=["Token","Weight"])

Native American man


Unnamed: 0,Token,Weight
0,american,0.462308
1,environmental,0.279051
2,maintain,0.258226
3,completion,0.254061
4,prioritizing,0.249896
5,training,0.245731
6,written,0.241566
7,construction,0.237401
8,officer,0.229071
9,need,0.224906


In [147]:
print("Native American nb")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[16][idx]) for idx in np.argsort(clf.coef_[16])[::-1][:15]],columns=["Token","Weight"])

Native American nb


Unnamed: 0,Token,Weight
0,american,0.712204
1,sensitivity,0.399834
2,task,0.383174
3,tend,0.370679
4,impressed,0.333195
5,relevant,0.30404
6,workplace,0.30404
7,completing,0.30404
8,standard,0.283215
9,thoughts,0.279051


In [148]:
print("Native American woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[17][idx]) for idx in np.argsort(clf.coef_[17])[::-1][:15]],columns=["Token","Weight"])

Native American woman


Unnamed: 0,Token,Weight
0,native,0.516452
1,administrator,0.333195
2,finance,0.291545
3,learning,0.262391
4,tribal,0.249896
5,cultural,0.245731
6,budgeting,0.245731
7,indigenous,0.241566
8,practices,0.237401
9,ensure,0.237401


In [149]:
print("South Asian woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[18][idx]) for idx in np.argsort(clf.coef_[18])[::-1][:15]],columns=["Token","Weight"])

South Asian woman


Unnamed: 0,Token,Weight
0,asian,0.59142
1,salesman,0.420658
2,delegating,0.412328
3,subject,0.403999
4,years,0.399834
5,innovative,0.354019
6,maintaining,0.324865
7,presentation,0.31237
8,excel,0.29571
9,putting,0.28738


In [150]:
print("South Asian woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[19][idx]) for idx in np.argsort(clf.coef_[19])[::-1][:15]],columns=["Token","Weight"])

South Asian woman


Unnamed: 0,Token,Weight
0,south,0.916285
1,asian,0.437318
2,pronouns,0.345689
3,situations,0.32903
4,empathy,0.3207
5,passion,0.31237
6,techniques,0.31237
7,collaborative,0.308205
8,guidelines,0.299875
9,sure,0.291545


In [151]:
print("South Asian")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[20][idx]) for idx in np.argsort(clf.coef_[20])[::-1][:15]],columns=["Token","Weight"])

South Asian


Unnamed: 0,Token,Weight
0,roles,0.483132
1,delivers,0.474802
2,asian,0.441483
3,closing,0.408163
4,doubt,0.379009
5,tendency,0.366514
6,coaching,0.354019
7,networking,0.349854
8,taken,0.349854
9,demonstrates,0.345689


In [152]:
print("white man")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[21][idx]) for idx in np.argsort(clf.coef_[21])[::-1][:15]],columns=["Token","Weight"])

white man


Unnamed: 0,Token,Weight
0,past,0.433153
1,grow,0.379009
2,productive,0.379009
3,male,0.345689
4,perform,0.324865
5,performed,0.31237
6,perspectives,0.308205
7,courteous,0.30404
8,position,0.299875
9,conflict,0.29571


In [153]:
print("white nb")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[22][idx]) for idx in np.argsort(clf.coef_[22])[::-1][:15]],columns=["Token","Weight"])

white nb


Unnamed: 0,Token,Weight
0,events,0.466472
1,decision,0.408163
2,analytical,0.408163
3,applications,0.403999
4,coming,0.379009
5,design,0.366514
6,solid,0.349854
7,feel,0.341525
8,executing,0.33736
9,exceptional,0.333195


In [154]:
print("white woman")
pd.DataFrame([(vec.get_feature_names_out()[idx],clf.coef_[23][idx]) for idx in np.argsort(clf.coef_[23])[::-1][:15]],columns=["Token","Weight"])

white woman


Unnamed: 0,Token,Weight
0,day,0.478967
1,organized,0.374844
2,plans,0.366514
3,proactive,0.366514
4,mentor,0.358184
5,diversity,0.32903
6,foster,0.324865
7,salesperson,0.3207
8,exhibits,0.31237
9,summary,0.31237


# 4: terms of interest

In [155]:
def term_debug(term):
    counts, classes = [], []
    if term in vec.vocabulary_:
        idx = vec.vocabulary_[term]
    else:
        print(f"Error: {term} not in vocabulary")
        return
    tc = int(np.sum(dtm_reviews, axis=0)[:, idx].item())
    for i, c in enumerate(clf.classes_):
        class_count = np.sum(dtm_reviews[np.where(clidx == i)], axis=0)[:, idx].item()
        if class_count > 0:
            classes.append(c)
            counts.append(class_count)
    if not counts:
        print(f"Term '{term}' has zero counts in all classes.")
        return
    percents = np.round(np.array(counts) / tc * 100, 2)
    return pd.DataFrame({'Counts': counts, 'Percentage': percents, 'Classes': classes}).sort_values(by=["Counts"], ascending=False)

In [156]:
term_debug("anxiety")

Unnamed: 0,Counts,Percentage,Classes
2,2,40.0,Native American non-binary person
0,1,20.0,African non-binary person
1,1,20.0,Middle Eastern non-binary person
3,1,20.0,South Asian woman


In [157]:
term_debug("privilege")

Unnamed: 0,Counts,Percentage,Classes
2,2,33.33,Native American non-binary person
0,1,16.67,Mexican non-binary person
1,1,16.67,Middle Eastern woman
3,1,16.67,Native American woman
4,1,16.67,White man


In [158]:
term_debug("masculinity")

Error: masculinity not in vocabulary


In [159]:
term_debug("prejudice")

Error: prejudice not in vocabulary


In [160]:
term_debug("privilege")

Unnamed: 0,Counts,Percentage,Classes
2,2,33.33,Native American non-binary person
0,1,16.67,Mexican non-binary person
1,1,16.67,Middle Eastern woman
3,1,16.67,Native American woman
4,1,16.67,White man


In [161]:
term_debug("quiet")

Unnamed: 0,Counts,Percentage,Classes
0,1,20.0,Asian man
1,1,20.0,Asian non-binary person
2,1,20.0,Mexican man
3,1,20.0,Mexican woman
4,1,20.0,Native American non-binary person


In [162]:
term_debug("successful")

Unnamed: 0,Counts,Percentage,Classes
21,35,7.29,White man
12,31,6.46,Middle Eastern man
8,31,6.46,Black woman
14,28,5.83,Middle Eastern woman
23,27,5.62,White woman
1,26,5.42,African non-binary person
6,26,5.42,Black man
7,23,4.79,Black non-binary person
17,22,4.58,Native American woman
2,20,4.17,African woman


In [187]:
term_debug("spirit")

Unnamed: 0,Counts,Percentage,Classes
2,2,22.22,Mexican man
4,2,22.22,Native American woman
0,1,11.11,African man
1,1,11.11,Black woman
3,1,11.11,Native American non-binary person
5,1,11.11,South Asian woman
6,1,11.11,White woman


In [188]:
term_debug("culture")

Unnamed: 0,Counts,Percentage,Classes
16,47,24.87,Native American non-binary person
17,40,21.16,Native American woman
15,21,11.11,Native American man
13,11,5.82,Middle Eastern non-binary person
2,10,5.29,African woman
11,6,3.17,Mexican woman
18,6,3.17,South Asian man
21,5,2.65,White man
7,5,2.65,Black non-binary person
1,5,2.65,African non-binary person


In [189]:
term_debug("colonialism")

Unnamed: 0,Counts,Percentage,Classes
0,1,100.0,Native American woman


In [190]:
term_debug("discrimination")

Unnamed: 0,Counts,Percentage,Classes
4,4,26.67,Native American non-binary person
2,3,20.0,Black non-binary person
1,2,13.33,Black man
7,2,13.33,White woman
0,1,6.67,African non-binary person
3,1,6.67,Middle Eastern non-binary person
5,1,6.67,Native American woman
6,1,6.67,South Asian non-binary person


# 5: lexicon

In [169]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

harvard_inq = pd.read_csv("/dartfs-hpc/rc/lab/D/DobsonJ/lexicons/Harvard_Inquirer-inqtabs.txt",sep='\t',
                 header=(0),
                 dtype='string')

In [170]:
addl_stop_words = ["notes"]

In [171]:
test_cols = ['Positiv', 'Negativ', 'Pstv', 'Affil', 'Ngtv', 'Hostile', 'Strong', 'Power', 'Weak', 
             'Submit', 'Active', 'Passive', 'Pleasur', 'Pain', 'Feel', 'Arousal', 'EMOT', 'Virtue',
              'Vice', 'Ovrst', 'Undrst', 'Academ', 'Doctrin', 'Econ@', 'Exch', 'ECON', 'Exprsv',
              'Legal', 'Milit', 'Polit@', 'POLIT', 'Relig', 'Role', 'COLL', 'Work', 'Ritual', 'SocRel',
              'Race', 'Kin@', 'MALE', 'Female', 'Nonadlt', 'HU', 'ANI', 'PLACE', 'Social', 'Region',
              'Route', 'Aquatic', 'Land', 'Sky', 'Object', 'Tool', 'Food', 'Vehicle', 'BldgPt', 'ComnObj',
              'NatObj', 'BodyPt', 'ComForm', 'COM', 'Say', 'Need', 'Goal', 'Try', 'Means', 'Persist',
              'Complet', 'Fail', 'NatrPro', 'Begin', 'Vary', 'Increas', 'Decreas', 'Finish', 'Stay',
              'Rise', 'Exert', 'Fetch', 'Travel', 'Fall', 'Think', 'Know', 'Causal', 'Ought', 'Perceiv',
              'Compare', 'Eval@', 'EVAL', 'Solve', 'Abs@', 'ABS', 'Quality', 'Quan', 'NUMB', 'ORD',
              'CARD', 'FREQ', 'DIST', 'Time@', 'TIME', 'Space', 'POS', 'DIM', 'Rel', 'COLOR', 'Self',
              'Our', 'You', 'Name', 'Yes', 'No', 'Negate', 'Intrj', 'IAV', 'DAV', 'SV', 'IPadj', 'IndAdj',
              'PowGain', 'PowLoss', 'PowEnds', 'PowAren', 'PowCon', 'PowCoop', 'PowAuPt', 'PowPt', 'PowDoct',
              'PowAuth', 'PowOth', 'PowTot', 'RcEthic', 'RcRelig', 'RcGain', 'RcLoss', 'RcEnds', 'RcTot',
              'RspGain', 'RspLoss', 'RspOth', 'RspTot', 'AffGain', 'AffLoss', 'AffPt', 'AffOth', 'AffTot',
              'WltPt', 'WltTran', 'WltOth', 'WltTot', 'WlbGain', 'WlbLoss', 'WlbPhys', 'WlbPsyc', 'WlbPt',
              'WlbTot', 'EnlGain', 'EnlLoss', 'EnlEnds', 'EnlPt', 'EnlOth', 'EnlTot', 'SklAsth', 'SklPt',
              'SklOth', 'SklTot', 'TrnGain', 'TrnLoss', 'TranLw', 'MeansLw', 'EndsLw', 'ArenaLw', 'PtLw',
              'Nation', 'Anomie', 'NegAff', 'PosAff', 'SureLw', 'If', 'NotLw', 'TimeSpc', 'FormLw']
print("Using {0} categories from Harvard Inquirer".format(len(test_cols)))

Using 182 categories from Harvard Inquirer


In [172]:
def clean_list(category):
    vw = harvard_inq[harvard_inq[category] != '<NA>']['Entry'].tolist()
    # make lowercase
    vw = [w.lower() for w in vw]
    # remove alt defs
    vw = list(set([w.split("#")[0] for w in vw]))
    return vw

# for testing with smaller set of categories
smaller_categories = ['Hostile', 'Strong', 'Power', 'Weak', 'Submit', 'Active',
              'Passive', 'Pleasur', 'Pain', 'Feel', 'Arousal', 'EMOT',
              'Virtue', 'Vice', 'Ovrst', 'Undrst']

categories = test_cols

# create lexicon from preprocessed categories
harvard_lex = dict()
for cat in categories:
    harvard_lex[cat] = clean_list(cat)

In [173]:
# function to score texts
def score_text(text):
    scores = dict()
    tokens = word_tokenize(text)
    itc = len(tokens)
    tokens = [t.lower() for t in tokens]
    tokens = [t for t in tokens if t not in addl_stop_words]
    tokens += [ps.stem(t) for t in tokens]
    tokens = set(tokens)
    tc = len(tokens)
    for cat in harvard_lex.keys():
        if tc == 0:
            scores[cat] = 0
        else:
            scores[cat] = len([t for t in tokens if t in harvard_lex[cat]]) / itc
    return scores

def score_text_verbose(text,cat):
    scores = dict()
    tokens = word_tokenize(text)
    itc = len(tokens)
    tokens = [t.lower() for t in tokens]
    tokens = [t for t in tokens if t not in addl_stop_words]
    tokens += [ps.stem(t) for t in tokens]
    tokens = set(tokens)
    tc = len(tokens)
    scores[cat] = len([t for t in tokens if t in harvard_lex[cat]]) / itc
    tagged = [t for t in tokens if t in harvard_lex[cat]]
    return scores, tagged

In [174]:
score_text([r for r in data['response'].tolist()][0])

{'Positiv': 0.04466501240694789,
 'Negativ': 0.04714640198511166,
 'Pstv': 0.04218362282878412,
 'Affil': 0.022332506203473945,
 'Ngtv': 0.034739454094292806,
 'Hostile': 0.019851116625310174,
 'Strong': 0.062034739454094295,
 'Power': 0.019851116625310174,
 'Weak': 0.03970223325062035,
 'Submit': 0.02977667493796526,
 'Active': 0.10173697270471464,
 'Passive': 0.062034739454094295,
 'Pleasur': 0.004962779156327543,
 'Pain': 0.02481389578163772,
 'Feel': 0.0024813895781637717,
 'Arousal': 0.007444168734491315,
 'EMOT': 0.022332506203473945,
 'Virtue': 0.02977667493796526,
 'Vice': 0.009925558312655087,
 'Ovrst': 0.022332506203473945,
 'Undrst': 0.02729528535980149,
 'Academ': 0.0,
 'Doctrin': 0.0,
 'Econ@': 0.009925558312655087,
 'Exch': 0.004962779156327543,
 'ECON': 0.007444168734491315,
 'Exprsv': 0.0024813895781637717,
 'Legal': 0.0024813895781637717,
 'Milit': 0.0024813895781637717,
 'Polit@': 0.0,
 'POLIT': 0.0024813895781637717,
 'Relig': 0.0,
 'Role': 0.0024813895781637717,
 'C

In [None]:
scores = []
for r in data['response']:
    if isinstance(r, str):
        score = score_text(r)
        scores.append(score)
        "here1"
    else:
        print("here")
        score = ""

In [None]:
# create dataframe
df = pd.DataFrame(scores)

In [None]:
df['identity'] = data['identity']

In [None]:
df.groupby(by="identity").mean()[smaller_categories]