## Maryam Afshari


## Gender Bias Analysis Notebook

In this notebook, I conducted a comparative analysis of gender bias in discussions on two selected shows from `discussions.p`. The choice of these shows was made based on their relevance to the context of gender representation. After selecting the shows, I formulated a hypothesis regarding the expected gender bias in the discussions.

Next, I trained two word embeddings models, one for each show, to analyze the language used in the discussions. I compiled a list of male and female-related words considered relevant for the corpus.

Using the method outlined in the paper by Wevers, I compared the gender biases present in the discussions of both shows. Following this analysis, I interpreted the results and discussed their implications in relation to my initial hypothesis.

Ultimately, I explored whether the observed gender biases in the discussions shed light on the shows themselves or merely reflected the dynamics of the discussions surrounding them. This comprehensive approach aimed to provide insights into the portrayal and perception of gender within the contexts of the chosen shows.



### Comparison
##### Game of Thrones and The Witcher were chosen
- Both shows have male and female characters 
- Both shows are of a similar genre (fantasy drama)
- Both shows portray the same period of time
- Both have enough amount of data in the dataset

### Hypothesis
Given the context of both shows, I expect to see gender bias towards males in the comments on both shows.


In [1]:
import spacy
import pickle
import pandas as pd
import numpy as np

In [2]:
# comparison and the hypothsis
# choosing game of thrones and the witcher because both has female chrachters  
# hypothesis: 
# Game of Thrones and The Witcher are selected for the following reasons:

# Both shows are of a similar genre (fantasy drama)
# Both have female characters (as main chracters) 
# both has similarly of medieval times where females are expected to take a more background roles compared to men
# Both have enough amount of data in the original dataset
# The hypothesis is as follows: Given the context of both shows I expect to see gender bias towrads female in the comments about shows 

# coclusion : witcher is more biased towrda man more 

In [4]:
df = pd.read_pickle('discussions.p')
df.head()

Unnamed: 0,title,type,year,post
0,Better Call Saul,linear,2017,Walter. And there the chain ends.
1,Better Call Saul,linear,2016,I love this show. But it's hard to argue again...
2,Better Call Saul,linear,2017,What am I missing? A lot of reference to ribs...
3,Better Call Saul,linear,2018,"Oh come on Mike, he's a good little boy."
4,Better Call Saul,linear,2017,Look again 👀


In [63]:
#df.title.value_counts()

In [64]:
# keeping two selected shows

df_tw = df[(df.title == 'The Witcher')]
#df_tw.tail()

In [65]:
# inspect column post in the witcher show
#df_tw.post[:50]

In [42]:
# inspect column post in Games of Thrones
# df_gt.post[:50]

In [66]:
#making subset of Games of Thrones
df_gt = df[df.title == 'Game of Thrones']
#df_gt.head()

### Preprocessing Text

In [18]:
nlp = spacy.load("en_core_web_sm")


posts_gt = df_gt.post.values 
processed_texts_gt= [text for text in nlp.pipe(posts_gt, 
                                              disable=["ner",
                                                       "parser"])]

posts_tw = df_tw.post.values 
processed_texts_tw= [text for text in nlp.pipe(posts_tw, 
                                              disable=["ner",
                                                       "parser"])]


In [12]:
# list of all stop words in spicy: to see if He and She is there
# stop_words = nlp.Defaults.stop_words
# stop_words

In [30]:
pd.options.mode.chained_assignment = None # ignore the warnning
# add a column processed text to the witcher df
df_tw['processed_texts'] = processed_texts_tw
df_tw.head()

Unnamed: 0,title,type,year,post,processed_texts,tokenized_texts
43692,The Witcher,netflix,2019,Well that and you’ll notice the queen of Cintr...,"(Well, that, and, you, ’ll, notice, the, queen...","[well, that, and, you, ’ll, notice, the, queen..."
43693,The Witcher,netflix,2020,"Yen sucked at using magic, but she had the mos...","(Yen, sucked, at, using, magic, ,, but, she, h...","[yen, sucked, at, using, magic, but, she, had,..."
43694,The Witcher,netflix,2020,Blessed silence.,"(Blessed, silence, .)","[blessed, silence]"
43695,The Witcher,netflix,2019,"""It's true, he has the face of a cad and a cow...","("", It, 's, true, ,, he, has, the, face, of, a...","[it, 's, true, he, has, the, face, of, a, cad,..."
43696,The Witcher,netflix,2019,I'll give ya 10 marks for that duck.,"(I, 'll, give, ya, 10, marks, for, that, duck, .)","[i, 'll, give, ya, 10, marks, for, that, duck]"


In [31]:
# add a column processed text to Games of Thrones df
df_gt['processed_texts'] = processed_texts_gt
df_gt.head()

Unnamed: 0,title,type,year,post,processed_texts,tokenized_texts
19096,Game of Thrones,linear,2016,And manifested itself into a psychological dis...,"(And, manifested, itself, into, a, psychologic...","[and, manifested, itself, into, a, psychologic..."
19097,Game of Thrones,linear,2017,LITTLEFINGER IS FUCKING DEAD UPVOTE PARTY,"(LITTLEFINGER, IS, FUCKING, DEAD, UPVOTE, PARTY)","[littlefinger, is, fucking, dead, upvote, party]"
19098,Game of Thrones,linear,2015,"Oh shit, hey schnitz!!!","(Oh, shit, ,, hey, schnitz, !, !, !)","[oh, shit, hey, schnitz]"
19099,Game of Thrones,linear,2016,"Seriously dude. \n \n""Let's go back to Meree...","(Seriously, dude, ., \n \n, "", Let, 's, go, ...","[seriously, dude, \n \n, let, 's, go, back, ..."
19100,Game of Thrones,linear,2017,At least it would be a reason for Samwell to p...,"(At, least, it, would, be, a, reason, for, Sam...","[at, least, it, would, be, a, reason, for, sam..."


In [21]:
# get the lower case token and remove punctuations from games of thrones df
tokenized_texts_gt = [[token.text.lower() for token in processed_text if not token.is_punct] 
                      for processed_text in processed_texts_gt]

In [32]:
df_gt['tokenized_texts'] = tokenized_texts_gt
df_gt.head()

Unnamed: 0,title,type,year,post,processed_texts,tokenized_texts
19096,Game of Thrones,linear,2016,And manifested itself into a psychological dis...,"(And, manifested, itself, into, a, psychologic...","[and, manifested, itself, into, a, psychologic..."
19097,Game of Thrones,linear,2017,LITTLEFINGER IS FUCKING DEAD UPVOTE PARTY,"(LITTLEFINGER, IS, FUCKING, DEAD, UPVOTE, PARTY)","[littlefinger, is, fucking, dead, upvote, party]"
19098,Game of Thrones,linear,2015,"Oh shit, hey schnitz!!!","(Oh, shit, ,, hey, schnitz, !, !, !)","[oh, shit, hey, schnitz]"
19099,Game of Thrones,linear,2016,"Seriously dude. \n \n""Let's go back to Meree...","(Seriously, dude, ., \n \n, "", Let, 's, go, ...","[seriously, dude, \n \n, let, 's, go, back, ..."
19100,Game of Thrones,linear,2017,At least it would be a reason for Samwell to p...,"(At, least, it, would, be, a, reason, for, Sam...","[at, least, it, would, be, a, reason, for, sam..."


In [33]:
# get the lower case token and remove punctuations from The Witcher df
tokenized_texts_tw = [[token.text.lower() for token in processed_text if not token.is_punct] 
                      for processed_text in processed_texts_tw]
df_tw['tokenized_texts'] = tokenized_texts_tw
df_tw.head()

Unnamed: 0,title,type,year,post,processed_texts,tokenized_texts
43692,The Witcher,netflix,2019,Well that and you’ll notice the queen of Cintr...,"(Well, that, and, you, ’ll, notice, the, queen...","[well, that, and, you, ’ll, notice, the, queen..."
43693,The Witcher,netflix,2020,"Yen sucked at using magic, but she had the mos...","(Yen, sucked, at, using, magic, ,, but, she, h...","[yen, sucked, at, using, magic, but, she, had,..."
43694,The Witcher,netflix,2020,Blessed silence.,"(Blessed, silence, .)","[blessed, silence]"
43695,The Witcher,netflix,2019,"""It's true, he has the face of a cad and a cow...","("", It, 's, true, ,, he, has, the, face, of, a...","[it, 's, true, he, has, the, face, of, a, cad,..."
43696,The Witcher,netflix,2019,I'll give ya 10 marks for that duck.,"(I, 'll, give, ya, 10, marks, for, that, duck, .)","[i, 'll, give, ya, 10, marks, for, that, duck]"


###  Word Embedding of Games of Thrones

In [24]:
# word embedings of Games of Thrones
import gensim
from gensim.models import Word2Vec

tokenized_texts_gt = df_gt['tokenized_texts'].values

SIZE = 300 # dimensions of the embeddings
SG = 1 # whether to use skip-gram or CBOW (we use skip-gram)
WINDOW = 10 # the window size
N_WORKERS = 1 # number of workers to use
MIN_COUNT = 1

model = Word2Vec(size=SIZE,
                sg=SG,
                window=WINDOW, 
                min_count=MIN_COUNT,
                workers=N_WORKERS)

model.build_vocab(tokenized_texts_gt)

model.train(tokenized_texts_gt,
           total_examples=model.corpus_count,
           epochs=model.epochs) # grab some coffee while training

(1167930, 1534780)

In [34]:
#getting similar words to women in Games of Thrones
model.wv.most_similar('women')

[('gods', 0.9604064226150513),
 ('respected', 0.9590197801589966),
 ('bodies', 0.9589460492134094),
 ('murdered', 0.9575037956237793),
 ('whom', 0.9573748111724854),
 ('targaryens', 0.9563419818878174),
 ('above', 0.9548050165176392),
 ('apparently', 0.9536995887756348),
 ('guards', 0.9535270929336548),
 ('sons', 0.9532104730606079)]

In [26]:
#getting similar words to man in Games of Thrones
model.wv.most_similar('man')

[('dream', 0.8980989456176758),
 ('friend', 0.8870887160301208),
 ('ass', 0.8870283365249634),
 ('bro', 0.880265474319458),
 ('bitch', 0.8754574656486511),
 ('boss', 0.8747919797897339),
 ('taught', 0.874590277671814),
 ('dude', 0.871314525604248),
 ('girl', 0.8708091974258423),
 ('pod', 0.8691800832748413)]

In [35]:
# getting similarity of words 'man' and 'genius'
print(model.similarity('man', 'genius'))
# getting similarity of words 'woman' and 'genius'
print(model.similarity('woman', 'genius'))

0.8039272
0.89606255


  print(model.similarity('man', 'genius'))
  print(model.similarity('woman', 'genius'))


### Word Embeddings of The Witcher

In [28]:
# word embedings of The Witcher
import gensim
from gensim.models import Word2Vec

tokenized_texts_tw = df_tw['tokenized_texts'].values

SIZE = 300 # dimensions of the embeddings
SG = 1 # whether to use skip-gram or CBOW (we use skip-gram)
WINDOW = 10 # the window size
N_WORKERS = 1 # number of workers to use
MIN_COUNT = 1

model_tw = Word2Vec(size=SIZE,
                sg=SG,
                window=WINDOW, 
                min_count=MIN_COUNT,
                workers=N_WORKERS)

model_tw.build_vocab(tokenized_texts_tw)

model_tw.train(tokenized_texts_tw,
           total_examples=model_tw.corpus_count,
           epochs=model_tw.epochs) # grab some coffee while training

(147828, 199585)

In [37]:
#getting similar words to women in The Witcher
model_tw.wv.most_similar('women')

[('sorceresses', 0.9998741745948792),
 ('stand', 0.9998688697814941),
 ('hedgehog', 0.9998676180839539),
 ('destined', 0.9998660087585449),
 ('eel', 0.9998648762702942),
 ('cause', 0.999861478805542),
 ('complete', 0.9998611807823181),
 ('djinns', 0.9998590350151062),
 ('cute', 0.9998587369918823),
 ('prepared', 0.9998579621315002)]

In [38]:
#getting similar words to women in The Witcher
model_tw.wv.most_similar('man')

[('elves', 0.9998028874397278),
 ('room', 0.9997982978820801),
 ('beast', 0.9997875690460205),
 ('old', 0.9997825622558594),
 ('daughter', 0.9997678399085999),
 ('white', 0.999767541885376),
 ('down', 0.9997608661651611),
 ('attack', 0.9997596740722656),
 ('sleep', 0.9997522830963135),
 ('mage', 0.9997521638870239)]

In [39]:
# getting similarity of words 'man' and 'genius'
print(model_tw.similarity('man', 'genius'))

# getting similarity of words 'woman' and 'genius'
print(model_tw.similarity('woman', 'genius'))

0.9909216
0.9930625


  print(model_tw.similarity('man', 'genius'))
  print(model_tw.similarity('woman', 'genius'))


# Gender bias in Games of Thrones

In [47]:
#male mean embeding
male_list = ['he', 'his','boy friend','father', 'brother','man','husband','boy friend','son', 'grandpa','lord','king','actor']
words = [word for word in male_list if word in model.wv.vocab] # checks if word is in vocabulary (i.e. has been seen by the model before)
male_mean_embedding = np.mean([model.wv[word] for word in words], axis=0)
print(male_mean_embedding.shape)

(300,)


In [48]:
#female mean embeding
female_list =['she','her','girl','mother','women','girl friend','wife','daughter','grandma','queen','lady','actress']
words = [word for word in female_list if word in model.wv.vocab] # checks if word is in vocabulary (i.e. has been seen by the model before)
female_mean_embedding = np.mean([model.wv[word] for word in words], axis=0)
print(male_mean_embedding.shape)

(300,)


In [49]:
#load df wever
df_wever = pickle.load(open('word_cats.p','rb'))
df_wever = pd.DataFrame(df_wever)
df_wever.head()

Unnamed: 0,affect,posemo,negemo,social,family,cogproc,percept,body,work,leisure,money,relig,occupation
0,protesting,incentive,destruction,chick,ma's,comply,squeez,pussy,dotcom,dnd,portfolio,goddess,accountant
1,pretty,luck,beaten,ma's,niece,luck,sand,wears,employee,vacation,sale,karma,actor
2,sighs,freeing,battl,lets,stepkid,unquestion,moist,hearts,paper,hobb,stores,pastor,actress
3,warmth,pretty,protesting,son's,son's,pretty,warmth,asleep,earns,band,bets,temple,actuary
4,mooch,nicely,dumber,daddies,daddies,become,gloomy,gums,assign,skat,bank,holy,acupuncturist


In [54]:
# bias in occupation category
bias = {}
for word in df_wever.occupation:
    if word in model.wv.vocab:
        distance_male =np.linalg.norm(np.subtract(model.wv[word],male_mean_embedding))
        distance_female =np.linalg.norm(np.subtract(model.wv[word],female_mean_embedding))
        gender_bias = distance_male - distance_female
        bias[word] = gender_bias
    
bias = pd.DataFrame.from_dict(bias,orient ='index')
bias_occupation = bias.rename(columns ={0 :"bias_gender"})
bias_occupation.head()

Unnamed: 0,bias_gender
actor,0.032011
actress,0.34151
actuary,0.019624
agent,0.029762
arts,0.05635


In [55]:
# most biased oocupations towards females
bias_occupation.sort_values(by ="bias_gender",ascending=False)

Unnamed: 0,bias_gender
actress,0.34151
servant,0.144495
teacher,0.141397
judge,0.116434
nun,0.111305
doctor,0.093893
soldier,0.090522
turner,0.069812
representative,0.057186
janitor,0.056669


In [56]:
# bias per category
bias_category={}
for column in df_wever:
    words = df_wever[column]
    for word in words:
        if word in model.wv.vocab:
            distance_male = np.linalg.norm(np.subtract(model.wv[word],male_mean_embedding))
            distance_female = np.linalg.norm(np.subtract(model.wv[word],female_mean_embedding))
            gender_bias = distance_male - distance_female 
    mean = np.mean(gender_bias)
    bias_category[column] = mean

bias_category_gt = pd.DataFrame.from_dict(bias_category,orient ='index')# keys should be rows 
bias_category_gt = bias_category_gt.rename(columns ={0 :"bias_per_category"})
bias_category_gt

Unnamed: 0,bias_per_category
affect,0.091893
posemo,0.091893
negemo,0.09951
social,0.092533
family,-0.330979
cogproc,0.304908
percept,0.088271
body,0.125751
work,0.14277
leisure,0.091893


# Gender bias in The Witcher

In [57]:
#male mean embedding
words_tw = [word for word in male_list if word in model_tw.wv.vocab] # checks if word is in vocabulary (i.e. has been seen by the model before)
male_mean_embedding_tw = np.mean([model_tw.wv[word] for word in words_tw], axis=0)
print(male_mean_embedding_tw.shape)

(300,)


In [58]:
#female mean embedding
words_tw = [word for word in female_list if word in model_tw.wv.vocab] # checks if word is in vocabulary (i.e. has been seen by the model before)
female_mean_embedding_tw = np.mean([model_tw.wv[word] for word in words_tw], axis=0)
print(female_mean_embedding_tw.shape)

(300,)


In [59]:
# bias in occupation category
bias_occupation_tw = {}
for word in df_wever.occupation:
    if word in model_tw.wv.vocab:
        distance_male =np.linalg.norm(np.subtract(model_tw.wv[word],male_mean_embedding_tw))
        distance_female =np.linalg.norm(np.subtract(model_tw.wv[word],female_mean_embedding_tw))
        gender_bias = distance_male - distance_female
        bias_occupation_tw[word] = gender_bias
    
bias_occupation_tw = pd.DataFrame.from_dict(bias_occupation_tw,orient ='index')
bias_occupation_tw = bias_occupation_tw.rename(columns ={0 :"gender_bias"})
bias_occupation_tw.head()

Unnamed: 0,gender_bias
actor,0.135365
actress,0.180476
butcher,-0.17738
cameraman,-0.178224
doctor,-0.173959


In [60]:
# most biased oocupations towards females
bias_occupation_tw.sort_values(by ="gender_bias",ascending=False)
bias_occupation_tw

Unnamed: 0,gender_bias
actor,0.135365
actress,0.180476
butcher,-0.17738
cameraman,-0.178224
doctor,-0.173959
farmer,-0.178913
judge,-0.178367
mechanic,-0.178829
pilot,-0.169301
postman,-0.178168


In [61]:
# bias per category
bias_category={}
for column in df_wever:
    words = df_wever[column]
    for word in words:
        if word in model_tw.wv.vocab:
            distance_male = np.linalg.norm(np.subtract(model_tw.wv[word],male_mean_embedding_tw))
            distance_female = np.linalg.norm(np.subtract(model_tw.wv[word],female_mean_embedding_tw))
            gender_bias = distance_male - distance_female
    mean = np.mean(gender_bias)
    bias_category[column] = mean

bias_category = pd.DataFrame.from_dict(bias_category,orient ='index') # keys should be rows 
bias_category= bias_category.rename(columns ={0 :"bias_per_category"})
bias_category

Unnamed: 0,bias_per_category
affect,-0.176955
posemo,-0.177832
negemo,-0.179464
social,0.144715
family,-0.176849
cogproc,0.160146
percept,-0.178007
body,-0.178619
work,-0.179092
leisure,0.16479


## Steps taken:
### Data preparation
1. Selecting two shows.
2. Inspecting the "post" column of shows.
3. Preprocessing text using SpaCy.
4. Vectorizing text of both shows using Word2Vec model.
5. Exploring most similar words to "man" and "woman" and getting an idea of gender bias with some comparison between words.

### Detecting Gender Bias
1. Defining lists of words related to male and female. (In this step, I inspected the "post" column in the dataset related to each show to define female and male-related words. I ensured that I did not remove stop words, including gender-related words such as 'he' and 'she'.)
2. Calculating the mean female and male embedding (having two vectors representing male and female).
3. Within each category of df_wever, calculating the distance of the words from the male and female lists, and then subtracting the distance to female from the distance to male. This calculation yields a gender bias metric: if the resulting number is positive, it indicates a bias towards female, while if it is negative, it suggests a bias towards male.
4. Finally, calculating the mean of the gender bias per category to gain insight into the level of gender bias within each category.

**Some Explanations**
- Positive numbers are biased towards female, and negative numbers are biased towards male.
- If the word is close to 0, then the bias is less.


## Discussion

I anticipated observing a gender bias towards males due to the historical context of Medieval Times, characterized by significant gender inequality. 

While examining similarities between certain words, I observed that the word 'genius' is closer to women than men in Game of Thrones, whereas in The Witcher, it is close to both women and men. However, we cannot draw any conclusions regarding gender bias from this observation.

In Game of Thrones, the majority of occupations exhibit a bias towards females. Only 'priest' and 'warden' show bias towards males. Surprisingly, even the occupation 'groom' leans towards females.

Furthermore, in Wever's categories, only the 'family' category exhibits a bias towards males, whereas all other categories are biased towards females.

It seems the hypothesis does not align with the observed data in this context.

In "The Witcher" show, among the occupations, only "actor" and "actress" show a bias towards females, while the rest are biased towards males.

It is interesting to note that in Wever's categories, "social," "cogproc," and "leisure" are biased towards females, while the remaining categories, including "work," "money," "relig," "occupation," "precept," "family," "negemo," "posemo," and "affect," are biased towards males.

Based on the observed biases in the categories and occupations within the texts of "The Witcher" and "Game of Thrones," it seems reasonable to conclude that posts written on "The Witcher" exhibit a higher gender bias towards males, while posts written on "Game of Thrones" show a higher bias towards females.

The hypothesis is partly rejected because we have two shows set in the same historical period, yet posts written about one show exhibit a bias towards females, while for the other show, the bias is towards males. Earlier in this analysis, I assumed that comments on the show would mirror its content. Additionally, I presumed that a show set in Medieval Times would inherently reflect the biases prevalent during that era.

These assumptions formed the basis of my hypothesis. However, posts written about Game of Thrones challenge this hypothesis.

It is essential to acknowledge that the audience watching these shows doesn't exist within the context of Medieval Times. Consequently, the gender biases reflected in posts written about these shows might diverge from those depicted within the shows themselves.

Determining whether the gender bias evident in posts written about these shows reflects gender bias within the shows themselves warrants further analysis, particularly focusing on the scripts. This could be a valuable next step in this analysis. Comparing the results of gender bias in the show scripts with the findings of this analysis would be intriguing and potentially enlightening.


In [None]:
# Add the untracked files to the commit
!git add discussions.p
# If you don't want to include .ipynb_checkpoints/, you can ignore it
# !git add .ipynb_checkpoints/

# Commit the changes to your local repository
!git commit -m "Add Gender Bias Detection notebook and discussions data"

# Push the changes to the remote repository
!git push origin master


On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Gender_Bias_Detection.ipynb

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.ipynb_checkpoints/

no changes added to commit (use "git add" and/or "git commit -a")


In [7]:
# Add, commit, and push the changes to the correct branch
!git add Gender_Bias_Detection.ipynb
!git commit -m "Add Gender Bias Detection notebook"
!git push origin master  # Replace 'master' with the correct branch name if necessary


The file will have its original line endings in your working directory


[master ca02cd8] Add Gender Bias Detection notebook
 1 file changed, 127 insertions(+), 3 deletions(-)


error: src refspec # does not match any
error: src refspec Replace does not match any
error: src refspec 'master' does not match any
error: src refspec with does not match any
error: src refspec the does not match any
error: src refspec correct does not match any
error: src refspec branch does not match any
error: src refspec name does not match any
error: src refspec if does not match any
error: src refspec necessary does not match any
error: failed to push some refs to 'https://github.com/Maryam-Afshari/Gender-Bias-Detection'
