# Text Analysis


Note: some code in this notebook is taken or derived from the manual provided for the university course "data mining" at UU.

In [1]:
import pickle
import pandas as pd

df = pd.read_pickle('discussions.p')

In [2]:
df.head()

Unnamed: 0,title,type,year,post
0,Better Call Saul,linear,2017,Walter. And there the chain ends.
1,Better Call Saul,linear,2016,I love this show. But it's hard to argue again...
2,Better Call Saul,linear,2017,What am I missing? A lot of reference to ribs...
3,Better Call Saul,linear,2018,"Oh come on Mike, he's a good little boy."
4,Better Call Saul,linear,2017,Look again 👀


In [3]:
df.tail()

Unnamed: 0,title,type,year,post
49995,Twin Peaks,linear,2017,Anyone else think that the top of the mushroom...
49996,Twin Peaks,linear,2017,Shit I thought it was mini van lady with shoot...
49997,Twin Peaks,linear,2017,Did Janey say that Dougie was absent for a nig...
49998,Twin Peaks,linear,2017,From what I've read and seen they were mutual ...
49999,Twin Peaks,linear,2017,PBR... blech!!!


## Statement on the relevance of the comparison and hypothesis

Orange is the new black (OITNB) and Breaking Bad (BB) are both popular shows but differ in their amount of male and female protagonists. 
OITNB takes place in a womens prison, the main protagonists are female and also most side characters. 
On the other hand, BB revolves mainly about male characters, with the main protagonists being male.
As a result, the discussions about OITNB may contain more words associated with female gender and discussions about BB may contain more words associated with the male gender.

I expect both shows discussions' to have a gender bias, however, I assume discussions about OITNB to have a gender bias toward women and discussions about BB to have a gender bias toward men.

Based on the differences in the shows explained above, it will be interesting to investigate whether there will actually be a gender bias in the discussions and if this gender bias, for each show, exists in the expected direction.

Checking the amount of data available for both shows.

In [4]:
df.title.value_counts()

Game of Thrones            15462
Breaking Bad                6424
Better Call Saul            5268
Black Mirror                4720
Stranger Things             2891
True Detective              2721
Twin Peaks                  2399
Dark                        2004
Ozark                       1417
Mr. Robot                   1347
Orange is the New Black     1208
The Witcher                 1188
Fargo                        680
Mindhunter                   657
The Newsroom                 484
Succession                   389
The Crown                    338
House of Cards               158
La Casa de Papel             150
The Mandelorian               95
Name: title, dtype: int64

Creating subsets of data for each show.

In [5]:
df_oitnb = df[df['title'] == 'Orange is the New Black']
df_oitnb.head()

Unnamed: 0,title,type,year,post
36870,Orange is the New Black,netflix,2016,Maria stole Piper's business ... fuck Maria.
36871,Orange is the New Black,netflix,2016,I think it was more of he was too distracted b...
36872,Orange is the New Black,netflix,2016,Yeah it's issue 100.
36873,Orange is the New Black,netflix,2016,They weren't that bad given the situation. I'm...
36874,Orange is the New Black,netflix,2016,Maritza bartending in Miami? It's like watchin...


In [6]:
df_bb = df[df['title'] == 'Breaking Bad']
df_bb.head()

Unnamed: 0,title,type,year,post
9988,Breaking Bad,linear,2012,&gt;Mike and his granddaughter.
9989,Breaking Bad,linear,2013,Cool! Thanks for the update! Much appreciated.
9990,Breaking Bad,linear,2011,The last thing they need is a wildcard. Jesse...
9991,Breaking Bad,linear,2013,I don't think there's any doubt that Walt pois...
9992,Breaking Bad,linear,2012,Don't you have to also note that you watched i...


Preprocessing:
Converting dataframe column 'post' to list, tokenizing the list, removing punctuation and lowercasing.

In [7]:
def tokenize(text):
    punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    for punctuation in punctuations:
        text = text.replace(punctuation, '')
    text = text.lower() 
    text = text.split()
    return text

#### BB

In [8]:
bb_list = df_bb['post'].tolist()

In [9]:
tokenized_texts_bb = [tokenize(text) for text in bb_list]

#### OITNB

In [10]:
oitnb_list = df_oitnb['post'].tolist()

In [11]:
tokenized_texts_oitnb = [tokenize(text) for text in oitnb_list]

Training the word embedding models (Word2Vec) for each show.
For the models I use a window-size of 10, 300 embedding dimensions (as the data is large enough for both shows), minimum count of 5 (a word has to appear at least 5 times to be included; to exlcude words that are not relevant) and skip-gram.

In [12]:
import gensim
from gensim.models import Word2Vec

#### BB

In [13]:
SIZE = 300 
SG = 1 
WINDOW = 10 
N_WORKERS = 1 
MIN_COUNT = 5

model_bb = Word2Vec(size=SIZE,
                sg=SG,
                window=WINDOW, 
                min_count=MIN_COUNT,
                workers=N_WORKERS)

model_bb.build_vocab(tokenized_texts_bb)


model_bb.train(tokenized_texts_bb,
               total_examples=model_bb.corpus_count,
               epochs=model_bb.epochs)

(334244, 518180)

#### OITNB

In [14]:
SIZE = 300 
SG = 1 
WINDOW = 10
N_WORKERS = 1 
MIN_COUNT = 5

model_oitnb = Word2Vec(size=SIZE,
                sg=SG,
                window=WINDOW, 
                min_count=MIN_COUNT,
                workers=N_WORKERS)

model_oitnb.build_vocab(tokenized_texts_oitnb)

model_oitnb.train(tokenized_texts_oitnb,
                    total_examples=model_oitnb.corpus_count,
                    epochs=model_oitnb.epochs)

(92488, 167750)

Printing common proper nouns of both shows using spacy, to include the 2 most common female and male names of each show in the gender-related lists.

In [15]:
import spacy 
nlp = spacy.load("en_core_web_sm")

#### BB

In [16]:
texts = bb_list

processed_texts = [text for text in nlp.pipe(texts, 
                                              n_process=4,
                                              disable=["ner",
                                                       "parser"])]

tokenized_propn_bb = [[word.lemma_.lower() for word in processed_text if word.pos_ == 'PROPN'
                                and not word.is_stop and not word.is_punct] for processed_text in processed_texts]

In [17]:
flatten = lambda t: [item for sublist in t for item in sublist]

tokenized_propn_bb_flat = flatten(tokenized_propn_bb)

In [18]:
from collections import Counter
propn_counts_bb = Counter(tokenized_propn_bb_flat)

In [19]:
propn_counts_bb.most_common()[:10] 

[('walt', 1207),
 ('jesse', 728),
 ('hank', 417),
 ('mike', 296),
 ('gus', 243),
 ('todd', 184),
 ('skyler', 155),
 ('walter', 149),
 ('marie', 132),
 ('saul', 106)]

#### OITNB

In [20]:
texts = oitnb_list

processed_texts = [text for text in nlp.pipe(texts, 
                                              n_process=4,
                                              disable=["ner",
                                                       "parser"])]

tokenized_propn_oitnb = [[word.lemma_.lower() for word in processed_text if word.pos_ == 'PROPN'
                                and not word.is_stop and not word.is_punct] for processed_text in processed_texts]

In [21]:
flatten = lambda t: [item for sublist in t for item in sublist]

tokenized_propn_oitnb_flat = flatten(tokenized_propn_oitnb)

In [22]:
propn_counts_oitnb = Counter(tokenized_propn_oitnb_flat)

In [23]:
propn_counts_oitnb.most_common()[:10] 

[('piper', 98),
 ('alex', 50),
 ('suzanne', 38),
 ('poussey', 34),
 ('healy', 31),
 ('caputo', 30),
 ('daya', 26),
 ('stella', 24),
 ('bennett', 24),
 ('boo', 21)]

Creating lists of male and female related words (most common 2 female and male character names of each show included).

#### BB

In [24]:
male_rel_bb = ["walt", "jesse", "he", "his", "male", "men", "man", "brother", "father", "son"]

In [25]:
female_rel_bb = ["skyler", "marie", "she", "her", "female", "woman", "women", "mother", "sister", "daughter"]

#### OITNB

In [26]:
male_rel_oitnb = ["caputo", "bennett", "he", "his", "male", "men", "man", "brother", "father", "son"]

In [27]:
female_rel_oitnb = ["piper", "alex", "she", "her", "female", "woman", "women", "mother", "sister", "daughter"]

#### Wever's method

Loading word categories.

In [28]:
df_cats = pd.read_pickle('word_cats.p')
df_cats.head()

Unnamed: 0,affect,posemo,negemo,social,family,cogproc,percept,body,work,leisure,money,relig,occupation
0,protesting,incentive,destruction,chick,ma's,comply,squeez,pussy,dotcom,dnd,portfolio,goddess,accountant
1,pretty,luck,beaten,ma's,niece,luck,sand,wears,employee,vacation,sale,karma,actor
2,sighs,freeing,battl,lets,stepkid,unquestion,moist,hearts,paper,hobb,stores,pastor,actress
3,warmth,pretty,protesting,son's,son's,pretty,warmth,asleep,earns,band,bets,temple,actuary
4,mooch,nicely,dumber,daddies,daddies,become,gloomy,gums,assign,skat,bank,holy,acupuncturist


Calculate mean embeddings of female and male related words.

In [29]:
import numpy as np

#### BB

In [30]:
words = [word for word in female_rel_bb if word in model_bb.wv.vocab] 
mean_embedding_female_bb = np.mean([model_bb.wv[word] for word in words], axis=0)

In [31]:
words = [word for word in male_rel_bb if word in model_bb.wv.vocab] 
mean_embedding_male_bb = np.mean([model_bb.wv[word] for word in words], axis=0)

#### OITNB

In [32]:
words = [word for word in female_rel_oitnb if word in model_oitnb.wv.vocab] 
mean_embedding_female_oitnb = np.mean([model_oitnb.wv[word] for word in words], axis=0)

In [33]:
words = [word for word in male_rel_oitnb if word in model_oitnb.wv.vocab] 
mean_embedding_male_oitnb = np.mean([model_oitnb.wv[word] for word in words], axis=0)

Calculation of gender bias: for each word in a category of Wever's, the distance between the word vector and each gender vector (male and female) is calulated. Those differences are then substracted and indicate the bias of this certain word.

#### BB

Difference between the gender vectors of BB.

In [34]:
differnce_gender_bb = np.linalg.norm(np.subtract(mean_embedding_male_bb, mean_embedding_female_bb))
differnce_gender_bb

0.6565799

To get the average bias in discussions about Breaking Bad, the average of the biases of all words in Wever's categories is calculated.

In [35]:
bias = {}
total_bias_bb = []

for column in df_cats.columns:
    for word in df_cats[column]:
        if word in model_bb.wv.vocab:
            distance_male = np.linalg.norm(np.subtract(model_bb.wv[word], mean_embedding_male_bb))
            distance_female = np.linalg.norm(np.subtract(model_bb.wv[word], mean_embedding_female_bb))
            gender_bias_word = distance_male - distance_female
            bias[word] = gender_bias_word
    bias_calc = pd.DataFrame.from_dict(bias, orient = 'index', columns= ['col'])
    mean_bias = bias_calc['col'].mean()
    #print(column, ":", mean_bias)
    total_bias_bb.append(mean_bias)
    bias = {}
print(np.mean(total_bias_bb))

-0.1692482244791621


The bias for each of Wever's categories is the mean of all biases of words in one category.

In [36]:
bias = {}

for column in df_cats.columns:
    for word in df_cats[column]:
        if word in model_bb.wv.vocab:
            distance_male = np.linalg.norm(np.subtract(model_bb.wv[word], mean_embedding_male_bb))
            distance_female = np.linalg.norm(np.subtract(model_bb.wv[word], mean_embedding_female_bb))
            gender_bias_word = distance_male - distance_female
            bias[word] = gender_bias_word
    bias_calc = pd.DataFrame.from_dict(bias, orient = 'index', columns= ['col'])
    mean_bias = bias_calc['col'].mean()
    print(column, ":", mean_bias)
    bias = {}

affect : -0.1729731326646144
posemo : -0.16990755615281125
negemo : -0.17475669961614707
social : -0.15973079005877178
family : -0.14413301302836493
cogproc : -0.19673039964059505
percept : -0.16080044291831636
body : -0.16532285460110369
work : -0.20211017339728599
leisure : -0.17945732921361923
money : -0.18830093068461265
relig : -0.0955503053135342
occupation : -0.19045329093933105


#### OITNB

Difference between the gender vectors of OITNB.

In [37]:
differnce_gender_oitnb = np.linalg.norm(np.subtract(mean_embedding_male_oitnb, mean_embedding_female_oitnb))
differnce_gender_oitnb

0.05801101

Average bias in discussions about Orange is the new black.

In [38]:
bias = {}
total_bias_oitnb = []

for column in df_cats.columns:
    for word in df_cats[column]:
        if word in model_oitnb.wv.vocab:
            distance_male = np.linalg.norm(np.subtract(model_oitnb.wv[word], mean_embedding_male_oitnb))
            distance_female = np.linalg.norm(np.subtract(model_oitnb.wv[word], mean_embedding_female_oitnb))
            gender_bias_word = distance_male - distance_female
            bias[word] = gender_bias_word
    bias_calc = pd.DataFrame.from_dict(bias, orient = 'index', columns= ['col'])
    mean_bias = bias_calc['col'].mean()
    #print(column, ":", mean_bias)
    total_bias_oitnb.append(mean_bias)
    bias = {}
print(np.mean(total_bias_oitnb))

0.0019175960883425268


Bias per category.

In [39]:
bias = {}

for column in df_cats.columns:
    for word in df_cats[column]:
        if word in model_oitnb.wv.vocab:
            distance_male = np.linalg.norm(np.subtract(model_oitnb.wv[word], mean_embedding_male_oitnb))
            distance_female = np.linalg.norm(np.subtract(model_oitnb.wv[word], mean_embedding_female_oitnb))
            gender_bias_word = distance_male - distance_female
            bias[word] = gender_bias_word
    bias_calc = pd.DataFrame.from_dict(bias, orient = 'index', columns= ['col'])
    mean_bias = bias_calc['col'].mean()
    print(column, ":", mean_bias)
    bias = {}

affect : 0.004558481622998621
posemo : 0.004853530786931515
negemo : 0.004251630492508412
social : -0.0007804765997604392
family : -0.0019715074449777603
cogproc : 0.001793845419908737
percept : 0.00402089090245526
body : 0.0005251049167580075
work : -0.0002679368481040001
leisure : 0.0016652430097262064
money : 0.0009889466067155202
relig : 0.002361629158258438
occupation : 0.0029293671250343323


# Interpretation and discussion 

A smaller distance of a word to one gender vector suggests that this word is more associated with that gender, and a larger difference suggest this word is less associated with that gender (Wevers, 2019).
The (average) difference in distance of Wever's words to both gender vectors portrays the gender bias (Wevers, 2019). A positive bias indicates a bias toward women, a negative bias indicates a bias toward men (given that the difference is calculated as gender bias = distance to male vector - distance to female vector).

The results show that there is an (average) gender bias in the discussion of BB, as expected. There is only a very small (average) gender bias in the discussions of OITNB, which contradicts the assumption of a clear bias.

The biases, as assumed, are toward women in the discussions of OITNB (though this bias may not be big enough to be called a real bias) and toward men in the discussions of BB.
Moreover, every category of Wever is biased in the same direction as the overall average gender bias in BB discussions, but mixed for discussions of OITNB.

Comparing the size of the bias with the difference between the gender vectors of female and male of both shows, the relative size of the average gender bias of BB discussions is far larger than the average gender bias of OITNB discussions.
This suggest that discussions of OITNB contain almost equally male- and female-related words, but discussions of BB contain more male- than female-related words.

As the dataset (more specifically the 'post' column) contains discussion posts of the shows but not the actual content of the shows themselves, it is important to note that it is only possible to make statements about the gender bias of the discussions. Though the argumentation of the origin of a bias in the discussions may connected to the content of the shows, it would be necessary to analyse the content of the shows to say something about the gender bias of the shows themselves.

Wevers, M. (2019). Using Word Embeddings to Examine Gender Bias in Dutch Newspapers, 1950-1990, 92–97. https://doi.org/10.18653/v1/w19-4712