Let's set up SparkNLP.

In [1]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2024-07-20 04:00:30--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 3.86.22.73
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|3.86.22.73|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2024-07-20 04:00:30--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’

Installing PySpark 3.2.3 and Spark NLP 5.4.1

2024-07-20 04:00:30 (30.8 MB/s) - written to stdout [1191/1191]

setup Colab for PySpark 3.2.3 and Spark NLP 5.4.1
[2

In [2]:
import sparknlp
spark = sparknlp.start()

from sparknlp.pretrained import PretrainedPipeline

import regex as re
from pyspark.sql import SparkSession
from pyspark import SparkContext

from bs4 import BeautifulSoup

In [3]:
pipeline = PretrainedPipeline("explain_document_ml")

explain_document_ml download started this may take some time.
Approx size to download 9 MB
[OK!]


**Introduction**

The English language allows for pronouns to serve as both the subject and object of a sentence. This of course, depends on the context and the pronoun must be properly conjugated to make gramatical sense. Focusing on the gendered pronouns (he/him/his and she/her/hers), one may assume that these correspond with each other, able to replace one another in the sentence. However, this is untrue; there are two uses for 'her' and 'his'. 'Her' can be used as both the subject and the object in a sentence. Let's use the example of the sentences 'I saw her' and 'That is her name'. In this example, the 'her' in 'I saw her' is used as the subject of the sentence, or as a personal pronoun. This corresponds with the 'him' in 'I saw him'. The 'her' in 'that is her name' is used as an adjective, describing the 'name'. In this case, it is a possessive determiner pronoun, always used with a noun. It corresponds with 'that is his name'. 'His' can also be used in multiple ways. We have already seen it as the possesive determiner pronoun, but it can also be used as a possessive pronoun in place of a noun, such as in 'that is his'. This version of 'his' is used to describe ownership and corresponds with the feminine 'hers' as in 'that is hers'. Both versions of 'his' serve as the object of the sentence.

To take a look at ingrained biases in writing, we can look at the frequency to which these pronouns show up in their different forms. For this, I have decided to explore how these pronouns show up in online FanFiction. FanFiction are unpublished stories, traditionally based on another original story. This can be expanding a story with an author's own interpretations, a story with original characters inserted into the world of an existing piece of literature, an alternate universe from an existing universe, or something else entirely. The phrase 'FanFiction' has moved beyond just stories based on existing universes, and now colloquially can refer to any unpublished story.

Due to the open nature of the internet, anyone with a connection can upload and share their stories. Many popular FanFiction websites exist including Wattpad and FanFiction.net, as well as communities on social media platforms such as Tumblr and Reddit. Though not representative of the entire amateur writing community, these sites offer a large selection of stories in many genres and from many existing universes.

To achieve this, I will scrape FanFiction stories and analyze their text, looking at the uses and context of gendered pronouns throughout. I will study different genres and existing universes to see if there are significant differences in the appearance of subject versus object gendered pronouns.

In [4]:
#Fanfiction dataset
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/horror_1.txt" -o horror_1.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/horror_2.txt" -o horror_2.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/horror_3.txt" -o horror_3.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/horror_4.txt" -o horror_4.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/horror_5.txt" -o horror_5.txt

!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/harrypotter_1.txt" -o harrypotter_1.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/harrypotter_2.txt" -o harrypotter_2.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/harrypotter_3.txt" -o harrypotter_3.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/harrypotter_4.txt" -o harrypotter_4.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/harrypotter_5.txt" -o harrypotter_5.txt

!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/romance_1.txt" -o romance_1.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/romance_2.txt" -o romance_2.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/romance_3.txt" -o romance_3.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/romance_4.txt" -o romance_4.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/romance_5.txt" -o romance_5.txt

!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/twilight_1.txt" -o twilight_1.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/twilight_2.txt" -o twilight_2.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/twilight_3.txt" -o twilight_3.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/twilight_4.txt" -o twilight_4.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/twilight_5.txt" -o twilight_5.txt

!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/warewolf_1.txt" -o warewolf_1.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/warewolf_2.txt" -o warewolf_2.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/warewolf_3.txt" -o warewolf_3.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/warewolf_4.txt" -o warewolf_4.txt
!curl "https://raw.githubusercontent.com/Jwarmflash/PronounTextAnalysis/main/warewolf_5.txt" -o warewolf_5.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  392k  100  392k    0     0  1144k      0 --:--:-- --:--:-- --:--:-- 1145k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  206k  100  206k    0     0   719k      0 --:--:-- --:--:-- --:--:--  718k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  435k  100  435k    0     0  1395k      0 --:--:-- --:--:-- --:--:-- 1400k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  151k  100  151k    0     0   487k      0 --:--:-- --:--:-- --:--:--  488k
  % Total    % Received % Xferd  Average Speed   Tim

In [5]:
horror = ''
for i in range(5):
  j=i+1
  text = str('horror_' + str(j) + '.txt')
  horror = horror + (open(text).read())

romance = ''
for i in range(5):
  j=i+1
  text = str('romance_' + str(j) + '.txt')
  romance = romance + (open(text).read())

warewolf = ''
for i in range(5):
  j=i+1
  text = str('warewolf_' + str(j) + '.txt')
  warewolf = warewolf + (open(text).read())

harrypotter = ''
for i in range(5):
  j=i+1
  text = str('harrypotter_' + str(j) + '.txt')
  harrypotter = harrypotter + (open(text).read())

twilight = ''
for i in range(5):
  j=i+1
  text = str('twilight_' + str(j) + '.txt')
  twilight = twilight + (open(text).read())

In [74]:
horror_sentences[11][97]

'\xa0'

In [106]:
cleaned_text = horror.replace(u'\xa0', u' ')
cleaned_text = re.sub(r'\n+', ' ', cleaned_text)
cleaned_text = re.sub(r'\t+', ' ', cleaned_text)
cleaned_text = re.sub(r' +', ' ', cleaned_text)
cleaned_text = re.sub(r'\+', ' ', cleaned_text)

cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+\.?[0-9]*K?', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+\.?[0-9]*K?', '.', cleaned_text)
cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+', '.', cleaned_text)

horror_text_str = cleaned_text
horror_sentences = re.split(r'(?<!\b(?:Mr|Mrs|Ms|Dr)\.)(?<!\w\.\w.)(?<=[.!?][”"]?) +', cleaned_text)
horror_sentences[50:60]

['Then you will feel a presence behind you...',
 'You will feel a tap on your shoulder...',
 'The last thing you will see before you die will be the face of the Blind Maiden, staring mercilessly at you with her horrible eyes.',
 "It is also said, that the blind maiden will rip out your eyes and take a snapshot of your face so that you will forever be a part of the website's picture gallery.",
 'Baby Blue is an urban legend about a strange game that kids play in bathrooms.',
 'If you perform the ritual, they say an evil ghostly infant will appear in your arms.',
 "This urban legend is related to the myth of 'Bloody Mary'.",
 'To play "Blue Baby Blue", you have to go into the bathroom on your own, turn off the lights and lock the door.',
 'Then you stare into the mirror, hold out your arms like you are rocking a baby and repeat the words, "Baby Blue, Blue Baby" 13 times without making a mistake.',
 'If you do it right, you will suddenly feel the weight of an invisible baby in your arms.'

In [107]:
cleaned_text = romance.replace(u'\xa0', u' ')
cleaned_text = re.sub(r'\n+', ' ', cleaned_text)
cleaned_text = re.sub(r'\t+', ' ', cleaned_text)
cleaned_text = re.sub(r' +', ' ', cleaned_text)
cleaned_text = re.sub(r'\+', ' ', cleaned_text)
cleaned_text = re.sub(r'[\*~•°]+', '', cleaned_text)

cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+\.?[0-9]*K?', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+\.?[0-9]*K?', '.', cleaned_text)
cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+', '.', cleaned_text)

romance_text_str = cleaned_text
romance_sentences = re.split(r'(?<!\b(?:Mr|Mrs|Ms|Dr)\.)(?<!\w\.\w.)(?<=[.!?][”"]?) +', cleaned_text)
romance_sentences[608:611]

['"Why would you want to date Tyson?',
 'I have a way bigger dick.',
 'My dick is so long it goes from a all the way to z."']

In [108]:
cleaned_text = warewolf.replace(u'\xa0', u' ')
cleaned_text = re.sub(r'\n+', ' ', cleaned_text)
cleaned_text = re.sub(r'\t+', ' ', cleaned_text)
cleaned_text = re.sub(r' +', ' ', cleaned_text)
cleaned_text = re.sub(r'\+', ' ', cleaned_text)
cleaned_text = re.sub(r'\. \. \.', '...', cleaned_text)
cleaned_text = re.sub(r'[•~\*]+', '', cleaned_text)

cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+\.?[0-9]*K?', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+\.?[0-9]*K?', '.', cleaned_text)
cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+', '.', cleaned_text)

warewolf_text_str = cleaned_text
warewolf_sentences = re.split(r'(?<!\b(?:Mr|Mrs|Ms|Dr)\.)(?<!\w\.\w.)(?<=[.!?][”"]?) +', cleaned_text)
print(warewolf_sentences[8:10])
print(warewolf_sentences[10793])

['The pack it led by the Alpha and their mate who is called the Luna (usually female), second in command is the Beta.', 'The Gamma comes in third and is mostly in charge of the warriors.']
Can I please bury my face in your neck to smell more of your delicious earthy scent that I pick up with my supernatural nose?


In [109]:
cleaned_text = harrypotter.replace(u'\xa0', u' ')
cleaned_text = re.sub(r'\n+', ' ', cleaned_text)
cleaned_text = re.sub(r'\t+', ' ', cleaned_text)
cleaned_text = re.sub(r' +', ' ', cleaned_text)
cleaned_text = re.sub(r'\+', ' ', cleaned_text)
cleaned_text = re.sub(r'\. \. \.', '...', cleaned_text)
cleaned_text = re.sub(r'[•~\*°ϟ]+', '', cleaned_text)

cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+\.?[0-9]*K?', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+\.?[0-9]*K?', '.', cleaned_text)
cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+', '.', cleaned_text)

harrypotter_text_str = cleaned_text
harrypotter_sentences = re.split(r'(?<!\b(?:Mr|Mrs|Ms|Dr)\.)(?<!\w\.\w.)(?<=[.!?][”"]?) +', cleaned_text)
harrypotter_sentences[5000:5020]

['"Shut up Lily," he glanced at the picture of mom and dad behind me.',
 '"Look it\'s all Dad\'s fault."',
 "I glanced at Dad's equally messy hair in the picture that sat behind me.",
 "Harry was the spitting image of Dad, but with Mum's emerald green, almond shaped eyes.",
 'I was the opposite.',
 "I was the spitting image of Mum, but with Dad's round, hazel eyes.",
 'It was nearly impossible to tell we were twins.',
 'We sat around for a little while longer, mostly playing games on my iPod, of taking pictures of one another and laughing at how bad we looked.',
 'A few minutes before we figured the Dursleys, or rather Dudley, would be finish, I took a glance toward Harry.',
 'The knee of his jeans were finally beginning to wear, and his skin was visible through a quickly growing rip.',
 '"No funny business," snarled Uncle Vernon as we stepped out of the car.',
 '"Yes Uncle Vernon" Harry droned monotonously while at the same time I scoffed what I thought was quietly.',
 'Vernon glared 

In [110]:
cleaned_text = twilight.replace(u'\xa0', u' ')
cleaned_text = re.sub(r'\n+', ' ', cleaned_text)
cleaned_text = re.sub(r'\t+', ' ', cleaned_text)
cleaned_text = re.sub(r' +', ' ', cleaned_text)
cleaned_text = re.sub(r'\+', ' ', cleaned_text)
cleaned_text = re.sub(r'\. \. \.', '...', cleaned_text)
cleaned_text = re.sub(r'[•~·☾♡☽·︵‿⍣]+', '', cleaned_text)

cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+', '.', cleaned_text)
cleaned_text = re.sub(r'[\.!?][”"][ ]*[0-9]+\.?[0-9]*K?', '."', cleaned_text)
cleaned_text = re.sub(r'[\.!?][ ]*[0-9]+\.?[0-9]*K?', '.', cleaned_text)

twilight_text_str = cleaned_text
twilight_sentences = re.split(r'(?<!\b(?:Mr|Mrs|Ms|Dr)\.)(?<!\w\.\w\.)(?<=[\.!?][”"]?) +', cleaned_text)
twilight_sentences[5000:5020]

["Paul chuckled, sniffing the air in hopes of catching Driana's scent.",
 'However, his happiness fell because there was something odd.',
 '"You smell different," Paul told her.',
 '"I never changed my products," Driana told him, confused.',
 'The girl furrowed her brows as she stared into his hard eyes.',
 'Paul inhaled deeply.',
 "He glared and pointed at Driana's jacket.",
 '"Where\'d you get that?"',
 'The girl hugged the jacket closer to her.',
 '"A friend."',
 '"I don\'t think you should be friends with them," Paul harshly said.',
 '"Since when do you control who I hang out with?"',
 'Driana raised a brow at him.',
 '"If you know what\'s good for you, return that to the owner," Paul ordered.',
 '"I\'m not one of your wolves to command, Paul," Driana argued.',
 '"Besides, how would you know what\'s good for me?"',
 '"Can\'t you listen, at least?"',
 'Paul growled.',
 '"Do you even know who that jacket belongs to?"',
 '"He\'s one of Dr. Carlisle Cullen\'s kids," Driana glared.']

In [123]:
spark = SparkSession.builder.appName("demo").getOrCreate()

In [124]:
# Define the prefixes
prefixes = ['he', 'him', 'his', 'she', 'her', 'hers']

In [121]:
#horror parts of speech
dfs = [pipeline.annotate(i.lower()) for i in horror_sentences]

# Extract words and parts-of-speech
tok_tag = [(df['token'],df['pos']) for df in dfs]

# fuse pos to word
zips = [list(zip(tt[0], tt[1])) for tt in tok_tag]
horror_sentences_tagged = [" ".join(["".join(word) for word in hl]) for hl in zips]

#Count tagged words
horror_tagged_spark = spark.sparkContext.parallelize(horror_sentences_tagged)

horror_counts = (
    horror_tagged_spark.flatMap(lambda line: line.split(" "))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)

horror_counts.collect()[:10]


[('urbanJJ', 120),
 ('legendNN', 182),
 ('storyNN', 167),
 ('contemporaryVB', 1),
 ('itPRP', 2733),
 ('typeNN', 4),
 ('popularJJ', 14),
 ('beliefNN', 3),
 (',,', 9554),
 ('sometimesRB', 39)]

In [130]:
#romance parts of speech
dfs = [pipeline.annotate(i.lower()) for i in romance_sentences]

# Extract words and parts-of-speech
tok_tag = [(df['token'],df['pos']) for df in dfs]

# fuse pos to word
zips = [list(zip(tt[0], tt[1])) for tt in tok_tag]
romance_sentences_tagged = [" ".join(["".join(word) for word in hl]) for hl in zips]

#Count tagged words
romance_tagged_spark = spark.sparkContext.parallelize(romance_sentences_tagged)

romance_counts = (
    romance_tagged_spark.flatMap(lambda line: line.split(" "))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)

romance_counts.collect()[:10]

 ('haveVBP', 1074),
 ('hatingVBG', 2),
 ('theDT', 7735),
 ('playerNN', 30),
 ('iNNP', 15725),
 ('wantVBP', 477),
 ('absolutelyRB', 31),
 ('horribleJJ', 43),
 ('languageNN', 9)]

In [132]:
#warewolf parts of speech
dfs = [pipeline.annotate(i.lower()) for i in warewolf_sentences]

# Extract words and parts-of-speech
tok_tag = [(df['token'],df['pos']) for df in dfs]

# fuse pos to word
zips = [list(zip(tt[0], tt[1])) for tt in tok_tag]
warewolf_sentences_tagged = [" ".join(["".join(word) for word in hl]) for hl in zips]

#Count tagged words
warewolf_tagged_spark = spark.sparkContext.parallelize(warewolf_sentences_tagged)

warewolf_counts = (
    warewolf_tagged_spark.flatMap(lambda line: line.split(" "))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)

warewolf_counts.collect()[:10]

[('hereRB', 640),
 ('knowVB', 620),
 ('aboutIN', 793),
 ('thisDT', 2117),
 ('particularJJ', 20),
 ('werewolfNN', 136),
 ('someDT', 512),
 ('appliesVBZ', 2),
 ('onlyRB', 519),
 ('myPRP$', 6591)]

In [134]:
#harrypotter parts of speech
dfs = [pipeline.annotate(i.lower()) for i in harrypotter_sentences]

# Extract words and parts-of-speech
tok_tag = [(df['token'],df['pos']) for df in dfs]

# fuse pos to word
zips = [list(zip(tt[0], tt[1])) for tt in tok_tag]
harrypotter_sentences_tagged = [" ".join(["".join(word) for word in hl]) for hl in zips]

#Count tagged words
harrypotter_tagged_spark = spark.sparkContext.parallelize(harrypotter_sentences_tagged)

harrypotter_counts = (
    harrypotter_tagged_spark.flatMap(lambda line: line.split(" "))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)

harrypotter_counts.collect()[:10]

[('12CD', 6),
 ('theDT', 10886),
 ('endNN', 71),
 (',,', 14040),
 ('shePRP', 3554),
 ('wasVBD', 3498),
 ('hereRB', 273),
 ('onlyRB', 361),
 ('everRB', 150),
 ('girlNN', 132)]

In [136]:
#twilight parts of speech
dfs = [pipeline.annotate(i.lower()) for i in twilight_sentences]

# Extract words and parts-of-speech
tok_tag = [(df['token'],df['pos']) for df in dfs]

# fuse pos to word
zips = [list(zip(tt[0], tt[1])) for tt in tok_tag]
twilight_sentences_tagged = [" ".join(["".join(word) for word in hl]) for hl in zips]

#Count tagged words
twilight_tagged_spark = spark.sparkContext.parallelize(twilight_sentences_tagged)

twilight_counts = (
    twilight_tagged_spark.flatMap(lambda line: line.split(" "))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)

twilight_counts.collect()[:10]

[('disclaimerNN', 1),
 ('iNNP', 3063),
 ('doVBP', 239),
 ('notRB', 611),
 ('ownVB', 6),
 ('theDT', 7124),
 ('twilightNN', 29),
 ("don'tVBP", 149),
 ('charactersNNS', 9),
 ('anythingNN', 170)]

In [129]:
#Horror
#Filter the RDD for keys that match the criteria
filtered_rdd = horror_counts.filter(
    lambda kv: any(
        kv[0].lower().startswith(prefix) and kv[0][len(prefix):][0].isupper()
        for prefix in prefixes
    )
)

# Collect and print the results
print(filtered_rdd.collect())

#male subject pronouns and total male pronouns
male_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP', 'hisPRP$'])
male_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP'])
summed_male_pronouns = male_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_male_pronouns_subjects = male_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#female subject pronouns and total female pronouns
female_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP', 'herPRP$', 'hersNNS', 'hersPRP'])
female_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP'])
summed_female_pronouns = female_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_female_pronouns_subjects = female_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#pronoun proportions
horror_male_pronouns_subject_proportion = summed_male_pronouns_subjects / summed_male_pronouns
horror_female_pronouns_subject_proportion = summed_female_pronouns_subjects / summed_female_pronouns

print("Horror male subject pronoun proportion: " + str(horror_male_pronouns_subject_proportion))
print("Horror female subject pronoun proportion: " + str(horror_female_pronouns_subject_proportion))

#[('shePRP', 2391), ('hePRP', 3809), ('herPRP', 408), ('herPRP$', 2165), ('himPRP', 1100), ('hisPRP$', 2592), ('hersNNS', 2), ('hersPRP', 1)]
#Horror male subject pronoun proportion: 0.6544460738568191
#Horror female subject pronoun proportion: 0.5635192268975237

[('shePRP', 2391), ('hePRP', 3809), ('herPRP', 408), ('herPRP$', 2165), ('himPRP', 1100), ('hisPRP$', 2592), ('hersNNS', 2), ('hersPRP', 1)]
Horror male subject pronoun proportion: 0.6544460738568191
Horror female subject pronoun proportion: 0.5635192268975237


In [131]:
#Romance
#Filter the RDD for keys that match the criteria
filtered_rdd = romance_counts.filter(
    lambda kv: any(
        kv[0].lower().startswith(prefix) and kv[0][len(prefix):][0].isupper()
        for prefix in prefixes
    )
)

# Collect and print the results
print(filtered_rdd.collect())

#male subject pronouns and total male pronouns
male_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP', 'hisPRP$'])
male_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP'])
summed_male_pronouns = male_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_male_pronouns_subjects = male_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#female subject pronouns and total female pronouns
female_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP', 'herPRP$', 'hersNNS', 'hersPRP'])
female_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP'])
summed_female_pronouns = female_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_female_pronouns_subjects = female_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#pronoun proportions
romance_male_pronouns_subject_proportion = summed_male_pronouns_subjects / summed_male_pronouns
romance_female_pronouns_subject_proportion = summed_female_pronouns_subjects / summed_female_pronouns

print("Romance male subject pronoun proportion: " + str(romance_male_pronouns_subject_proportion))
print("Romance female subject pronoun proportion: " + str(romance_female_pronouns_subject_proportion))

#[('hePRP', 4905), ('herPRP', 651), ('shePRP', 2928), ('himPRP', 2255), ('herPRP$', 2457), ('hisPRP$', 2763), ('hersPRP', 11), ('hersNNS', 12)]
#Romance male subject pronoun proportion: 0.7215559810541167
#Romance female subject pronoun proportion: 0.590691533256313

[('hePRP', 4905), ('herPRP', 651), ('shePRP', 2928), ('himPRP', 2255), ('herPRP$', 2457), ('hisPRP$', 2763), ('hersPRP', 11), ('hersNNS', 12)]
Romance male subject pronoun proportion: 0.7215559810541167
Romance female subject pronoun proportion: 0.590691533256313


In [133]:
#Warewolf
#Filter the RDD for keys that match the criteria
filtered_rdd = warewolf_counts.filter(
    lambda kv: any(
        kv[0].lower().startswith(prefix) and kv[0][len(prefix):][0].isupper()
        for prefix in prefixes
    )
)

# Collect and print the results
print(filtered_rdd.collect())

#male subject pronouns and total male pronouns
male_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP', 'hisPRP$'])
male_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP'])
summed_male_pronouns = male_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_male_pronouns_subjects = male_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#female subject pronouns and total female pronouns
female_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP', 'herPRP$', 'hersNNS', 'hersPRP'])
female_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP'])
summed_female_pronouns = female_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_female_pronouns_subjects = female_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#pronoun proportions
warewolf_male_pronouns_subject_proportion = summed_male_pronouns_subjects / summed_male_pronouns
warewolf_female_pronouns_subject_proportion = summed_female_pronouns_subjects / summed_female_pronouns

print("Warewolf male subject pronoun proportion: " + str(warewolf_male_pronouns_subject_proportion))
print("Warewolf female subject pronoun proportion: " + str(warewolf_female_pronouns_subject_proportion))

#[('hePRP', 3662), ('shePRP', 3085), ('herPRP', 798), ('herPRP$', 4442), ('hisPRP$', 4217), ('himPRP', 1710), ('hersNNS', 41), ('hersPRP', 8)]
#Warewolf male subject pronoun proportion: 0.5602252581082491
#Warewolf female subject pronoun proportion: 0.46369715786959637

[('hePRP', 3662), ('shePRP', 3085), ('herPRP', 798), ('herPRP$', 4442), ('hisPRP$', 4217), ('himPRP', 1710), ('hersNNS', 41), ('hersPRP', 8)]
Warewolf male subject pronoun proportion: 0.5602252581082491
Warewolf female subject pronoun proportion: 0.46369715786959637


In [135]:
#Harry potter
#Filter the RDD for keys that match the criteria
filtered_rdd = harrypotter_counts.filter(
    lambda kv: any(
        kv[0].lower().startswith(prefix) and kv[0][len(prefix):][0].isupper()
        for prefix in prefixes
    )
)

# Collect and print the results
print(filtered_rdd.collect())

#male subject pronouns and total male pronouns
male_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP', 'hisPRP$'])
male_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP'])
summed_male_pronouns = male_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_male_pronouns_subjects = male_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#female subject pronouns and total female pronouns
female_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP', 'herPRP$', 'hersNNS', 'hersPRP'])
female_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP'])
summed_female_pronouns = female_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_female_pronouns_subjects = female_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#pronoun proportions
harrypotter_male_pronouns_subject_proportion = summed_male_pronouns_subjects / summed_male_pronouns
harrypotter_female_pronouns_subject_proportion = summed_female_pronouns_subjects / summed_female_pronouns

print("Harry Potter male subject pronoun proportion: " + str(harrypotter_male_pronouns_subject_proportion))
print("Harry Potter female subject pronoun proportion: " + str(harrypotter_female_pronouns_subject_proportion))

#[('shePRP', 3554), ('herPRP', 415), ('hePRP', 3592), ('herPRP$', 2633), ('hisPRP$', 2503), ('himPRP', 1347), ('hersNNS', 17), ('hersPRP', 6)]
#Harry Potter male subject pronoun proportion: 0.6636656812684762
#Harry Potter female subject pronoun proportion: 0.5990943396226415

[('shePRP', 3554), ('herPRP', 415), ('hePRP', 3592), ('herPRP$', 2633), ('hisPRP$', 2503), ('himPRP', 1347), ('hersNNS', 17), ('hersPRP', 6)]
Harry Potter male subject pronoun proportion: 0.6636656812684762
Harry Potter female subject pronoun proportion: 0.5990943396226415


In [137]:
#Twilight
#Filter the RDD for keys that match the criteria
filtered_rdd = twilight_counts.filter(
    lambda kv: any(
        kv[0].lower().startswith(prefix) and kv[0][len(prefix):][0].isupper()
        for prefix in prefixes
    )
)

# Collect and print the results
print(filtered_rdd.collect())

#male subject pronouns and total male pronouns
male_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP', 'hisPRP$'])
male_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['hePRP', 'himPRP'])
summed_male_pronouns = male_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_male_pronouns_subjects = male_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#female subject pronouns and total female pronouns
female_pronouns = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP', 'herPRP$', 'hersNNS', 'hersPRP'])
female_pronouns_subjects = filtered_rdd.filter(lambda kv: kv[0] in ['shePRP', 'herPRP'])
summed_female_pronouns = female_pronouns.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)
summed_female_pronouns_subjects = female_pronouns_subjects.map(lambda kv: kv[1]).reduce(lambda a, b: a + b)

#pronoun proportions
twilight_male_pronouns_subject_proportion = summed_male_pronouns_subjects / summed_male_pronouns
twilight_female_pronouns_subject_proportion = summed_female_pronouns_subjects / summed_female_pronouns

print("Twilight male subject pronoun proportion: " + str(twilight_male_pronouns_subject_proportion))
print("Twilight female subject pronoun proportion: " + str(twilight_female_pronouns_subject_proportion))

#[('shePRP', 3554), ('hePRP', 2031), ('herPRP', 622), ('herPRP$', 4040), ('himPRP', 1016), ('hisPRP$', 1534), ('hersNNS', 15), ('hersPRP', 8)]
#Twilight male subject pronoun proportion: 0.6651386160227024
#Twilight female subject pronoun proportion: 0.5068576283529554

[('shePRP', 3554), ('hePRP', 2031), ('herPRP', 622), ('herPRP$', 4040), ('himPRP', 1016), ('hisPRP$', 1534), ('hersNNS', 15), ('hersPRP', 8)]
Twilight male subject pronoun proportion: 0.6651386160227024
Twilight female subject pronoun proportion: 0.5068576283529554
