# Pronoun Wars: Gendered Language Showdown in Mid-1800s Literature

Authors: Alaina Holland and Jonathan McGechie

*July 25, 2024*

PRESENTATION LINK: https://docs.google.com/presentation/d/1mnwp6gMs2t4wr-hH0glf7vDPN0Pls8ho/edit?usp=sharing&ouid=101740956676251134360&rtpof=true&sd=true


## Introduction



This study aims to determine if there is a statistically significant difference in the frequencies of subject and object pronouns for feminine and masculine personal pronouns in historical texts. Additionally, we explore whether the gender of the authors influences these results, shedding light on gender bias and representation in mid-19th century literature.

Understanding gendered language in literature is crucial as it reflects and reinforces societal norms and biases. By examining texts from the mid-1800s, we gain insight into the historical context of gender representation, which can inform contemporary discussions about gender bias in media and literature.

#### Grammatical Gender

- **Feminine:** Refers to nouns and pronouns typically associated with female entities (e.g., she, her, woman, girl).
- **Masculine:** Refers to nouns and pronouns typically associated with male entities (e.g., he, him, man, boy).

#### Syntactic Roles

- **Subject:** The noun or pronoun performing the action in a sentence (e.g., She threw the ball).
- **Object:** The noun or pronoun receiving the action in a sentence (e.g., He gave the book to her).

#### Putting it Together

- **Feminine Subject:** A feminine pronoun used as the subject of a sentence or clause (e.g., She went to the store).
- **Feminine Object:** A feminine pronoun used as the object of a verb or preposition (e.g., Give the flower to her).
- **Masculine Subject:** A masculine pronoun used as the subject of a sentence or clause (e.g., He is playing soccer).
- **Masculine Object:** A masculine pronoun used as the object of a verb or preposition (e.g., I saw him at the park).

**Important Note:**
*Grammatical gender doesn't always align with biological sex or gender identity. Many languages have grammatical genders that apply to inanimate objects or concepts.*

## Methodology

### Datasets

1. [**Moby-Dick**](https://www.gutenberg.org/ebooks/2701) (1851)  by Herman Melville (male-identified author)
2. [**Little Women**](https://www.gutenberg.org/ebooks/37106) (1868-1869) by Louisa May Alcott (female-identified author)
3. [**Little Men**](https://www.gutenberg.org/ebooks/2788) (1871) by Louisa May Alcott (female-identified author)

### Technologies

- **SparkNLP**: An open-source library providing advanced natural language processing capabilities.
- **PySpark**: A Python API for Apache Spark, enabling large-scale data processing.

### Data Processing

1. **Text Preprocessing**: Clean and tokenize the text to extract individual words.
2. **Part-of-Speech Tagging**: Use SparkNLP to annotate the text, identifying personal pronouns and their grammatical roles.
3. **Pronoun Frequency Analysis**: Calculate the frequencies of subject (e.g., he, she) and object (e.g., him, her) pronouns.
4. **Statistical Testing**: Perform chi-square tests to determine if differences in pronoun frequencies are statistically significant.



### Data Sources and Integrity

The texts used in this study were sourced from Project Gutenberg, a well-established repository of free eBooks. Project Gutenberg provides high-quality, digitized versions of public domain texts, ensuring that the works of Herman Melville and Louisa May Alcott are accurately represented. The following steps were taken to ensure data integrity:
1. **Text Verification**: Each text was verified to ensure it matched the original publication as closely as possible, minimizing errors and inconsistencies.
2. **Cleaning and Standardization**: Texts were cleaned to remove any extraneous characters, metadata, and formatting issues that could affect the analysis.
3. **Tokenization**: Texts were tokenized into individual words and sentences to facilitate accurate part-of-speech tagging and pronoun identification.
4. **Quality Control**: Multiple checks were performed to ensure that the processed texts maintained their integrity and that no relevant content was inadvertently removed.


## Hypothesis



**Hypothesis 1: Author Gender Influences Pronoun Usage**

We hypothesize that the gender of the authors significantly influences pronoun usage in the texts "Moby-Dick" and "Little Women," both published in the mid-1850s. We expect "Little Women" to exhibit a higher frequency of feminine pronouns, while "Moby-Dick" will have a higher frequency of masculine pronouns. The mid-1850s is a relevant period because it provides a historical context where gender roles and biases were more pronounced in literature while the "gender" representation in writing was still roughly balanced (40% women to 60% men).

* **Null Hypothesis (H0):** There is no significant difference in the frequency of masculine and feminine pronouns between the texts.
* **Alternative Hypothesis (H1):** There is a significant difference in the frequency of masculine and feminine pronouns between the texts.

**Hypothesis 2: Gender Narration Perspective Impacts Pronoun Usage**

Furthermore, by comparing "Little Women" and "Little Men," we hypothesize that pronoun usage will statistically vary. We presume that "Little Women," written from a female perspective, will differ significantly from "Little Men," which is not from a female perspective.

* **Null Hypothesis (H0):** There is no significant difference in feminine pronoun frequency between "Little Women" and "Little Men".
* **Alternative Hypothesis (H1):** There is a significant difference in feminine pronoun frequency between "Little Women" and "Little Men".


#### Higher Frequency of Masculine Pronouns in "Moby-Dick"

1. **Implication:**  This could indicate a male-centric narrative focus, possibly reflecting the author's gender and societal norms of the time, which often emphasized male experiences and perspectives.

2. **Discussion:**  This outcome would align with the hypothesis that Herman Melville, as a male author, might have used more masculine pronouns, mirroring the predominant gender roles and biases of the mid-1800s.





#### Higher Frequency of Feminine Pronouns in "Little Women"

1. **Implication:**  This could suggest a female-centric narrative, emphasizing female experiences and perspectives. It may reflect Louisa May Alcott's intention to highlight women's roles and lives during the period.

2. **Discussion:**  This would support the hypothesis that Louisa May Alcott, as a female author, focused more on feminine pronouns, offering a counter-narrative to the male-dominated literature of the time.

#### Significant Difference in Pronoun Usage Between "Little Women" and "Little Men"

1. **Implication:**  A notable difference in pronoun usage between these two texts by the same author could indicate deliberate stylistic and thematic choices to reflect the gendered focus of each narrative.

2. **Discussion:**  This outcome would support the idea that even within the works of a single author, the gender perspective can significantly influence language use.

#### No Significant Difference in Pronoun Usage Across Texts

1. **Implication:**  This could suggest that the gender of the author does not significantly impact pronoun usage, or that the themes and contexts of the narratives override gendered language tendencies.

2. **Discussion:**  This outcome would challenge the hypothesis, suggesting that factors other than the author's gender might play a more crucial role in determining pronoun usage in literature.

#### Higher Frequency of Masculine Pronouns in Both Texts

1. **Implication:**  This might reflect broader societal norms and literary conventions of the time, where male experiences were more commonly represented, regardless of the author's gender.

2. **Discussion:**  This outcome would indicate that mid-19th century literature, in general, favored masculine pronouns, perhaps due to prevailing gender biases and expectations in society.

#### Higher Frequency of Feminine Pronouns in Both Texts

1. **Implication:**  
This could be indicative of an exceptional focus on female experiences and perspectives, possibly challenging the typical gender biases of the time.

2. **Discussion:**  
This outcome would be unexpected and would prompt further investigation into the specific themes and contexts of the texts that led to this result.



## Statistical Analysis



### Justification for Statistical Methods




#### Why Chi-square?

We use chi-square tests to determine if the observed differences in pronoun frequencies are statistically significant. The chi-square test is appropriate for this analysis because it evaluates whether the distribution of categorical variables differs from what is expected

#### Example of Chi-Square Evaluation Criteria for Hypothesis 1

| Step                         | Description                                                                                                             |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| 1. Define Hypotheses         | **Null Hypothesis (H0):** No significant difference in pronoun frequency between texts.<br> **Alternative Hypothesis (H1):** Significant difference in pronoun frequency between texts. |
| 2. Collect Data              | Count occurrences of masculine and feminine pronouns in the texts.                                                      |
| 3. Calculate Expected Frequencies (E) | Calculate expected frequency for each cell: <br> \[ E = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}} \] |
| 4. Calculate Chi-Square Statistic | Use the formula: <br> \[ \chi^2 = \sum \frac{(O - E)^2}{E} \] |
| 5. Determine Degrees of Freedom (df) | \[ df = (rows - 1) \times (columns - 1) \]                                                                           |
| 6. Compare to Chi-Square Distribution | Find critical value for \( df = 1 \) at significance level \( \alpha = 0.05 \).                                      |
| 7. Make a Decision            | If \(\chi^2 \) > critical value, reject the null hypothesis.                                                          |
| 8. Interpretation             | Interpret the result in context.                                                                                       |


#### Chi-Square Evaluation Criteria for Hypothesis 2

| Step                         | Description                                                                                                             |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| 1. Define Hypotheses         | **Null Hypothesis (H0):** There is no significant difference in pronoun frequency between "Little Women" and "Little Men".<br> **Alternative Hypothesis (H1):** There is a significant difference in pronoun frequency between "Little Women" and "Little Men". |
| 2. Collect Data              | Count occurrences of masculine and feminine pronouns in "Little Women" and "Little Men".                                                      |
| 3. Calculate Expected Frequencies (E) | Calculate expected frequency for each cell: <br> \[ E = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}} \] |
| 4. Calculate Chi-Square Statistic | Use the formula: <br> \[ \chi^2 = \sum \frac{(O - E)^2}{E} \] |
| 5. Determine Degrees of Freedom (df) | \[ df = (rows - 1) \times (columns - 1) \]                                                                           |
| 6. Compare to Chi-Square Distribution | Find critical value for \( df = 1 \) at significance level \( \alpha = 0.05 \).                                      |
| 7. Make a Decision            | If \(\chi^2 \) > critical value, reject the null hypothesis.                                                          |
| 8. Interpretation             | Interpret the result in context.                                                                                       |


## Implications and Expected Impact

### Implications Potential Pronoun Frequences

### Expected Impact



**Confirmation of Gender Bias:**  
If the study finds significant differences in pronoun usage aligned with the author's gender, it would confirm that gender bias and representation were prevalent in mid-19th century literature. This could lead to a deeper understanding of how historical gender roles influenced literary expression and narrative focus.

**Challenge to Existing Assumptions:**  
If no significant differences are found, or if unexpected patterns emerge, it might challenge existing assumptions about gendered language in literature. This could prompt reevaluation of how gender influences writing and whether other factors (e.g., themes, genre, audience) play a more significant role.

**Implications for Modern Literature:**  
The findings could inform contemporary discussions about gender representation in modern literature and media. Understanding historical patterns of gender bias can provide context for evaluating and addressing gender biases in current literary practices.

By exploring these potential outcomes, the study aims to contribute to a nuanced understanding of gendered language in literature, highlighting the complex interplay between author gender, societal norms, and narrative focus.

## Evaluation


### SparkNLP Setup and Data Preparation


### 1: Install and Initialize SparkNLP

In [1]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

# Install SparkNLP
!pip install spark-nlp==4.2.8

# Import SparkNLP and start the session
import sparknlp
spark = sparknlp.start() # Remove the spark24 parameter if present

# Import PretrainedPipeline
from sparknlp.pretrained import PretrainedPipeline

# Load a pretrained pipeline
pipeline = PretrainedPipeline("explain_document_ml")


--2024-07-26 01:21:48--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 3.86.22.73
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|3.86.22.73|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2024-07-26 01:21:48--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing PySpark 3.2.3 and Spark NLP 5.4.1
setup Colab for PySpark 3.2.3 and Spark 

### 2: Annotate Sample Sentences

Define a list of sample sentences containing various pronouns and use the pretrained pipeline to annotate them.

In [2]:
hls = [
    # Subject pronouns
    "She ran", "He ran", "I ran", "We ran", "They ran", "You ran", "It ran",
    # Object pronouns
    "I saw her", "I saw him", "I saw them", "I saw us", "I saw you", "I saw it",
    # Possessive pronouns (used as objects)
    "I know her name", "I know his name", "I know their name", "I know our name", "I know your name", "I know its name",
    # Possessive pronouns (used as subjects)
    "That is hers", "That is his", "That is theirs", "That is ours", "That is yours", "That is its",
    # Reflexive pronouns
    "She saw herself", "He saw himself", "I saw myself", "We saw ourselves", "They saw themselves", "You saw yourself", "It saw itself"
]

dfs = [pipeline.annotate(hl) for hl in hls] # Now hls is defined before this line is executed
tok_tag = [(df['token'],df['pos']) for df in dfs]
zips = [list(zip(tt[0], tt[1])) for tt in tok_tag]
tagged = [" ".join(["".join(word) for word in hl]) for hl in zips]


### 3: Download and Read Texts

Download the texts from Project Gutenberg and read them into Python.

In [3]:
# Grab the texts from Project Gutenberg

!curl "https://www.gutenberg.org/cache/epub/2701/pg2701.txt" -o mobydick.txt
!curl "https://www.gutenberg.org/cache/epub/37106/pg37106.txt" -o littlewomen.txt
!curl "https://www.gutenberg.org/cache/epub/2788/pg2788.txt" -o littlemen.txt


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1246k  100 1246k    0     0  1326k      0 --:--:-- --:--:-- --:--:-- 1325k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1087k  100 1087k    0     0  1352k      0 --:--:-- --:--:-- --:--:-- 1352k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  585k  100  585k    0     0   688k      0 --:--:-- --:--:-- --:--:--  688k


In [4]:
mobydick = open('mobydick.txt').read()
littlewomen = open('littlewomen.txt').read()
littlemen = open('littlemen.txt').read()

In [5]:

# Sample output
print(mobydick[:1000])
pipeline.annotate(mobydick[:100])['pos']

print(littlewomen[:1000])
pipeline.annotate(littlewomen[:100])['pos']

print(littlemen[:1000])
pipeline.annotate(littlemen[:100])['pos']


﻿The Project Gutenberg eBook of Moby Dick; Or, The Whale
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Moby Dick; Or, The Whale

Author: Herman Melville

Release date: July 1, 2001 [eBook #2701]
                Most recently updated: August 18, 2021

Language: English

Credits: Daniel Lazarus, Jonesey, and David Widger


*** START OF THE PROJECT GUTENBERG EBOOK MOBY DICK; OR, THE WHALE ***




MOBY-DICK;

or, THE WHALE.

By Herman Melville



CONTENTS

ETYMOLOGY.

EXTRACTS (Supplied by a Sub-Sub-Librarian).

CHAPTER 1. Loomings.

CHAPTER 2. The Carpet-Bag.

CHAPTER 3. The Spouter-Inn

['DT',
 'NNP',
 'NNP',
 'NN',
 'IN',
 'NNP',
 'NNP',
 ':',
 'NNP',
 'IN',
 'NNP',
 'IN',
 'NNS',
 'NNP',
 'DT',
 'NN',
 'VBZ',
 'IN']

### Step 4: Word Count Exploration

Use Spark to perform a simple word count on the texts.

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Pronoun Frequency Analysis").getOrCreate()

texts = {
    'mobydick': 'mobydick.txt',
    'littlewomen': 'littlewomen.txt',
    'littlemen': 'littlemen.txt'
}

for name, file in texts.items():
    text_rdd = spark.sparkContext.textFile(file)
    counts = (
        text_rdd.flatMap(lambda line: line.split(" "))
        .map(lambda word: (word, 1))
        .reduceByKey(lambda a, b: a + b)
    )
    print(f"{name} word counts: {counts.collect()[:10]}")


mobydick word counts: [('The', 634), ('Project', 80), ('of', 6642), ('Moby', 79), ('', 4320), ('ebook', 2), ('is', 1585), ('use', 36), ('anyone', 5), ('anywhere', 11)]
littlewomen word counts: [('The', 341), ('Project', 82), ('of', 3622), ('Women;', 3), ('Jo,', 406), ('Amy', 356), ('', 24604), ('ebook', 2), ('is', 758), ('use', 54)]
littlemen word counts: [('The', 218), ('Project', 79), ('of', 2019), ('Life', 3), ('at', 545), ('Boys', 3), ('', 3383), ('ebook', 2), ('is', 436), ('use', 21)]


### Step 5: Pronoun Frequency Analysis
Perform a detailed analysis of pronoun frequencies.

In [7]:
import pyspark
from sparknlp.base import DocumentAssembler # Make sure this line of code is executed before you use the Document Assembler object.
from sparknlp.annotator import Tokenizer, PerceptronModel
from pyspark.sql.functions import col, explode, arrays_zip, when, count
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from pyspark.ml import Pipeline # Import the Pipeline object
import numpy as np


In [8]:
# Initialize Spark session
spark = SparkSession.builder.appName("Pronoun Frequency Analysis").getOrCreate()


In [9]:
# Load the texts
texts_data = [(mobydick, 'Moby-Dick'), (littlewomen, 'Little Women'), (littlemen, 'Little Men')]
df = spark.createDataFrame(texts_data, ["text", "book"])


In [10]:
# Define Spark NLP pipeline
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
pos_tagger = PerceptronModel.pretrained().setInputCols(["document", "token"]).setOutputCol("pos")

pipeline = Pipeline(stages=[document_assembler, tokenizer, pos_tagger])

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]


In [11]:
# Process the texts
model = pipeline.fit(df)
result = model.transform(df)

In [12]:
# Extract tokens and POS tags
tokens_pos = result.select("book", explode(arrays_zip(result.token.result, result.pos.result)).alias("cols")) \
    .select("book", col("cols")["0"].alias("token"), col("cols")["1"].alias("pos"))


In [13]:
# Pronoun lists
feminine_pronouns = ['she', 'her']
masculine_pronouns = ['he', 'him']

In [14]:
# Count pronoun occurrences by POS tag
subject_tags = ['PRP']
object_tags = ['PRP$', 'DT']

In [15]:
def get_count(count_list, index=0):
    return count_list[index]['count'] if count_list else 0

In [None]:
# Collect counts with safety checks
feminine_subject_count = tokens_pos.filter((col("token").isin(feminine_pronouns)) & (col("pos").isin(subject_tags))).groupBy("book").count().collect()
feminine_object_count = tokens_pos.filter((col("token").isin(feminine_pronouns)) & (col("pos").isin(object_tags))).groupBy("book").count().collect()
masculine_subject_count = tokens_pos.filter((col("token").isin(masculine_pronouns)) & (col("pos").isin(subject_tags))).groupBy("book").count().collect()
masculine_object_count = tokens_pos.filter((col("token").isin(masculine_pronouns)) & (col("pos").isin(object_tags))).groupBy("book").count().collect()


In [None]:
# Convert counts to list for chi-squared test
data = [
    [
        get_count(feminine_subject_count),
        get_count(feminine_object_count)
    ],
    [
        get_count(masculine_subject_count),
        get_count(masculine_object_count)
    ]
]

In [None]:
# Perform chi-squared test
chi2, p, dof, ex = chi2_contingency(data)

In [None]:
# Visualize the data
labels = ['Feminine Subject', 'Feminine Object', 'Masculine Subject', 'Masculine Object']
counts_data = [
    get_count(feminine_subject_count),
    get_count(feminine_object_count),
    get_count(masculine_subject_count),
    get_count(masculine_object_count)
]

plt.figure(figsize=(10, 6))
plt.bar(labels, counts_data, color=['#FF69B4', '#FF69B4', '#1E90FF', '#1E90FF'])
plt.xlabel('Pronoun Type and Syntactic Role')
plt.ylabel('Frequency')
plt.title('Pronoun Frequencies by Syntactic Role and Text')
plt.show()

In [None]:
# Output results
print("Chi-Squared Test")
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p}")

Visualize by book


In [None]:
# Visualization
books = ['Moby-Dick', 'Little Women', 'Little Men']
pronoun_labels = ['Feminine Subject', 'Feminine Object', 'Masculine Subject', 'Masculine Object']

# Extract counts for each book
counts_by_book = {
    book: [
        get_count(feminine_subject_count, books.index(book)), # Use books.index(book) to get the index of the current book
        get_count(feminine_object_count, books.index(book)), # Use books.index(book) to get the index of the current book
        get_count(masculine_subject_count, books.index(book)), # Use books.index(book) to get the index of the current book
        get_count(masculine_object_count, books.index(book)) # Use books.index(book) to get the index of the current book
    ] for book in books
}

In [None]:
# prompt: Do it by each book

# Create a bar chart for each book
for book, counts in counts_by_book.items():
    plt.figure(figsize=(10, 6))
    plt.bar(pronoun_labels, counts, color=['#FF69B4', '#FF69B4', '#1E90FF', '#1E90FF'])
    plt.xlabel('Pronoun Type and Syntactic Role')
    plt.ylabel('Frequency')
    plt.title(f'Pronoun Frequencies by Syntactic Role in "{book}"')
    plt.show()

    # Perform chi-squared test for each book
    chi2, p, dof, ex = chi2_contingency([counts[:2], counts[2:]])
    print(f"Chi-Squared Test for {book}")
    print(f"Chi2 Statistic: {chi2}")
    print(f"P-value: {p}")


Comparing Grammatical Gender by Feminine and Masculine Subject and Object by Book

In [None]:
# prompt: Comparing Grammatical Gender by Feminine and Masculine Subject and Object by Book

# Comparing Grammatical Gender by Feminim and Masculine Subject and Object by Book
# Assuming the task is to compare the counts of feminine and masculine pronouns in subject and object positions for each book:

books = ['Moby-Dick', 'Little Women', 'Little Men']
pronoun_labels = ['Feminine Subject', 'Feminine Object', 'Masculine Subject', 'Masculine Object']

# Extract counts for each book
counts_by_book = {
    book: [
        get_count(feminine_subject_count, books.index(book)),
        get_count(feminine_object_count, books.index(book)),
        get_count(masculine_subject_count, books.index(book)),
        get_count(masculine_object_count, books.index(book))
    ] for book in books
}

# Print the counts for comparison
for book, counts in counts_by_book.items():
    print(f"Pronoun Counts in '{book}':")
    for label, count in zip(pronoun_labels, counts):
        print(f"  {label}: {count}")
    print()


In [None]:
import matplotlib.pyplot as plt

# Pronoun counts for each book
books = ['Moby-Dick', 'Little Women', 'Little Men']
feminine_subjects = [127, 663, 2327]
feminine_objects = [291, 766, 2968]
masculine_subjects = [2532, 1998, 2083]
masculine_objects = [0, 0, 0]

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

# Create a bar chart for each book
for i, book in enumerate(books):
    ax = axes[i]
    labels = ['Feminine Subject', 'Feminine Object', 'Masculine Subject', 'Masculine Object']
    counts = [feminine_subjects[i], feminine_objects[i], masculine_subjects[i], masculine_objects[i]]

    ax.bar(labels, counts, color=['#FF69B4', '#D8BFD8', '#1E90FF', '#ADD8E6'])
    ax.set_xlabel('Pronoun Type and Syntactic Role')
    ax.set_ylabel('Frequency')
    ax.set_title(f'Pronoun Frequencies in "{book}"')

plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Pronoun counts for each book
books = ['Moby-Dick', 'Little Women', 'Little Men']
feminine_subjects = [127, 663, 2327]
feminine_objects = [291, 766, 2968]
masculine_subjects = [2532, 1998, 2083]
masculine_objects = [0, 0, 0]

# Create separate figures for each pronoun type
fig1, ax1 = plt.subplots(figsize=(8, 6))
fig2, ax2 = plt.subplots(figsize=(8, 6))
fig3, ax3 = plt.subplots(figsize=(8, 6))
fig4, ax4 = plt.subplots(figsize=(8, 6))

# Create bar charts
ax1.bar(books, feminine_subjects, color='#FF69B4')
ax1.set_ylabel('Frequency')
ax1.set_title('Feminine Subject Pronoun Frequencies')

ax2.bar(books, feminine_objects, color='#D8BFD8')
ax2.set_ylabel('Frequency')
ax2.set_title('Feminine Object Pronoun Frequencies')

ax3.bar(books, masculine_subjects, color='#1E90FF')
ax3.set_ylabel('Frequency')
ax3.set_title('Masculine Subject Pronoun Frequencies')

ax4.bar(books, masculine_objects, color='#ADD8E6')
ax4.set_ylabel('Frequency')
ax4.set_title('Masculine Object Pronoun Frequencies')

# Display the charts
plt.show()

% of Pronoun Usage by book

In [None]:
# prompt: Calculate the % of Pronoun Usage by book

total_pronouns_by_book = {
    book: sum(counts)
    for book, counts in counts_by_book.items()
}

percentage_by_book = {
    book: {
        label: count / total_pronouns_by_book[book] * 100
        for label, count in zip(pronoun_labels, counts)
    }
    for book, counts in counts_by_book.items()
}

for book, percentages in percentage_by_book.items():
    print(f"Percentage of Pronoun Usage in {book}:")
    for label, percentage in percentages.items():
        print(f"  {label}: {percentage:.2f}%")


### **Hypothesis 1: Author Gender Influences Pronoun Usage**


* **Null Hypothesis (H0):** There is no significant difference in the frequency of masculine and feminine pronouns between the texts.
* **Alternative Hypothesis (H1):** There is a significant difference in the frequency of masculine and feminine pronouns between the texts.


In [None]:
# prompt: Compare Little Women and Moby-Dick Pronoun Usage and chi-square

# Assuming you want to perform a chi-squared test to compare pronoun usage in Little Women and Moby Dick:

# Extract counts for Little Women and Moby Dick
little_women_counts = counts_by_book['Little Women']
moby_dick_counts = counts_by_book['Moby-Dick']

# Remove Masculine Object Pronouns as they are all 0
little_women_counts = little_women_counts[:-1]
moby_dick_counts = moby_dick_counts[:-1]

# Perform chi-squared test
chi2, p, dof, ex = chi2_contingency([little_women_counts, moby_dick_counts])

# Output results
print("Chi-Squared Test for Little Women vs. Moby Dick")
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p}")

In [None]:
# prompt: Visualize Little Women and Moby-Dick Pronoun Usage

# Extract counts for Little Women and Moby Dick
little_women_counts = counts_by_book['Little Women']
moby_dick_counts = counts_by_book['Moby-Dick']

# Create a bar chart for comparison
labels = ['Feminine Subject', 'Feminine Object', 'Masculine Subject', 'Masculine Object']
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(10, 6))
rects1 = ax.bar(x - width/2, little_women_counts, width, label='Little Women', color='#FF69B4')
rects2 = ax.bar(x + width/2, moby_dick_counts, width, label='Moby Dick', color='#1E90FF')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Frequency')
ax.set_title('Pronoun Frequencies in Little Women vs. Moby Dick')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

plt.show()


###Results
* Chi2 Statistic: 607.79
* P-value: 1.0457e-132

**Interpretation**

The chi-square test comparing "Little Women" and "Moby-Dick" shows a chi2 statistic of 607.79, which is very high. This means there is a significant difference in the pronoun usage between the two books. The p-value is extremely small, indicating that these differences are not due to random chance.

**Conclusion**

**Reject** the Null Hypothesis (H0): We can confidently say that there is a significant difference in pronoun frequency between "Little Women" and "Moby-Dick."

**Support the Alternative Hypothesis (H1):** The data strongly supports the idea that there are significant differences in how pronouns are used in these two books.

**Addressing the Hypotheses**

Hypothesis 1: Higher Frequency of Masculine Pronouns in "Moby-Dick" vs. Higher Frequency of Feminine Pronouns in "Little Women"

The results show that "Moby-Dick" has a higher frequency of masculine pronouns, while "Little Women" has a higher frequency of feminine pronouns, confirming our hypothesis.

In [None]:
# prompt: Visualize Little Women and Moby-Dick Pronoun Usage Chi-Square

# Extract counts for Little Women and Moby Dick
little_women_counts = counts_by_book['Little Women'][:2]  # Feminine subject and object
moby_dick_counts = counts_by_book['Moby-Dick'][:2]  # Feminine subject and object

# Perform chi-squared test
chi2, p, dof, ex = chi2_contingency([little_women_counts, moby_dick_counts])

# Output results
print("Chi-Squared Test for Feminine Pronoun Usage in Little Women vs. Moby Dick")
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p}")

# Create a bar chart for comparison
labels = ['Feminine Subject', 'Feminine Object']
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(8, 6))
rects1 = ax.bar(x - width/2, little_women_counts, width, label='Little Women', color='#FF69B4')
rects2 = ax.bar(x + width/2, moby_dick_counts, width, label='Moby Dick', color='#1E90FF')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Frequency')
ax.set_title('Feminine Pronoun Frequencies in Little Women vs. Moby Dick')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

plt.show()


**Interpretation**

The **chi-square test** comparing feminine pronoun usage in "Little Women" and "Moby-Dick" shows a chi2 statistic of 33.23. This high value indicates a significant difference between the observed and expected frequencies of feminine pronoun usage in these texts. The p-value is extremely small, much lower than the typical significance level of 0.05, suggesting that the observed differences in feminine pronoun frequencies are highly unlikely to have occurred by chance.

The **chart** visually confirms that "Little Women" uses feminine pronouns (both subject and object) far more frequently than "Moby-Dick."
This supports the hypothesis that "Little Women," written from a female perspective, emphasizes feminine pronouns to reflect the female-centric narrative.
In contrast, "Moby-Dick" has fewer feminine pronouns, aligning with the male-centric narrative focus of the text.


### **Hypothesis 2: Gender Narration Perspective Impacts Pronoun Usage**


* **Null Hypothesis (H0):** There is no significant difference in pronoun frequency between "Little Women" and "Little Men".
* **Alternative Hypothesis (H1):** There is a significant difference in pronoun frequency between "Little Women" and "Little Men".

In [None]:
# prompt: Compare Little Women and lIttle Men Pronoun Usage and chi-square

# Extract counts for Little Women and Little Men
lw_counts = counts_by_book['Little Women']
lm_counts = counts_by_book['Little Men']

# Remove Masculine Object Pronouns as they are all 0
lw_counts = lw_counts[:-1]
lm_counts = lm_counts[:-1]

# Perform chi-squared test
chi2, p, dof, ex = chi2_contingency([lw_counts, lm_counts])

# Print results
print("Chi-Squared Test for Little Women vs. Little Men")
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p}")

#### Results and Interpretation
* Chi^2 Statistic: 902.28
* P-value: 1.1794e-196

**Interpretation**

The chi-square test results show a very high chi2 statistic of 902.28, which indicates a significant difference between the observed and expected frequencies of pronoun usage in "Little Women" and "Little Men." The p-value is extremely small, much lower than the common threshold of 0.05, suggesting that the observed differences in pronoun frequencies are highly unlikely to have occurred by chance.

**Conclusion**
Reject the Null Hypothesis (H0): We can confidently reject the null hypothesis, which claims there is no significant difference in pronoun frequency between "Little Women" and "Little Men."
Support the Alternative Hypothesis (H1): The data strongly supports the alternative hypothesis, indicating a significant difference in pronoun usage between the two texts.

**Addressing the Hypotheses**

Hypothesis 2: Significant Difference in Pronoun Usage Between "Little Women" and "Little Men"

The test results confirm a statistically significant difference in pronoun usage between "Little Women" and "Little Men." This supports the idea that the narrative perspective and thematic focus of each text influence pronoun usage patterns.

In [None]:
# prompt: Visualize Little Women and Little Men Pronoun Usage Chi-Square

# Extract counts for Little Women and Little Men
lw_counts = counts_by_book['Little Women']
lm_counts = counts_by_book['Little Men']

# Create a bar chart for comparison
labels = ['Feminine Subject', 'Feminine Object', 'Masculine Subject', 'Masculine Object']
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(10, 6))
rects1 = ax.bar(x - width/2, lw_counts, width, label='Little Women', color='#FF69B4')
rects2 = ax.bar(x + width/2, lm_counts, width, label='Little Men', color='#1E90FF')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Frequency')
ax.set_title('Pronoun Frequencies in Little Women vs. Little Men')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

plt.show()


In [None]:
# prompt: Visualize Little Women and Little Men Pronoun Usage

# Extract counts for Little Women and Little Men
lw_counts = counts_by_book['Little Women'][:2]  # Feminine subject and object
lm_counts = counts_by_book['Little Men'][:2]  # Feminine subject and object

# Perform chi-squared test
chi2, p, dof, ex = chi2_contingency([lw_counts, lm_counts])

# Output results
print("Chi-Squared Test for Feminine Pronoun Usage in Little Women vs. Little Men")
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p}")

# Create a bar chart for comparison
labels = ['Feminine Subject', 'Feminine Object']
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(8, 6))
rects1 = ax.bar(x - width/2, lw_counts, width, label='Little Women', color='#FF69B4')
rects2 = ax.bar(x + width/2, lm_counts, width, label='Little Men', color='#D8BFD8')  # Lighter shade for Little Men

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Frequency')
ax.set_title('Feminine Pronoun Frequencies in Little Women vs. Little Men')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

plt.show()


### Results

**Overall Pronoun Usage**
- **Chi2 Statistic:** 902.28
- **P-value:** 1.1794e-196

**Feminine Pronoun Usage**
- **Chi2 Statistic:** 2.63
- **P-value:** 0.1045

### Interpretation

**Overall Pronoun Usage:**

The chi-square test for overall pronoun usage shows a very high chi2 statistic (902.28) and an extremely small p-value (1.1794e-196). This indicates a significant difference in pronoun usage between "Little Women" and "Little Men," suggesting that the two texts use pronouns in markedly different ways.

**Feminine Pronoun Usage:**

The chi-square test for feminine pronoun usage shows a chi2 statistic of 2.63 and a p-value of 0.1045. Since the p-value is above the common significance level of 0.05, this suggests that there is no statistically significant difference in the usage of feminine pronouns between "Little Women" and "Little Men."

### Conclusion

**Overall Pronoun Usage:**
- **Reject the Null Hypothesis (H0):** We can confidently reject the null hypothesis that there is no significant difference in overall pronoun frequency between "Little Women" and "Little Men."
- **Support the Alternative Hypothesis (H1):** The data strongly supports the alternative hypothesis, indicating significant differences in overall pronoun usage between the two texts.

**Feminine Pronoun Usage:**
- **Fail to Reject the Null Hypothesis (H0):** We cannot reject the null hypothesis for feminine pronoun usage. This means that, statistically, there is no significant difference in the frequency of feminine pronouns between "Little Women" and "Little Men."

### Addressing the Hypotheses

**Hypothesis 2: Significant Difference in Pronoun Usage Between "Little Women" and "Little Men"**
- **Overall Pronouns:** The results confirm a significant difference in overall pronoun usage, supporting the hypothesis that the narrative perspective and thematic focus influence pronoun usage patterns.
- **Feminine Pronouns:** The results indicate no significant difference in feminine pronoun usage, suggesting that both texts use feminine pronouns similarly, despite their different narrative focuses.

### Visualization Support

The bar chart provided earlier supports these findings:

- **Feminine Pronouns:** "Little Men" shows higher frequencies for feminine pronouns, but the statistical test indicates this difference is not significant.
- **Overall Pronouns:** The chart and statistical test together highlight significant differences in overall pronoun usage, particularly in the distribution of masculine and feminine pronouns.


## Conclusion


Our study looked at the differences in pronoun usage between "Little Women," "Little Men," and "Moby-Dick" to see how gender and narrative perspective influenced language in the mid-1800s. Here's what we found:




### Hypotheses

**Hypothesis 1: Author Gender Influences Pronoun Usage**
- We thought the gender of the authors would significantly influence pronoun usage in "Moby-Dick" and "Little Women." Specifically, we expected "Little Women" to have more feminine pronouns and "Moby-Dick" to have more masculine pronouns.

**Hypothesis 2: Gender Narration Perspective Impacts Pronoun Usage**
- We also believed that pronoun usage would differ between "Little Women" and "Little Men," expecting "Little Women" to differ significantly from "Little Men" because of its female perspective.

#### Overall Pronoun Usage

**Chi-Squared Test**
- **Chi2 Statistic:** 1947.796335480638
- **P-value:** 0.0

There is a significant difference in overall pronoun usage between "Little Women", "Moby-Dick" and "Little Men." This supports our idea that different narrative focuses lead to different pronoun usage patterns.

### Little Women vs. Moby-Dick

#### Overall Pronoun Usage

- **Chi2 Statistic:** 607.79
- **P-value:** 1.0457e-132

Comparing "Little Women" and "Moby-Dick," we found a significant difference in pronoun usage. "Moby-Dick" has more masculine pronouns, while "Little Women" has more feminine pronouns, which aligns with the different perspectives of the authors.

#### Feminine Pronoun Usage

- **Chi2 Statistic:** 33.23
- **P-value:** 8.1948e-09

For feminine pronouns, there's a clear difference between "Little Women" and "Moby-Dick." "Little Women" uses more feminine pronouns, reflecting its focus on female experiences.

### Little Women vs. Little Men

#### Overall Pronoun Usage

- **Chi2 Statistic:** 902.28
- **P-value:** 1.1794e-196

There is a significant difference in overall pronoun usage between "Little Women" and "Little Men," showing that different narrative perspectives and themes lead to different pronoun usage.

#### Feminine Pronoun Usage

- **Chi2 Statistic:** 2.63
- **P-value:** 0.1045

For feminine pronouns, there is no significant difference between "Little Women" and "Little Men." This suggests both texts use feminine pronouns in similar ways, despite their different focuses.

### Notable Absence of Masculine Object Pronouns

One interesting finding is the lack of masculine object pronouns in all three works. Masculine object pronouns include "him" and "his" (when used as objects). This absence might reflect the narrative styles and thematic focuses of the authors, as well as the societal norms of the mid-1800s that influenced their writing.

### Implications of 1800s Gendered Writing

These findings show how gender and narrative perspective influenced writing in the mid-1800s. "Moby-Dick" by Herman Melville, with its higher use of masculine pronouns, reflects a male-centric view typical of that era. Louisa May Alcott's "Little Women" and "Little Men," on the other hand, show a more balanced use of feminine and masculine pronouns, highlighting female experiences even while addressing male perspectives in "Little Men."

This analysis helps us understand the historical context of gender representation in literature. The significant differences in pronoun usage reveal how authors' gender and societal norms shaped their writing. These insights are valuable for today's discussions about gender bias in media and literature, showing how historical gender roles have influenced language and storytelling.

By examining these texts, we can better appreciate how gendered language was shaped by cultural contexts, providing a basis for addressing and challenging gender biases in modern literature and media.


## References
1. https://phys.org/news/2018-02-professor-big-history-gender-fiction.html

2. http://gendernovels.digitalhumanitiesmit.org/info/subject_object_pronoun_analysis

3. With help from [Gemini](https://gemini.google.com/)