<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/11.nlp/HW10_SyntacticRelations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/11.nlp/HW10_SyntacticRelations.ipynb)

# HW10: Exploring gender in books

This notebook explores dependency parsing by identifying the actions and objects that are characteristically associated with characters as a function of their referential gender ("he"/"she").

In [1]:
import math
import operator

from collections import Counter

import spacy
from tqdm import tqdm

In [2]:
nlp = spacy.load('en_core_web_sm')

## Load data

We'll run seven novels by Jane Austen through spaCy (this will take a few minutes).

In [3]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/emma.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/lady_susan.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/mansfield_park.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/northanger_abbey.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/persuasion.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/pride.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/sense_and_sensibility.txt

--2025-11-09 18:10:20--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/emma.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 927445 (906K) [text/plain]
Saving to: ‘emma.txt’


2025-11-09 18:10:20 (5.35 MB/s) - ‘emma.txt’ saved [927445/927445]

--2025-11-09 18:10:20--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/lady_susan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 149538 (146K) [text/plain]
Saving to: ‘lady_susan.txt’


2025-11-09 18:10:21 (1.78 MB/s) - 

In [4]:
files = ["emma.txt", "lady_susan.txt", "mansfield_park.txt", "northanger_abbey.txt", "persuasion.txt", "pride.txt", "sense_and_sensibility.txt"]

def read_all_files(filenames):
    all_tokens = []

    for filename in tqdm(filenames):
        data = open(filename, encoding="utf-8").read()
        tokens = nlp(data)
        all_tokens.extend(tokens)
    return all_tokens

all_tokens = read_all_files(files)

100%|██████████| 7/7 [03:01<00:00, 25.91s/it]


In [5]:
print(len(all_tokens))

972810


## Setting up log odds

In [6]:
def logodds(counter1, counter2, display=25):
    """
    Function that takes two Counter objects as inputs and prints out a ranked list of terms
    more characteristic of the first counter than the second.  Here we'll use log-odds
    with an uninformative prior (from Monroe et al 2008, "Fightin Words", eqn. 22) as our metric.

    "Category 1" corresponds to the category of the counter1 object (the first argument)
    "Category 2" corresponds to the category of the counter2 object (the second argument)
    """
    vocab=dict(counter1)
    vocab.update(dict(counter2))
    count1_sum=sum(counter1.values())
    count2_sum=sum(counter2.values())

    ranks={}
    alpha=0.01
    alphaV=len(vocab)*alpha

    for word in vocab:

        log_odds_ratio=math.log( (counter1[word] + alpha) / (count1_sum+alphaV-counter1[word]-alpha) ) - math.log( (counter2[word] + alpha) / (count2_sum+alphaV-counter2[word]-alpha) )
        variance=1./(counter1[word] + alpha) + 1./(counter2[word] + alpha)

        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)

    print("Most category 1:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))

    print("\nMost category 2:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

## Dependency parsing with SpaCy

SpaCy uses the [ClearNLP dependency labels](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md), which are very close to the Stanford typed dependencies.  See the [Stanford dependencies manual](http://people.ischool.berkeley.edu/~dbamman/DependencyManual.pdf) for more information about each tag.  Parse information is contained in the spacy token object; see the following for which attributes encode the token text, idx (position in sentence), part of speech, and dependency relation.  The syntactic head for a token is another token given in `token.head` (where all of those same token attributes are accessible).

In [7]:
test_doc = nlp("He started his car.")
for token in test_doc:
    print("\t".join(str(x) for x in [token.text, token.idx, token.tag_, token.dep_, token.head.text, token.head.idx, token.head.tag_]))


He	0	PRP	nsubj	started	3	VBD
started	3	VBD	ROOT	started	3	VBD
his	11	PRP$	poss	car	15	NN
car	15	NN	dobj	started	3	VBD
.	18	.	punct	started	3	VBD


**Q1**. Find the verbs that men are more characteristically the *subject* of than women.  Feel free to only consider subjects that are "he" and "she" pronouns.  This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given verb has "he" (`he_counter`) and "she" (`she_counter`) as its syntactic subject.

In [10]:
def count_subjects(tokens):
    he_counter = Counter()
    she_counter = Counter()

    for token in tokens:
        if token.lemma_ == 'he':
            for verb in token.head.children:
                if verb.head.pos_ == 'VERB' and verb.dep_ == 'nsubj':
                    verb_lemma = verb.head.lemma_
                    he_counter[verb_lemma] += 1
        elif token.lemma_ == 'she':
            for verb in token.head.children:
                if verb.head.pos_ == 'VERB' and verb.dep_ == 'nsubj':
                    verb_lemma = verb.head.lemma_
                    she_counter[verb_lemma] += 1

    return he_counter, she_counter

In [11]:
he_counts, she_counts = count_subjects(all_tokens)
logodds(he_counts, she_counts, display=10)

Most category 1:
8.634	come
5.232	reply
4.589	say
4.247	seem
4.010	talk
3.716	do
3.458	mean
3.374	continue
3.030	take
2.912	beg

Most category 2:
-8.431	feel
-4.644	hear
-3.175	think
-2.892	resolve
-2.792	fear
-2.753	expect
-2.729	cry
-2.683	find
-2.552	strike
-2.514	see


**Q2**. Find the verbs that men are more characteristically the *object* of than women.  Feel free to only consider objects that are "him" and "her" pronouns.  This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given verb has "him" (`he_counter`) and "her" (`she_counter`) as its syntactic direct object.

In [14]:
def count_objects(tokens):
    he_counter=Counter()
    she_counter=Counter()

    for token in tokens:
        if token.dep_ == 'dobj':
            verb = token.head
            if verb.pos_ != 'VERB':
                continue
            obj_text = token.text.lower()
            verb_lemma = verb.lemma_.lower()

            if obj_text == 'him':
                he_counter[verb.lemma_] += 1
            elif obj_text == 'her':
                she_counter[verb.lemma_] += 1

    return he_counter, she_counter

In [15]:
he_counts, she_counts = count_objects(all_tokens)
logodds(he_counts, she_counts, display=10)

Most category 1:
5.309	see
4.798	like
3.145	know
2.790	wish
2.570	introduce
2.454	send
2.333	believe
2.322	suspect
2.175	recommend
2.018	dislike

Most category 2:
-3.500	leave
-2.794	attend
-2.546	please
-2.317	strike
-2.182	prevent
-2.075	oblige
-2.068	support
-2.032	escape
-2.032	treat
-1.968	amuse


**Q3**. Find the objects that are *possessed* more frequently by men than women. Feel free to only consider possessors that are "his" and "her" pronouns.   This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given term is possessed by "he" (`he_counter`) and "she" (`she_counter`).

In [20]:
def count_possessions(tokens):
    he_counter  = Counter()
    she_counter = Counter()

    for token in tokens:
        obj_text = token.text.lower()
        if token.dep_ == 'poss' and obj_text in ['his', 'her']:
            possessed_noun = token.head
            if possessed_noun.pos_ == 'NOUN':
                noun_lemma = possessed_noun.lemma_.lower()
                if obj_text == 'his':
                    he_counter[noun_lemma] += 1
                elif obj_text == 'her':
                    she_counter[noun_lemma] += 1

    return he_counter, she_counter

In [21]:
he_counts, she_counts = count_possessions(all_tokens)
logodds(he_counts, she_counts, display=10)

Most category 1:
4.569	manner
4.351	house
4.291	return
4.090	name
3.890	attachment
3.823	horse
3.745	address
3.610	profession
3.523	behaviour
3.461	son

Most category 2:
-7.223	mother
-4.576	aunt
-4.056	uncle
-3.909	sister
-3.882	spirit
-3.835	eye
-3.580	room
-3.504	heart
-3.102	brother
-3.046	thought


**Q4**. Find the actions that are men do *to women* more frequently than women do *to men*.  Feel free to only consider subjects and objects that are "she"/"he"/"her"/"him" pronouns.   This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given verb has "he" as the subject and "her" as the object (`he_counter`) and "she" as the subject and "him" as the object (`she_counter`).

In [26]:
def count_SVO_tuples(tokens):
    he_counter  = Counter()
    she_counter = Counter()

    for token in tokens:
        if token.pos_ == 'VERB':
            verb_lemma = token.lemma_.lower()

            subj = None
            obj = None
            for child in token.children:
                if child.dep_ in ['nsubj', 'nsubjpass']:
                    subj = child
                elif child.dep_ == 'dobj':
                    obj = child

            if subj and obj:
                subj_text = subj.text.lower()
                obj_text  = obj.text.lower()
                if subj_text == 'he' and obj_text == 'her':
                    he_counter[verb_lemma] += 1
                elif subj_text == 'she' and obj_text == 'him':
                    she_counter[verb_lemma] += 1

    return he_counter, she_counter

In [27]:
he_counts, she_counts = count_SVO_tuples(all_tokens)
logodds(he_counts, she_counts, display=10)

Most category 1:
2.779	tell
2.089	leave
1.340	hear
1.340	join
1.229	give
1.075	assure
1.043	forget
1.043	address
0.923	ask
0.885	love

Most category 2:
-3.095	see
-1.593	have
-1.166	entreat
-0.997	know
-0.875	wish
-0.875	understand
-0.875	watch
-0.714	like
-0.663	accept
-0.633	refuse


**Q5**. **In a few sentences,** reflect on the analysis you did above. What claims can you make about the data? What are some limitations?

Across all analyses above, we see that there is a very clear difference in when terms are applied with reference to gender labels. It is also clear from the results that category 1 and 2 tend to follow certain themes in each case. For example, in the first analysis, we see that terms associated with 'he' tend to involve some form of speaking or doing. With 'she' the terms broadly embody emotion and sensory. With these analyses, we can find these relationships in the language used, and can claim that terms used in the text are not applied evenly for both 'he' and 'she'. A limitation, however, could be that the text is simply missing several cases where, for example, a term is possessed by one gender over another. In such a case, it is possible to have lots of data for one category, but very little for the other, impacting the validity of the results.