In [None]:
import spacy # text analysis
from spacy import displacy # tree plotting
import pandas as pd # data-frame manipulation
from tqdm.auto import tqdm # progress bar
import matplotlib.pyplot as plt # plotting
import seaborn as sns # plotting

sns.set(context='paper', style='ticks', font_scale=1) # set the plot style

Let's load the spaCy model for the English language:

In [None]:
nlp = spacy.load("en_core_web_sm")

Let's also load the book corpus:

In [None]:
harry_potter_corpus = pd.read_csv("https://raw.githubusercontent.com/" +
                                  "alexis-raymond/NLP-HP-Books/refs/" +
                                  "heads/main/data/processed/training_df.csv")

You can see that each row is storing one sentence and each sentence is annotated according to the book it is taken from.

In [None]:
harry_potter_corpus.head(10)

## 0. Inspiration

The contents of this practical are heavily inspired by this [paper](https://www.pnas.org/doi/10.1073/pnas.2319514121), so if you want to know more about this kind of work you are encouraged to read it.

## 1. Syntax in spaCy

Besides from morphological analysis, spaCy can also be used to syntactic analyses. For instance, you can get the syntactic category of a word, or its part-of-speech (POS).

Let's analyse the sentence "John saw a man with a telescope":


In [None]:
text = "John saw a man with a telescope."

Let's look at the POS and lemmas of the word-forms in this sentence:

In [None]:
for token in nlp(text):
    print(f'{token.text:{12}} {token.lemma_:{12}} {token.pos_}')

Besides from this, we can also look at the syntactic relationships between words in sentences using spaCy. However, it's not quite like binary trees that we saw in class. Instead, this package produces the so-called dependency trees:

In [None]:
displacy.render(nlp(text), style='dep', jupyter=True)

This types of visualizations of syntax are called dependency trees, because the visualize the relationships between heads in dependents in a phrase, instead of grupping them together into nested structures. For us, the most important thing about this type of analysis is that it can detect the subject (nsubj) of sentence, as well as the direct object (dobj).

## 2. Which character has more agency?

Let's pick three characters (Harry, Hermione and Ron) and, and find all of the sentences where they are the subject (nsubj) of a verb. First, we will need to find all the sentences that contain the words 'Harry', 'Hermione' and 'Ron'. For simplicity, we will strart by only looking at book 1.

In [None]:
book_1 = harry_potter_corpus[harry_potter_corpus['book'] == 1]
book_1.shape

Let's find all the sentences with at least one of the three characters. We will be using the pandas `str` functionality to do this:

In [None]:
characters = ['Harry',
              'Hermione',
              'Ron']

sents = book_1[book_1['sentence'].str.contains('|'.join(characters))]
sents.shape

Great, 1/3 of the first book sentences contain the names of the characters that we are interested in. First, let's start by counting the percentage of sentences in which every character appears individually:

In [None]:
counts_frec = dict()

for sent in sents['sentence'].values:
    for char in characters:
        if char in sent:
            if char in counts_frec:
                counts_frec[char] += 1
            else:
                counts_frec[char] = 1

As you can see, Harry is one of the most frequent characters.

In [None]:
counts_frec

Now, let's process those sentences with spacy, and find the sentences where one of the characters has the `nsubj` relationship with the verb. To do this, we first need to get the dependency of the subject, then look at it's ancestors, and retrieve the verb that governs it.

In [None]:
# find the sentences in which the character is nsubj, but also record the verb
subjs = []

for sent in tqdm(sents['sentence'].values):
    doc = nlp(sent)

    for token in doc:
        if token.dep_ == 'nsubj' and token.text in characters:
          for ancestor in token.ancestors:
            if ancestor.pos_ == 'VERB':
              subjs.append((token.text, ancestor.lemma_))

Let's look at the results:

In [None]:
subjs[:10]

Now let's convert this to a dataframe:

In [None]:
subjects_df = pd.DataFrame(subjs, columns=['subject', 'verb'])
subjects_df.head(10)

Let's count the number of times each of the characters is a subject in a sentence. First, we extract the counts:co

In [None]:
counts = subjects_df['subject'].value_counts()
counts

Now we plot them:

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=counts.index, y=counts.values)
plt.title('Number of times a character is a subject in a sentence')
plt.xlabel('Character')
plt.ylabel('Count')
plt.show()

While this plot is providing some information, it is a bit meaningless, since harry appears the most in all of the sentnces anyways, let's divide this number by the number of total appearances to get the percentage of `subjecthood`:

In [None]:
counts_df = pd.DataFrame(counts).reset_index()
counts_df.columns = ['subject', 'count']

counts_df['count_frec'] = counts_df['subject'].apply(lambda x: counts_frec[x])
counts_df['percentage_nsubj'] = counts_df['count'] / counts_df['count_frec']

counts_df.head(10)

Let's plot the percentages, what this plot can tell you about the first book?

In [None]:
### YOUR CODE HERE ###

As you can see from the `counts_df` database, we also stored the verbs corresponding to each of the three characters -- which verbs are the most frequently used with which character?

In [None]:
### YOUR CODE HERE ###

## 3. Agency surplus

The interesting thing about spaCy is that we can not only detect subjects, but arso objects, i.e. participants on which the action is directed.

For instance, in the sentence "John saw a man with a telescope", "a man" or "a man with a telescope" are objects of the verb "see", depending on the interpretation.

We can see this on the dependency tree, where the objects of a verb are usually labeled as labelled `dobj`:

In [None]:
displacy.render(nlp(text), style='dep', jupyter=True)

Let's modify our code from above to also record cases in which one of the characters is an object. For that, we will add a new list `objs`, in which we will store the verbs in which the corresponding character appears as `dobj`:

In [None]:
# find the sentences in which the character is nsubj, but also record the verb
subjs = []
objs = []

for sent in tqdm(sents['sentence'].values):
    doc = nlp(sent)

    for token in doc:
        if token.dep_ == 'nsubj' and token.text in characters:
          for ancestor in token.ancestors:
            if ancestor.pos_ == 'VERB':
              subjs.append((token.text, ancestor.lemma_))
        elif token.dep_ == 'dobj' and token.text in characters:
          for ancestor in token.ancestors:
            if ancestor.pos_ == 'VERB':
              objs.append((token.text, ancestor.lemma_))

Now let's combine all of this into one dataframe:

In [None]:
# let's convert both lists to dataframes
subjs_df = pd.DataFrame(subjs, columns=['character', 'verb'])
objs_df = pd.DataFrame(objs, columns=['character', 'verb'])

## add columns with roles
subjs_df['role'] = 'subject'
objs_df['role'] = 'object'

# concatenate the dataframes
agency_df = pd.concat([subjs_df, objs_df])

agency_df.head(5)

Now we can compute the number of times that the characters were either subjects or objects, and we can convert these values to percentages according to the total number of sentences in which every character is either a subject or an object:

In [None]:
character_summary = agency_df.groupby('character')['role'].value_counts().reset_index()
character_summary

Now let's convert the counts into percentages by summing them for each character and then dividing each entry by this sum:

In [None]:
character_summary['percentage'] = character_summary.groupby('character')['count'].transform(lambda x: x / x.sum())
character_summary

Now compute the agency surplus by applying the following equation to each character:

$$\text{Agency surplus} = count(object) - count(object)$$

If this value is positive, the character is more frequently and object, but if it's negative, the character is more likely to be a subject, i.e. an active participant.

In [None]:
### YOUR CODE HERE ####

Finaly, can you extend this analysis to all of the 6 books and compare the agency surprlus of each character? To do this, you would need to write a function that takes the set of sentences in a book, and returns the character info with their corresponding agency surplus.

In [None]:
### YOUR CODE HERE ####