# 2) Adverbs: Do Great Writers Avoid Them?

**Goal:** Estimate -ly adverb rate and compare across the two texts.

# Setup: Load Texts

This notebook needs **Fellowship of the Ring** and **The Return of the King** as input texts.

**How to provide the texts:**
1. Aquire books through all means necessary

2. Place two text files in the "data" folder with names:
   - `Fellowship.txt`
   - `Return.txt`

In [1]:
import re
from pathlib import Path

In [None]:

def load_texts(local_fellow: str = '..\\data\\Fellowship.txt',
               local_king: str = '..\\data\\TheKing.txt'):
    """Load both texts from disk.

    Parameters
    ----------
    local_fellow : str
        Path to Fellowship
         text file. Defaults to '../data/Fellowship.txt
    '.
    local_king : str
        Path to TheKing text file. Defaults to '../data/TheKing.txt'.

    Returns
    -------
    tuple[str, str]
        (fellowship_text, theking_text).

    Raises
    ------
    FileNotFoundError
        If either file is missing.

    Extra Notes
    -----------
    - Using UTF-8 with `errors='ignore'` avoids codec exceptions on
      older Project Gutenberg dumps or inconsistent encodings.
    """
    p1, p2 = Path(local_fellow), Path(local_king)

    # Fail fast with a clear message if a file is missing
    if not p1.exists():
        raise FileNotFoundError(
            f"Missing file: {p1}\n"
            "→ Please place 'Fellowship.txt at this path or update load_texts(...)."
        )
    if not p2.exists():
        raise FileNotFoundError(
            f"Missing file: {p2}\n"
            "→ Please place 'TheKing.txt' at this path or update load_texts(...)."
        )

    # Read the files (UTF-8; ignore undecodable bytes to stay robust)
    fellowship   = p1.read_text(encoding='utf-8', errors='ignore')
    theking = p2.read_text(encoding='utf-8', errors='ignore')
    return fellowship, theking

def normalize(text: str, is_fellowship: bool = False) -> str:
    """Normalize a text for tokenization."""
    if not text:
        return ''
    
    # If it's Fellowship, skip the Foreword and Prologue
    if is_fellowship:
        prologue_end = text.find('Chapter 1\n\nA Long-expected Party')
        if prologue_end != -1:
            text = text[prologue_end:]
    
    # For Return of the King
    if not is_fellowship:
        contents_end = text.find('Book V\n\nChapter 1. Minas Tirith')
        if contents_end != -1:
            text = text[contents_end:]

    return text.replace('\r\n', '\n')

# Load raw texts
fellowship_raw, theking_raw = load_texts()

# Normalize for tokenization
fellowship = normalize(fellowship_raw, is_fellowship=True)
theking = normalize(theking_raw, is_fellowship=False) # Use new file var

print(f"Fellowship chars: {len(fellowship):,} | TheKing chars: {len(theking):,}")

Fellowship chars: 948,198 | TheKing chars: 709,796


### Helpers: Tokenization

In [None]:
# This new regex finds words like "don't" but skips junk like "'s"
WORD_RE = re.compile(r"\b[A-Za-z][A-Za-z']*\b") 

def words(text: str):
    """Smarter word tokenizer (lowercased, ASCII letters + internal apostrophes)."""
    return WORD_RE.findall(text.lower())


def sentences(text: str):
    """Naive sentence splitter using punctuation boundaries."""
    return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]


# --- Run the tokenizers ---
fellowship_words = words(fellowship)
theking_words = words(theking)

fellowship_sentences = sentences(fellowship)
theking_sentences = sentences(theking)

# Save total word counts for later
nF = len(fellowship_words) # Total words in Fellowship
nR = len(theking_words) # Total words in TheKing

print(f"Fellowship words: {nF:,} | TheKing words: {nR:,}")
print(f"Fellowship sentences: {len(fellowship_sentences):,} | TheKing sentences: {len(theking_sentences):,}")

Fellowship words: 179,144 | TheKing words: 136,735
Fellowship sentences: 10,880 | TheKing sentences: 7,449


### Estimate -ly Adverb Rate

In [18]:

def adverb_rate(words):
    adverbs = [w for w in words if w.endswith('ly') and len(w)>2]
    return len(adverbs), len(words), (len(adverbs)/len(words))*100

f_adv, f_total, f_pct = adverb_rate(fellowship_words)
r_adv, r_total, r_pct = adverb_rate(theking_words)
print(f"Fellowship: {f_adv}/{f_total} = {f_pct:.2f}%")
print(f"TheKing: {r_adv}/{r_total} = {r_pct:.2f}%")


Fellowship: 2078/179144 = 1.16%
TheKing: 1113/136735 = 0.81%


**Prompt:** Inspect a sample of detected -ly words. Which are true adverbs vs. adjectives/nouns? How would you refine the rule?

In [17]:
# Show some adverbs from fellowship
adverbs_fellowship = [w for w in fellowship_words if w.endswith('ly') and len(w)>2]
print(f"Fellowship adverbs (first 20): {adverbs_fellowship[:20]}")

Fellowship adverbs (first 20): ['shortly', 'popularly', 'apparently', 'reputedly', 'finally', 'comfortably', 'lively', 'only', 'suddenly', 'commonly', 'mainly', 'friendly', 'constantly', 'especially', 'seemingly', 'family', 'only', 'suddenly', 'mostly', 'firmly']


Let's use a smarter way to find adverbs:

In [None]:
# Run this cell once
import spacy
from spacy.cli import download

try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    download("en_core_web_sm")          # downloads the small English model
    nlp = spacy.load("en_core_web_sm")  # try again


In [19]:
def strip_gutenberg_markup(t: str) -> str:
    # remove lone underscores and _italic_ markup
    t = re.sub(r"\b_+\b", " ", t)
    t = re.sub(r"_([A-Za-z]+)_", r"\1", t)
    return t

# This helper function is from Notebook 1
def per_10k(count: int, total_words: int) -> float:
    """Normalize a raw count per 10,000 words for fair comparisons."""
    return (count / max(1, total_words)) * 10000.0

# 1) Pre-clean both texts
fellowship_clean = strip_gutenberg_markup(fellowship)
theking_clean = strip_gutenberg_markup(theking)

# 2) Process both texts with nlp()
#    This may take 1-2 minutes
print("Running spaCy analysis on Fellowship...")
f_doc = nlp(fellowship_clean)
print("Running spaCy analysis on TheKing...")
r_doc = nlp(theking_clean)

# 3) Count adverbs (POS tag == 'ADV') that end in '-ly'
f_adv_spacy = len([t for t in f_doc if t.pos_ == 'ADV' and t.text.endswith('ly')])
r_adv_spacy = len([t for t in r_doc if t.pos_ == 'ADV' and t.text.endswith('ly')])

# 4) Get rates per 10,000 words (using nF and nR from Cell 8)
f_adv_rate_spacy = per_10k(f_adv_spacy, nF)
r_adv_rate_spacy = per_10k(r_adv_spacy, nR)

print("\n--- Final spaCy-powered Adverb Rate ---")
print(f"Fellowship spaCy rate: {f_adv_rate_spacy:5.2f}/10k")
print(f"TheKing spaCy rate: {r_adv_rate_spacy:5.2f}/10k")

Running spaCy analysis on Fellowship...
Running spaCy analysis on TheKing...

--- Final spaCy-powered Adverb Rate ---
Fellowship spaCy rate: 102.54/10k
TheKing spaCy rate: 72.55/10k


In [20]:
import pandas as pd
import altair as alt

# Create a DataFrame for the graph
data = [
    {'Book': 'Fellowship', 'Rate (per 10k words)': f_adv_rate_spacy},
    {'Book': 'The King', 'Rate (per 10k words)': r_adv_rate_spacy}
]
chart_df = pd.DataFrame(data)

# Build the chart
chart = alt.Chart(chart_df).mark_bar().encode(
    x=alt.X('Book', sort=None),  # Use sort=None to keep the order
    y=alt.Y('Rate (per 10k words)', title='-ly Adverb Rate (per 10k words)'),
    color='Book',
    tooltip=['Book', 'Rate (per 10k words)']
).properties(
    title='Stylistic Fingerprint (Tone): Adverb Rate in Tolkien'
).interactive()

# Save the chart
chart.save('adverb_rate_chart.json')
print("Graph saved to 'adverb_rate_chart.json'")
chart

Graph saved to 'adverb_rate_chart.json'
