
# 1) Frequent Words = Literary Fingerprints

This notebook compares **word frequency** between our two toy texts:
- *Pet Semetary* (here referenced as **pet**)
- *The Shining* (here referenced as **Shining**)

We practice simple tokenization and frequency analysis, then discuss
what's **meaningful signal** vs. **noise** in the results, and how to
improve the method (normalization, keyness, etc).


In [1]:
from pathlib import Path
import re
from collections import Counter
import math
# at least one letter, apostrophes only allowed inside (keeps "don't", drops "'" alone)
WORD_RE = re.compile(r"[A-Za-z]+(?:'[A-Za-z]+)*")

In [2]:
def load_texts(local_pet: str = '../data/PetSemetary.txt',
               local_shining: str = '../data/TheShining.txt'):
    p1, p2 = Path(local_pet), Path(local_shining)

    if not p1.exists():
        raise FileNotFoundError(f"Missing file: {p1}")
    if not p2.exists():
        raise FileNotFoundError(f"Missing file: {p2}")

    pet    = p1.read_text(encoding='utf-8', errors='ignore')
    shining = p2.read_text(encoding='utf-8', errors='ignore')
    return pet, shining

In [3]:
def normalize(text: str) -> str:
    if not text:
        return ''
    # normalize curly quotes to ASCII '
    text = text.replace("’", "'").replace("‘", "'")
    # normalize Windows endings
    text = text.replace('\r\n', '\n')
    # join hyphenated line breaks
    text = re.sub(r"-\s*\n", "", text)
    return text

def words(text: str):
    return WORD_RE.findall(text.lower())

def sentences(text: str):
    return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]



## Load & Normalize
We load both texts using **inline path checks** and then apply a simple normalization.


In [4]:

# === Lightweight stopworded top-words helper ===

def top_words(words_list, min_len=4, extra_stop=None, n=30):
    """Return top-N frequent words after lightweight filtering."""
    base_stop = {
        'the','and','to','of','a','i','it','in','that','was','he','you','is','for','on','as',
        'with','his','her','at','be','she','had','not','but','said','they','them','this','so','all','one','very',
        'there','what','were','from','have','would','could','when','been','their','we','my','me','or','by','up','no','out','if',
        'pet'   # book-specific: remove if you want
    }
    if extra_stop:
        base_stop |= set(extra_stop)

    c = Counter(w for w in words_list if len(w) >= min_len and w not in base_stop)
    return c.most_common(n)


# === Load & Normalize ===

# Load raw texts
pet_raw, shining_raw = load_texts()

# Normalize
pet_norm     = normalize(pet_raw)
shining_norm = normalize(shining_raw)

print(f"Pet Sematary chars: {len(pet_norm):,} | The Shining chars: {len(shining_norm):,}")

# Tokenize (RAW tokens before cleaning)
pet_tokens_raw     = words(pet_norm)
shining_tokens_raw = words(shining_norm)

pet_sentences     = sentences(pet_norm)
shining_sentences = sentences(shining_norm)

print(f"Pet Sematary words (raw): {len(pet_tokens_raw):,} | The Shining words (raw): {len(shining_tokens_raw):,}")
print(f"Pet Sematary sentences: {len(pet_sentences):,} | The Shining sentences: {len(shining_sentences):,}")





Pet Sematary chars: 812,353 | The Shining chars: 905,869
Pet Sematary words (raw): 147,144 | The Shining words (raw): 162,085
Pet Sematary sentences: 9,269 | The Shining sentences: 12,914



## Tokenize
We use a simple regex tokenizer (letters + apostrophes). For more serious work,
consider spaCy or stanza for tagging and lemmatization.


In [5]:
# === Tokenize ===

# THESE should be strings, so we load and normalize first
pet_raw, shining_raw = load_texts()
pet_norm     = normalize(pet_raw)
shining_norm = normalize(shining_raw)

# Now tokenize TEXT STRINGS (not lists!)
pet_words_raw     = WORD_RE.findall(pet_norm.lower())
shining_words_raw = WORD_RE.findall(shining_norm.lower())

pet_sentences     = sentences(pet_norm)
shining_sentences = sentences(shining_norm)

print(f"Pet Sematary words (raw): {len(pet_words_raw):,} | The Shining words (raw): {len(shining_words_raw):,}")
print(f"Pet Sematary sentences: {len(pet_sentences):,} | The Shining sentences: {len(shining_sentences):,}")


Pet Sematary words (raw): 147,144 | The Shining words (raw): 162,085
Pet Sematary sentences: 9,269 | The Shining sentences: 12,914



## Top Words (after basic stopwords)
The list is **partly signal, partly noise**—use it to start discussion.


In [6]:
JUNK = {
    "ll","s","t","ve","re","m","d",
    "didn","don","doesn","isn","wasn","aren","weren","ain",
    "couldn","wouldn","shouldn","hadn","haven","hasn","mustn"
}

def clean_tokens(tokens):
    cleaned = []
    for w in tokens:
        if w in JUNK:
            continue
        if len(w) == 1:    # removes 'i', 'a', single letters etc.
            continue
        cleaned.append(w)
    return cleaned

pet_words     = clean_tokens(pet_tokens_raw)
shining_words = clean_tokens(shining_tokens_raw)

print(f"Pet Sematary words (clean): {len(pet_words):,} | The Shining words (clean): {len(shining_words):,}")
print("Sample Pet tokens:", pet_words[:20])
print("Sample Shining tokens:", shining_words[:20])


Pet Sematary words (clean): 142,499 | The Shining words (clean): 156,970
Sample Pet tokens: ['pet', 'sematary', 'by', 'stephen', 'king', 'published', 'jjjjj', 'iiiii', 'table', 'of', 'contents', 'dedication', 'introduction', 'part', 'the', 'pet', 'sematary', 'chapter', 'thru', 'chapter']
Sample Shining tokens: ['the', 'shining', 'by', 'stephen', 'king', 'this', 'is', 'for', 'joe', 'hill', 'king', 'who', 'shines', 'on', 'my', 'editor', 'on', 'this', 'book', 'as']



## Discussion
- Which frequent words are **thematically meaningful** vs. artifacts of stopwording?
- Do **chess terms** (e.g., *queen*, *white*, *red*) show higher distinctiveness in *Looking-Glass*?
- Do **spatial/falling terms** (e.g., *down*, *rabbit*) show higher distinctiveness in *Wonderland*?
- How would **lemmatization** (e.g., *think/thinks/thought*) change results?
- Implement **per_10k(count,total_words)** and **lolookingglass_likelihood(k1,n1,k2,n2) (Dunning’s G²)**, then list the 20 most distinctive words between Wonderland and Looking-Glass with per-10k rates and briefly argue which are meaningful vs. artifacts.


## Optional continution:


## Distinctiveness via Log-Likelihood (Keyness)
Raw frequency is not enough. Compute **G²** to find words that are *distinctive* of each book.


In [7]:
def per_10k(count: int, total_words: int) -> float:
    """Normalize a raw count per 10,000 words for fair comparisons."""
    return (count / max(1, total_words)) * 10000.0


def log_likelihood(k1: int, n1: int, k2: int, n2: int) -> float:
    """ log-likelihood (G^2) keyness score for word distinctiveness.

    Parameters
    ----------
    k1 : int  Frequency in corpus A
    n1 : int  Total words in corpus A
    k2 : int  Frequency in corpus B
    n2 : int  Total words in corpus B

    Returns
    -------
    float
        G^2 value; larger absolute values indicate stronger distinctiveness.
        Direction should be interpreted by comparing rates (per_10k) or counts.

    Notes
    -----
    - Symmetric measure widely used for corpus comparison.
    - Great classroom upgrade over raw frequency lists.
    """
    E1 = n1 * (k1 + k2) / max(1, (n1 + n2))
    E2 = n2 * (k1 + k2) / max(1, (n1 + n2))

    def term(k, E):
        return 0.0 if k == 0 or E == 0 else k * math.log(k / E)

    return 2.0 * (term(k1, E1) + term(k2, E2))


In [8]:

# Build frequency dictionaries
cw = Counter(pet_words)
cg = Counter(shining_words)
nW, nG = sum(cw.values()), sum(cg.values())

# Compare a candidate set (union of top ~500 from each to keep it fast)
candidates = set([w for w,_ in cw.most_common(500)] + [w for w,_ in cg.most_common(500)])

rows = []
for w in candidates:
    g2 = log_likelihood(cw[w], nW, cg[w], nG)
    rows.append((g2, w, per_10k(cw[w], nW), per_10k(cg[w], nG)))

# Sort by distinctiveness (descending)
rows.sort(reverse=True)

print("Most distinctive (either direction):")
for g2, w, a10, b10 in rows[:20]:
    print(f"{w:>12}  G2={g2:7.1f}  W:{a10:6.2f}/10k  LG:{b10:6.2f}/10k")


Most distinctive (either direction):
       louis  G2= 2398.8  W:113.33/10k  LG:  0.00/10k
       danny  G2= 1165.3  W:  0.00/10k  LG: 57.46/10k
        jack  G2=  918.6  W:  0.28/10k  LG: 47.46/10k
      rachel  G2=  760.6  W: 36.56/10k  LG:  0.06/10k
         jud  G2=  751.6  W: 35.51/10k  LG:  0.00/10k
       ellie  G2=  604.5  W: 28.56/10k  LG:  0.00/10k
        gage  G2=  582.3  W: 27.51/10k  LG:  0.00/10k
       wendy  G2=  463.8  W:  0.00/10k  LG: 22.87/10k
   hallorann  G2=  436.7  W:  0.00/10k  LG: 21.53/10k
      church  G2=  307.8  W: 16.84/10k  LG:  0.38/10k
    overlook  G2=  236.4  W:  0.00/10k  LG: 11.66/10k
      ullman  G2=  231.3  W:  0.00/10k  LG: 11.40/10k
       hotel  G2=  217.7  W:  0.21/10k  LG: 12.04/10k
      gage's  G2=  173.8  W:  8.21/10k  LG:  0.00/10k
     danny's  G2=  173.1  W:  0.00/10k  LG:  8.54/10k
         cat  G2=  172.1  W: 10.88/10k  LG:  0.57/10k
        tony  G2=  160.2  W:  0.07/10k  LG:  8.41/10k
        snow  G2=  157.7  W:  0.77/10k  LG: 1