# Comparing Word Frequencies Across Two Texts
## Introduction
In this lesson, you’ll learn how to compare the most common words in two different texts using Python. This is a foundational technique in digital text analysis and can help you spot patterns, themes, or stylistic differences between books, chapters, or authors.

We’ll use two sample texts from Project Gutenberg: *Alice’s Adventures in Wonderland* by Lewis Carroll and *The Adventures of Sherlock Holmes* by Arthur Conan Doyle.

### Downloading the Texts

Here, we use the requests library to download the full text of "Alice's Adventures in Wonderland" and "The Adventures of Sherlock Holmes" from Project Gutenberg. It stores the text of each book as a string in the variables alice_text and sherlock_text.

In [13]:
import requests

# URLs for the texts
url_alice = "https://www.gutenberg.org/files/11/11-0.txt"
url_sherlock = "https://www.gutenberg.org/files/1661/1661-0.txt"

# Download and decode
alice_text = requests.get(url_alice).text
sherlock_text = requests.get(url_sherlock).text

### Basic Text Cleaning and Tokenization
Here, we define a function clean_and_tokenize to preprocess the raw text. First, it normalizes curly quotes and apostrophes to standard ones, converts the text to lowercase, and then uses a regular expression (re.findall) to extract all words, including those with internal apostrophes (contractions). Finally, it removes any single-letter tokens (like 't'), except for 'i' and 'a', which are valid English words. The function is then applied to both alice_text and sherlock_text, storing the resulting lists of cleaned words in alice_words and sherlock_words.

In [14]:
import re

def clean_and_tokenize(text):
    # Normalize curly quotes and apostrophes to straight ones
    text = text.replace('“', '"').replace('”', '"').replace("’", "'").replace("‘", "'")
    text = text.lower()
    # Regex to match words with optional internal apostrophes (e.g. don't, it's)
    words = re.findall(r"\b[a-z]+(?:'[a-z]+)?\b", text)
    # Optionally remove single-letter tokens except 'i' and 'a'
    words = [word for word in words if len(word) > 1 or word in ('i', 'a')]
    return words

alice_words = clean_and_tokenize(alice_text)
sherlock_words = clean_and_tokenize(sherlock_text)


### Removing Stopwords
In a list, we define common English stopwords - words that are frequent but don't carry much meaning - and then create a function remove_stopwords to eliminate these words from the tokenized lists.

The function uses a list comprehension to return a new list containing only the words that are not in the stopwords set. This is applied to both alice_words and sherlock_words, saving the results as alice_words_clean and sherlock_words_clean.

In [9]:
# A simple stopword list
stopwords = set([
    'the', 'and', 'to', 'of', 'a', 'i', 'it', 'in', 'or', 'is', 'd', 's', 'as', 
    'for', 'on', 'with', 'was', 'he', 'she', 'at', 'by', 'an', 'be', 'this', 'that', 
    'from', 'but', 'not','then', 'are', 'his', 'her', 'they', 'you', 'had', 'have', 'my',
    'me', 'we', 'so', 'what', 'there', 'said', 'very', 'him', 'her', 'your', 'were',
    'would', 'if', 'when', 'which', 'do', 'been',  'could', 'about', 'them', 'has', 'into',
    'know', 'like', 'herself','went', 'again', 'up', 'out', 'all', 'one', 'no',
    'some', 'upon', 'will', 'who', 'now', 'our', 'how', 'am', 'did', 'us'
])

def remove_stopwords(words):
    return [word for word in words if word not in stopwords]

alice_words_clean = remove_stopwords(alice_words)
sherlock_words_clean = remove_stopwords(sherlock_words)

### Counting Word Frequencies
We'll use Python's collections.Counter class to efficiently count the frequency of each word in the cleaned word lists. Counter automatically creates a dictionary-like object where keys are the unique words and values are their counts. The word counts for each text are stored in alice_counts and sherlock_counts.

In [10]:
from collections import Counter

alice_counts = Counter(alice_words_clean)
sherlock_counts = Counter(sherlock_words_clean)

### Displaying the Most Common Words
Let's see the top 10 words in each text. We can use the most_common() method of the Counter object to get a list of the 10 most frequent words and their counts, and then print these lists out.

In [11]:
print("Top 10 words in Alice's Adventures in Wonderland:")
print(alice_counts.most_common(10))

print("\nTop 10 words in The Adventures of Sherlock Holmes:")
print(sherlock_counts.most_common(10))

Top 10 words in Alice's Adventures in Wonderland:
[('alice', 386), ('little', 127), ('down', 103), ('thought', 74), ('off', 73), ('time', 71), ('queen', 68), ('see', 67), ('well', 63), ('king', 61)]

Top 10 words in The Adventures of Sherlock Holmes:
[('holmes', 466), ('man', 291), ('mr', 275), ('little', 269), ('see', 232), ('down', 230), ('may', 212), ('should', 211), ('well', 201), ('over', 183)]


### Comparing the Top Words
Which words are common to both texts? Which are unique? Feel free to compare texts of your choosing, and see what the results bring to light.

In [12]:
alice_top_words = set([word for word, count in alice_counts.most_common(10)])
sherlock_top_words = set([word for word, count in sherlock_counts.most_common(10)])

common_words = alice_top_words & sherlock_top_words
alice_unique = alice_top_words - sherlock_top_words
sherlock_unique = sherlock_top_words - alice_top_words

print("Words common to both texts:", common_words)
print("Words unique to Alice:", alice_unique)
print("Words unique to Sherlock:", sherlock_unique)

Words common to both texts: {'well', 'see', 'little', 'down'}
Words unique to Alice: {'alice', 'off', 'time', 'king', 'queen', 'thought'}
Words unique to Sherlock: {'man', 'mr', 'over', 'holmes', 'should', 'may'}
