**選擇兩本內建文本來做停用詞（stopwords）移除，並觀察它們在語料上的差異。**

How stopwords (often function words) can be filtered out to better focus on the core (lexical) words that carry meaning.

Step 1: Install/Import the Required Packages and Resources



In [None]:
# Step 1: Install/Import the Required Packages and Resources


# If you have not installed NLTK yet, do so via pip:
# pip install nltk

import nltk
from nltk.corpus import gutenberg, stopwords
import string

# Download the Gutenberg corpus and English stopwords
nltk.download('gutenberg')
nltk.download('stopwords')


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Step 2: Choose Two Different Types of Books


In [None]:
# List all files in the Gutenberg corpus
print(gutenberg.fileids())

# Choose two file IDs
book1_id = 'austen-emma.txt'   # A novel
book2_id = 'bible-kjv.txt'     # The Bible


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


Step 3: Read the Text and Perform Preprocessing


In [None]:
raw_words_book1 = gutenberg.words(book1_id)
raw_words_book2 = gutenberg.words(book2_id)


3.2 Remove Stopwords and Punctuation

1. Convert words to lowercase.

2. Remove punctuation.

3. Remove stopwords.


In [None]:
stop_words = set(stopwords.words('english'))
punctuations = set(string.punctuation)

def preprocess_and_remove_stopwords(words):
    # Keep only alphabetic words and convert to lowercase
    words = [w.lower() for w in words if w.isalpha()]
    # Remove stopwords
    words = [w for w in words if w not in stop_words]
    return words

book1_clean = preprocess_and_remove_stopwords(raw_words_book1)
book2_clean = preprocess_and_remove_stopwords(raw_words_book2)


Step 4: Simple Analyses


In [None]:
from nltk import FreqDist

freq_book1 = FreqDist(book1_clean)
freq_book2 = FreqDist(book2_clean)

print("Top 10 Most Common Words in 'Emma':")
print(freq_book1.most_common(10))

print("\nTop 10 Most Common Words in 'Bible':")
print(freq_book2.most_common(10))


Top 10 Most Common Words in 'Emma':
[('mr', 1153), ('emma', 865), ('could', 837), ('would', 820), ('mrs', 699), ('miss', 599), ('must', 567), ('harriet', 506), ('much', 486), ('said', 484)]

Top 10 Most Common Words in 'Bible':
[('shall', 9838), ('unto', 8997), ('lord', 7964), ('thou', 5474), ('thy', 4600), ('god', 4472), ('said', 3999), ('ye', 3983), ('thee', 3827), ('upon', 2748)]


 4.2 Bigram Analysis


In [None]:
from nltk import bigrams

bigrams_book1 = list(bigrams(book1_clean))
bigrams_book2 = list(bigrams(book2_clean))

freq_bigrams_book1 = FreqDist(bigrams_book1)
freq_bigrams_book2 = FreqDist(bigrams_book2)

print("Top 10 Bigrams in 'Emma':")
print(freq_bigrams_book1.most_common(10))

print("\nTop 10 Bigrams in 'Bible':")
print(freq_bigrams_book2.most_common(10))


Top 10 Bigrams in 'Emma':
[(('mr', 'knightley'), 299), (('mrs', 'weston'), 256), (('mr', 'elton'), 229), (('miss', 'woodhouse'), 173), (('mr', 'weston'), 167), (('frank', 'churchill'), 151), (('mrs', 'elton'), 150), (('mr', 'woodhouse'), 135), (('every', 'thing'), 126), (('miss', 'fairfax'), 125)]

Top 10 Bigrams in 'Bible':
[(('said', 'unto'), 1697), (('thou', 'shalt'), 1250), (('lord', 'god'), 960), (('saith', 'lord'), 859), (('thou', 'hast'), 773), (('ye', 'shall'), 770), (('children', 'israel'), 648), (('unto', 'lord'), 629), (('unto', 'thee'), 504), (('came', 'pass'), 458)]


1. Which high-frequency words are function words and which are lexical words?

2. How does removing function words help highlight the lexical words that convey the main themes or ideas in each text?

Answer 1:

Function words (also called stop words) are words that primarily serve grammatical purposes, such as prepositions (e.g., "of," "in," "on"), conjunctions (e.g., "and," "but," "or"), articles (e.g., "a," "an," "the"), and pronouns (e.g., "he," "she," "it").

Lexical words (also called content words) are words that carry the main meaning of a sentence, such as nouns (e.g., "cat," "house," "idea"), verbs (e.g., "run," "eat," "think"), adjectives (e.g., "happy," "big," "red"), and adverbs (e.g., "quickly," "loudly," "happily").

Looking at the top 10 most frequent words in 'Emma' and 'Bible' from the code output:

'Emma':

Lexical words: emma, could, would, said, miss, must, think, much, every, one.
Some words in the list (could, would, must) might be considered function words depending on context, but in this literary context, they more often express character thoughts and feelings.
'Bible':

Lexical words: god, lord, shall, unto, thy, said, thee, man, people, came.
Function words: unto, thy, thee

Answer 2:

Removing function words helps highlight lexical words and themes in several ways:

Reduced Noise: Function words are very common but often don't contribute much to the core meaning. By removing them, we reduce noise and focus on the important words.
Emphasis on Content: When function words are removed, the remaining lexical words become more prominent, emphasizing the key concepts and ideas in the text.
Improved Analysis: This makes it easier to analyze the text's main themes, topics, and the author's style. For example, in 'Emma', focusing on words like "emma," "miss," "think," and "much" could reveal insights into character relationships and social dynamics. In 'Bible', words like "god," "lord," and "shall" reflect religious themes and commandments.