# Stop Words (spaCy)

**Stop words** are very common words (e.g., *the*, *is*, *and*) that often add little meaning for tasks like keyword extraction, search, and topic modeling.

In **spaCy**, stop words are maintained per language in `nlp.Defaults.stop_words`, and each token/lexeme has an `is_stop` flag. You can:

- **Inspect** the default stop-word list
- **Check** whether a specific word is marked as a stop word
- **Add** domain-specific stop words (and set `nlp.vocab[word].is_stop = True`)
- **Remove** custom stop words when you no longer need them

Whether you should remove stop words depends on your task: removal can reduce noise, but it can also remove meaning (e.g., negations like *not*).

In [18]:
# Import spaCy and load the pre-trained English model
import spacy

nlp = spacy.load("en_core_web_sm")

In [19]:
# Print all default stop words
print(nlp.Defaults.stop_words)

{'an', 'nothing', '’s', 'may', 'alone', 'as', 'up', 'without', 'very', 'must', 'seeming', "'d", 'himself', 'six', 'behind', 'put', 'third', 'upon', 'therein', 'anything', 'mine', 'when', 'so', 'whoever', 'does', 'should', 'first', 'thereupon', 'meanwhile', 'if', 'already', 'become', 'that', 'whether', 'one', 'during', 'n’t', 'many', 'ever', 'afterwards', 'really', "'re", 'me', 'give', 'myself', 'nowhere', 'thence', 'more', 'noone', 'go', 'enough', 'been', 'into', 'is', '‘m', 'your', 'no', 'off', 'fifty', 'to', 'fox', 'who', 'amount', 'call', 'herself', 'someone', 'only', 'their', 'hereafter', 'neither', 'another', 'beside', 'whom', 'between', 'along', 'within', 'always', '’ll', '’ve', 'serious', 'n‘t', 'on', 'seem', 'unless', 'onto', 'of', 'after', '‘d', 'fifteen', 'for', '‘ll', 'becomes', 'back', 'wherein', 'towards', 'ourselves', 'am', 'down', "'s", 'ca', 'before', 'whenever', 'either', "n't", 'has', 'everywhere', 'here', 'thru', '’d', 'these', 'please', 'do', 'well', 'forty', 'least

In [20]:
# Count the number of default stop words
len(nlp.Defaults.stop_words)

328

In [21]:
# Access the vocabulary entry for a word
nlp.vocab["btw"]

<spacy.lexeme.Lexeme at 0x16edad294c0>

In [22]:
# Check if "btw" is marked as a stop word (False by default)
nlp.vocab["btw"].is_stop

False

In [24]:
# Mark "btw" as a stop word
nlp.vocab["btw"].is_stop = True

In [25]:
# Verify that "btw" is now marked as a stop word
nlp.vocab["btw"].is_stop

True

In [26]:
# Unmark "btw" as a stop word and verify the change
# Note: nlp.Defaults.stop_words.remove("btw") would fail since "btw" is not in the default list
nlp.vocab["btw"].is_stop = False
nlp.vocab["btw"].is_stop

False

In [27]:
doc = nlp("The quick brown fox jumps over the lazy dog")

## Filtering Stop Words from Text

A practical demonstration of removing stop words from a document, which is common in NLP tasks like text analysis and information retrieval.

In [30]:
# Display all tokens
print("All tokens:")
all_tokens = [token.text for token in doc]
print(all_tokens)

# Filter out stop words
print("\nNon-stop words only:")
non_stop_tokens = [token.text for token in doc if not token.is_stop]
print(non_stop_tokens)

# Show which words are stop words
print("\nStop word analysis:")
for token in doc:
    print(f"{token.text:<12} - Stop word: {token.is_stop}")

All tokens:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Non-stop words only:
['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Stop word analysis:
The          - Stop word: True
quick        - Stop word: False
brown        - Stop word: False
fox          - Stop word: False
jumps        - Stop word: False
over         - Stop word: True
the          - Stop word: True
lazy         - Stop word: False
dog          - Stop word: False


## Adding Custom Stop Words

Demonstrate how to add your own words to the stop words list for domain-specific NLP applications.

In [31]:
# Add custom words to the stop words list
custom_stop_words = ["fox", "dog"]

for word in custom_stop_words:
    nlp.Defaults.stop_words.add(word)
    nlp.vocab[word].is_stop = True

# Test the document again with custom stop words
print("Original document:", "The quick brown fox jumps over the lazy dog")

print("\nNon-stop words (with custom stop words):")
non_stop_tokens_custom = [token.text for token in doc if not token.is_stop]
print(non_stop_tokens_custom)

print("\nComparison:")
print("Without custom stop words:", all_tokens)
print("With custom stop words:", non_stop_tokens_custom)

Original document: The quick brown fox jumps over the lazy dog

Non-stop words (with custom stop words):
['quick', 'brown', 'jumps', 'lazy']

Comparison:
Without custom stop words: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
With custom stop words: ['quick', 'brown', 'jumps', 'lazy']


## Removing Custom Stop Words

In [32]:
for word in custom_stop_words:
    nlp.Defaults.stop_words.remove(word)
    nlp.vocab[word].is_stop = False