# Stopword Removal with NLTK and SpaCy

Purpose: Reduces the size of the dataset and focuses on more impactful words.

Common Stop Words: "the", "is", "at", "of", "for", "to", "in" etc.

Tools for Stop Word Removal:
- NLTK: 𝑛𝑙𝑡𝑘.𝑐𝑜𝑟𝑝𝑢𝑠.𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑𝑠.
- SpaCy: 𝑛𝑙𝑝.𝐷𝑒𝑓𝑎𝑢𝑙𝑡𝑠.𝑠𝑡𝑜𝑝_𝑤𝑜𝑟𝑑𝑠.

In [5]:
# Import necessary libraries
import string
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [9]:
# Download necessary NLTK resources
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

In [10]:
# Function to remove stopwords using NLTK
def remove_stopwords_nltk(text: str) -> str:
    stop_words = set(stopwords.words("english"))
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    return " ".join(filtered_tokens)

# Function to remove stopwords using spaCy
def remove_stopwords_spacy(text: str) -> str:
    doc = nlp(text)
    filtered_tokens = [token.text for token in doc if not token.is_stop]
    return " ".join(filtered_tokens)


In [12]:
# Example usage
if __name__ == "__main__":
    sample_text = "This is an example sentence demonstrating the removal of stopwords."

    # Remove stopwords using NLTK
    nltk_filtered_text = remove_stopwords_nltk(sample_text)
    print("NLTK Stopword Removal:", nltk_filtered_text)

    # Remove stopwords using spaCy
    spacy_filtered_text = remove_stopwords_spacy(sample_text)
    print("spaCy Stopword Removal:", spacy_filtered_text)

NLTK Stopword Removal: example sentence demonstrating removal stopwords .
spaCy Stopword Removal: example sentence demonstrating removal stopwords .
