### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Removing Stopwords

**Steps**:
1. Data Set: Use a dataset of text product descriptions.
2. Stopword Removal: Utilize an NLP library (e.g., NLTK) to remove stopwords from the
descriptions.
3. Assess Impact: Examine the effectiveness by analyzing word frequency before and after
removal.

In [7]:
##
import pandas as pd
import re
from collections import Counter

# Sample product descriptions
data = {
    "product_id": [101, 102, 103],
    "description": [
        "This is a fantastic wireless mouse with ergonomic design.",
        "An excellent choice for gaming and office work.",
        "The product is durable and very easy to use."
    ]
}
df = pd.DataFrame(data)

# Basic stopword list
custom_stopwords = set("""
a an the is are for to with of and or in on at by from this that very be
""".split())

# Tokenizer and stopword remover
def tokenize(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

def remove_stopwords(text):
    tokens = tokenize(text)
    return [word for word in tokens if word not in custom_stopwords]

# Apply functions
df["tokens_before"] = df["description"].apply(tokenize)
df["tokens_after"] = df["description"].apply(remove_stopwords)

# Word frequency
all_before = [word for tokens in df["tokens_before"] for word in tokens]
all_after = [word for tokens in df["tokens_after"] for word in tokens]

freq_before = Counter(all_before)
freq_after = Counter(all_after)

print("Word Frequency Before Stopword Removal:\n", freq_before)
print("\nWord Frequency After Stopword Removal:\n", freq_after)

Word Frequency Before Stopword Removal:
 Counter({'is': 2, 'and': 2, 'this': 1, 'a': 1, 'fantastic': 1, 'wireless': 1, 'mouse': 1, 'with': 1, 'ergonomic': 1, 'design': 1, 'an': 1, 'excellent': 1, 'choice': 1, 'for': 1, 'gaming': 1, 'office': 1, 'work': 1, 'the': 1, 'product': 1, 'durable': 1, 'very': 1, 'easy': 1, 'to': 1, 'use': 1})

Word Frequency After Stopword Removal:
 Counter({'fantastic': 1, 'wireless': 1, 'mouse': 1, 'ergonomic': 1, 'design': 1, 'excellent': 1, 'choice': 1, 'gaming': 1, 'office': 1, 'work': 1, 'product': 1, 'durable': 1, 'easy': 1, 'use': 1})
