# ***Engr.Muhammad Javed***

# 3. Stop Words

## What are Stop Words?
Stop words are common words (like 'the', 'is', 'in', 'and') that are often removed from text during preprocessing because they carry less meaningful information for many NLP tasks.

## Why Remove?
- Reduces dataset size.
- Focuses model on meaningful words.

## When NOT to Remove?
- When context matters (e.g., machine translation, text summarization).
- When exact phrase matching is needed.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd

# Ensure stopwords are downloaded
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
print(f"Number of English stopwords: {len(stop_words)}")
print(list(stop_words)[:10])

## Applying Removal

In [None]:
def remove_stopwords(text):
    tokens = word_tokenize(text)
    return ' '.join([word for word in tokens if word.lower() not in stop_words])

sample_text = "This is a sample sentence showing off the stop words filtration."
print("Original:", sample_text)
print("Cleaned:", remove_stopwords(sample_text))

# Apply to dataset
df_train = pd.read_csv('../Dataset/train.txt', sep=';', names=['text', 'emotion'])
df_train['text_no_stop'] = df_train['text'].apply(remove_stopwords)
df_train.head()