# Stopwords in NLP

## What are Stopwords?

**Stopwords** are commonly used words (such as "the", "is", "in", "at", etc.) that are often **removed** from text during text processing. These words usually don't carry much **meaning** and can be safely ignored in many NLP tasks.

## Why Remove Stopwords?

- **Improve performance**: Reducing the size of the text data can speed up processing.
- **Focus on relevant words**: Helps highlight important words in tasks like **text classification**, **sentiment analysis**, and **search indexing**.

## When to use Stopwords?
- **Information retrieval**: Sometimes, stopwords might hold context (e.g., in queries or sentence structure).
- **Part-of-speech tagging**: Stopwords may be important for identifying sentence structure.

In [2]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/u5c2dbc0bf2849dd5288e3311262c709/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [3]:
paragraph = """Natural Language Processing is a fascinating field of artificial intelligence. 
It allows computers to understand, interpret, and generate human language. 
Many applications like chatbots, language translation, and sentiment analysis rely heavily on NLP techniques. 
With the growth of digital content, the ability to analyze large volumes of text has become essential. 
NLP helps in extracting useful information, automating tasks, and enhancing user experiences across different domains."""

## LancasterStemmer

In [4]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
sentences = nltk.sent_tokenize(paragraph)
print(type(sentences))
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

print(sentences)

<class 'list'>
['nat langu process fascin field art intellig .', 'it allow comput understand , interpret , gen hum langu .', 'many apply lik chatbot , langu transl , senty analys rely heavy nlp techn .', 'with grow digit cont , abl analys larg volum text becom ess .', 'nlp help extract us inform , autom task , enh us expery across diff domain .']


## Lemmatization

In [7]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word.lower(), pos='v') for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

print(sentences)

['natural language process fascinate field artificial intelligence .', 'it allow computers understand , interpret , generate human language .', 'many applications like chatbots , language translation , sentiment analysis rely heavily nlp techniques .', 'with growth digital content , ability analyze large volumes text become essential .', 'nlp help extract useful information , automate task , enhance user experience across different domains .']
