# Why we get rid of stopwords in NLP ?

In Natural Language Processing (NLP), stopwords refer to commonly used words that are often considered insignificant or do not contribute much to the overall meaning of a sentence. Examples of stopwords include "the," "is," "and," "a," and so on, depending on the language.

Here are some reasons why stopwords are often removed in NLP preprocessing:

1. Noise reduction: Stopwords typically appear frequently in a corpus without carrying much semantic information. By removing stopwords, we can reduce the noise in the text data and focus on the more important words that carry more meaning.

2. Computational efficiency: Removing stopwords can help reduce the size of the vocabulary and the dimensionality of the feature space. This reduction can lead to more efficient and faster computation for NLP tasks such as text classification, information retrieval, and clustering.

3. Improved model performance: Stopwords often occur in many different contexts, making it challenging for models to learn meaningful patterns from them. Removing stopwords can improve the performance of certain NLP tasks, such as sentiment analysis or topic modeling, by allowing the models to focus on more discriminative and informative words.

4. Better interpretability: Stopwords can obscure the interpretation of results in some cases. By eliminating them, it becomes easier to identify and understand the most relevant and distinctive terms in a document or corpus.

However, it's important to note that there are cases where stopwords may carry specific meanings or contribute to the context. In such cases, removing stopwords may not be desirable. The decision to remove or retain stopwords depends on the specific NLP task, the characteristics of the corpus, and the goals of the analysis.

In [5]:
#!jupyter qtconsole 

# Stopwords

```

```

TK add explanation of stopwords, + examples + some sample code + code we would actually use (eg NLTK)

We will be using [this tweet](https://twitter.com/ivan_bezdomny/status/1367160747537682438) (don't worry, we will get to train some models):

In [7]:
tweet = """I’m amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public finetuned checkpoints, is good enough for the job.

Both impressed, and a little disappointed how rarely I get to actually train a model that matters :("""

We will be using the **NLTK** library for removing stopwords. NLTK comes with several stopword corpora, we will be using the English corpus. This corpus contains a huge number of English stopwords like *a*, *the*, *be*, *for*, *do*, and so on.

In [9]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

print(stop_words[:5])
print(len(stop_words))

179


Now we have a list of stopwords. When we process our text data we will iterate through each word, if it is present in `stop_words` it will be removed. To optimize the speed of the stopword lookup we can convert `stop_words` to a `set` object.

In [6]:
stop_words = set(stop_words)

First we need to lowercase our text (because all of our stopwords are lowercased). Then we use split our input text into a list of tokens (each token is a word seperated by a space).

In [12]:
tweet = tweet.lower().split()

tweet

['i’m',
 'amazed',
 'how',
 'often',
 'in',
 'practice,',
 'not',
 'only',
 'does',
 'a',
 '@huggingface',
 'nlp',
 'model',
 'solve',
 'your',
 'problem,',
 'but',
 'one',
 'of',
 'their',
 'public',
 'finetuned',
 'checkpoints,',
 'is',
 'good',
 'enough',
 'for',
 'the',
 'job.',
 'both',
 'impressed,',
 'and',
 'a',
 'little',
 'disappointed',
 'how',
 'rarely',
 'i',
 'get',
 'to',
 'actually',
 'train',
 'a',
 'model',
 'that',
 'matters',
 ':(']

And now we can iterate through the list, we check if each word exists in `stop_words` - if it does we discard it.

In [13]:
tweet_no_stopwords = [word for word in tweet if word not in stop_words]

print("With stopwords:", ' '.join(tweet))
print("Without:", ' '.join(tweet_no_stopwords))

With stopwords: i’m amazed how often in practice, not only does a @huggingface nlp model solve your problem, but one of their public finetuned checkpoints, is good enough for the job. both impressed, and a little disappointed how rarely i get to actually train a model that matters :(
Without: i’m amazed often practice, @huggingface nlp model solve problem, one public finetuned checkpoints, good enough job. impressed, little disappointed rarely get actually train model matters :(


It's that easy! We'll move onto more preprocessing methods in the following sections.