<a href="https://colab.research.google.com/github/Raj-dot-GitHub/NLP-Notes/blob/main/Stopwords/NLP_Removing_Stopwords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### In this notebook we will learn about stopwords and how to remove them from our text.

## **What are Stopwords?**
> Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document.

Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.

Eg. "There is a pen on the table."

So, in the above example the words "is", "a", "on" and "the" add no meaning to the statement while parsing it. Whereas the keywords of the statement are "there", "book", "table". 

### **Note:-** We perform tokenization before removing stopwords.

## **Why do we need to remove "Stopwords"?**
> Removing stopwords is not a hard and fast rule in NLP. It depends upon the task that we are working on. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text so that more focus can be given to those words which define the meaning of the text.

## **Advantages of removing Stopwords**.

1. On removing stopwords, dataset size decreases and the time to train the model also decreases.
2. Removing stopwords can potentially help improve the performance as there are fewer and only meaningful tokens left. Thus, it could increase classification accuracy.
3. Even search engines like Google remove stopwords for fast and relevant retrieval of data from the database 

## **When should we remove stopwords?**

I’ve summarized this into two parts: when we can remove stopwords and when we should avoid doing so.

### **Remove Stopwords**

We can remove stopwords while performing tasks:-

*   Text Classification
    *   Spam Filtering
    *   Language Classification
    *   Genre Classification

* Caption Generation
* Auto-Tag Generation

### **Avoid Stopword removal**

* Machine Translation
* Language Modeling
* Text Summarization
* Question-Answering problems

Any many more...






## **Different NLP libaries which we will use to remove stopwords are:**

1. NLTK
2. Spacy
3. Gensim

## **Stopwords removal using NLTK**

In [14]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
print(set(stopwords.words("english")))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{'out', 'my', "needn't", 'now', 'or', 'this', 'these', 'was', 'd', 'wouldn', 'y', 'all', 'myself', 'we', 'after', 'has', 'both', 'until', 'for', 'ain', 'yours', 'before', 'but', 'there', 'are', 'up', 'of', "hasn't", 'their', 'then', "couldn't", 'isn', 'than', 'and', 'here', 'mustn', 'with', 'been', "aren't", "it's", 'were', 'you', 'yourself', 'few', 'needn', 'be', 'above', 'he', 'is', 'each', 'below', 'more', 's', 've', 'am', 'if', 'haven', 'ma', 'through', 'those', 'won', 'it', 'i', 'our', 'll', 'such', 'very', 'at', "shouldn't", 'over', 'm', "you've", 'how', "should've", 'only', 'nor', "you'd", 'does', 'your', 'she', 'into', 'most', "that'll", 'yourselves', 'can', 'when', 'themselves', 'weren', 'mightn', 'against', 'no', 'not', "weren't", 'hadn', 'herself', 

In [15]:
# Count of English stopwords in NLTK.
len(set(stopwords.words("english")))

179

In [16]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words("english"))
# Tokenization
word_token = word_tokenize(text)


In [17]:
filtered_sentence = []
for word in word_token:
  if word not in stop_words:
    filtered_sentence.append(word)

print("\n\nOriginal text\n\n")
print(" ".join(word_token))

print("\n\nFiltered Sentence \n\n")
print(" ".join(filtered_sentence)) 




Original text


He determined to drop his litigation with the monastry , and relinguish his claims to the wood-cuting and fishery rihgts at once . He was the more ready to do this becuase the rights had become much less valuable , and he had indeed the vaguest idea where the wood and river in question were .


Filtered Sentence 


He determined drop litigation monastry , relinguish claims wood-cuting fishery rihgts . He ready becuase rights become much less valuable , indeed vaguest idea wood river question .


In [21]:
print("Length of Original text {}".format(len(word_token)))
print("Length of Filtered text {}".format(len(filtered_sentence)))

Length of Original text 56
Length of Filtered text 26


## **Stopwords removal using Spacy**

In [28]:
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
# Default stopwords in Spacy
print(nlp.Defaults.stop_words)
print("There are {} default stopwords in Spacy.".format(len(nlp.Defaults.stop_words)))

{'somewhere', 'beforehand', 'this', 'within', 'therefore', 'becomes', 'toward', "'ll", 'has', 'since', 'ever', 'until', 'keep', 'for', 'latterly', 'yours', 'twelve', 'n’t', 'there', '‘ll', 'afterwards', 'up', 'of', 'moreover', 'give', 'serious', 'bottom', 'always', 'above', 'empty', 'is', 'each', 'everywhere', 'quite', 'less', 'anyone', 'it', 'such', 'towards', 'over', 'thru', 'nor', 'into', 'most', 'neither', 'various', 'otherwise', 'thereupon', 'when', 'former', 'themselves', 'besides', 'herself', 'its', 'hereby', 'amount', 'own', 'made', 'who', 'eight', 'should', 'her', 'due', 'name', 'too', 'elsewhere', 'back', 'whom', 'beyond', 'done', 'eleven', 'indeed', 'everything', 'ours', 'side', 'do', 'doing', 'seeming', 'one', '‘d', 'did', 'make', 'anything', 'nine', 'some', 'along', 'whereby', 'see', 'others', 'everyone', 'per', 'out', 'someone', 'none', '‘s', 'three', 'wherein', 'two', 'really', 'could', 'least', 'show', 'than', 'six', 'you', 'also', 'few', 'be', 'often', 'he', 'neverthel

In [19]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)

from spacy.lang.en.stop_words import STOP_WORDS

# Create list of word tokens after removing stopwords
filtered_sentence =[] 

for word in token_list:
    lexeme = nlp.vocab[word]
    if lexeme.is_stop == False:
        filtered_sentence.append(word) 
print(token_list)
print(filtered_sentence)   

['He', 'determined', 'to', 'drop', 'his', 'litigation', 'with', 'the', 'monastry', ',', 'and', 'relinguish', 'his', 'claims', 'to', 'the', 'wood', '-', 'cuting', 'and', '\n', 'fishery', 'rihgts', 'at', 'once', '.', 'He', 'was', 'the', 'more', 'ready', 'to', 'do', 'this', 'becuase', 'the', 'rights', 'had', 'become', 'much', 'less', 'valuable', ',', 'and', 'he', 'had', '\n', 'indeed', 'the', 'vaguest', 'idea', 'where', 'the', 'wood', 'and', 'river', 'in', 'question', 'were', '.']
['determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood', '-', 'cuting', '\n', 'fishery', 'rihgts', '.', 'ready', 'becuase', 'rights', 'valuable', ',', '\n', 'vaguest', 'idea', 'wood', 'river', 'question', '.']


In [22]:
print(len(token_list))
print(len(filtered_sentence))

60
26


In [29]:
# To check if a word is a stopword or not.
nlp.vocab["myself"].is_stop

True

In [30]:
# 'myself' is a stopword which is also included in the stopword list.

In [31]:
nlp.vocab["mystery"].is_stop

False

In [32]:
# Add any word to the stopword list.
nlp.Defaults.stop_words.add("mystery")

In [35]:
# Set the stop_word tag on the lexeme
nlp.vocab['mystery'].is_stop = True

In [36]:
len(nlp.Defaults.stop_words)  # It has changed from 326 to 327.

327

In [37]:
# Let's check if 'mystery' is really added to the stopwords list.
nlp.vocab["mystery"].is_stop

True

In [38]:
# Remove any word from the default stopwords list.
nlp.Defaults.stop_words.remove("beyond")

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

In [39]:
nlp.vocab['beyond'].is_stop

False

## **Stopwords removal using Gensim**

We can use gensim's  remove_stopwords method from the class gensim.parsing.preprocessing.

In [40]:
from gensim.parsing.preprocessing import remove_stopwords


In [42]:
result = remove_stopwords("""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, 
and he had indeed the vaguest idea where the wood and river in question were.""")

print('\n\n Filtered Sentence \n\n')
print(result)



 Filtered Sentence 


He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts once. He ready becuase rights valuable, vaguest idea wood river question were.


In [43]:
print(len(result))

163


**Note:-** While using gensim for removing stopwords, we can directly use it on the raw text. There’s no need to perform tokenization before removing stopwords. This can save us a lot of time.

## **That's It !**