# What are Stopwords?


Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document.

    Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.


a about after all also always am an and any are at be been being but by came can cant come 
could did didn't do does doesn't doing don't else for from get give goes going had happen 
has have having how i if ill i'm in into is isn't it its i've just keep let like made make 
many may me mean more most much no not now of only or our really say see some something 
take tell than that the their them then they thing this to try up us use used uses very 
want was way we what when where which who why will with without wont you your youre

When Should we Remove Stopwords?

I’ve summarized this into two parts: when we can remove stopwords and when we should avoid doing so.

Remove Stopwords

We can remove stopwords while performing the following tasks:

    Text Classification
        Spam Filtering
        Language Classification
        Genre Classification
    Caption Generation
    Auto-Tag Generation



Avoid Stopword Removal


    Machine Translation
    Language Modeling
    Text Summarization
    Question-Answering problems


# Different Methods to Remove Stopwords

1. Stopword Removal using NLTK

use the below code to see the list of stopwords in NLTK

In [12]:
import nltk

In [13]:
#import nltk
#from nltk.corpus import stopwords
#set(stopwords.words('english'))

In [14]:
# The following code is to remove stop words from sentence using nltk
# Created by - ANALYTICS VIDHYA

# importing libraries

In [15]:
# importing libraries
#import nltk
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize 
#set(stopwords.words('english'))

In [16]:
# sample sentence
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

In [17]:
# set of stop words
#stop_words = set(stopwords.words('english')) 

In [18]:
# tokens of words  
#word_tokens = word_tokenize(text) 
    
#filtered_sentence = [] 
  
#for w in word_tokens: 
    #if w not in stop_words: 
        #filtered_sentence.append(w) 



#print("\n\nOriginal Sentence \n\n")
#print(" ".join(word_tokens)) 

#print("\n\nFiltered Sentence \n\n")
#print(" ".join(filtered_sentence)) 

Here is the list we obtained after tokenization:

He determined to drop his litigation with the monastry, and relinguish his claims to the 
wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights
had become much less valuable, and he had indeed the vaguest idea where the wood and river
 in question were

And the list after removing stopwords:

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He 
ready becuase rights become much less valuable, indeed vaguest idea wood river question.

Here’s how you can remove stopwords using spaCy in Python:

In [21]:
#from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
#nlp = English()

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

#  "nlp" Object is used to create documents with linguistic annotations.
#my_doc = nlp(text)

# Create list of word tokens
#token_list = []
#for token in my_doc:
    #token_list.append(token.text)

#from spacy.lang.en.stop_words import STOP_WORDS

# Create list of word tokens after removing stopwords
#filtered_sentence =[] 

#for word in token_list:
    #lexeme = nlp.vocab[word]
    #if lexeme.is_stop == False:
        #filtered_sentence.append(word) 
#print(token_list)
#print(filtered_sentence) 

This is the list we obtained after tokenization:

He determined to drop his litigation with the monastry and relinguish his claims to the 
wood-cuting and \n fishery rihgts at once. He was the more ready to do this becuase the 
rights had become much less valuable, and he had \n indeed the vaguest idea where the wood
 and river in question were.

And the list after removing stopwords:

determined drop litigation monastry, relinguish claims wood-cuting \n fishery rihgts. ready
becuase rights become valuable, \n vaguest idea wood river question.

# 3. Stopword Removal using Gensim

In [23]:
# The following code is to remove stop words using gensim
# Created by - ANALYTICS VIDHYA
#from gensim.parsing.preprocessing import remove_stopwords

# pass the sentence in the remove_stopwords function
#result = remove_stopwords("""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, 
#and he had indeed the vaguest idea where the wood and river in question were.""")

#print('\n\n Filtered Sentence \n\n')
#print(result)  


He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts once.
He ready becuase rights valuable, vaguest idea wood river question were.

# Introduction to Text Normalization

What are Stemming and Lemmatization?

Stemming and Lemmatization is simply normalization of words, which means reducing a word to its root form.

# Stemming

Let’s first understand stemming:

    Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into account a list of common prefixes or suffixes that could be found in that word
    It is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word


# Lemmatization

Lemmatization, on the other hand, is an organized & step-by-step procedure of obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

# Methods to Perform Text Normalization

Text Normalization using NLTK

Stemming

In [24]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer

In [26]:
#set(stopwords.words('english'))

In [28]:
#text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
#fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
#indeed the vaguest idea where the wood and river in question were."""

#stop_words = set(stopwords.words('english')) 
  
#word_tokens = word_tokenize(text) 
    
#filtered_sentence = [] 
  
#for w in word_tokens: 
    #if w not in stop_words: 
        #filtered_sentence.append(w) 

#Stem_words = []
#ps =PorterStemmer()
#for w in filtered_sentence:
    #rootWord=ps.stem(w)
    #Stem_words.append(rootWord)
#print(filtered_sentence)
#print(Stem_words)

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He 
ready becuase rights become much less valuable, indeed vaguest idea wood river question.

He determin drop litig monastri, relinguish claim wood-cut fisheri rihgt. He readi becuas
right become much less valuabl, inde vaguest idea wood river question.


Lemmatization

In [29]:
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize 
#import nltk
#from nltk.stem import WordNetLemmatizer
#set(stopwords.words('english'))

#text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
#fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
#indeed the vaguest idea where the wood and river in question were."""

#stop_words = set(stopwords.words('english')) 
  
#word_tokens = word_tokenize(text) 
    
#filtered_sentence = [] 
  
#for w in word_tokens: 
    #if w not in stop_words: 
        #filtered_sentence.append(w) 
#print(filtered_sentence) 

#lemma_word = []
#import nltk
#from nltk.stem import WordNetLemmatizer
#wordnet_lemmatizer = WordNetLemmatizer()
#for w in filtered_sentence:
    #word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
    #word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
    #word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
    #lemma_word.append(word3)
#print(lemma_word)

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He 
ready becuase rights become much less valuable, indeed vaguest idea wood river question.

He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He 
ready becuase right become much le valuable, indeed vaguest idea wood river question.



# 2. Text Normalization using spaCy

In [None]:
#make sure to download the english model with "python -m spacy download en"

import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were.""")

lemma_word1 = [] 
for token in doc:
    lemma_word1.append(token.lemma_)
lemma_word1

RON- determine to drop -PRON- litigation with the monastry, and relinguish -PRON- claim
to the wood-cuting and \n fishery rihgts at once. -PRON- be the more ready to do this 
becuase the right have become much less valuable, and -PRON- have \n indeed the vague idea
where the wood and river in question be.

Here -PRON- is the notation for pronoun which could easily be removed using regular expressions. The benefit of spaCy is that we do not have to pass any pos parameter to perform lemmatization.


# 3. Text Normalization using TextBlob

TextBlob is a Python library especially made for preprocessing text data. It is based on the NLTK library. We can use TextBlob to perform lemmatization. However, there’s no module for stemming in TextBlob.

So let’s see how to perform lemmatization using TextBlob in Python:

In [33]:
# from textblob lib import Word method 
#from textblob import Word 

#text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
#fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
#indeed the vaguest idea where the wood and river in question were."""

#lem = []
#for i in text.split():
    #word1 = Word(i).lemmatize("n")
    #word2 = Word(word1).lemmatize("v")
    #word3 = Word(word2).lemmatize("a")
    #lem.append(Word(word3).lemmatize())
#print(lem)

He determine to drop his litigation with the monastry, and relinguish his claim to the 
wood-cuting and fishery rihgts at once. He wa the more ready to do this becuase the right
have become much le valuable, and he have indeed the vague idea where the wood and river
in question were.

Stopwords play an important role in problems like sentiment analysis, question answering systems, etc. That’s why removing stopwords can potentially affect our model’s accuracy drastically.