# Stop Words Removal

Assume project to identify the category of a news article as one of politics, sports, business, and other.
For this project, suppose we have already used an efficient sentence segmenter and word tokenizer.

Some of the words in the corpus such as a, an, the, of, in, etc., do not add much value to the information that is required for the classification.

We typically remove such words, called stop words

### Why remove stop words

stop words are present in abundance.

remove low-level information from the content to focus on important information.

By removing these words, we retain most of the important information so no negative consequences our model.

The cleaning of stop words helps us reduce the size of the dataset


### Why not remove stop words

The decision to remove the stop words depends upon the task and goal that we want to achieve.
For sentiment analysis, on several occasions, the removal of stop words may be disastrous

## Stop Word Libraries 

### 1. NLTK

In [8]:
import nltk

In [9]:
from nltk.corpus import stopwords

In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
stopwords_nltk = stopwords.words('english')

In [13]:
print(stopwords_nltk)  # print all nltk stopwords

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [14]:
print("total number of stopwords in nltk library is : ",len(stopwords_nltk))

total number of stopwords in nltk library is :  179


#### Let’s remove the stop words

In [101]:
text = '''Doglapan by Ashneer Grover is hands down one of the most interesting books I've ever read and now my favourite.
Ashneer's life according to me is a full-blown masala Bollywood story, and that's one of the best parts of this book! Nowhere does he shy away from admitting his vulnerability, his failures, or his insecurities.
His book maybe titled as "Doglapan", but has a lot of "Sachhai". 
This book, although written in English, avoids technical jargon like a plague. Simple, non heavy words, and heavy recall value.
As Ashneer in his inimitable style says, "4 shabd se zyada mein baat samjhani pade to bekaar hai phir." Ashneer, in this book, takes you through a founder's birth, his life, his highs and lows, and finally a martyr's death.
But, as we all know, "ke picture abhi baaki hai mere dost", so waiting for the Phoenix to rise yet again, and to take the world by storm, with another idea, another product. We believe in you Ashneer.
And we know you'll emerge stronger and with another market disrupter.'''

In [32]:
words = [word for word in text.split() if word.lower() not in stopwords_nltk]

In [33]:
new_text = " ".join(words)

In [34]:
new_text

'Doglapan Ashneer Grover hands one interesting books I\'ve ever read favourite. Ashneer\'s life according full-blown masala Bollywood story, that\'s one best parts book! Nowhere shy away admitting vulnerability, failures, insecurities. book maybe titled "Doglapan", lot "Sachhai". book, although written English, avoids technical jargon like plague. Simple, non heavy words, heavy recall value. Ashneer inimitable style says, "4 shabd se zyada mein baat samjhani pade bekaar hai phir." Ashneer, book, takes founder\'s birth, life, highs lows, finally martyr\'s death. But, know, "ke picture abhi baaki hai mere dost", waiting Phoenix rise yet again, take world storm, another idea, another product. believe Ashneer. know emerge stronger another market disrupter.'

In [35]:
print("length of of old text : ",len(text))

length of of old text :  1012


In [36]:
print("length of new text : ",len(new_text))

length of new text :  756


### 2. spaCy

In [47]:
#!python -m spacy download en_core_web_sm

In [44]:
import spacy

In [46]:
s = spacy.load("en_core_web_sm")

In [48]:
stopwords_spacy = s.Defaults.stop_words

In [49]:
print(stopwords_spacy)

{'but', 'i', 'his', 'serious', '’d', 'whence', 'around', 'cannot', 'where', 'fifteen', 'are', 'may', 'three', 'several', 'amount', 'still', 'anyone', 'have', 'without', 'regarding', 'nine', 'first', 'side', 'has', 'them', 'again', 'upon', 'while', 'been', 'did', 'with', 'therefore', 'although', 'so', 'we', 'former', 'being', 'often', 'above', 'full', 'before', 'meanwhile', 'besides', 'become', 'out', 'please', 'not', 'something', "'s", 'himself', 'less', 'whole', 'done', 'since', 'am', 'put', 'others', 'go', 'alone', 'nor', 'anywhere', 'more', "'ll", 'unless', 'yourself', 'sometimes', 'whose', 'here', '‘d', 'used', 'do', 'once', 'against', 'her', 'though', 'beside', 'a', 'everyone', 'than', 'really', 'up', 'keep', 'whereupon', 'neither', 'make', 'therein', 'n‘t', '’m', 'yours', 'whither', 'either', 'rather', 'moreover', 'show', 'top', 'else', 'thence', 'nobody', 'give', 'whereas', 'down', 'thereafter', 'becomes', 'this', '‘m', 'during', 'onto', 'it', 'forty', 'whenever', 'whatever', 'w

In [50]:
print("number of stopwords in spacy library : ",len(stopwords_spacy))

number of stopwords in spacy library :  326


#### removing stopwords from text by spacy

In [51]:
words = [word for word in text.split() if word.lower() not in stopwords_spacy]

In [52]:
new_text = " ".join(words)

In [53]:
print(new_text)

Doglapan Ashneer Grover hands interesting books I've read favourite. Ashneer's life according full-blown masala Bollywood story, that's best parts book! shy away admitting vulnerability, failures, insecurities. book maybe titled "Doglapan", lot "Sachhai". book, written English, avoids technical jargon like plague. Simple, non heavy words, heavy recall value. Ashneer inimitable style says, "4 shabd se zyada mein baat samjhani pade bekaar hai phir." Ashneer, book, takes founder's birth, life, highs lows, finally martyr's death. But, know, "ke picture abhi baaki hai mere dost", waiting Phoenix rise again, world storm, idea, product. believe Ashneer. know you'll emerge stronger market disrupter.


In [54]:
print("old text length : ",len(text))

old text length :  1012


In [55]:
print("new text length : ",len(new_text))

new text length :  700


We can clearly see that the removal of stop words reduced the length of the sentence from 756 to 700,

Shorter than NLTK because the spaCy library has more stop words than NLTK.

The results, in this case, are quite similar though.

### Gensim 
Gensim (Generate Similar) is an open-source software library that uses modern statistical machine learning.

In [57]:
import gensim

In [58]:
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS

In [59]:
print(STOPWORDS)

frozenset({'but', 'i', 'his', 'amoungst', 'serious', 'whence', 'around', 'don', 'cannot', 'where', 'fifteen', 'are', 'may', 'three', 'several', 'amount', 'still', 'anyone', 'have', 'without', 'regarding', 'nine', 'first', 'sincere', 'side', 'has', 'them', 'again', 'upon', 'while', 'been', 'did', 'with', 'therefore', 'although', 'so', 'we', 'former', 'being', 'often', 'above', 'full', 'before', 'meanwhile', 'besides', 'doesn', 'become', 'out', 'please', 'not', 'something', 'himself', 'less', 'whole', 'done', 'since', 'inc', 'con', 'am', 'put', 'others', 'go', 'alone', 'nor', 'anywhere', 'more', 'unless', 'yourself', 'whose', 'sometimes', 'here', 'cry', 'used', 'do', 'once', 'against', 'her', 'though', 'beside', 'a', 'everyone', 'than', 'eg', 'up', 'really', 'keep', 'whereupon', 'neither', 'make', 'therein', 'ie', 'yours', 'whither', 'either', 'rather', 'moreover', 'show', 'top', 'else', 'thence', 'nobody', 'give', 'whereas', 'down', 'thereafter', 'becomes', 'this', 'during', 'onto', 'sy

In [60]:
print(len(STOPWORDS))

337


In [64]:
new_text = remove_stopwords(text)

In [65]:
print(new_text)

Doglapan Ashneer Grover hands interesting books I've read favourite. Ashneer's life according full-blown masala Bollywood story, that's best parts book! Nowhere shy away admitting vulnerability, failures, insecurities. His book maybe titled "Doglapan", lot "Sachhai". This book, written English, avoids technical jargon like plague. Simple, non heavy words, heavy recall value. As Ashneer inimitable style says, "4 shabd se zyada mein baat samjhani pade bekaar hai phir." Ashneer, book, takes founder's birth, life, highs lows, finally martyr's death. But, know, "ke picture abhi baaki hai mere dost", waiting Phoenix rise again, world storm, idea, product. We believe Ashneer. And know you'll emerge stronger market disrupter.


In [67]:
print("length of new_text : ",len(new_text))

length of new_text :  727


### 4. Sklearn

In [76]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [77]:
print(ENGLISH_STOP_WORDS)

frozenset({'but', 'i', 'his', 'amoungst', 'serious', 'whence', 'around', 'cannot', 'where', 'fifteen', 'are', 'may', 'three', 'several', 'amount', 'still', 'anyone', 'have', 'without', 'nine', 'first', 'sincere', 'side', 'has', 'them', 'again', 'upon', 'while', 'been', 'with', 'therefore', 'although', 'so', 'we', 'former', 'being', 'often', 'above', 'full', 'before', 'meanwhile', 'besides', 'become', 'out', 'please', 'not', 'something', 'himself', 'less', 'whole', 'done', 'since', 'inc', 'am', 'con', 'put', 'others', 'go', 'alone', 'nor', 'anywhere', 'more', 'yourself', 'sometimes', 'whose', 'here', 'cry', 'do', 'once', 'against', 'her', 'though', 'beside', 'a', 'everyone', 'than', 'eg', 'up', 'keep', 'whereupon', 'neither', 'therein', 'ie', 'yours', 'whither', 'either', 'rather', 'moreover', 'show', 'top', 'else', 'thence', 'nobody', 'give', 'whereas', 'down', 'thereafter', 'becomes', 'this', 'during', 'onto', 'system', 'it', 'forty', 'whenever', 'whatever', 'why', 'after', 'these', '

In [78]:
print(len(ENGLISH_STOP_WORDS))

318


In [81]:
n = [word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS]

In [82]:
new_text = " ".join(n)

In [83]:
print(new_text)

Doglapan Ashneer Grover hands interesting books I've read favourite. Ashneer's life according full-blown masala Bollywood story, that's best parts book! does shy away admitting vulnerability, failures, insecurities. book maybe titled "Doglapan", lot "Sachhai". book, written English, avoids technical jargon like plague. Simple, non heavy words, heavy recall value. Ashneer inimitable style says, "4 shabd se zyada mein baat samjhani pade bekaar hai phir." Ashneer, book, takes founder's birth, life, highs lows, finally martyr's death. But, know, "ke picture abhi baaki hai mere dost", waiting Phoenix rise again, world storm, idea, product. believe Ashneer. know you'll emerge stronger market disrupter.


In [84]:
print("old text length : ",len(text))

old text length :  1012


In [85]:
print("new text length: ",len(new_text))

new text length:  705


## Stop Words Custom Lists 

In [86]:
stopwords_nltk.extend(['first','second','third','why'])

In [87]:
print(len(stopwords_nltk))

183


the number of words in the list increased from 179 to 183

## Remove Words 

In [91]:
stopwords_nltk.remove('why')

In [92]:
print(len(stopwords_nltk))

182


The number of words in the list reduced from 183 to 182

## Create Custom List

In [93]:
custom_list = ['was','a','many','in','the','after','of','where','her','they']

In [96]:
w = [word for word in text.split() if word.lower() not in custom_list]

In [97]:
new_text = " ".join(w)

In [104]:
print(new_text)

Doglapan by Ashneer Grover is hands down one most interesting books I've ever read and now my favourite. Ashneer's life according to me is full-blown masala Bollywood story, and that's one best parts this book! Nowhere does he shy away from admitting his vulnerability, his failures, or his insecurities. His book maybe titled as "Doglapan", but has lot "Sachhai". This book, although written English, avoids technical jargon like plague. Simple, non heavy words, and heavy recall value. As Ashneer his inimitable style says, "4 shabd se zyada mein baat samjhani pade to bekaar hai phir." Ashneer, this book, takes you through founder's birth, his life, his highs and lows, and finally martyr's death. But, as we all know, "ke picture abhi baaki hai mere dost", so waiting for Phoenix to rise yet again, and to take world by storm, with another idea, another product. We believe you Ashneer. And we know you'll emerge stronger and with another market disrupter.


In [105]:
print(len(new_text))

961


by  - Ayush Singh Rawat

email - ayush191302013@gmail.com