## Stop Words
Stop words are commonly used words in a language that are typically filtered out before or after processing text. Examples of stop words in the English language include "a," "an," "the," "and," "in," etc. These words are considered unimportant or have little value in text analysis, so they are often removed to save computational resources and improve the efficiency of natural language processing tasks such as text classification, information retrieval, and machine translation. Additionally, removing stop words can also help to reduce the dimensionality of the data, allowing for more accurate results in some machine learning models.

In [1]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS


In [2]:
len(STOP_WORDS)

326

In [3]:
!python -m spacy download en_core_web_lg  --quiet
import spacy
nlp = spacy.load("en_core_web_lg")

[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')


In [4]:
doc = nlp("We just opened our wings, the flying part is coming soon")

In [5]:
for tok in doc:
    if tok.is_stop:
        print(tok)

We
just
our
the
part
is


In [6]:
text = "We just opened our wings, the flying part is coming soon"
type(text)
text

'We just opened our wings, the flying part is coming soon'

In [7]:
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

In [8]:
def process(text):
    doc = nlp(text)
    no_stop_words = [tok.text for tok in doc if not tok.is_stop]
    return ''.join(no_stop_words)

In [9]:
process(text)

'openedwings,flyingcomingsoon'

In [29]:
def process1(text):
    doc = nlp(text)
    no_stop_words = [tok.text for tok in doc if not tok.is_stop and not tok.is_punct]
    return ' '.join(no_stop_words)

In [30]:
process(text)

'opened wings flying coming soon'

### Remove stop words from pandas dataframe text column

In [21]:
import pandas as pd
df = pd.read_json('doj_press.json',lines=True)
df.head()

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


In [22]:
filt = df['topics'].str.len() != 0
df[filt]

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"
...,...,...,...,...,...,...
13081,14-1377,"Yuba City, California, Man Sentenced to 46 Mon...","Anthony Merrell Tyler, 34, of Yuba City, Calif...",2014-12-09T00:00:00-05:00,[Hate Crimes],"[Civil Rights Division, Civil Rights - Crimina..."
13082,16-735,Yuengling to Upgrade Environmental Measures to...,The Department of Justice and the U.S. Environ...,2016-06-23T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
13084,17-045,Zimmer Biomet Holdings Inc. Agrees to Pay $17....,Subsidiary Agrees to Plead Guilty to Violating...,2017-01-12T00:00:00-05:00,[Foreign Corruption],"[Criminal Division, Criminal - Criminal Fraud ..."
13085,17-252,ZTE Corporation Agrees to Plead Guilty and Pay...,ZTE Corporation has agreed to enter a guilty p...,2017-03-07T00:00:00-05:00,"[Asset Forfeiture, Counterintelligence and Exp...","[National Security Division (NSD), USAO - Texa..."


In [23]:
df_new = df.head(50)
df_new.head()

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


In [31]:
df_new['contents_new'] = df_new['contents'].apply(process1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['contents_new'] = df_new['contents'].apply(process1)


In [25]:
df_new.head()

Unnamed: 0,id,title,contents,date,topics,components,contents_new
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)],PORTLAND Oregon Mohamed Osman Mohamud 23 convi...
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division],WASHINGTON North Carolina Waccamaw River wa...
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division],BOSTON $ 1 million settlement reached n...
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division],WASHINGTON federal grand jury Las Vegas tod...
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],U.S. Department Justice U.S. Environmental Pro...


In [32]:
df_new['contents'][0][:100]

'PORTLAND, Oregon. – Mohamed Osman Mohamud, 23, who was convicted in 2013 of attempting to use a weap'

In [33]:
df_new['contents_new'][0][100]

's'

### Examples where removing stop words can create a problem
#### (1) Sentiment detection: Not always but in some cases, based on your dataset it can change the sentiment of a sentence if you remove stop words

In [36]:
process1("this is a good movie")


'good movie'

In [37]:
process1("this is not a good movie")

'good movie'

#### (2) Language translation: Say you want to translate following sentence from english to telugu. Before actual translation if you remove stop words and then translate, it will produce horrible result

In [38]:
process1("how are you doing dhaval?")


'dhaval'

#### (3) Chat bot or any Q&A system

In [39]:
process1("I don't find yoga mat on your website. Can you help?")


'find yoga mat website help'